docs/testing.md

# Testing Superpowers

Superpowers has two distinct kinds of tests, each in its own directory:

- **`tests/`** — does the plugin's non-LLM code work? Bash + node + python integration tests for brainstorm-server JS, OpenCode plugin loading, codex-plugin sync, and analysis utilities.
- **`evals/`** — do agents behave correctly on real LLM sessions? Python harness driving real tmux sessions of Claude Code / Codex / Gemini CLI, with an LLM actor and verifier judging skill compliance.

## Plugin tests

Live in `tests/`. Currently:

- `tests/brainstorm-server/` — node test suite for the brainstorm server JS code.
- `tests/opencode/` — bash tests for OpenCode plugin loading, bootstrap caching, and tool registration.
- `tests/codex-plugin-sync/` — bash sync verification.
- `tests/claude-code/test-helpers.sh`, `analyze-token-usage.py` — utilities used by remaining bash tests.
- `tests/claude-code/test-subagent-driven-development.sh` — agent-can-describe-SDD test (no drill counterpart; tests description-recall, not behavior).
- `tests/claude-code/test-subagent-driven-development-integration.sh` — extended SDD integration with token analysis (drill covers the YAGNI subset; bash adds commit-count, TodoWrite, and token telemetry assertions).
- `tests/claude-code/test-worktree-native-preference.sh` — RED-GREEN-REFACTOR validation for worktree skill (drill covers the PRESSURE phase; bash also covers RED/GREEN baselines).
- `tests/explicit-skill-requests/` — Haiku-specific, multi-turn, and skill-name-prompted tests not covered by drill.

Run plugin tests via the relevant directory's `run-*.sh` or `npm test`.

## Skill behavior evals

Live in `evals/`. Drill is the harness; scenarios live at `evals/scenarios/*.yaml`. See `evals/README.md` for setup. Quick start:

```bash
cd evals
uv sync --extra dev
export ANTHROPIC_API_KEY=sk-...
uv run drill run triggering-test-driven-development -b claude
```

Drill scenarios are slow (3-30+ minutes each) and run real LLM sessions. They are not part of CI today; the natural follow-up is a tiered model (fast subset on PR, full sweep nightly + on-demand).
docs: introduce evals/ as the canonical skill-behavior eval harness - docs/testing.md split into Plugin tests + Skill behavior evals. Plugin tests section enumerates the bash tests that survive (kept by drill-coverage analysis or as describe-skill tests). - CLAUDE.md adds Eval harness section pointing at evals/. - README.md Contributing section mentions evals/ alongside tests/. - .gitignore adds evals/{results,.venv,.env} as belt-and-suspenders (evals/.gitignore covers these locally; root-level entries help tooling that does not recurse into nested ignore files). 2026-05-06 12:33:10 -07:00			`# Testing Superpowers`
Add comprehensive testing documentation Documents: - How to run integration tests - subagent-driven-development test details - Token analysis tool usage - Troubleshooting common issues - Writing new integration tests - Session transcript format 2025-11-29 21:03:33 -08:00
docs: introduce evals/ as the canonical skill-behavior eval harness - docs/testing.md split into Plugin tests + Skill behavior evals. Plugin tests section enumerates the bash tests that survive (kept by drill-coverage analysis or as describe-skill tests). - CLAUDE.md adds Eval harness section pointing at evals/. - README.md Contributing section mentions evals/ alongside tests/. - .gitignore adds evals/{results,.venv,.env} as belt-and-suspenders (evals/.gitignore covers these locally; root-level entries help tooling that does not recurse into nested ignore files). 2026-05-06 12:33:10 -07:00			`Superpowers has two distinct kinds of tests, each in its own directory:`
Add comprehensive testing documentation Documents: - How to run integration tests - subagent-driven-development test details - Token analysis tool usage - Troubleshooting common issues - Writing new integration tests - Session transcript format 2025-11-29 21:03:33 -08:00
docs: introduce evals/ as the canonical skill-behavior eval harness - docs/testing.md split into Plugin tests + Skill behavior evals. Plugin tests section enumerates the bash tests that survive (kept by drill-coverage analysis or as describe-skill tests). - CLAUDE.md adds Eval harness section pointing at evals/. - README.md Contributing section mentions evals/ alongside tests/. - .gitignore adds evals/{results,.venv,.env} as belt-and-suspenders (evals/.gitignore covers these locally; root-level entries help tooling that does not recurse into nested ignore files). 2026-05-06 12:33:10 -07:00			- `tests/` — does the plugin's non-LLM code work? Bash + node + python integration tests for brainstorm-server JS, OpenCode plugin loading, codex-plugin sync, and analysis utilities.
Address adversarial review findings - evals/README.md, evals/CLAUDE.md: fix uv install command from 'uv sync --dev' to 'uv sync --extra dev'. Drill's pyproject.toml uses [project.optional-dependencies], so --dev is a no-op for pytest/ruff/ty; --extra dev is the correct invocation. - tests/claude-code/run-skill-tests.sh: drop test-requesting-code-review.sh from integration_tests array (file deleted earlier in this branch). - tests/claude-code/README.md: replace test-requesting-code-review.sh section with test-worktree-native-preference.sh (the worktree test is kept; the code-review test was lifted into drill). - docs/testing.md, CLAUDE.md: remove "Copilot CLI" from the harness list. evals/backends/ has claude*, codex, gemini configs but no copilot.yaml, so the claim was unsupported. Adversarial review credit: reviewer #2 found four legitimate issues (uv-sync, run-skill-tests stale ref, README stale ref via #1, and Copilot CLI fabrication); reviewer #1 found two distinct issues (run-skill-tests + tests/claude-code/README.md). Reviewer #2 wins this round. 2026-05-06 12:41:28 -07:00			- `evals/` — do agents behave correctly on real LLM sessions? Python harness driving real tmux sessions of Claude Code / Codex / Gemini CLI, with an LLM actor and verifier judging skill compliance.
Add comprehensive testing documentation Documents: - How to run integration tests - subagent-driven-development test details - Token analysis tool usage - Troubleshooting common issues - Writing new integration tests - Session transcript format 2025-11-29 21:03:33 -08:00
docs: introduce evals/ as the canonical skill-behavior eval harness - docs/testing.md split into Plugin tests + Skill behavior evals. Plugin tests section enumerates the bash tests that survive (kept by drill-coverage analysis or as describe-skill tests). - CLAUDE.md adds Eval harness section pointing at evals/. - README.md Contributing section mentions evals/ alongside tests/. - .gitignore adds evals/{results,.venv,.env} as belt-and-suspenders (evals/.gitignore covers these locally; root-level entries help tooling that does not recurse into nested ignore files). 2026-05-06 12:33:10 -07:00			`## Plugin tests`
Add comprehensive testing documentation Documents: - How to run integration tests - subagent-driven-development test details - Token analysis tool usage - Troubleshooting common issues - Writing new integration tests - Session transcript format 2025-11-29 21:03:33 -08:00
docs: introduce evals/ as the canonical skill-behavior eval harness - docs/testing.md split into Plugin tests + Skill behavior evals. Plugin tests section enumerates the bash tests that survive (kept by drill-coverage analysis or as describe-skill tests). - CLAUDE.md adds Eval harness section pointing at evals/. - README.md Contributing section mentions evals/ alongside tests/. - .gitignore adds evals/{results,.venv,.env} as belt-and-suspenders (evals/.gitignore covers these locally; root-level entries help tooling that does not recurse into nested ignore files). 2026-05-06 12:33:10 -07:00			Live in `tests/`. Currently:
Add comprehensive testing documentation Documents: - How to run integration tests - subagent-driven-development test details - Token analysis tool usage - Troubleshooting common issues - Writing new integration tests - Session transcript format 2025-11-29 21:03:33 -08:00
docs: introduce evals/ as the canonical skill-behavior eval harness - docs/testing.md split into Plugin tests + Skill behavior evals. Plugin tests section enumerates the bash tests that survive (kept by drill-coverage analysis or as describe-skill tests). - CLAUDE.md adds Eval harness section pointing at evals/. - README.md Contributing section mentions evals/ alongside tests/. - .gitignore adds evals/{results,.venv,.env} as belt-and-suspenders (evals/.gitignore covers these locally; root-level entries help tooling that does not recurse into nested ignore files). 2026-05-06 12:33:10 -07:00			- `tests/brainstorm-server/` — node test suite for the brainstorm server JS code.
			- `tests/opencode/` — bash tests for OpenCode plugin loading, bootstrap caching, and tool registration.
			- `tests/codex-plugin-sync/` — bash sync verification.
			- `tests/claude-code/test-helpers.sh`, `analyze-token-usage.py` — utilities used by remaining bash tests.
			- `tests/claude-code/test-subagent-driven-development.sh` — agent-can-describe-SDD test (no drill counterpart; tests description-recall, not behavior).
			- `tests/claude-code/test-subagent-driven-development-integration.sh` — extended SDD integration with token analysis (drill covers the YAGNI subset; bash adds commit-count, TodoWrite, and token telemetry assertions).
			- `tests/claude-code/test-worktree-native-preference.sh` — RED-GREEN-REFACTOR validation for worktree skill (drill covers the PRESSURE phase; bash also covers RED/GREEN baselines).
			- `tests/explicit-skill-requests/` — Haiku-specific, multi-turn, and skill-name-prompted tests not covered by drill.
Add comprehensive testing documentation Documents: - How to run integration tests - subagent-driven-development test details - Token analysis tool usage - Troubleshooting common issues - Writing new integration tests - Session transcript format 2025-11-29 21:03:33 -08:00
docs: introduce evals/ as the canonical skill-behavior eval harness - docs/testing.md split into Plugin tests + Skill behavior evals. Plugin tests section enumerates the bash tests that survive (kept by drill-coverage analysis or as describe-skill tests). - CLAUDE.md adds Eval harness section pointing at evals/. - README.md Contributing section mentions evals/ alongside tests/. - .gitignore adds evals/{results,.venv,.env} as belt-and-suspenders (evals/.gitignore covers these locally; root-level entries help tooling that does not recurse into nested ignore files). 2026-05-06 12:33:10 -07:00			Run plugin tests via the relevant directory's `run-*.sh` or `npm test`.
Add comprehensive testing documentation Documents: - How to run integration tests - subagent-driven-development test details - Token analysis tool usage - Troubleshooting common issues - Writing new integration tests - Session transcript format 2025-11-29 21:03:33 -08:00
docs: introduce evals/ as the canonical skill-behavior eval harness - docs/testing.md split into Plugin tests + Skill behavior evals. Plugin tests section enumerates the bash tests that survive (kept by drill-coverage analysis or as describe-skill tests). - CLAUDE.md adds Eval harness section pointing at evals/. - README.md Contributing section mentions evals/ alongside tests/. - .gitignore adds evals/{results,.venv,.env} as belt-and-suspenders (evals/.gitignore covers these locally; root-level entries help tooling that does not recurse into nested ignore files). 2026-05-06 12:33:10 -07:00			`## Skill behavior evals`
Add comprehensive testing documentation Documents: - How to run integration tests - subagent-driven-development test details - Token analysis tool usage - Troubleshooting common issues - Writing new integration tests - Session transcript format 2025-11-29 21:03:33 -08:00
docs: introduce evals/ as the canonical skill-behavior eval harness - docs/testing.md split into Plugin tests + Skill behavior evals. Plugin tests section enumerates the bash tests that survive (kept by drill-coverage analysis or as describe-skill tests). - CLAUDE.md adds Eval harness section pointing at evals/. - README.md Contributing section mentions evals/ alongside tests/. - .gitignore adds evals/{results,.venv,.env} as belt-and-suspenders (evals/.gitignore covers these locally; root-level entries help tooling that does not recurse into nested ignore files). 2026-05-06 12:33:10 -07:00			Live in `evals/`. Drill is the harness; scenarios live at `evals/scenarios/*.yaml`. See `evals/README.md` for setup. Quick start:
Add comprehensive testing documentation Documents: - How to run integration tests - subagent-driven-development test details - Token analysis tool usage - Troubleshooting common issues - Writing new integration tests - Session transcript format 2025-11-29 21:03:33 -08:00
			```bash
docs: introduce evals/ as the canonical skill-behavior eval harness - docs/testing.md split into Plugin tests + Skill behavior evals. Plugin tests section enumerates the bash tests that survive (kept by drill-coverage analysis or as describe-skill tests). - CLAUDE.md adds Eval harness section pointing at evals/. - README.md Contributing section mentions evals/ alongside tests/. - .gitignore adds evals/{results,.venv,.env} as belt-and-suspenders (evals/.gitignore covers these locally; root-level entries help tooling that does not recurse into nested ignore files). 2026-05-06 12:33:10 -07:00			`cd evals`
			`uv sync --extra dev`
			`export ANTHROPIC_API_KEY=sk-...`
			`uv run drill run triggering-test-driven-development -b claude`
Add comprehensive testing documentation Documents: - How to run integration tests - subagent-driven-development test details - Token analysis tool usage - Troubleshooting common issues - Writing new integration tests - Session transcript format 2025-11-29 21:03:33 -08:00			```

docs: introduce evals/ as the canonical skill-behavior eval harness - docs/testing.md split into Plugin tests + Skill behavior evals. Plugin tests section enumerates the bash tests that survive (kept by drill-coverage analysis or as describe-skill tests). - CLAUDE.md adds Eval harness section pointing at evals/. - README.md Contributing section mentions evals/ alongside tests/. - .gitignore adds evals/{results,.venv,.env} as belt-and-suspenders (evals/.gitignore covers these locally; root-level entries help tooling that does not recurse into nested ignore files). 2026-05-06 12:33:10 -07:00			`Drill scenarios are slow (3-30+ minutes each) and run real LLM sessions. They are not part of CI today; the natural follow-up is a tiered model (fast subset on PR, full sweep nightly + on-demand).`