mirror of https://github.com/anthropics/claude-plugins-official.git synced 2026-04-19 23:32:40 +00:00

Files

Ralph Furman 3c9bf4ff5d Add math-olympiad skill — adversarial verification for competition math

The skill that addresses the Proof-or-Bluff gap: self-verified 85.7% IMO
becomes <5% under human grading. Uses fresh-context verifiers armed with
specific failure patterns (not generic 'check logic').

Validated: 17/18 IMO+Putnam 2025 solved, 0 false positives, 2 novel proofs.
See eval data in anthropic monorepo sandbox/sandbox/ralph/math_skills/.

2026-03-19 23:17:36 +00:00

2.9 KiB

Raw Blame History

Model Tier Defaults

Parameters scale with model capability. Budget is not the constraint — the constraints are diminishing returns (more voters stop helping past a point) and the asymmetric noise floor (Haiku verifiers are individually less reliable, so the right response is width not depth).

Haiku 4.5

Width compensates for per-sample noise. Scaffolding is where the leverage is.

Parallel solvers: 12 (wide fan — each individual solve is weaker, so cast a wider net)
Vote budget: 7 verifiers, need 5-confirm / 3-refute (pigeonhole exit: stop when outcome decided)
Abstain threshold: 3 consecutive revise cycles fail
Pattern sweep: all 12 patterns — Haiku can follow a checklist, the patterns are the scaffold
Presentation pass: yes, 3 drafts, comparator picks cleanest. Haiku's raw output is rougher, so this matters MORE not less.
Rationale: The skill's value is highest where the base model is weakest. Give Haiku the full harness. The 3-refute threshold (higher than Sonnet's 2) accounts for Haiku verifiers being individually noisier — don't let 2 confused Haikus kill a correct proof.

Sonnet 4.6

Balanced.

Parallel solvers: 6
Vote budget: 5 verifiers, need 4-confirm / 2-refute
Abstain threshold: 3 consecutive revise cycles fail
Pattern sweep: all 12
Presentation pass: 2 drafts, comparator picks cleaner
Rationale: 4-of-5 tolerates one flake. 2 dissents is signal.

Opus 4.6 / Capybara

Depth. Each sample is strong, so invest in making the adversarial pass harder.

Parallel solvers: 4
Vote budget: 5 general verifiers (4-confirm / 2-refute) PLUS one dedicated verifier per pattern in verifier_patterns.md (12 targeted attacks). Any pattern-specific HOLE FOUND counts toward refute.
Abstain threshold: 5 consecutive revise cycles fail (trust the model's ability to eventually fix)
Pattern sweep: all 12, each with its own dedicated agent
Presentation pass: 3 drafts with different instructions ("most elegant," "most elementary," "shortest"), comparator picks the best. Strong models can genuinely produce different styles of proof.
Rationale: Opus/Capybara can execute the deep patterns (#19 base-vs-derived, #22 mean-first) that need real mathematical judgment. The 12 dedicated pattern passes are where the model's capability is best spent — it's the difference between "be skeptical" and "check THIS specific thing."

On the pigeonhole exit

Kept at all tiers — not because of cost, but because once inflight >= confirm_needed + refute_needed - 1, the remaining votes carry no information regardless of how they land. Launching them anyway is pure latency.

Identifying the tier

If the orchestrating session doesn't know which model it is, default to Sonnet configuration. A reasonable heuristic: ask the model to self-identify in its first response and match against haiku/sonnet/opus/capybara in the output.

2.9 KiB Raw Blame History