The skill that addresses the Proof-or-Bluff gap: self-verified 85.7% IMO becomes <5% under human grading. Uses fresh-context verifiers armed with specific failure patterns (not generic 'check logic'). Validated: 17/18 IMO+Putnam 2025 solved, 0 false positives, 2 novel proofs. See eval data in anthropic monorepo sandbox/sandbox/ralph/math_skills/.
2.9 KiB
Model Tier Defaults
Parameters scale with model capability. Budget is not the constraint — the constraints are diminishing returns (more voters stop helping past a point) and the asymmetric noise floor (Haiku verifiers are individually less reliable, so the right response is width not depth).
Haiku 4.5
Width compensates for per-sample noise. Scaffolding is where the leverage is.
- Parallel solvers: 12 (wide fan — each individual solve is weaker, so cast a wider net)
- Vote budget: 7 verifiers, need 5-confirm / 3-refute (pigeonhole exit: stop when outcome decided)
- Abstain threshold: 3 consecutive revise cycles fail
- Pattern sweep: all 12 patterns — Haiku can follow a checklist, the patterns are the scaffold
- Presentation pass: yes, 3 drafts, comparator picks cleanest. Haiku's raw output is rougher, so this matters MORE not less.
- Rationale: The skill's value is highest where the base model is weakest. Give Haiku the full harness. The 3-refute threshold (higher than Sonnet's 2) accounts for Haiku verifiers being individually noisier — don't let 2 confused Haikus kill a correct proof.
Sonnet 4.6
Balanced.
- Parallel solvers: 6
- Vote budget: 5 verifiers, need 4-confirm / 2-refute
- Abstain threshold: 3 consecutive revise cycles fail
- Pattern sweep: all 12
- Presentation pass: 2 drafts, comparator picks cleaner
- Rationale: 4-of-5 tolerates one flake. 2 dissents is signal.
Opus 4.6 / Capybara
Depth. Each sample is strong, so invest in making the adversarial pass harder.
- Parallel solvers: 4
- Vote budget: 5 general verifiers (4-confirm / 2-refute) PLUS one dedicated verifier per pattern in
verifier_patterns.md(12 targeted attacks). Any pattern-specific HOLE FOUND counts toward refute. - Abstain threshold: 5 consecutive revise cycles fail (trust the model's ability to eventually fix)
- Pattern sweep: all 12, each with its own dedicated agent
- Presentation pass: 3 drafts with different instructions ("most elegant," "most elementary," "shortest"), comparator picks the best. Strong models can genuinely produce different styles of proof.
- Rationale: Opus/Capybara can execute the deep patterns (#19 base-vs-derived, #22 mean-first) that need real mathematical judgment. The 12 dedicated pattern passes are where the model's capability is best spent — it's the difference between "be skeptical" and "check THIS specific thing."
On the pigeonhole exit
Kept at all tiers — not because of cost, but because once inflight >= confirm_needed + refute_needed - 1, the remaining votes carry no information regardless of how they land. Launching them anyway is pure latency.
Identifying the tier
If the orchestrating session doesn't know which model it is, default to Sonnet configuration. A reasonable heuristic: ask the model to self-identify in its first response and match against haiku/sonnet/opus/capybara in the output.