The A-rate benchmark: how we grade crypto LLMs
Our 45-question crypto benchmark is graded on an A–F scale by two reviewers plus a rubric matcher. Here's exactly how, and why it's harder to game than multiple-choice.
Most public LLM benchmarks are multiple-choice quizzes. MMLU, HellaSwag, ARC — they're useful for comparing raw capability, but they don't tell you whether a model can write a coherent paragraph explaining MEV to a non-expert. Ours does.
The setup
- 45 questions across 8 categories: DeFi mechanics, MEV/PBS/FCFS, tokenomics, L2 design, Solana-specific, smart-money / wallet clustering, regime detection, security.
- Each question has a rubric: a list of facts that must appear, common misconceptions that must not, and required caveats (e.g. "must mention that slashing is a real risk in LRTs").
- Grading scale: A (complete + correct), B (correct with minor gaps), C (partially correct), D (mostly wrong), F (wrong).
- Two human graders + one rubric matcher. Disagreement among the three triggers a third human reviewer.
Why A–F, not 0–1
Numbers-only grading forces reviewers to collapse nuance. A response that gets the mechanism right but misses the risk disclaimer is a B in our system — a 0.87 in a numeric system, which looks fine and is actually a meaningful miss. A–F captures the kind of failure.
How we prevent gaming
- The eval set is small (45 questions) but the rubrics are strict. A model fine-tuned on our exact questions would need to also match our rubric style without being explicitly trained on it, which is hard without us noticing.
- Reviewers rotate. No reviewer sees the same question twice in a row.
- We publish the eval set — anyone can reproduce it.
Current scores
| Model | A-rate | A+B | | --- | --- | --- | | Sovereign v2 v6-SFT (14B) | 87% | 98% | | GPT-4o (closed) | 68% | 91% | | Claude 3.5 Haiku | 64% | 88% | | Llama 3 70B | 59% | 82% | | Qwen3-14B (no fine-tune) | 47% | 69% |
Where we fail
Our model graded C on smart-money / wallet clustering. Our training data underrepresents this category; we're fixing it with a curriculum-weighted v7.
Why publish this
Because if we don't, no one can argue with our numbers — and if no one can argue, the numbers are worthless. Run the benchmark yourself. Argue with the rubrics. File PRs to add questions.