Reproducible. Skeptical. Public.
A continuously-updated crypto-domain benchmark graded A–F by an LLM judge cross-model panel. Same questions, same grader, every model. Every 6 hours the pipeline re-runs and this page redeploys.
Headline results
| Model | Size | A-rate | A+B | N | Last run |
|---|---|---|---|---|---|
Sovereign v2 (v6-canary) sovereign-v2:v6-canary | 14B | 86.7% | 97.8% | 45 | 2026-04-16 |
Sovereign v2 (canary) sovereign-v2:canary | 14B | 82.5% | 95.0% | 40 | 2026-04-16 |
Sovereign v2 (dpo-candidate) sovereign-v2:dpo-candidate | 14B | 82.2% | 93.3% | 45 | 2026-04-16 |
Sovereign v2 (current production) sovereign-v2:latest | 14B | 78.2% | 93.6% | 78 | 2026-05-10 |
Sovereign v2 (v6-dpo) sovereign-v2:v6-dpo | 14B | 77.8% | 97.8% | 45 | 2026-04-16 |
Sovereign v2 (v10-sft) sovereign-v2:v10-sft | 14B | 72.0% | 94.0% | 100 | 2026-04-20 |
Generated 2026-05-10. Results/ directory has every run since April 2026. See the research posts for methodology and failure analysis.
Top model — Sovereign v2 (v6-canary)
Strongest categories (score = fraction of A/B grades on that slice):
| Category | Grade | Score |
|---|---|---|
| tech-analysis | A | 100.0% |
| error-handling | A | 100.0% |
| agent-architecture | A | 100.0% |
| growth-hacking | A | 100.0% |
| resource-scheduling | A | 100.0% |
| market-micro | A | 100.0% |
| analytics-metrics | A | 100.0% |
| community-management | A | 100.0% |
| brand-strategy | A | 100.0% |
| quant-trading | A | 100.0% |
Weak spots we're training against
Lowest-scoring categories — the MAB curriculum sends more prompts here until A-rate catches up.
| Category | Grade | Score |
|---|---|---|
| security | C | 57.1% |
| wallet-cluster | B | 73.2% |
| regime-detect | B | 75.0% |
| prompt-engineering | B | 75.0% |
| predict-market | B | 75.0% |
Methodology
The benchmark is a fixed set of crypto-domain questions spanning DeFi mechanics, Solana-specific tooling, MEV, tokenomics, risk management, quant trading, and smart-money / wallet-clustering patterns. Each question has a rubric of required keywords and expected technical concepts.
Every model's responses are graded A (accurate + thorough), B (mostly accurate, some gaps), C (vague or partial), D (wrong), or F (incoherent). Grading combines automated rubric keyword matching with a cross-model LLM judge on a separate GPU (currently gemma2-9b-instruct on an RTX 2070, different architecture from the Sovereign v2 Qwen3 base to break correlated errors).
Baseline models (Qwen 2.5 Coder 7B, Gemma 2 9B) run the same benchmark on the same hardware every 6 hours. Running baselines alongside Sovereign v2 lets us separate "base model got smarter" from "our fine-tune got better" — the delta between sovereign-v2 and its base model is what training is actually worth.
Reproduce
git clone https://github.com/sovereignai/eval cd eval pip install -r requirements.txt # Run against our API python run_benchmark.py --model sovereign-v2 --api-base https://api.sovereignai.rip # Or any Ollama-compatible endpoint python run_benchmark.py --model qwen2.5-coder:7b --host localhost --port 11434