Public benchmark · live data

Reproducible. Skeptical. Public.

A continuously-updated crypto-domain benchmark graded A–F by an LLM judge cross-model panel. Same questions, same grader, every model. Every 6 hours the pipeline re-runs and this page redeploys.

Headline results

Model	Size	A-rate	A+B	N	Last run
Sovereign v2 (v6-canary) sovereign-v2:v6-canary	14B	86.7%	97.8%	45	2026-04-16
Sovereign v2 (canary) sovereign-v2:canary	14B	82.5%	95.0%	40	2026-04-16
Sovereign v2 (dpo-candidate) sovereign-v2:dpo-candidate	14B	82.2%	93.3%	45	2026-04-16
Sovereign v2 (current production) sovereign-v2:latest	14B	78.2%	93.6%	78	2026-05-10
Sovereign v2 (v6-dpo) sovereign-v2:v6-dpo	14B	77.8%	97.8%	45	2026-04-16
Sovereign v2 (v10-sft) sovereign-v2:v10-sft	14B	72.0%	94.0%	100	2026-04-20

Generated 2026-05-10. Results/ directory has every run since April 2026. See the research posts for methodology and failure analysis.

Top model — Sovereign v2 (v6-canary)

Strongest categories (score = fraction of A/B grades on that slice):

Category	Grade	Score
tech-analysis	A	100.0%
error-handling	A	100.0%
agent-architecture	A	100.0%
growth-hacking	A	100.0%
resource-scheduling	A	100.0%
market-micro	A	100.0%
analytics-metrics	A	100.0%
community-management	A	100.0%
brand-strategy	A	100.0%
quant-trading	A	100.0%

Weak spots we're training against

Lowest-scoring categories — the MAB curriculum sends more prompts here until A-rate catches up.

Category	Grade	Score
security	C	57.1%
wallet-cluster	B	73.2%
regime-detect	B	75.0%
prompt-engineering	B	75.0%
predict-market	B	75.0%

Methodology

The benchmark is a fixed set of crypto-domain questions spanning DeFi mechanics, Solana-specific tooling, MEV, tokenomics, risk management, quant trading, and smart-money / wallet-clustering patterns. Each question has a rubric of required keywords and expected technical concepts.

Every model's responses are graded A (accurate + thorough), B (mostly accurate, some gaps), C (vague or partial), D (wrong), or F (incoherent). Grading combines automated rubric keyword matching with a cross-model LLM judge on a separate GPU (currently gemma2-9b-instruct on an RTX 2070, different architecture from the Sovereign v2 Qwen3 base to break correlated errors).

Baseline models (Qwen 2.5 Coder 7B, Gemma 2 9B) run the same benchmark on the same hardware every 6 hours. Running baselines alongside Sovereign v2 lets us separate "base model got smarter" from "our fine-tune got better" — the delta between sovereign-v2 and its base model is what training is actually worth.

Reproduce

git clone https://github.com/sovereignai/eval
cd eval
pip install -r requirements.txt

# Run against our API
python run_benchmark.py --model sovereign-v2 --api-base https://api.sovereignai.rip

# Or any Ollama-compatible endpoint
python run_benchmark.py --model qwen2.5-coder:7b --host localhost --port 11434