Sovereign AI
Public benchmark · live data

Reproducible. Skeptical. Public.

A continuously-updated crypto-domain benchmark graded A–F by an LLM judge cross-model panel. Same questions, same grader, every model. Every 6 hours the pipeline re-runs and this page redeploys.

Headline results

ModelSizeA-rateA+BNLast run
Sovereign v2 (v6-canary)
sovereign-v2:v6-canary
14B86.7%97.8%452026-04-16
Sovereign v2 (canary)
sovereign-v2:canary
14B82.5%95.0%402026-04-16
Sovereign v2 (dpo-candidate)
sovereign-v2:dpo-candidate
14B82.2%93.3%452026-04-16
Sovereign v2 (current production)
sovereign-v2:latest
14B78.2%93.6%782026-05-10
Sovereign v2 (v6-dpo)
sovereign-v2:v6-dpo
14B77.8%97.8%452026-04-16
Sovereign v2 (v10-sft)
sovereign-v2:v10-sft
14B72.0%94.0%1002026-04-20

Generated 2026-05-10. Results/ directory has every run since April 2026. See the research posts for methodology and failure analysis.

Top model — Sovereign v2 (v6-canary)

Strongest categories (score = fraction of A/B grades on that slice):

CategoryGradeScore
tech-analysisA100.0%
error-handlingA100.0%
agent-architectureA100.0%
growth-hackingA100.0%
resource-schedulingA100.0%
market-microA100.0%
analytics-metricsA100.0%
community-managementA100.0%
brand-strategyA100.0%
quant-tradingA100.0%

Weak spots we're training against

Lowest-scoring categories — the MAB curriculum sends more prompts here until A-rate catches up.

CategoryGradeScore
securityC57.1%
wallet-clusterB73.2%
regime-detectB75.0%
prompt-engineeringB75.0%
predict-marketB75.0%

Methodology

The benchmark is a fixed set of crypto-domain questions spanning DeFi mechanics, Solana-specific tooling, MEV, tokenomics, risk management, quant trading, and smart-money / wallet-clustering patterns. Each question has a rubric of required keywords and expected technical concepts.

Every model's responses are graded A (accurate + thorough), B (mostly accurate, some gaps), C (vague or partial), D (wrong), or F (incoherent). Grading combines automated rubric keyword matching with a cross-model LLM judge on a separate GPU (currently gemma2-9b-instruct on an RTX 2070, different architecture from the Sovereign v2 Qwen3 base to break correlated errors).

Baseline models (Qwen 2.5 Coder 7B, Gemma 2 9B) run the same benchmark on the same hardware every 6 hours. Running baselines alongside Sovereign v2 lets us separate "base model got smarter" from "our fine-tune got better" — the delta between sovereign-v2 and its base model is what training is actually worth.

Reproduce

git clone https://github.com/sovereignai/eval
cd eval
pip install -r requirements.txt

# Run against our API
python run_benchmark.py --model sovereign-v2 --api-base https://api.sovereignai.rip

# Or any Ollama-compatible endpoint
python run_benchmark.py --model qwen2.5-coder:7b --host localhost --port 11434