Open vs closed models on crypto Q&A: what the numbers actually say

Closed models like GPT-4o and Claude have larger parameter counts, more training compute, and access to better instruction data than any individual team. So why does a fine-tuned 14B model beat them on crypto Q&A?

Because they're not specialized, and crypto knowledge in their training data is diluted by everything else.

Headline

| Model | A-rate | Notes | | --- | --- | --- | | Sovereign v2 (14B SFT) | 87% | Fine-tuned on crypto-specific corpus | | GPT-4o | 68% | General-purpose frontier model | | Claude 3.5 Haiku | 64% | General-purpose, optimized for speed | | Llama 3 70B | 59% | Larger open model, not crypto-fine-tuned | | Qwen3-14B (our base, no fine-tune) | 47% | Proves fine-tuning did the work |

Where sovereign wins

MEV / PBS / FCFS mechanics — GPT-4o often confuses Solana's priority-fee auction with Ethereum's mev-boost. We explicitly trained this distinction.
Tokenomics details — vesting schedules, emission curves, airdrop mechanics. Closed models hedge; specialist answers.
L2 / rollup design — optimistic vs zk trade-offs, data availability. Closed models give textbook answers; we add failure modes.
Solana-specific — validator economics, stake weights, restaking flavors. Closed models are weaker here because less of their training data is Solana-heavy.

Where closed models win

Smart-money / wallet clustering — requires more math + reasoning; we're currently graded C. GPT-4o is better here because it's better at general reasoning chains.
Novel topics that emerged after our training cutoff — we know nothing about events after April 2026. GPT-4o's knowledge cutoff is similarly fixed, but their training data was richer when the cutoff hit.
Rarely-discussed protocols — if a protocol launched last week and isn't in our fine-tune data, a 14B model has less "general intuition" to fall back on than a 70B.

What this means

Specialized fine-tuning of a modest model can beat a frontier generalist on the specific task you care about. This isn't new — it's been true since BERT was state of the art. But it's underused in crypto-native applications, which still default to "call GPT-4o with a prompt like an expert in crypto".

It also means no single model wins everything. If you want the best-of-both, you use a specialized model for domain questions and a frontier general model for reasoning chains, and route appropriately.

How to reproduce

Clone https://github.com/sovereignai/eval
Run python run_benchmark.py --model sovereign-v2 --api-base https://api.sovereignai.xyz
Run it against any OpenAI-compatible endpoint (including your own Ollama)
Compare

If our numbers are wrong — argue with them in public. That's the point of a reproducible benchmark.