Sovereign AI
2026-04-21 · 2 min read

Training a 14B LLM on a Tesla V100 in 2026

How we fine-tune Qwen3-14B on a single V100 16GB without renting H100s. Unsloth, LoRA, and the unglamorous bits nobody writes about.

Most "how we trained X" posts skip the part where you spent four hours debugging a torch.cuda.OutOfMemoryError that turned out to be caused by max_length=1024. This post doesn't.

Why the V100

In 2026, a V100 16GB feels quaint next to H100s and B200s. But for a 14B SFT, it's more than enough — if you treat memory as a first-class concern.

  • Cost to own: we already had the hardware.
  • Cost per step: essentially zero (electricity).
  • Cost per model version: tens of dollars in power versus thousands in H100 rental.
  • Constraint it imposes: no massive batch sizes, no full-parameter training. LoRA or bust.

Stack

  • Base model: Qwen3-14B
  • Framework: Unsloth (faster than raw Hugging Face, more reliable than Axolotl in our hands)
  • Method: LoRA SFT, rank 32, learning rate 1e-5
  • Sequence length: 2048 for SFT, 512 for DPO (1024 reliably crashes the V100 with CUDA illegal memory access — see DPO postmortem)
  • Dataset size: 2,500 records, curriculum-weighted to push weak categories

The unglamorous parts

Memory fragmentation

DPO in particular would OOM on the second epoch despite having headroom on the first. Fix: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. Should be the default; isn't.

NUL bytes in training data

At one point, a scraping pipeline silently wrote \x00 bytes into our SFT JSONL. The trainer accepted them, tokenized them, and produced gibberish weights. We added a prechecker; if your training code trains through NUL bytes without warning you, add one too.

venv is broken

We learned the hard way that source venv/bin/activate plus python plus CUDA_VISIBLE_DEVICES plus systemd user services equals "does this use GPU or CPU?" We now pass explicit absolute paths to the Python binary and CUDA_VISIBLE_DEVICES as an environment, and never trust activation.

The merge step

LoRA adapters need to be merged back into the base weights before GGUF export. This is another GPU-heavy step — the merge eats more memory than the training itself in some cases. Same expandable_segments flag required.

Deployment

  • Export merged weights to GGUF Q6_K (~12 GB)
  • Import into Ollama with OLLAMA_HOST=http://localhost:1234
  • Tag with version number; only promote to :latest after full benchmark

Numbers

| Metric | Value | | --- | --- | | Train wall-clock (SFT) | ~6 hours | | DPO (when it worked) | ~10 minutes | | GGUF export | ~8 minutes | | Benchmark run | ~12 minutes | | Total cycle | ~7 hours end to end |

Takeaway

Consumer-grade (V100) compute can produce competitive domain-specialized models in 2026. The bottleneck isn't the hardware — it's data quality, evaluation rigor, and the boring plumbing work nobody posts about.

We'll write up the DPO postmortem separately. Short version: DPO can regress a strong SFT model if your preference pairs aren't diverse enough.