Training a 14B LLM on a Tesla V100 in 2026
How we fine-tune Qwen3-14B on a single V100 16GB without renting H100s. Unsloth, LoRA, and the unglamorous bits nobody writes about.
Most "how we trained X" posts skip the part where you spent four hours debugging a torch.cuda.OutOfMemoryError that turned out to be caused by max_length=1024. This post doesn't.
Why the V100
In 2026, a V100 16GB feels quaint next to H100s and B200s. But for a 14B SFT, it's more than enough — if you treat memory as a first-class concern.
- Cost to own: we already had the hardware.
- Cost per step: essentially zero (electricity).
- Cost per model version: tens of dollars in power versus thousands in H100 rental.
- Constraint it imposes: no massive batch sizes, no full-parameter training. LoRA or bust.
Stack
- Base model: Qwen3-14B
- Framework: Unsloth (faster than raw Hugging Face, more reliable than Axolotl in our hands)
- Method: LoRA SFT, rank 32, learning rate 1e-5
- Sequence length: 2048 for SFT, 512 for DPO (1024 reliably crashes the V100 with CUDA illegal memory access — see DPO postmortem)
- Dataset size: 2,500 records, curriculum-weighted to push weak categories
The unglamorous parts
Memory fragmentation
DPO in particular would OOM on the second epoch despite having headroom on the first. Fix: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. Should be the default; isn't.
NUL bytes in training data
At one point, a scraping pipeline silently wrote \x00 bytes into our SFT JSONL. The trainer accepted them, tokenized them, and produced gibberish weights. We added a prechecker; if your training code trains through NUL bytes without warning you, add one too.
venv is broken
We learned the hard way that source venv/bin/activate plus python plus CUDA_VISIBLE_DEVICES plus systemd user services equals "does this use GPU or CPU?" We now pass explicit absolute paths to the Python binary and CUDA_VISIBLE_DEVICES as an environment, and never trust activation.
The merge step
LoRA adapters need to be merged back into the base weights before GGUF export. This is another GPU-heavy step — the merge eats more memory than the training itself in some cases. Same expandable_segments flag required.
Deployment
- Export merged weights to GGUF Q6_K (~12 GB)
- Import into Ollama with
OLLAMA_HOST=http://localhost:1234 - Tag with version number; only promote to
:latestafter full benchmark
Numbers
| Metric | Value | | --- | --- | | Train wall-clock (SFT) | ~6 hours | | DPO (when it worked) | ~10 minutes | | GGUF export | ~8 minutes | | Benchmark run | ~12 minutes | | Total cycle | ~7 hours end to end |
Takeaway
Consumer-grade (V100) compute can produce competitive domain-specialized models in 2026. The bottleneck isn't the hardware — it's data quality, evaluation rigor, and the boring plumbing work nobody posts about.
We'll write up the DPO postmortem separately. Short version: DPO can regress a strong SFT model if your preference pairs aren't diverse enough.