“How much to train an LLM” has no single answer — it ranges from pocket change to hundreds of millions. But the math is simple, and you can estimate your own number.
The formula
Cost = GPU-hours x price per GPU-hour.
GPU-hours come from compute:
- Total training FLOPs for a dense transformer ≈ 6 x parameters x training tokens (the “6ND” rule).
- GPU-hours = total FLOPs / (per-GPU FLOP/s x MFU x 3600), where MFU (model FLOPs utilization) is realistically 30-55%.
An H100 delivers roughly 1,000 TFLOP/s of usable BF16/FP16 compute. At ~$2.30/hr per H100 (cheapest on Vultr), an 8xH100 node is ~$18/hr — see the cheapest 8xH100 node ranking.
Worked examples (neocloud H100 prices)
| Job | Rough GPU-hours | Est. cost @ ~$2.30/H100-hr |
|---|---|---|
| LoRA fine-tune, 7B model, 1 GPU, 4 h | 4 | ~$10 |
| Full fine-tune, 13B, 8 GPUs, 6 h | 48 | ~$110 |
| Pretrain 1B model on 25B tokens | ~1,500 | ~$3,500 |
| Pretrain 7B model on 1T tokens | ~150,000 | ~$345,000 |
| Frontier model (proxy) | tens of millions | millions+ |
These assume ~40% MFU and exclude data prep, failed runs, storage and egress. Snapshot June 2026 — verify current prices before budgeting. Plug your own numbers into the cost calculator.
What actually moves the bill
- GPU-hours dominate. Halving them (better data quality, higher MFU, a smaller model that’s “good enough”) saves far more than a cheaper provider.
- MFU is free money. Going from 30% to 50% utilization cuts cost by ~40% with no extra hardware.
- Spot for restartable phases. Checkpointed pretraining on spot/community can cut 30-65%.
- Right-size the GPU. Fine-tunes rarely need an H100 — an A100 or even an inference card is often plenty.
The takeaway
Pick a cheap provider (the neoclouds are 2-4x cheaper than hyperscalers), then spend your energy on efficiency and GPU-hours — that’s where the real money is.