The biggest mistake in inference budgets is renting a flagship GPU you don’t need. For Stable Diffusion and most model serving, a cheaper card is faster per dollar. Here’s what to rent.
Cheapest inference GPUs (June 2026)
| GPU | VRAM | Cheapest /hr | Best for |
|---|---|---|---|
| RTX 4090 | 24GB | ~$0.34 (RunPod Community) | Stable Diffusion, small-model serving |
| L4 | 24GB | ~$0.33 (Vast) / $0.39 (RunPod) | High-volume, low-power inference |
| A6000 | 48GB | ~$0.49 (RunPod) | Budget 48GB VRAM workhorse |
| RTX 6000 Ada | 48GB | ~$0.50 (Lambda) | Faster 48GB Ada card |
| L40S | 48GB | ~$0.86 (RunPod) | High-throughput inference, bigger models |
Snapshot June 2026 — prices change weekly; verify on each provider’s pricing page. See the live cheapest GPUs for inference ranking.
For Stable Diffusion: the RTX 4090
The RTX 4090 (24GB) is the value champion for image generation — fast, enough VRAM for SDXL and most ComfyUI workflows, and dirt cheap on community/marketplace clouds (RunPod Community ~$0.34/hr, Vast.ai ~$0.20-0.52/hr). It’s a consumer card (no ECC, rarely on enterprise clouds), but per dollar nothing beats it for single-GPU generation.
For LLM serving: match VRAM to the model
- Small/medium models (up to ~13B), steady traffic: the L4 (24GB, 72W) gives the best cost-per-token.
- Bigger models or higher throughput: the L40S (48GB) or A6000 (48GB).
- Only when you truly need it: an H100/H200 for very large models or memory-bandwidth-bound serving.
The rule: pick the cheapest GPU whose VRAM fits your model and batch size. Over-provisioning to an H100 for a 7B model wastes most of your budget.
Spot for batch, on-demand for live endpoints
For batch/offline inference, use spot/community to cut ~50%. For latency-critical live endpoints, prefer on-demand so a reclaim doesn’t drop your service — or run a small on-demand baseline plus spot for burst.
Try it
Price your inference GPU and hours across every provider in the calculator, or browse all the inference cards.