What's the cheapest GPU for Stable Diffusion?

An RTX 4090 (24GB) on RunPod Community (~$0.34/hr) or Vast.ai (~$0.20-0.52/hr) is the best value for Stable Diffusion image generation. It's fast, has enough VRAM for SDXL and most workflows, and costs a fraction of a datacenter card.

Do I need an H100 to serve an LLM?

Rarely. For 7-13B models an L40S (48GB) or even an A6000 handles inference fine at a fraction of H100 cost. Use an H100/H200 only when the model or batch size needs the extra memory bandwidth. Match VRAM to model size first.

L4 vs L40S for inference?

The L4 (24GB, 72W) is the cheapest and most power-efficient for steady serving of small/medium models. The L40S (48GB) is much faster and fits bigger models, at a higher hourly rate. Pick L4 for cost-per-token at volume, L40S for throughput and larger models.

Cheapest GPUs for Stable Diffusion and AI inference in 2026

The biggest mistake in inference budgets is renting a flagship GPU you don’t need. For Stable Diffusion and most model serving, a cheaper card is faster per dollar. Here’s what to rent.

Cheapest inference GPUs (June 2026)

GPU	VRAM	Cheapest /hr	Best for
RTX 4090	24GB	~$0.34 (RunPod Community)	Stable Diffusion, small-model serving
L4	24GB	~$0.33 (Vast) / $0.39 (RunPod)	High-volume, low-power inference
A6000	48GB	~$0.49 (RunPod)	Budget 48GB VRAM workhorse
RTX 6000 Ada	48GB	~$0.50 (Lambda)	Faster 48GB Ada card
L40S	48GB	~$0.86 (RunPod)	High-throughput inference, bigger models

Snapshot June 2026 — prices change weekly; verify on each provider’s pricing page. See the live cheapest GPUs for inference ranking.

For Stable Diffusion: the RTX 4090

The RTX 4090 (24GB) is the value champion for image generation — fast, enough VRAM for SDXL and most ComfyUI workflows, and dirt cheap on community/marketplace clouds (RunPod Community ~$0.34/hr, Vast.ai ~$0.20-0.52/hr). It’s a consumer card (no ECC, rarely on enterprise clouds), but per dollar nothing beats it for single-GPU generation.

For LLM serving: match VRAM to the model

Small/medium models (up to ~13B), steady traffic: the L4 (24GB, 72W) gives the best cost-per-token.
Bigger models or higher throughput: the L40S (48GB) or A6000 (48GB).
Only when you truly need it: an H100/H200 for very large models or memory-bandwidth-bound serving.

The rule: pick the cheapest GPU whose VRAM fits your model and batch size. Over-provisioning to an H100 for a 7B model wastes most of your budget.

Spot for batch, on-demand for live endpoints

For batch/offline inference, use spot/community to cut ~50%. For latency-critical live endpoints, prefer on-demand so a reclaim doesn’t drop your service — or run a small on-demand baseline plus spot for burst.

Try it

Price your inference GPU and hours across every provider in the calculator, or browse all the inference cards.