GPU VRAM Requirement Calculator for LLMs

Before you rent GPU time or buy a new card, calculate exactly how much VRAM your model needs for inference and fine-tuning.

Model Weights = Parameters × Bytes per Parameter; KV Cache ≈ (Model Size × Factor) / 4096 × Seq Len × Batch; Inference VRAM = Weights + KV Cache + 10% Overhead
Llama 3.2 7B at FP16: ~14GB weights, +0.5GB KV cache = 16GB total. A single RTX 4090 (24GB) can run it. At INT4: ~4GB weights, RTX 3080 (10GB) is sufficient.

Why does LLM inference require so much VRAM?

LLMs store billions of parameters in memory. Each parameter (weights and activations) requires storage in floating point - typically 2 bytes (FP16) minimum for reasonable inference. A 7B parameter model needs ~14GB just to load the weights, plus additional memory for the KV cache during generation.

What's the difference between loading a model and actual inference?

Loading a model requires memory equal to the model size in the chosen precision. During inference, additional VRAM is needed for the KV cache (key-value pairs from attention layers), which scales with sequence length and batch size. This is why longer contexts require significantly more VRAM.

What precision should I use for inference?

FP16 (16-bit): Standard quality, full VRAM usage. FP32 (32-bit): Higher precision, 2x VRAM. INT8/Quantized: 50-75% VRAM reduction with slight quality loss. GPTQ/GGUF: Can reduce 70-90% with minimal quality impact on 7B-13B models. For consumer GPUs, 4-bit quantization is often the practical choice for larger models.

How does batch size affect VRAM requirements?

Batch size multiplies the KV cache memory requirement. Batch size 1 is most memory-efficient for single queries. Batch size 4-8 is typical for serving multiple users simultaneously. Larger batches improve throughput but require proportionally more VRAM. For consumer GPUs, batch size 1-2 is often the limit without quantization.

What VRAM do I need for fine-tuning?

Fine-tuning requires significantly more VRAM than inference because you need to store: Model weights, Gradients (backward pass), Optimizer states (Adam has 2 extra states per parameter), Activation snapshots. Full fine-tuning of 7B model typically needs 40-80GB VRAM. Parameter-efficient methods like LoRA reduce this to 8-24GB.