GPU VRAM Requirement Calculator for LLMs
Before you rent GPU time or buy a new card, calculate exactly how much VRAM your model needs for inference and fine-tuning.
Why does LLM inference require so much VRAM?
LLMs store billions of parameters in memory. Each parameter (weights and activations) requires storage in floating point - typically 2 bytes (FP16) minimum for reasonable inference. A 7B parameter model needs ~14GB just to load the weights, plus additional memory for the KV cache during generation.
What's the difference between loading a model and actual inference?
Loading a model requires memory equal to the model size in the chosen precision. During inference, additional VRAM is needed for the KV cache (key-value pairs from attention layers), which scales with sequence length and batch size. This is why longer contexts require significantly more VRAM.
What precision should I use for inference?
FP16 (16-bit): Standard quality, full VRAM usage. FP32 (32-bit): Higher precision, 2x VRAM. INT8/Quantized: 50-75% VRAM reduction with slight quality loss. GPTQ/GGUF: Can reduce 70-90% with minimal quality impact on 7B-13B models. For consumer GPUs, 4-bit quantization is often the practical choice for larger models.
How does batch size affect VRAM requirements?
Batch size multiplies the KV cache memory requirement. Batch size 1 is most memory-efficient for single queries. Batch size 4-8 is typical for serving multiple users simultaneously. Larger batches improve throughput but require proportionally more VRAM. For consumer GPUs, batch size 1-2 is often the limit without quantization.
What VRAM do I need for fine-tuning?
Fine-tuning requires significantly more VRAM than inference because you need to store: Model weights, Gradients (backward pass), Optimizer states (Adam has 2 extra states per parameter), Activation snapshots. Full fine-tuning of 7B model typically needs 40-80GB VRAM. Parameter-efficient methods like LoRA reduce this to 8-24GB.