Quantization

Simple Definition

Quantization compresses an AI model by representing its internal numbers (parameters) with lower precision. Instead of storing each value as a 32-bit or 16-bit number, quantization might store it as an 8-bit or even 4-bit number.

The result is a model that:

Takes up less storage space
Uses less memory (RAM/VRAM) to run
Runs faster, especially on consumer hardware
Loses a small amount of accuracy compared to the original

A Simple Analogy

Imagine you have a high-resolution photo at 10MB. If you reduce it to a lower-resolution version at 2MB, it looks nearly as good at normal sizes but loads much faster and takes less space. Quantization does the same thing to AI model weights, trading a tiny bit of quality for a large gain in efficiency.

Why Quantization Matters

Without quantization, running large AI models requires expensive, specialized hardware. With quantization, you can:

Run a capable model on a laptop with a standard GPU
Run models locally without sending data to the cloud
Deploy AI on phones and edge devices
Reduce cloud hosting costs significantly

Common Quantization Formats

FP16 (16-bit), half the size of full precision, minimal quality loss
INT8 (8-bit), smaller and faster, slight quality drop
INT4 (4-bit), very small, runs on consumer hardware, noticeable but acceptable quality trade-off
GGUF: a popular file format for quantized models used with tools like Ollama and LM Studio

Quantization and Local AI

If you’ve used tools like Ollama, LM Studio, or Jan to run AI on your computer, you’ve used quantized models. A 70B parameter model in full precision would require ~140GB of VRAM, impossible on consumer hardware. A 4-bit quantized version of the same model might need only 40GB, or a 7B model might run in under 6GB.