Quantization
Simple Definition
Quantization compresses an AI model by representing its internal numbers (parameters) with lower precision. Instead of storing each value as a 32-bit or 16-bit number, quantization might store it as an 8-bit or even 4-bit number.
The result is a model that:
- Takes up less storage space
- Uses less memory (RAM/VRAM) to run
- Runs faster, especially on consumer hardware
- Loses a small amount of accuracy compared to the original
A Simple Analogy
Imagine you have a high-resolution photo at 10MB. If you reduce it to a lower-resolution version at 2MB, it looks nearly as good at normal sizes but loads much faster and takes less space. Quantization does the same thing to AI model weights — trading a tiny bit of quality for a large gain in efficiency.
Why Quantization Matters
Without quantization, running large AI models requires expensive, specialized hardware. With quantization, you can:
- Run a capable model on a laptop with a standard GPU
- Run models locally without sending data to the cloud
- Deploy AI on phones and edge devices
- Reduce cloud hosting costs significantly
Common Quantization Formats
- FP16 (16-bit) — half the size of full precision, minimal quality loss
- INT8 (8-bit) — smaller and faster, slight quality drop
- INT4 (4-bit) — very small, runs on consumer hardware, noticeable but acceptable quality trade-off
- GGUF — a popular file format for quantized models used with tools like Ollama and LM Studio
Quantization and Local AI
If you’ve used tools like Ollama, LM Studio, or Jan to run AI on your computer, you’ve used quantized models. A 70B parameter model in full precision would require ~140GB of VRAM — impossible on consumer hardware. A 4-bit quantized version of the same model might need only 40GB, or a 7B model might run in under 6GB.
Related Terms
- Model Parameters — the values that get compressed during quantization
- SLM — small models that often pair with quantization for on-device use
- Inference — the process that quantization makes faster and cheaper
- LLM — large models that often need quantization to run on accessible hardware
Continue learning
Explore related guides, tools, workflows, and prompts that help you go deeper into this topic.
Browse all AI terms.
Learn termSee these concepts in practice.
Open workflowA simple explanation of this AI concept.
Learn termA simple explanation of this AI concept.
Learn termA simple explanation of this AI concept.
Learn termA simple explanation of this AI concept.
Learn termSee AI terms in action
Browse practical AI workflows that use the concepts in this glossary.
Last updated: