Benchmark
Simple Definition
A benchmark is a set of standardized tests that measure how well an AI model performs at specific tasks. Benchmarks let researchers and companies compare models against each other using a common measuring stick.
When AI companies announce a new model, they typically publish benchmark scores to show how it compares to competing models.
Why Benchmarks Exist
Without benchmarks, comparing models would be subjective — everyone would just say their model is “better.” Benchmarks create a shared standard so that performance claims can be verified and compared across organizations.
Common AI Benchmarks
| Benchmark | What It Tests |
|---|---|
| MMLU | General knowledge across 57 subjects |
| HumanEval | Ability to write correct code |
| GSM8K | Grade-school math word problems |
| MATH | Advanced math problems |
| HellaSwag | Commonsense reasoning |
| TruthfulQA | Tendency to avoid making up facts |
| GPQA | PhD-level science questions |
| SWE-Bench | Real-world software engineering tasks |
How to Read Benchmark Scores
Benchmarks are usually expressed as a percentage of questions answered correctly. A score of 90% on MMLU means the model answered 90% of the knowledge questions right. Higher is better.
However, scores alone don’t tell the whole story.
Why Benchmark Numbers Can Be Misleading
- Overfitting — a model may have trained on benchmark questions, inflating its score
- Task gap — a model can score well on benchmarks but underperform on real-world tasks
- Cherry-picking — companies tend to highlight benchmarks where they perform best
- Context matters — a model great at coding may be poor at nuanced writing
Benchmark scores are a useful starting point, not the final word.
Practical Takeaway
When comparing models for a specific use case, look for benchmarks relevant to that use case — not just overall scores. And if possible, test the model on your actual task rather than relying solely on published numbers.
Related Terms
- LLM — the models that benchmarks measure
- Foundation Model — large models benchmarked before and after release
- Fine-Tuning — can improve a model’s score on task-specific benchmarks
- Hallucination — what TruthfulQA and similar benchmarks try to measure
Continue learning
Explore related guides, tools, workflows, and prompts that help you go deeper into this topic.
Browse all AI terms.
Learn termSee these concepts in practice.
Open workflowA simple explanation of this AI concept.
Learn termA simple explanation of this AI concept.
Learn termA simple explanation of this AI concept.
Learn termA simple explanation of this AI concept.
Learn termSee AI terms in action
Browse practical AI workflows that use the concepts in this glossary.
Last updated: