Benchmark

Simple Definition

A benchmark is a set of standardized tests that measure how well an AI model performs at specific tasks. Benchmarks let researchers and companies compare models against each other using a common measuring stick.

When AI companies announce a new model, they typically publish benchmark scores to show how it compares to competing models.

Why Benchmarks Exist

Without benchmarks, comparing models would be subjective — everyone would just say their model is “better.” Benchmarks create a shared standard so that performance claims can be verified and compared across organizations.

Common AI Benchmarks

Benchmark	What It Tests
MMLU	General knowledge across 57 subjects
HumanEval	Ability to write correct code
GSM8K	Grade-school math word problems
MATH	Advanced math problems
HellaSwag	Commonsense reasoning
TruthfulQA	Tendency to avoid making up facts
GPQA	PhD-level science questions
SWE-Bench	Real-world software engineering tasks

How to Read Benchmark Scores

Benchmarks are usually expressed as a percentage of questions answered correctly. A score of 90% on MMLU means the model answered 90% of the knowledge questions right. Higher is better.

However, scores alone don’t tell the whole story.

Why Benchmark Numbers Can Be Misleading

Overfitting — a model may have trained on benchmark questions, inflating its score
Task gap — a model can score well on benchmarks but underperform on real-world tasks
Cherry-picking — companies tend to highlight benchmarks where they perform best
Context matters — a model great at coding may be poor at nuanced writing

Benchmark scores are a useful starting point, not the final word.

Practical Takeaway

When comparing models for a specific use case, look for benchmarks relevant to that use case — not just overall scores. And if possible, test the model on your actual task rather than relying solely on published numbers.