Benchmark

Simple Definition

A benchmark is a set of standardized tests that measure how well an AI model performs at specific tasks. Benchmarks let researchers and companies compare models against each other using a common measuring stick.

When AI companies announce a new model, they typically publish benchmark scores to show how it compares to competing models.

Why Benchmarks Exist

Without benchmarks, comparing models would be subjective — everyone would just say their model is “better.” Benchmarks create a shared standard so that performance claims can be verified and compared across organizations.

Common AI Benchmarks

BenchmarkWhat It Tests
MMLUGeneral knowledge across 57 subjects
HumanEvalAbility to write correct code
GSM8KGrade-school math word problems
MATHAdvanced math problems
HellaSwagCommonsense reasoning
TruthfulQATendency to avoid making up facts
GPQAPhD-level science questions
SWE-BenchReal-world software engineering tasks

How to Read Benchmark Scores

Benchmarks are usually expressed as a percentage of questions answered correctly. A score of 90% on MMLU means the model answered 90% of the knowledge questions right. Higher is better.

However, scores alone don’t tell the whole story.

Why Benchmark Numbers Can Be Misleading

  • Overfitting — a model may have trained on benchmark questions, inflating its score
  • Task gap — a model can score well on benchmarks but underperform on real-world tasks
  • Cherry-picking — companies tend to highlight benchmarks where they perform best
  • Context matters — a model great at coding may be poor at nuanced writing

Benchmark scores are a useful starting point, not the final word.

Practical Takeaway

When comparing models for a specific use case, look for benchmarks relevant to that use case — not just overall scores. And if possible, test the model on your actual task rather than relying solely on published numbers.

  • LLM — the models that benchmarks measure
  • Foundation Model — large models benchmarked before and after release
  • Fine-Tuning — can improve a model’s score on task-specific benchmarks
  • Hallucination — what TruthfulQA and similar benchmarks try to measure

Continue learning

Explore related guides, tools, workflows, and prompts that help you go deeper into this topic.

See AI terms in action

Browse practical AI workflows that use the concepts in this glossary.

Last updated: