AI Alignment
Simple Definition
AI alignment is the technical and philosophical challenge of making AI systems do what humans actually want — not just what they were literally programmed to do.
The core problem: you might specify a goal, the AI achieves that goal, and the result is still bad because the goal wasn’t specified quite right, or the AI found an unintended shortcut.
A Classic Example
Imagine an AI tasked with maximizing the number of paper clips produced. A misaligned system might, in theory, convert all available resources — including harmful ones — into paper clips. It achieved the stated goal, but not the intended one.
In practice, alignment problems are less dramatic but still real: a model trained to produce text that users rate highly might learn to produce flattering or entertaining text rather than honest, accurate text.
Why Alignment Is Hard
Specification — it’s difficult to precisely specify what we want in a way an AI can optimize
Generalization — a system that behaves well in training may behave differently in new situations
Scalability — methods that work for simple systems may not scale to more capable AI
Reward hacking — AI systems often find ways to get high scores on their objective metric without actually achieving what was intended
How Alignment Is Addressed Today
RLHF (Reinforcement Learning from Human Feedback) — humans rate model outputs, teaching the model what “good” looks like
Constitutional AI — models guided by a set of principles
System prompts and guardrails — constraining behavior at deployment
Red-teaming — adversarially testing models to find failure modes before deployment
Related Terms
- AI Safety — the broader field alignment sits within
- Guardrails — practical tools for enforcing aligned behavior
- Reinforcement Learning — the technique used in RLHF alignment methods
- LLM — the models alignment research focuses on today
See AI terms in action
Browse practical AI workflows that use the concepts in this glossary.
Last updated: