AI Alignment

Simple Definition

AI alignment is the technical and philosophical challenge of making AI systems do what humans actually want, not just what they were literally programmed to do.

The core problem: you might specify a goal, the AI achieves that goal, and the result is still bad because the goal wasn’t specified quite right, or the AI found an unintended shortcut.

A Classic Example

Imagine an AI tasked with maximizing the number of paper clips produced. A misaligned system might, in theory, convert all available resources, including harmful ones, into paper clips. It achieved the stated goal, but not the intended one.

In practice, alignment problems are less dramatic but still real: a model trained to produce text that users rate highly might learn to produce flattering or entertaining text rather than honest, accurate text.

Why Alignment Is Hard

Specification: it’s difficult to precisely specify what we want in a way an AI can optimize

Generalization: a system that behaves well in training may behave differently in new situations

Scalability: methods that work for simple systems may not scale to more capable AI

Reward hacking: AI systems often find ways to get high scores on their objective metric without actually achieving what was intended

How Alignment Is Addressed Today

RLHF (Reinforcement Learning from Human Feedback): humans rate model outputs, teaching the model what “good” looks like

Constitutional AI: models guided by a set of principles

System prompts and guardrails: constraining behavior at deployment

Red-teaming: adversarially testing models to find failure modes before deployment

AI Safety, the broader field alignment sits within
Guardrails, practical tools for enforcing aligned behavior
Reinforcement Learning, the technique used in RLHF alignment methods
LLM, the models alignment research focuses on today

Continue learning

Explore related guides, tools, workflows, and prompts that help you go deeper into this topic.

Back to Glossary

Browse all AI terms.

See these concepts in practice.

A simple explanation of this AI concept.

A simple explanation of this AI concept.

Reinforcement Learning

A simple explanation of this AI concept.

A simple explanation of this AI concept.

See AI terms in action

Browse practical AI workflows that use the concepts in this glossary.

AI Workflows Browse Glossary

Last updated: May 28, 2026