Inference

Simple Definition

Inference is the process of using a trained AI model to generate a response or prediction. When you type a message into ChatGPT and it replies — that’s inference. The model is applying what it learned during training to your specific input.

Training and inference are the two phases of an AI model’s life:

  • Training — the model learns from data (expensive, done once or periodically)
  • Inference — the model uses what it learned to respond to inputs (done every time you use it)

Why the Distinction Matters

Training a large AI model costs millions of dollars in compute. Inference is much cheaper per request, but because millions of people use these tools, the total inference cost is enormous. Inference efficiency — how quickly and cheaply a model can respond — is a major area of AI engineering.

Inference Speed

Inference speed is measured in tokens per second — how fast the model generates output. When an AI response streams in word by word, you’re watching inference happen in real time.

Factors that affect inference speed:

  • Model size (larger = slower)
  • Hardware (specialized chips = faster)
  • Batch size and optimization techniques

Local vs. Cloud Inference

Cloud inference — the model runs on a company’s servers (ChatGPT, Claude, Gemini)

Local inference — the model runs on your own device (Ollama, LM Studio with smaller models)

Local inference is slower and less capable for large models, but offers privacy since your data never leaves your machine.

  • LLM — the models that run inference for language tasks
  • Training Data — what the model learned from before inference
  • Token — the unit of text models generate during inference
  • Foundation Model — large models deployed for inference via APIs

See AI terms in action

Browse practical AI workflows that use the concepts in this glossary.

Last updated: