Text-to-Speech (TTS)

Simple Definition

Text-to-speech (TTS) is AI that converts written text into spoken audio. You provide text, and the AI produces a realistic-sounding voice reading it aloud.

Modern AI TTS is dramatically better than the robotic computer voices of the past. Tools like ElevenLabs can produce voices that are nearly indistinguishable from a real human.

How It Works

Modern TTS systems use deep learning to model how human speech sounds — the rhythm, intonation, emphasis, and natural variation in real voices. They’re trained on recordings of human speech and learn to reproduce speech patterns from text.

Some systems can also clone voices — given a sample of a specific person’s voice, they can generate new speech in that voice.

Leading TTS Tools

  • ElevenLabs — the leading AI voice platform for quality and realism
  • OpenAI TTS — fast, high-quality, available via API
  • Google Cloud TTS — extensive language support
  • Amazon Polly — AWS-native TTS service
  • Play.ht — voice cloning and podcast-focused

Use Cases

  • Video voiceovers — narrate videos without recording equipment
  • Podcast content — generate audio from written scripts
  • Accessibility — read content aloud for visually impaired users
  • E-learning — narrate courses and educational content
  • Audiobooks — produce audio versions of written content
  • Customer service — power voice bots and IVR systems

Important Considerations

Voice cloning raises ethical and legal questions around consent and misuse. Responsible use means only cloning voices you have permission to use.

See AI terms in action

Browse practical AI workflows that use the concepts in this glossary.

Last updated: