Speech-to-Text (STT)

Simple Definition

Speech-to-text (STT) AI — also called automatic speech recognition (ASR) or transcription — converts spoken words into written text. You speak or provide an audio file, and the AI produces a text transcript.

It’s the technology behind Siri, Alexa, Google voice search, and meeting transcription tools like Otter.ai.

How It Works

Modern STT systems use deep learning to analyze audio signals and map them to words. They’re trained on vast amounts of audio paired with transcripts, learning the sounds, patterns, and context of language.

More advanced systems also handle:

Multiple speakers (diarization)
Background noise
Different accents and languages
Real-time transcription

Leading STT Tools

Whisper (OpenAI) — open-source, highly accurate, supports many languages
Otter.ai — meeting transcription and collaboration
Fireflies.ai — meeting notes and action items
Deepgram — fast, API-first, real-time capable
Google Speech-to-Text — enterprise scale, multilingual
Rev — high-accuracy transcription service

Use Cases

Meeting transcription — automatic notes from Zoom, Teams, or Google Meet
Podcast and video transcripts — make audio content searchable and accessible
Dictation — write by speaking instead of typing
Voice assistants — Siri, Alexa, Google Assistant
Accessibility — enable hearing-impaired users to follow conversations
Customer call analysis — transcribe and analyze support calls

Text-to-Speech — the reverse: converting text into spoken audio
Multimodal AI — AI systems that handle audio alongside text
Natural Language Processing — processes the transcribed text

See AI terms in action

Browse practical AI workflows that use the concepts in this glossary.

AI Workflows Browse Glossary

Last updated: May 28, 2026

Speech-to-Text (STT)

Simple Definition

How It Works

Leading STT Tools

Use Cases

Related Terms

Related Terms and Resources

Back to Glossary

AI Workflows

Text To Speech

Multimodal Ai

Natural Language Processing

See AI terms in action