Speech-to-Text (STT)
Simple Definition
Speech-to-text (STT) AI — also called automatic speech recognition (ASR) or transcription — converts spoken words into written text. You speak or provide an audio file, and the AI produces a text transcript.
It’s the technology behind Siri, Alexa, Google voice search, and meeting transcription tools like Otter.ai.
How It Works
Modern STT systems use deep learning to analyze audio signals and map them to words. They’re trained on vast amounts of audio paired with transcripts, learning the sounds, patterns, and context of language.
More advanced systems also handle:
- Multiple speakers (diarization)
- Background noise
- Different accents and languages
- Real-time transcription
Leading STT Tools
- Whisper (OpenAI) — open-source, highly accurate, supports many languages
- Otter.ai — meeting transcription and collaboration
- Fireflies.ai — meeting notes and action items
- Deepgram — fast, API-first, real-time capable
- Google Speech-to-Text — enterprise scale, multilingual
- Rev — high-accuracy transcription service
Use Cases
- Meeting transcription — automatic notes from Zoom, Teams, or Google Meet
- Podcast and video transcripts — make audio content searchable and accessible
- Dictation — write by speaking instead of typing
- Voice assistants — Siri, Alexa, Google Assistant
- Accessibility — enable hearing-impaired users to follow conversations
- Customer call analysis — transcribe and analyze support calls
Related Terms
- Text-to-Speech — the reverse: converting text into spoken audio
- Multimodal AI — AI systems that handle audio alongside text
- Natural Language Processing — processes the transcribed text
See AI terms in action
Browse practical AI workflows that use the concepts in this glossary.
Last updated: