Whisper is OpenAI's open-source speech recognition model. It transcribes audio to text in 99 languages with near-human accuracy, handles accents and background noise better than any previous open-source model, and runs free on your own hardware. It also powers most of the transcription features built into other AI tools you may already use.
Whisper is a speech recognition model — it listens to audio and produces accurate text. Released by OpenAI in September 2022 as an open-source model, it marked a step change in what free, accessible speech-to-text could do. Before Whisper, high-quality transcription required expensive proprietary services. Whisper made near-professional quality transcription available to anyone with a computer.
It is not a product like ChatGPT. There is no Whisper app or website. It is a model — a trained neural network — that developers and technical users can run directly, or that powers transcription inside other products you may already use.
Where you already use Whisper without knowing it: Otter.ai, Descript, Adobe Podcast, many podcast transcription services, and OpenAI's own speech-to-text API all use Whisper or are based on it. It is the engine underneath most modern AI transcription.
Whisper comes in five sizes trading accuracy for speed: Tiny (fastest, lowest accuracy), Base, Small, Medium (recommended for most use cases), and Large-v3 (highest accuracy, what the API uses). For most professional transcription, medium locally or large via the API gives the best results.
Whisper transcribes — it does not understand. It will not summarise or take action on transcripts. It is designed for batch file processing, not live streaming. For real-time transcription, Otter.ai or Google Live Transcribe are better suited. It also does not identify who is speaking — no speaker diarisation built in.
Option 1 — OpenAI API (easiest, $0.006/min): Send an audio file to OpenAI's transcription endpoint, receive a transcript. No setup, no GPU. Uses large-v3 automatically.
Option 2 — Run locally (free, needs Python): Install via pip, run one command with your audio file. CPU (slow) or GPU (fast). Best for high volumes where API cost adds up.
Option 3 — GUI apps (no coding): MacWhisper (Mac), Whisper Desktop (Windows), Buzz (cross-platform) wrap Whisper with a UI. Drop in an audio file, get a transcript.
Whisper is a sequence-to-sequence transformer trained on 680,000 hours of multilingual audio from the internet. The encoder processes a mel spectrogram of the audio; the decoder generates text autoregressively. The same architecture handles transcription, translation, and language detection through special tokens — not separate model variants. Published in "Robust Speech Recognition via Large-Scale Weak Supervision" (Radford et al., OpenAI, 2022, arXiv:2212.04356).
Whisper large-v3 (November 2023) is the current best-performing version with approximately 10-20% WER reduction versus v2. The OpenAI API uses large-v3 by default. On standard English benchmarks (LibriSpeech clean), large-v3 achieves approximately 2.7% word error rate.
The OpenAI API accepts files up to 25MB. For longer files: split audio into chunks using pydub, transcribe each chunk, concatenate results. Split at silence points rather than mid-word. The pyannote.audio library assists with silence detection.
Whisper does not separate speakers. The workaround: combine Whisper with pyannote.audio (open-source diarisation) and align the outputs. Commercial services like Otter.ai add their own diarisation layer on top of Whisper.
Source note: Technical specifications from the Whisper research paper and OpenAI API documentation. Pricing from openai.com/pricing, April 2026.