Voice & Audio

OpenAI Whisper

Whisper is OpenAI's open-source speech recognition model. It transcribes audio to text in 99 languages with near-human accuracy, handles accents and background noise better than any previous open-source model, and runs free on your own hardware. It also powers most of the transcription features built into other AI tools you may already use.

Voice & Audio

What Whisper is

Whisper is a speech recognition model — it listens to audio and produces accurate text. Released by OpenAI in September 2022 as an open-source model, it marked a step change in what free, accessible speech-to-text could do. Before Whisper, high-quality transcription required expensive proprietary services. Whisper made near-professional quality transcription available to anyone with a computer.

It is not a product like ChatGPT. There is no Whisper app or website. It is a model — a trained neural network — that developers and technical users can run directly, or that powers transcription inside other products you may already use.

Where you already use Whisper without knowing it: Otter.ai, Descript, Adobe Podcast, many podcast transcription services, and OpenAI's own speech-to-text API all use Whisper or are based on it. It is the engine underneath most modern AI transcription.

What it does well

  • Multi-language transcription — 99 languages supported, language detected automatically
  • Accent and dialect handling — significantly more robust than earlier models on non-standard accents
  • Noise tolerance — handles moderate background noise and variable recording quality
  • Timestamps — word-level and segment-level timestamps for subtitles and captions
  • Translation — translates audio directly to English text in one step

Five model sizes

Whisper comes in five sizes trading accuracy for speed: Tiny (fastest, lowest accuracy), Base, Small, Medium (recommended for most use cases), and Large-v3 (highest accuracy, what the API uses). For most professional transcription, medium locally or large via the API gives the best results.

What it does not do

Whisper transcribes — it does not understand. It will not summarise or take action on transcripts. It is designed for batch file processing, not live streaming. For real-time transcription, Otter.ai or Google Live Transcribe are better suited. It also does not identify who is speaking — no speaker diarisation built in.

Three ways to use Whisper

Option 1 — OpenAI API (easiest, $0.006/min): Send an audio file to OpenAI's transcription endpoint, receive a transcript. No setup, no GPU. Uses large-v3 automatically.

Option 2 — Run locally (free, needs Python): Install via pip, run one command with your audio file. CPU (slow) or GPU (fast). Best for high volumes where API cost adds up.

Option 3 — GUI apps (no coding): MacWhisper (Mac), Whisper Desktop (Windows), Buzz (cross-platform) wrap Whisper with a UI. Drop in an audio file, get a transcript.

Prompts for working with Whisper outputs

Clean up a raw transcript
Here is a raw transcript from an audio recording. Clean it up: fix obvious transcription errors, add paragraph breaks at natural topic shifts, remove filler words (um, uh, you know), and format it for reading. Keep the meaning and wording intact otherwise: [paste transcript]
Summarise a meeting transcript
Here is the transcript of a [meeting / call / interview]. Produce: (1) a 3-sentence summary, (2) key decisions made, (3) action items with owner names if mentioned, (4) any unresolved questions. Transcript: [paste]
Generate subtitles from a Whisper transcript
I have a Whisper transcript with timestamps. Convert it to SRT subtitle format. Each subtitle should be 1-2 lines maximum, 42 characters per line maximum, timed to match the timestamps: [paste transcript]
Extract quotes from an interview
Here is a transcript of an interview with [person/role]. Extract the 5 most quotable statements — clear, self-contained, usable without extra context. Format each as a direct quote with a note on the topic: [paste transcript]
Create a blog post from a podcast transcript
Here is the transcript of a podcast episode about [topic]. Write a 600-800 word blog post based on the key ideas, structured with introduction, 3-4 main sections, and conclusion. Use quotes from the transcript where they strengthen points: [paste transcript]
Use Whisper API with Python
Show me a complete Python script that: (1) takes an audio file path as input, (2) sends it to the OpenAI Whisper API, (3) handles files over 25MB by splitting them, (4) saves the transcript as a .txt file. Use the openai Python library.
Run Whisper locally — command reference
I want to transcribe audio files using Whisper locally on [Mac/Windows/Linux]. Walk me through: (1) installing Whisper and dependencies, (2) transcribing a single file with the medium model, (3) transcribing a whole folder, (4) outputting SRT subtitle files.
Identify speakers in a multi-person transcript
This Whisper transcript has multiple speakers but no speaker labels. Based on context clues — speaking styles, name references, topic expertise — identify who is most likely speaking each section. Label [Speaker A], [Speaker B] etc and note identifying clues: [paste transcript]

Architecture

Whisper is a sequence-to-sequence transformer trained on 680,000 hours of multilingual audio from the internet. The encoder processes a mel spectrogram of the audio; the decoder generates text autoregressively. The same architecture handles transcription, translation, and language detection through special tokens — not separate model variants. Published in "Robust Speech Recognition via Large-Scale Weak Supervision" (Radford et al., OpenAI, 2022, arXiv:2212.04356).

Model versions

Whisper large-v3 (November 2023) is the current best-performing version with approximately 10-20% WER reduction versus v2. The OpenAI API uses large-v3 by default. On standard English benchmarks (LibriSpeech clean), large-v3 achieves approximately 2.7% word error rate.

The 25MB API limit and workarounds

The OpenAI API accepts files up to 25MB. For longer files: split audio into chunks using pydub, transcribe each chunk, concatenate results. Split at silence points rather than mid-word. The pyannote.audio library assists with silence detection.

Speaker diarisation

Whisper does not separate speakers. The workaround: combine Whisper with pyannote.audio (open-source diarisation) and align the outputs. Commercial services like Otter.ai add their own diarisation layer on top of Whisper.

Official resources

Source note: Technical specifications from the Whisper research paper and OpenAI API documentation. Pricing from openai.com/pricing, April 2026.