Whisper and Audio

Transformers hear sounds

Audio as Sequences

Raw audio is just a sequence of numbers—pressure measurements sampled many times per second. A typical recording uses 16,000 samples per second. That means 30 seconds of audio contains 480,000 numbers.

Processing half a million timesteps directly with attention would be computationally devastating. Remember, attention scales quadratically: every position attends to every other position. At 480,000 positions, that's 230 billion attention computations per layer.

We need a more compact representation. The solution comes from signal processing: spectrograms.

Spectrograms

A spectrogram transforms raw audio into something much more useful. Instead of representing audio as pressure values over time, a spectrogram shows which frequencies are present at each moment.

From Waveform to Spectrogram

Waveform (Time Domain)

Pressure over time

Spectrogram (Frequency Domain)

Frequency × Time × Energy

Show Spectrogram

Toggle different frequency components to see how they combine in the waveform and separate in the spectrogram. The spectrogram reveals which frequencies are present at each moment—essential for speech recognition.

Think of it this way: a piano chord contains multiple notes played simultaneously. The raw waveform is a complex squiggle—the sum of all those frequencies. A spectrogram separates those frequencies out, showing you which notes are playing and how loud each one is.

The spectrogram is a 2D representation:

X-axis: Time (divided into short windows)
Y-axis: Frequency (from low to high)
Brightness: Energy at that frequency and time

Suddenly, audio becomes an "image." And we already know how to process images with transformers—we treat them as sequences.

For Whisper, the input audio is converted to an 80-channel log-mel spectrogram. A 30-second clip becomes roughly 3,000 time frames, each with 80 frequency values. That's 3,000 positions instead of 480,000—manageable for attention.

Whisper's Architecture

Whisper uses the original encoder-decoder transformer architecture—the same structure introduced in "Attention Is All You Need."

Whisper Architecture

Encoder

AudioSpectrogram

Conv LayersDownsample

Self-Attention×N layers

Audio Repr.Encoded

Decoder

Text TokensPrevious

Masked Self-AttnCausal

Cross-AttentionAudio → Text

Next TokenPrediction

Hover over each component to learn more. The encoder processes audio features while the decoder generates text token by token, using cross-attention to align speech with transcription.

Here's how it works:

The Encoder processes the spectrogram:

Takes the 80-channel spectrogram as input
Uses two convolutional layers to downsample and create initial features
Passes through transformer encoder blocks with self-attention
Outputs a sequence of audio representations

The Decoder generates text autoregressively:

Starts with special tokens indicating the task (transcription, translation, etc.)
Uses masked self-attention (can only see previous tokens)
Uses cross-attention to look at the encoder's audio representations
Predicts the next text token, then feeds it back as input

The key is the cross-attention. When generating each word, the decoder can look at any part of the encoded audio. Generating the word "hello" might attend to the audio frames containing /h/, /e/, /l/, /l/, /o/. The model learns these alignments automatically.

Notice the pattern: it's the same attention mechanism we've seen for text-to-text translation. The only difference is that one side speaks audio and the other speaks text.

What Whisper Can Do

Whisper is trained on 680,000 hours of multilingual audio with transcriptions. This massive dataset enables remarkable capabilities:

Speech Recognition: Convert spoken audio to text in the same language. This is the primary use case—Whisper achieves near-human accuracy on many benchmarks.

Translation: Transcribe audio in one language directly to text in another. You speak Spanish, Whisper outputs English text. The encoder-decoder architecture makes this natural—just change what the decoder is trained to output.

Language Detection: Identify which of 99 languages is being spoken. The model can classify the input language before transcription.

Timestamp Generation: Predict when each word was spoken. Whisper can output word-level timestamps, useful for subtitles and alignment.

All of these tasks use the same model with different prompting. Special tokens at the start of decoding tell Whisper what task to perform:

<|transcribe|> for same-language transcription
<|translate|> for translation to English
<|en|>, <|es|>, etc. for language specification

Why Transformers Work for Audio

Transformers excel at audio for the same reasons they excel at text: long-range dependencies and learned attention patterns.

Attention on Audio

Spectrogram with Attention Weights

Transcription (click a word)

When generating "Hello", the model attends to frames 0, 1, 2, 3 — the audio segments containing that word's sounds.

High attention

Low attention

Cross-attention learns to align output words with their corresponding audio frames. Notice how each word attends to a different region of the spectrogram.

Consider transcribing "I went to the bank." Whether "bank" refers to a financial institution or a riverbank might depend on context from the beginning of the sentence—or even earlier. RNNs struggle with this; transformers handle it naturally.

Audio also benefits from parallel processing. RNNs must process audio sequentially, one frame at a time. Transformers process all spectrogram frames simultaneously, using attention to find relationships. This is not only faster on modern hardware—it also lets the model see the full context at every layer.

The attention patterns learned for audio are fascinating:

Some heads attend to nearby frames—modeling local acoustic features
Some heads attend to periodic patterns—capturing rhythm and prosody
Some heads attend to semantically similar sounds—grouping phonemes together

The model discovers what matters, rather than having it hardcoded.

Key Takeaways

Raw audio at 16kHz has 480,000 samples per 30 seconds—far too many for direct attention
Spectrograms compress audio into a 2D representation: time × frequency × energy
Whisper uses an encoder-decoder architecture: encoder processes audio, decoder generates text
Cross-attention lets the decoder look at any part of the audio when generating each word
A single model handles transcription, translation, language detection, and timestamping
Transformers capture long-range dependencies in audio that RNNs cannot
Attention patterns on spectrograms reveal local, periodic, and semantic groupings