Whisper: Transformer for Speech Recognition

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English. OpenAI has open-sourced models and inference code so that it serves as a foundation for building useful applications and for further research on robust speech processing.

Model Architecture

Whisper has an encoder-decoder Transformer (Vaswani et al., 2017) [2] architecture. All audio was re-sampled to 16,000 Hz, and an 80-channel logmagnitude Mel spectrogram representation is computed on 25-millisecond windows with a stride of 10 milliseconds.

For feature normalization, they globally scaled the input to be between -1 and 1 with approximately zero mean across the pre-training dataset.

Encoder

The encoder processes the input representation with a small stem consisting of two convolution layers with a filter width of 3 and the GELU activation function (Hendrycks & Gimpel, 2016) where the second convolution layer has a stride of two. Sinusoidal position embeddings are then added to the output of the stem after which the encoder Transformer blocks are applied. The transformer uses pre-activation residual blocks (Child et al., 2019), and a final layer normalization is applied to the encoder output.

Decoder

The decoder uses learned position embeddings and tied input-output token representations (Press & Wolf, 2017). The encoder and decoder have the same width (dimensions) and each encoder and decoder has 12 transformer blocks.

Figure 1 summarizes the model architecture.

Whisper Architecture

Figure 1: Whisper Architecture is a transformer-based model The encoder blocks are based on the original implementation of Vaswani et al. Whisper has 12 layers.

Tokenization

Whisper uses the same byte-level BPE text tokenizer used in GPT-2 (Sennrich et al., 2015; Radford et al., 2019) for the English only models and refit the vocabulary (but keep the same size) for the multilingual models to avoid excessive fragmentation on other languages since the GPT-2 BPE vocabulary is English only.

references

Alec Radford, JongWook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv:2212.04356
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. Attention is All You Need. arXiv:1706.03762