Sushant Kumar

GPT-2: A Deep Dive

GPTs (Generative Pre-trained Transformers) have become a household name with ChatGPT seeping into our daily lives. They have been used to generate text, translate languages, answer questions, and even write code. In this blog post, we will take a deep dive into GPT-2 (Radford et al., 2019), a large language model developed by OpenAI, and explore its architecture, training, and applications.

Let's get started! This post is broken down into the following sections:

  1. Tokenization
  2. Architecture
  3. Training

1. Tokenization: A BPE Tokenizer

As is the case with OpenAI's LLMs, GPT-2 also uses a byte pair encoding (BPE) tokenizer, which is a type of subword tokenization. This means that the tokenizer can split words into smaller subwords, which are then assigned an index from a fixed vocabulary. This allows the model to handle words that are not present in the vocabulary, as they can be split into subwords that are present in the vocabulary.

The vocabulary size of GPT-2 is 50,257 tokens, which includes words, subwords, and special tokens.

This tokenizer is available with the OpenAI 'tiktoken' package.

GPT-2 Tokenizer
import tiktoken
# Load the GPT-2 tokenizer
tokenizer = tiktoken.encoding_for_model('gpt-2')
# Tokenize a text sequence
text = "GPT2 is a large language model developed by OpenAI."
tokens = tokenizer.encode(text)
print(f"Tokens: {tokens}")
# Output: Tokens: [38, 11571, 17, 318, 257, 1588, 3303, 2746, 4166, 416, 4946, 20185, 13]
vocab_size = tokenizer.n_vocab
print(f"Vocabulary Size: {vocab_size}")
# Output: Vocabulary Size: 50257

You can play around with the tokenizer and explore the tokenization process further. Just remember that the tokenizer is specific to the GPT-2 model and may not work with other models.

2. Architecture: Transformer-Based Model

GPT-2 is a multi-layer transformer model, similar to the original Transformer model proposed by Vaswani et al. (2017). It consists of a stack of identical decoder layers, each of which has two sub-layers. The first sub-layer is a multi-head self-attention mechanism, and the second sub-layer is a position-wise feed-forward network.

3. Training: Autoregressive Language Modeling

GPT-2 was trained using unsupervised learning on a large corpus of text data. The model was trained to predict the next word in a sequence of text, given the previous words. This is known as the autoregressive language modeling objective.


  1. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. Attention is All You Need. arXiv:1706.03762