Sushant Kumar

BERT: Bidirectional Encoder Representations from Transformers

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained encoder-only transformer model developed by Google. It has been designed to understand the context of words in a sentence, and it has been shown to achieve state-of-the-art performance on a wide range of natural language processing (NLP) tasks.

BERT Architecture

BERT is a multi-layer bidirectional transformer encoder. The model architecture consists of a stack of identical layers, each with multiple self-attention heads and feed-forward neural networks. The input to BERT is a sequence of tokens, which are first embedded into vectors and then processed by the transformer layers. The final output of the model is a sequence of vectors, one for each input token.

BERT Architecture

Figure 1: BERT Architecture is a multi-layer bidirectional transformer encoder based on the original implementation of Vaswani et al. BERTBASE has 12 layers while BERTLARGE has 24 layers.


  1. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805
  2. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. Attention is All You Need. arXiv:1706.03762