Sushant Kumar

Grouped Query Attention

Grouped Query Attention is a new attention mechanism that can be used to improve the performance of transformer models. It is based on the idea of grouping queries in the self-attention mechanism of the transformer model. This allows the model to focus on different parts of the input sequence simultaneously, which can help improve the performance of the model on tasks that require long-range dependencies.

Grouped Query Attention

Figure 1: Comparison of Grouped Query Attention, Multi-Query Attention, and Multi-Head Attention. Image Source: arXiv:2305.13245


  1. Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv:2305.13245
  2. Noam Shazeer. Fast Transformer Decoding: One Write-Head is All You Need. arXiv:1911.02150
  3. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. Attention is All You Need. arXiv:1706.03762