Grouped Query Attention

Grouped Query Attention is a new attention mechanism that can be used to improve the performance of transformer models. It is based on the idea of grouping queries in the self-attention mechanism of the transformer model. This allows the model to focus on different parts of the input sequence simultaneously, which can help improve the performance of the model on tasks that require long-range dependencies.

Figure 1: Comparison of Grouped Query Attention, Multi-Query Attention, and Multi-Head Attention. Image Source: arXiv:2305.13245

References

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv:2305.13245
Noam Shazeer. Fast Transformer Decoding: One Write-Head is All You Need. arXiv:1911.02150
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. Attention is All You Need. arXiv:1706.03762