Gemma: Google's family of Open LLMs
Gemma is a family of open models based on Google's Gemini models. Gemma2B and Gemma7B have been trained on 3T and 6T tokens respectively of primarily English data from web documents, mathematics and code.
We are dividing this post into the following sections:
- Model Architecture
- Training
1. Model Architecture
Context length: 8192 (~8k) tokens
Gemma follows the standard Transformer decoder architecture introduced by Vaswani et al, 2017[2]. The architecture incorporates the improvements proposed after the original paper such as
-
Multi-Query Attention: The 7B model uses multi-head attention while the 2B checkpoints use multi-query attention (with num_kv_heads = 1), based on ablations that showed that multi-query attention (MQA) works well at small scales (Shazeer, 2019)[3].
-
RoPE Embeddings: Rather than absolute positional embeddings, Gemma uses Rotary Positional Embeddings (RoPE) introduced by (Su et al, 2021)[4]
-
RMSNorm: We normalize the input of each transformer sub-layer, the attention layer and the feedforward layer, with RMSNorm (Zhang and Sennrich, 2019)[5] to stabilize the training.
-
GeGLU: The feedforward layer of the transformer uses GeGLU (Shazeer, 2020)[6] as the activation function, orginally inspired by the Gated Linear Unit (Dauphin et al, 2016)[7].
2. Training
Gemma2B and Gemma7B have been trained on 3T and 6T tokens respectively of primarily English data from web documents, mathematics and code.
SentencePiece tokenizer
The text is tokenized using a subset of the SentencePiece tokenizer (Kudo and Richardson, 2018)[6] of Gemini for compatibility.
Parameters | Gemma 2B | Gemma 7B |
---|---|---|
d_model | 4096 | 4096 |
Layers | 18 | 28 |
Feedforward hidden dims | 32786 | 49152 |
Num heads | 8 | 16 |
Num KV heads | 1 | 16 |
Head size | 256 | 256 |
Vocab Size | 256128 | 256128 |
References
- Gemma Team. Gemma: Open Models Based on Gemini Research and Technology. arXiv:2403.08295
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. Attention is All You Need. arXiv:1706.03762
- Noam Shazeer. Fast Transformer Decoding: One Write-Head is All You Need. arXiv:1911.02150
- Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu. RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864
- Biao Zhang, Rico Sennrich. Root Mean Square Layer Normalization. arXiv:1910.07467
- Yann N. Dauphin, Angela Fan, Michael Auli, David Grangier. Language Modeling with Gated Convolutional Networks. arXiv:1612.08083
- Noam Shazeer. GLU Variants Improve Transformer. arXiv:2002.05202