Sushant Kumar

Gemma: Google's family of Open LLMs

Gemma is a family of open models based on Google's Gemini models. Gemma2B and Gemma7B have been trained on 3T and 6T tokens respectively of primarily English data from web documents, mathematics and code.

We are dividing this post into the following sections:

  1. Model Architecture
  2. Training

1. Model Architecture

Context length: 8192 (~8k) tokens

Gemma follows the standard Transformer decoder architecture introduced by Vaswani et al, 2017[2]. The architecture incorporates the improvements proposed after the original paper such as

  1. Multi-Query Attention: The 7B model uses multi-head attention while the 2B checkpoints use multi-query attention (with num_kv_heads = 1), based on ablations that showed that multi-query attention (MQA) works well at small scales (Shazeer, 2019)[3].

  2. RoPE Embeddings: Rather than absolute positional embeddings, Gemma uses Rotary Positional Embeddings (RoPE) introduced by (Su et al, 2021)[4]

  3. RMSNorm: We normalize the input of each transformer sub-layer, the attention layer and the feedforward layer, with RMSNorm (Zhang and Sennrich, 2019)[5] to stabilize the training.

  4. GeGLU: The feedforward layer of the transformer uses GeGLU (Shazeer, 2020)[6] as the activation function, orginally inspired by the Gated Linear Unit (Dauphin et al, 2016)[7].

2. Training

Gemma2B and Gemma7B have been trained on 3T and 6T tokens respectively of primarily English data from web documents, mathematics and code.

SentencePiece tokenizer

The text is tokenized using a subset of the SentencePiece tokenizer (Kudo and Richardson, 2018)[6] of Gemini for compatibility.

ParametersGemma 2BGemma 7B
d_model40964096
Layers1828
Feedforward hidden dims3278649152
Num heads816
Num KV heads116
Head size256256
Vocab Size256128256128
Table 1: Key model parameters for Gemma2B and Gemma7B

References

  1. Gemma Team. Gemma: Open Models Based on Gemini Research and Technology. arXiv:2403.08295
  2. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. Attention is All You Need. arXiv:1706.03762
  3. Noam Shazeer. Fast Transformer Decoding: One Write-Head is All You Need. arXiv:1911.02150
  4. Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu. RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864
  5. Biao Zhang, Rico Sennrich. Root Mean Square Layer Normalization. arXiv:1910.07467
  6. Yann N. Dauphin, Angela Fan, Michael Auli, David Grangier. Language Modeling with Gated Convolutional Networks. arXiv:1612.08083
  7. Noam Shazeer. GLU Variants Improve Transformer. arXiv:2002.05202