Gemma: Google's family of Open LLMs

Gemma is a family of open models based on Google's Gemini models. Gemma2B and Gemma7B have been trained on 3T and 6T tokens respectively of primarily English data from web documents, mathematics and code.

We are dividing this post into the following sections:

Model Architecture
Training

1. Model Architecture

Context length: 8192 (~8k) tokens

Gemma follows the standard Transformer decoder architecture introduced by Vaswani et al, 2017[2]. The architecture incorporates the improvements proposed after the original paper such as

Multi-Query Attention: The 7B model uses multi-head attention while the 2B checkpoints use multi-query attention (with num_kv_heads = 1), based on ablations that showed that multi-query attention (MQA) works well at small scales (Shazeer, 2019)[3].
RoPE Embeddings: Rather than absolute positional embeddings, Gemma uses Rotary Positional Embeddings (RoPE) introduced by (Su et al, 2021)[4]
RMSNorm: We normalize the input of each transformer sub-layer, the attention layer and the feedforward layer, with RMSNorm (Zhang and Sennrich, 2019)[5] to stabilize the training.
GeGLU: The feedforward layer of the transformer uses GeGLU (Shazeer, 2020)[6] as the activation function, orginally inspired by the Gated Linear Unit (Dauphin et al, 2016)[7].

2. Training

Gemma2B and Gemma7B have been trained on 3T and 6T tokens respectively of primarily English data from web documents, mathematics and code.

SentencePiece tokenizer

The text is tokenized using a subset of the SentencePiece tokenizer (Kudo and Richardson, 2018)[6] of Gemini for compatibility.

Parameters	Gemma 2B	Gemma 7B
d_model	4096	4096
Layers	18	28
Feedforward hidden dims	32786	49152
Num heads	8	16
Num KV heads	1	16
Head size	256	256
Vocab Size	256128	256128

Table 1: Key model parameters for Gemma2B and Gemma7B

References

Gemma Team. Gemma: Open Models Based on Gemini Research and Technology. arXiv:2403.08295
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. Attention is All You Need. arXiv:1706.03762
Noam Shazeer. Fast Transformer Decoding: One Write-Head is All You Need. arXiv:1911.02150
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu. RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864
Biao Zhang, Rico Sennrich. Root Mean Square Layer Normalization. arXiv:1910.07467
Yann N. Dauphin, Angela Fan, Michael Auli, David Grangier. Language Modeling with Gated Convolutional Networks. arXiv:1612.08083
Noam Shazeer. GLU Variants Improve Transformer. arXiv:2002.05202