Llama 3: SOTA open-weights LLM

Llama 3 by Meta AI is trained on more than 15T tokens of data, seven times larger training corpus than that used for Llama 2. It is claimed to be the best open-source LLM available beating out the competition. It will be released in three different sizes based on number of parameters: 8B, 70B and 400B.

The 8B and 70B have already been released and its weights can be downloaded and the 400B variant is currently still being trained.

Training data

The training data of 15T tokens was collected from publicly available sources. It has four times more code data than Llama 2. Over 5% of the Llama 3 pre-training dataset consists of high quality non-English data.

Model Architecture

Llama 3 has relatively standard decoder-only transformer architecture [2]. Llama 3 uses a tokenizer with a vocabulary of 128K tokens that encodes language much more efficiently, which leads to substantially improved model performance. To improve the inference efficiency of Llama 3 models, we've adopted grouped query attention - GQA [1] across both the 8B and 70B sizes.

We trained the models on sequences of 8,192 tokens, using a mask to ensure self-attention does not cross document boundaries.

Model Parameters	Llama 2	Llama 3
Context Length	4,096	8,196
Vocabulary Size	32,000	128,000

Table 1: Key improvements in Llama 3 over Llama 2

Coming up

Meta AI has announced that they plan to make the Llama3 multilingual and multimodal with longer context window over time. So, stay tuned for more updates.

References

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv:2305.13245
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. Attention is All You Need. arXiv:1706.03762