Mixtral of Experts

Mistral AI has created a lot of waves in AI by open-sourcing state-of-the-art LLM models as opposed to other research labs such as OpenAI, Anthropic which keep their LLM models closed-source. Mixtral of Experts (Mixtral, in short) outperforms Llama-2 70B and GPT-3.5 on most benchmarks, while being open-sourced with the highly permissive Apache License 2.0. Mixtral is one such model that has been making headlines in the AI community.

As of the time of writing this blog post, Mistral AI has released two different variants of Mixtral over a span of 4 months:

Mixtral 8x7B - Released on 8th December, 2023
Mixtral 8x22B - Released on 10th April, 2024

Mixtral 8x7B

Mixtral 8x7B has 46.7B total parameters but only uses 12.9B parameters per token. It, therefore, processes input and generates output at the same speed and for the same cost as a 12.9B model.

PyTorch Implementation code

Mixtral is available on the GitHub and it is highly recommended to check it out to understand the model better.

Now, let's dive deep into how the Mixtral of Experts works, and how it is different from the traditional Transformer architecture.

SMoE: Sparse Mixture of Experts

SMoE - Sparse Mixture of Experts

Mixtral is based on the Sparse Mixture of Experts (SMoE). Sparse activation means that for each token the neural network only uses/activates a small subset of its parameters. As you would see in the case of Mixtral, this would be 2 of the 8 experts (feed-forward networks of the transformer block).

In SMoE architecture, the self-attention mechanism within each transformer block remains unchanged. But, an alteration occurs in the feed-forward network part of each transformer block, as it is replaced with sparsely activated feed-forward networks, known as experts.

Inside the transformer block of a Mixtral, the feed-forward network block picks from a set of 8 distinct group of parameters (experts). At every layer, for every token, a router network chooses 2 out of the 8 experts to process the token and then combines them additively.

Training

Mixtral is trained on multilingual data using a context size of 32k tokens.

One key thing to note is that Mistral AI hasn't released details about the training data. This could be to avoid unnecessary issues around data privacy and security.

Model Architecture

Few salient features of Mixtral 8x7B are:

Context Length: 32768
Vocab Size: 32000
Number of Experts: 8
Top K Experts: 2

References

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. Mixtral of Experts. arXiv:2401.04088

MoE: Mixture of Experts