MoE: Mixture of Experts

Here, we take an in-depth look at the Mixture of Experts (MoE) architecture, which is taking the AI world by storm, revolutionizing the large language models. GPT-4 is claimed to to leverage the MoE architecture, which has significantly improved its performance over earlier GPTs. Let's understand what MoE is and how it enhances the performance of LLMs.

"Mixtral", the state-of-the-art open-source LLM by Mistral AI, is also based on the MoE architecture and open-sourced.

Building blocks of MoE: Expert Networks & Gating Network

MoE is a general architectural concept in neural networks where a model consists of multiple expert networks (or sub-networks) and a gating network. The gating network decides which expert(s) should be activated for a given input, allowing the model to specialize different experts for different types of data or tasks.

What are Sparse Models?

In a dense models, for every token or input, all the model parameters are used to make predictions. This can be computationally expensive and inefficient, especially for large models with millions or billions of parameters.

However, unlike dense models, sparse models use the idea of conditional computation. It uses only a small subset of the model parameters for every token. Now, which set of model parameters (or experts) should be activated is decided by the gating network based on the input data.

MoE architecture with Router

Switch Transformers: Bringing MoE to Transformers

As discussed, Mixture of Experts (MoE) is a general architectural concept where only a subset of model parameters of a neural network are activated or used at inference. Now, we already know the transformer model is a popular architecture for large language models such as GPT-3, BERT, etc and augmenting the transformer model with the MoE concept has led to the development of the Switch Transformers.

In a Switch Transformer, the experts are usually individual feed-forward neural networks within the Transformer layers, and the gating mechanism is a sparse routing algorithm called the "switch" mechanism, hence the name "Switch Transformer".

So, as we now understand that there would be two key components that would be needed to implement the MoE architecture:

Router or Gating Network
Experts or Sparse MoE Layers

We'll try to go in-depth and try to understand how these components work together in the MoE architecture.

References

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean. Outrageously Large Neural Networks: The Sparsely-Gated Mixture of Experts Layer. arXiv:1701.06538
William Fedus, Barret Zoph, Noam Shazeer. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv:2101.03961
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. arXiv:2006.16668