Training LLMs on GPUs

Training large language models (LLMs) is a computationally intensive endeavor, requiring specialized hardware that can handle massive amounts of data and complex calculations efficiently. Graphics Processing Units (GPUs) have become the gold standard for this purpose, thanks to their architecture, which is well-suited for parallel processing. In this post, we'll explore what makes GPUs so effective for LLM training, the key metrics for assessing their performance, and the commonly used GPUs in AI today.

What Are GPUs?

A Graphics Processing Unit (GPU) is a specialised processor initially designed for rendering graphics, mainly for video games and visual effects. Unlike a Central Processing Unit (CPU), which is a general-purpose processor optimised for handling a wide range of sequential tasks with a small number of powerful cores, GPUs contain thousands of smaller cores that excel at parallel computation. This architecture makes GPUs perfect for workloads where numerous calculations need to happen simultaneously.

The need for efficient computation of large data sets is the reason GPUs were first developed for rendering graphics: processing images requires complex calculations on large matrices of pixels.

The reason GPUs are so effective for AI, and especially for training large language models (LLMs), lies in their ability to handle massive amounts of data at once. LLMs consist of billions of parameters—the weights and biases that the model learns over time—and training them requires repeated calculations across enormous datasets. The thousands of cores in a GPU can process these operations in parallel, making training feasible in a reasonable amount of time.

Initially developed to accelerate the processing of images—which involves computing millions of pixel values simultaneously—GPUs are now indispensable in the AI field for their prowess in handling similar large-scale, parallel computations.

Understanding FLOPs: Measuring Computational Power

A crucial metric for understanding computational power is FLOPs, or Floating Point Operations per Second. This metric represents the number of mathematical calculations involving real numbers that a processor can perform in one second. These calculations are fundamental in training deep learning models, particularly for matrix multiplication, which lies at the heart of most AI computations.

For example, when training an LLM, each layer of the model applies transformations to data, which requires extensive computation. The number of floating-point operations depends on the number of parameters in the model and the size of the training dataset. Thus, FLOPs provide an indicator of the volume of work a GPU can handle. The higher the FLOPs, the better the GPU is suited for deep learning tasks.

In the context of training LLMs, FLOP as a metric measures the product of the parameters and the dataset size. Essentially, a higher number of parameters combined with a larger dataset leads to an exponential increase in the number of floating-point operations required to train the model. For instance, if an LLM has 10 billion parameters and you feed it with 1 billion words of text, the total number of operations needed during training is vast. GPUs, with their architecture designed for such workloads, help ensure that these calculations are performed efficiently and in a reasonable amount of time.

Let's address the elephant in the room: "compute optimal" language models. These are often called "Chinchilla scaling laws," named after the model series that established our current understanding of parameter counts. A compute-optimal language model maintains a specific relationship between its number of parameters and dataset size, following the approximation D=20P. This is optimal in a narrow sense: when the cost of using 1,000 GPUs for one hour equals that of using one GPU for 1,000 hours, this equation helps maximize performance while minimizing GPU-hour training costs.

While FLOPs are a crucial metric, the actual bottleneck in many training scenarios extends beyond pure computational power. A significant performance limitation comes from the data transfer speeds - specifically, how quickly information can be moved into and out of the GPU's memory. This is where VRAM becomes a critical factor in the training pipeline. The process of transferring data between the system memory and GPU memory (known as memory bandwidth) can become a substantial bottleneck if not properly optimised. Even with impressive FLOP capabilities, a GPU's performance may be constrained by these memory transfer limitations, making efficient memory management and optimisation just as important as raw computational power.

The role of VRAM in GPU Performance

VRAM (Video Random Access Memory) is the dedicated memory on a GPU. It stores model parameters, intermediate calculations, and input data during training. The larger the VRAM, the more data and parameters the GPU can store, which is crucial for training LLMs with billions of parameters. When VRAM is insufficient, data has to be offloaded to slower system memory, which drastically reduces training efficiency. Hence, high VRAM is essential for maintaining speed during LLM training.

What Are the Most Common GPUs Used for Training LLMs?

In the world of LLM training, not all GPUs are created equal. Here are some of the most commonly used GPUs today for training large AI models, along with their respective features:

1. NVIDIA H100 (80GB)

The H100 is NVIDIA's latest powerhouse GPU for AI and deep learning tasks. It features 80GB of High-Bandwidth Memory (HBM), designed for even the most computationally demanding models. It offers extremely high teraflops performance, making it suitable for models with hundreds of billions of parameters.

2. NVIDIA A100 (80GB, 40GB)

The A100 GPU is another favorite for AI researchers and practitioners. Available in 80GB and 40GB VRAM variants, it provides high-performance support for training LLMs and can handle data-intensive tasks like distributed training across multiple GPUs. It delivers excellent teraflops of compute power, both for FP16 (half precision) and BFLOAT16 (bfloat16) formats, ensuring a balanced trade-off between precision and computational speed.

3. NVIDIA A40

The A40 is another versatile GPU for AI workloads, suitable for both inference and training tasks. While it may not reach the performance of an A100, it offers a good balance of power and cost, making it an option for more budget-conscious training setups.

4. NVIDIA T4

The T4 GPU is widely used in cloud computing environments for both training and inference. It has lower memory compared to the H100 and A100, but still provides a good amount of teraflops performance. It is a more economical choice for lighter workloads or when working with smaller models.

Precision Types and Their Importance

Floating point numbers are fundamental to GPU computations and consist of three main components: the sign bit (positive or negative), the exponent (determining the number's range), and the mantissa (representing precision). The standard float32 format uses 32 bits total: 1 bit for sign, 8 bits for exponent, and 23 bits for mantissa. Float16 reduces this to 1 sign bit, 5 exponent bits, and 10 mantissa bits—though this reduction in exponent bits led to unexpected challenges. Brain float (bfloat16) offers an elegant solution by keeping float32's exponent size while reducing the mantissa, creating an effective balance between numerical range and memory efficiency. These format choices significantly influence both computational speed and numerical accuracy in LLM training.

Format	Sign Bit	Exponent Bits	Mantissa Bits	Primary Use
FP32 (Single Precision)	1	8	23	High-precision computations
TF32 (Half Precision)	1	8	10
FP16 (Half Precision)	1	5	10	Faster computation with reduced memory
BFLOAT16 (Brain Float)	1	8	7	Balance between range and memory efficiency

FP32 (Single Precision): Common for applications where high precision is necessary.
TF32 (Tensor Float 32): A custom NVIDIA format that combines FP32's range with reduced precision, using 8 exponent bits and 10 mantissa bits. It offers better numerical stability than FP16 while maintaining good performance, making it ideal for AI training on Ampere architecture GPUs.
FP16 (Half Precision): Used to speed up computations without a significant loss in accuracy.
BFLOAT16 (Brain Floating Point 16): A format designed to retain precision while consuming less memory, particularly useful in AI training.
Unlike FP16, bfloat16 has a greater dynamic range, making it more resilient to overflows during training.

What Is MFU?

Model FLOP Utilization (MFU) is a measure of how efficiently a GPU is used during training. High MFU indicates that the GPU is being used to its full potential, processing as many FLOPs as possible without significant downtime. Ensuring a high MFU is important to minimize training time and maximize the cost-effectiveness of using powerful hardware like the H100 or A100.

Computing costs for transformers are typically listed in GPU-hours or FLOP-seconds.

GPT architecture models achieves 150 TFLOP/s/A100 with normal attention and 180 TFLOP/s/A100 with Flash Attention. This is in line with other highly optimized libraries at scale, for example Megatron-DS reports between 137 and 163 TFLOP/s/A100.
As a general rule of thumb, you should always be able to achieve approximately 120 TFLOP/s/A100. If you are seeing below 115 TFLOP/s/A100 there is probably something wrong with your model or hardware configuration.
With high-quality interconnect such as InfiniBand, you can achieve linear or sublinear scaling across the data parallel dimension (i.e. increasing the data parallel degree should increase the overall throughput nearly linearly). Shown below is a plot from testing the GPT-NeoX library on Oak Ridge National Lab's Summit supercomputer. Note that V100s are on the x-axis, while most of the numerical examples in the post are for A100s.

Model Memory

Most transformers are trained in'mixed precision, either fp16 + fp32 or bf16 + fp32. This cuts down on the amount of memory required to train the models, and also the amount of memory required to run inference. We can cast language models from fp32 to fp16 or even int8 without suffering a substantial performance hit. These numbers refer to the size'in bits'a single parameter requires. Since there are 8 bits in a Byte, we divide this number by 8 to see how many Bytes each parameter requires:

In Int8: memory_model = (1 byte/param)*(No. of params)
In fp16: memory_model = (2 byte/param) * (No. of params)
In fp32: memory_model = (4 byte/param)*(No. of params)

Total Training Memory

Therefore, a good heuristic answer for "will this model fit for training" is:

TEXT

Total Memory_Training = Model Memory + Optimiser Memory + Activation Memory + Gradient Memory

Inference VRAM Memory

In addition to the memory needed to store the model weights, there is also a small amount of additional overhead during the actual forward pass. In our experience this overhead is ≤ 20% and is typically irrelevant to determining the largest model that will fit on your GPU.

In total, a good heuristic answer for "will this model fit for inference" is:

TEXT

Total Memory_inference = 1.2 × Model Memory

Conclusion

GPUs are the backbone of modern AI and machine learning, particularly for training large language models. Their ability to handle vast amounts of parallel computation through thousands of cores makes them uniquely suited for the task. Metrics like FLOPs help measure their computational power, while VRAM ensures that large models and datasets can fit into memory. With cutting-edge GPUs like the H100, A100, and others, the landscape of LLM training is moving forward rapidly, enabling breakthroughs in what these models can achieve.

Understanding how to choose the right GPU, optimize its utilization, and balance precision types is key to unlocking the full potential of large language models and staying competitive in the fast-evolving AI landscape.