Understanding Memory and Throughput in LLMs Training: A Practical Example

Understanding Memory and Throughput in LLMs Training: A Practical Example

Introduction

Large Language Models (LLMs) like GPT-3 and BERT are at the forefront of AI advancements, powering applications from natural language understanding to generative text. These models, however, bring significant challenges in terms of memory usage and throughput. To simplify these complex concepts, we'll use a practical example with a Multi-Layer Perceptron (MLP), a foundational neural network architecture. This blog will guide you through calculating memory requirements and throughput, setting the stage for understanding these metrics in more complex models like LLMs.

The content will follow the same structure as provided previously, with the focus remaining on using the MLP example to introduce the foundational concepts of memory and throughput that apply to LLMs. This way, readers can grasp these critical aspects in a manageable context before applying them to more sophisticated models.

Understanding the MLP Model

An MLP consists of an input layer, multiple hidden layers, and an output layer, with each neuron fully connected to neurons in the subsequent layer. This foundational architecture, though simpler than LLMs, helps illustrate the core principles of managing computational resources in neural networks.

Example MLP Configuration:

  • Input Size: 10,000 neurons

  • Hidden Layers: 10 layers, each with 10,000 neurons

  • Output Layer: 10,000 neurons

  • Data Type: float32 (4 bytes per float)

Calculating Memory Requirements

1. Model Parameters Memory

Each layer's memory requirements for storing weights and biases are calculated as follows:

Weights per Layer:

  • The number of weights is the product of the number of inputs and outputs per layer: n×mn \times mn×m.

  • For 10,000 input neurons and 10,000 output neurons, the weights per layer are 10810^8108.

  • Memory required: 108 weights×4 bytes/weight=400 MB10^8 \text{ weights} \times 4 \text{ bytes/weight} = 400 \text{ MB}108 weights×4 bytes/weight=400 MB per layer.

Biases per Layer:

  • Each neuron has an associated bias: 104 biases×4 bytes/bias=40 KB10^4 \text{ biases} \times 4 \text{ bytes/bias} = 40 \text{ KB}104 biases×4 bytes/bias=40 KB per layer.

Total Parameter Memory:

  • For 10 layers, the total memory for weights is 10×400 MB=4 GB10 \times 400 \text{ MB} = 4 \text{ GB}10×400 MB=4 GB.

2. Activation Memory

Activation memory stores intermediate outputs (activations) during the forward pass, which are needed for backpropagation.

Activation Memory per Layer per Sample:

  • For each sample, the activations require 104×4 bytes=40 KB10^4 \times 4 \text{ bytes} = 40 \text{ KB}104×4 bytes=40 KB per layer.

Total Activation Memory per Sample:

  • Including the input and output layers, there are 11 sets of activations.

  • Total activation memory per sample: 11×40 KB=440 KB11 \times 40 \text{ KB} = 440 \text{ KB}11×40 KB=440 KB.

Batch Size Calculation: To calculate the maximum batch size BBB that fits within a given memory limit, say 8 GB:

4 GB (parameters)+B×440 KB (activations)≤8 GB4 \text{ GB (parameters)} + B \times 440 \text{ KB (activations)} \leq 8 \text{ GB}4 GB (parameters)+B×440 KB (activations)≤8 GB4 GB+B×0.4297 MB≤8 GB4 \text{ GB} + B \times 0.4297 \text{ MB} \leq 8 \text{ GB}4 GB+B×0.4297 MB≤8 GBB×0.4297 MB≤4 GBB \times 0.4297 \text{ MB} \leq 4 \text{ GB}B×0.4297 MB≤4 GBB≤4 GB0.4297 MB≈9.31B \leq \frac{4 \text{ GB}}{0.4297 \text{ MB}} \approx 9.31B≤0.4297 MB4 GB​≈9.31

Thus, the maximum batch size is 9 samples.

Calculating Throughput

Computational Cost per Layer: For a fully connected layer with nnn inputs and mmm outputs, the number of floating-point operations (FLOPs) required includes:

  • Forward Pass: 2nm2nm2nm FLOPs (multiplications and additions)

  • Backward Pass: Approximately 4nm4nm4nm FLOPs (due to additional gradient computations)

For our example with 10 layers, each with 10,000 inputs and outputs:

  • FLOPs per Layer: 2×1082 \times 10^82×108 FLOPs for the forward pass, 4×1084 \times 10^84×108 FLOPs for the backward pass.

  • Total FLOPs per Sample per Layer: 6×1086 \times 10^86×108 FLOPs.

Total FLOPs per Sample:

  • For 10 layers: 10×6×10810 \times 6 \times 10^810×6×108 FLOPs = 6×1096 \times 10^96×109 FLOPs per sample.

Throughput Calculation: Given a hardware setup capable of 100 TFlops (100 trillion FLOPs per second):

Throughput=100 TFlops/s6 GFlops/sample≈16.67 samples/s\text{Throughput} = \frac{100 \text{ TFlops/s}}{6 \text{ GFlops/sample}} \approx 16.67 \text{ samples/s}Throughput=6 GFlops/sample100 TFlops/s​≈16.67 samples/s

Connecting MLPs to LLMs

While MLPs are much simpler than LLMs, the principles of memory management and throughput calculation are directly applicable to larger models. LLMs, which can contain billions of parameters, face similar challenges but on a larger scale. Techniques such as model pruning, quantization, and the use of specialized hardware are essential for managing these complex models effectively. Understanding these basics with MLPs provides a foundation for tackling the intricacies of LLMs.

Conclusion

This guide illustrates the foundational concepts of memory and throughput in neural networks using an MLP model as a simplified example. As you move toward more complex architectures like LLMs, these core principles remain essential for optimizing performance and resource usage. Whether you are a beginner or an advanced practitioner, understanding these fundamentals will help you navigate the complexities of modern AI models and ensure efficient deployment in real-world applications.

Did you find this article valuable?

Support Engineering Elevation by becoming a sponsor. Any amount is appreciated!