Modern deep learning models have grown exponentially in size and complexity. GPT-4 has over a trillion parameters, and even “smaller” models like LLaMA-70B require substantial computational resources. Training or fine-tuning such models on a single GPU is often impossible; not just because of time constraints, but because the model itself may not fit in the memory of a single device. This is where distributed training becomes essential.

Why Do We Need Distributed Training?

The Memory Wall Problem

A modern GPU like the NVIDIA A100 has 80GB of memory. Sounds like a lot? Let’s do some math:

  • A model with 7 billion parameters in FP32 requires: $7B \times 4 \text{ bytes} = 28\text{GB}$
  • But during training, we also need:
    • Gradients: another 28GB
    • Optimizer states (Adam has 2 momentum terms): 56GB more
    • Activations for backpropagation: varies, but often substantial

A 7B parameter model can easily require 150GB+ during training, far exceeding what a single GPU can handle.

The Time Constraint

Even if a model fits in memory, training on a single GPU can take prohibitively long. Consider:

  • Training GPT-3 on a single V100 GPU would take approximately 355 years
  • With distributed training across thousands of GPUs, this was reduced to weeks

Types of Parallelism

Distributed training employs different parallelism strategies:

StrategyWhat’s ParallelizedWhen to Use
Data ParallelismTraining data across replicasLarge datasets, model fits in single GPU
Model/Tensor ParallelismModel layers across devicesVery large layers (e.g., attention in transformers)
Pipeline ParallelismModel stages across devicesDeep models with many sequential layers
Hybrid (3D Parallelism)Combination of aboveExtremely large models (100B+ parameters)

This post dives deep into three fundamental approaches: DDP, Pipeline Parallelism, and FSDP.


Distributed Data Parallel (DDP)

DDP is the most straightforward and commonly used approach for distributed training. The core idea is simple: replicate the entire model on each GPU, split the data batch across GPUs, and synchronize gradients.

How DDP Works

Click the Play button to see a visualization of how DDP works.

DDP visualization
  1. Model Replication: Each GPU gets a complete copy of the model with identical initial weights
  2. Data Sharding: The training batch is split equally among all GPUs
  3. Forward Pass: Each GPU processes its data shard independently
  4. Backward Pass: Each GPU computes gradients for its local data
  5. Gradient Synchronization: All GPUs synchronize their gradients using AllReduce
  6. Weight Update: Each GPU applies the synchronized gradients to update its local model

The magic happens in the AllReduce operation, which efficiently computes the average of gradients across all GPUs and distributes the result back to each GPU.

AllReduce: The Heart of DDP

AllReduce is a collective communication operation that:

  1. Takes input tensors from all processes
  2. Applies a reduction operation (typically sum or average)
  3. Distributes the result to all processes

Pros of DDP

AdvantageDescription
Simple ImplementationMinimal code changes required; wrap model in DistributedDataParallel
Linear ScalingNear-linear speedup with more GPUs for communication-bound scenarios
No Model ChangesWorks with any model architecture without modifications
Fault ToleranceEasy to checkpoint and resume training
Overlapping CommunicationGradient sync overlaps with backward computation

Cons of DDP

DisadvantageDescription
Memory RedundancyFull model replicated on each GPU
Model Size LimitModel must fit entirely in single GPU memory
Communication OverheadAllReduce scales with model size
Synchronization BarrierAll GPUs must wait for slowest one (stragglers)

When to Use DDP

DDP is ideal when:

  • Your model fits comfortably in a single GPU
  • You want simple, robust distributed training
  • You’re scaling across multiple machines with fast interconnects

Pipeline Parallelism

When models are too large to fit on a single GPU, we need to partition them across devices. Pipeline Parallelism splits the model into stages, where each stage runs on a different GPU.

How Pipeline Parallelism Works

Click the Play button.

Pipeline parallelism
  1. Model Partitioning: Split model into N sequential stages
  2. Stage Assignment: Each GPU handles one or more stages
  3. Micro-batching: Split input batch into smaller micro-batches
  4. Pipeline Execution: Process micro-batches in a pipelined fashion

The Bubble Problem

Naive pipeline parallelism has a significant issue, pipeline bubbles. Notice the Device Utilization Timeline in the above animation as training progresses. It shows “bubbles” where GPUs sit idle, waiting for data from previous stages, leading to wasted compute resources.

Reducing Bubbles with Micro-batching

The key optimization is to use many micro-batches:

$$\text{Bubble Fraction} = \frac{p - 1}{m}$$

Where $p$ is the number of pipeline stages and $m$ is the number of micro-batches. With more micro-batches, the bubble overhead becomes negligible.

Pipeline Schedules

Different scheduling strategies minimize bubbles:

ScheduleDescriptionMemoryBubble Ratio
GPipeAll forward, then all backwardHigh (stores activations)$(p-1)/m$
1F1BAlternates forward/backwardLower$(p-1)/m$
Interleaved 1F1BVirtual stagesLowest$(p-1)/(m \cdot v)$

Pros of Pipeline Parallelism

AdvantageDescription
Scales Model SizeTrain models larger than single GPU memory
Lower CommunicationOnly activations transferred between stages
Works with Sequential ModelsNatural fit for transformer layers
Memory EfficientEach GPU only holds subset of model

Cons of Pipeline Parallelism

DisadvantageDescription
Pipeline BubblesIdle time reduces GPU utilization
Complex ImplementationRequires careful model partitioning
Load BalancingStages must have similar compute cost
Increased LatencyForward pass must traverse all stages
Gradient StalenessSome schedules have delayed gradient updates

When to Use Pipeline Parallelism

Pipeline parallelism shines when:

  • Model doesn’t fit on a single GPU but isn’t excessively large
  • Model has clear sequential structure (e.g., transformer blocks)
  • You have limited inter-GPU bandwidth
  • Combined with data parallelism for better scaling

Fully Sharded Data Parallel (FSDP)

FSDP represents a paradigm shift in distributed training. Instead of replicating the entire model on each GPU (like DDP), FSDP shards the model parameters, gradients, and optimizer states across all GPUs.

How FSDP Works

FSDP visualization

FSDP follows a gather-compute-scatter pattern:

  1. Sharding: Model parameters are partitioned across all GPUs
  2. AllGather: Before forward pass, gather full parameters for current layer
  3. Forward Compute: Execute layer with full parameters
  4. Discard: After forward, discard non-local parameters to save memory
  5. Repeat for backward pass: AllGather → Compute gradients → ReduceScatter
  6. ReduceScatter: Distribute and reduce gradients back to shards

Memory Savings

The memory savings with FSDP are dramatic:

ComponentDDP MemoryFSDP Memory
Parameters$\Phi$ per GPU$\Phi / N$ per GPU
Gradients$\Phi$ per GPU$\Phi / N$ per GPU
Optimizer States$2\Phi$ per GPU (Adam)$2\Phi / N$ per GPU
Total$4\Phi$$4\Phi / N$

Where $\Phi$ is model size and $N$ is number of GPUs.

For a 7B model on 8 GPUs:

  • DDP: 28GB × 4 = 112GB per GPU (doesn’t fit on 80GB A100!)
  • FSDP: 112GB / 8 = 14GB per GPU ✓

Sharding Strategies

FSDP offers flexible sharding strategies:

StrategyWhat’s ShardedMemoryCommunication
FULL_SHARDParams, Grads, OptimizerMinimumMaximum
SHARD_GRAD_OPGrads, OptimizerMediumMedium
NO_SHARDNothing (like DDP)MaximumMinimum

Pros of FSDP

AdvantageDescription
Massive Memory SavingsLinear reduction in memory with GPU count
Train Huge ModelsEnable training of models that don’t fit on single GPU
Flexible ShardingChoose tradeoff between memory and communication
Native PyTorchWell-integrated into PyTorch ecosystem
Mixed PrecisionWorks seamlessly with AMP/BF16
Activation CheckpointingCombines well with gradient checkpointing

Cons of FSDP

DisadvantageDescription
Communication OverheadMore collective operations than DDP
ComplexityMore configuration options to tune
Debugging DifficultyHarder to debug distributed sharded state
Checkpoint ComplexitySaving/loading requires special handling
LatencyAllGather adds latency before each layer

When to Use FSDP

FSDP is the right choice when:

  • Model doesn’t fit on single GPU even with mixed precision
  • You need to train models with billions of parameters
  • You have fast GPU interconnects (NVLink, InfiniBand)
  • Memory is the primary bottleneck

Comparison Summary

AspectDDPPipeline ParallelFSDP
Memory per GPUFull modelModel / stagesModel / GPUs
CommunicationAllReducePoint-to-pointAllGather + ReduceScatter
ComplexityLowMediumMedium-High
Model Size LimitSingle GPUTotal GPU memoryTotal GPU memory
GPU UtilizationHighMedium (bubbles)High
Best ForSmall-Medium modelsSequential modelsLarge models

Choosing the Right Strategy

graph LR; A[Model] --> B{Fit on single GPU?} B --> |YES| G[Use DDP] B --> |NO| C{Do you have fast interconnect?} C --> |YES| E[Use FSDP] C --> |NO| F[Use Pipeline Parallel]

Practical Recommendations

  1. Start with DDP if your model fits on a single GPU. DDP is the simplest and most efficient.

  2. Switch to FSDP when memory becomes the bottleneck. Start with SHARD_GRAD_OP for less communication overhead.

  3. Add Pipeline Parallelism for very deep models, especially when combined with FSDP for each pipeline stage.

  4. Use 3D Parallelism (Data + Pipeline + Tensor) for even bigger models (100B+ parameters).

  5. Profile and measure: Use tools like PyTorch Profiler to identify bottlenecks.

Conclusion

Distributed training is essential for working with state-of-the-art neural network models. Understanding these three fundamental approaches gives you the tools to train models of any size.

  • DDP for simplicity and efficiency with smaller models
  • Pipeline Parallelism for scaling deep sequential models
  • FSDP for massive models that exceed single-GPU memory

Further Reading