Slides: Distributed Training for ML

Explore distributed training techniques through this interactive presentation. Navigate through the slides using arrow keys or the navigation controls.

Topics Covered

Back to Basics: Understanding neural network fundamentals
Why Distributed Training: Memory constraints and scaling challenges
DDP (Data Distributed Parallel): Replicating models across GPUs
Pipeline Parallelism: Splitting models across devices
FSDP (Fully Sharded Data Parallel): Advanced sharding techniques

Slides

Use the arrow keys (← →) or click the navigation arrows to move between slides. Some slides include animations that you can step through using the animation controls at the bottom.

The visualizations and images on MNIST examples in the “Back to Basics” section are from the educational content at 3Blue1Brown.

Key Takeaways

Data Distributed Parallel (DDP)

Best for: Models that fit in single GPU memory
How it works: Full model replica on each GPU
Trade-off: High memory usage but simple implementation

Pipeline Parallelism

Best for: Very deep sequential models
How it works: Different layers on different GPUs
Trade-off: Requires careful batch sizing to minimize idle time

FSDP (Fully Sharded Data Parallel)

Best for: Very large models (100B+ parameters)
How it works: Shards model parameters, gradients, and optimizer states
Trade-off: More complex but enables training of massive models

For a detailed written guide on these techniques, check out my Distributed Training blog post.