Explore distributed training techniques through this interactive presentation. Navigate through the slides using arrow keys or the navigation controls.

Topics Covered

  • Back to Basics: Understanding neural network fundamentals
  • Why Distributed Training: Memory constraints and scaling challenges
  • DDP (Data Distributed Parallel): Replicating models across GPUs
  • Pipeline Parallelism: Splitting models across devices
  • FSDP (Fully Sharded Data Parallel): Advanced sharding techniques

Slides

Use the arrow keys (← →) or click the navigation arrows to move between slides. Some slides include animations that you can step through using the animation controls at the bottom.

Distributed training presentation
The visualizations and images on MNIST examples in the “Back to Basics” section are from the educational content at 3Blue1Brown.

Key Takeaways

Data Distributed Parallel (DDP)

  • Best for: Models that fit in single GPU memory
  • How it works: Full model replica on each GPU
  • Trade-off: High memory usage but simple implementation

Pipeline Parallelism

  • Best for: Very deep sequential models
  • How it works: Different layers on different GPUs
  • Trade-off: Requires careful batch sizing to minimize idle time

FSDP (Fully Sharded Data Parallel)

  • Best for: Very large models (100B+ parameters)
  • How it works: Shards model parameters, gradients, and optimizer states
  • Trade-off: More complex but enables training of massive models

For a detailed written guide on these techniques, check out my Distributed Training blog post.