FSDP: Fully Sharded Data Parallelism
Ready
Ready
Training Data
Batch 0
Batch 1
Batch 2
3 Hidden Layers Model
Input
W1
Hidden L1
W2
Hidden L2
W3
Hidden L3
GPU 0
Owns W1 (Input->L1)
W1
W2
W3
Batch 0
Memory Usage
GPU 1
Owns W2 (L1->L2)
W1
W2
W3
Batch 1
Memory Usage
GPU 2
Owns W3 (L2->L3)
W1
W2
W3
Batch 2
Memory Usage
⏮ Previous
▶ Play
⏭ Next Step
↻ Reset