DDP Benchmark

Eight models, eight distributed modes, 70 training runs.
Two mismatched GPUs training faster than the fast one alone.

Hardware: RTX 5060 Ti (sm_120, 15 GB) + GTX 1060 (sm_61, 6 GB). A 2.5x compute gap. No pre-built libtorch covers both architectures — compiled from source with fdl libtorch build.


ResNet-20 on CIFAR-10

200 epochs, every DDP mode surpasses solo.
Published reference: 91.25% accuracy (He et al. 2015, Table 6).

This is the FlowBuilder Graph version — the same ResNet-20 architecture expressed as a computation graph. One Ddp::setup() call turns it into multi-GPU training. The non-graph ResNet-20 (manual Module impl) produces identical results through Ddp::builder(), confirming that both DDP entry points converge correctly.

Mode Eval vs Paper Wall time Speedup GPU0 GPU1
solo-0 (5060 Ti)91.66%+0.41pp3127s1.0x100%
solo-1 (1060)91.69%+0.44pp3702s0.8x100%
nccl-sync92.38%+1.13pp2605s1.2x99%100%
nccl-cadence92.42%+1.17pp2650s1.2x100%100%
nccl-async92.44%+1.19pp2697s1.2x100%100%
cpu-sync92.53%+1.28pp4418s0.7x99%100%
cpu-cadence92.04%+0.79pp2670s1.2x100%100%
cpu-async92.43%+1.18pp2614s1.2x100%100%
sync (Graph DDP)91.43%+0.18pp6376s0.5x100%100%

Eval = test set accuracy (held-out 10K samples). GPU% = compute utilization. Speedup relative to solo-0. All 9 modes beat the published 91.25%. Sync (Graph DDP) pays full barrier cost with a 2.5x speed gap, hence the lower speedup.


Same result, manual Module

ResNet-20 without the Graph builder. Same architecture,
same convergence — confirming both DDP entry points work.

Mode Eval vs Paper Wall time Speedup
solo-0 (5060 Ti)91.62%+0.37pp2994s1.0x
nccl-sync92.19%+0.94pp2491s1.2x
nccl-cadence92.00%+0.75pp2574s1.2x
nccl-async92.10%+0.85pp2522s1.2x
cpu-sync92.44%+1.19pp4356s0.7x
cpu-cadence91.75%+0.50pp2467s1.2x
cpu-async91.92%+0.67pp2534s1.2x

Ddp::builder() path. No Graph required — any Module works.


CPU averaging: fixed

v0.3.0 shipped with CPU averaging broken. v0.4.0 fixes it.
One stream synchronization bug, three policies restored.

Policyv0.3.0v0.4.0
CPU Syncbroken92.53%
CPU Cadencebroken92.04%
CPU Asyncbroken92.43%

ResNet-20 Graph on CIFAR-10, 200 epochs. Both backends are now production-ready.


El Che: three policies

Different tradeoffs for different hardware.
All three converge. The choice is throughput vs gradient freshness.

Sync

AllReduce every batch. Maximum gradient freshness. The fast GPU waits at every barrier. Best convergence, worst throughput on mixed hardware.

ResNet-20 Graph: 92.38%, 2605s (NCCL) / 92.53%, 4418s (CPU)

Cadence

Slow GPU anchors the sync interval. Fast GPU processes more batches between syncs, contributing proportionally to its speed. Auto-tunes to keep AllReduce overhead below 10%.

ResNet-20 Graph: 92.42%, 2650s (NCCL) / 92.04%, 2670s (CPU)

Async

Workers run independently, averaging when ready. The fast GPU ranges ahead, creating useful parameter diversity. Fastest wall time, convergence smooths out over epochs.

ResNet-20 Graph: 92.44%, 2697s (NCCL) / 92.43%, 2614s (CPU)


Confirmation: 6 more models

Small models with 5 epochs. DDP converges correctly across the board,
though short runs penalize El Che calibration time.

Model Published solo-0 Best DDP Mode Fastest DDP Mode
LeNet-5 (MNIST)~99%98.57%98.96%cpu-async7.0scpu-async
MLP (MNIST)~97%97.00%97.89%cpu-sync3.3scpu-async
Logistic (MNIST)~92%92.37%92.51%nccl-cadence3.1scpu-async
Conv AE (MNIST)0.00070.0006sync5.3scpu-cadence
GPT-nano (Shakespeare)~1.5–2.02.47432.4316cpu-cadence7.0scpu-async
Char-RNN (Shakespeare)~1.51.57071.5672solo-121.5scpu-cadence

Eval is test-set accuracy (classification models) or validation loss (language models, lower is better). 5 epochs is too short for DDP to fully demonstrate its advantage — gradient averaging slows early convergence and El Che needs time to calibrate. These results confirm correctness, not peak performance.


Methodology

Reproducible. Ship it with fdl ddp-bench.

Nine modes: solo-0, solo-1, sync (Graph DDP), and six builder modes (nccl/cpu × sync/cadence/async). Every combination of backend and El Che policy.

Eval protocol: Solo modes evaluate against the held-out test set every epoch. DDP builder modes run a final evaluation after training completes, loading the averaged parameters into a fresh model. Both paths use the same eval function on the same test data.

GPU utilization: Sampled at 100ms intervals via NVML. GPU0/GPU1 percentages are compute utilization (not memory, not load). Idle time is cumulative seconds with <5% utilization.

Reproducibility: Seed 42, deterministic data loading, identical hyperparameters across all modes. Every run produces a timeline (JSON/CSV/HTML) with GPU traces, sync events, and idle analysis.

fdl ddp-bench full-sweep          # all models, all modes
fdl ddp-bench --model resnet-graph --mode nccl-async --epochs 200
fdl ddp-bench report              # generate markdown report

The full detailed report with per-epoch trajectories, idle analysis, and El Che calibration data is available in docs/ddp-benchmark.md.