Eight models, eight distributed modes, 70 training runs.
Two mismatched GPUs training faster than the fast one alone.
Hardware: RTX 5060 Ti (sm_120, 15 GB) + GTX 1060 (sm_61, 6 GB).
A 2.5x compute gap. No pre-built libtorch covers both architectures —
compiled from source with fdl libtorch build.
200 epochs, every DDP mode surpasses solo.
Published reference: 91.25% accuracy
(He et al. 2015, Table 6).
This is the FlowBuilder Graph version — the same ResNet-20 architecture
expressed as a computation graph. One Ddp::setup() call turns it into
multi-GPU training. The non-graph ResNet-20 (manual Module impl)
produces identical results through Ddp::builder(), confirming that
both DDP entry points converge correctly.
| Mode | Eval | vs Paper | Wall time | Speedup | GPU0 | GPU1 |
|---|---|---|---|---|---|---|
| solo-0 (5060 Ti) | 91.66% | +0.41pp | 3127s | 1.0x | 100% | — |
| solo-1 (1060) | 91.69% | +0.44pp | 3702s | 0.8x | — | 100% |
| nccl-sync | 92.38% | +1.13pp | 2605s | 1.2x | 99% | 100% |
| nccl-cadence | 92.42% | +1.17pp | 2650s | 1.2x | 100% | 100% |
| nccl-async | 92.44% | +1.19pp | 2697s | 1.2x | 100% | 100% |
| cpu-sync | 92.53% | +1.28pp | 4418s | 0.7x | 99% | 100% |
| cpu-cadence | 92.04% | +0.79pp | 2670s | 1.2x | 100% | 100% |
| cpu-async | 92.43% | +1.18pp | 2614s | 1.2x | 100% | 100% |
| sync (Graph DDP) | 91.43% | +0.18pp | 6376s | 0.5x | 100% | 100% |
Eval = test set accuracy (held-out 10K samples). GPU% = compute utilization. Speedup relative to solo-0. All 9 modes beat the published 91.25%. Sync (Graph DDP) pays full barrier cost with a 2.5x speed gap, hence the lower speedup.
ResNet-20 without the Graph builder. Same architecture,
same convergence — confirming both DDP entry points work.
| Mode | Eval | vs Paper | Wall time | Speedup |
|---|---|---|---|---|
| solo-0 (5060 Ti) | 91.62% | +0.37pp | 2994s | 1.0x |
| nccl-sync | 92.19% | +0.94pp | 2491s | 1.2x |
| nccl-cadence | 92.00% | +0.75pp | 2574s | 1.2x |
| nccl-async | 92.10% | +0.85pp | 2522s | 1.2x |
| cpu-sync | 92.44% | +1.19pp | 4356s | 0.7x |
| cpu-cadence | 91.75% | +0.50pp | 2467s | 1.2x |
| cpu-async | 91.92% | +0.67pp | 2534s | 1.2x |
Ddp::builder() path. No Graph required — any Module works.
v0.3.0 shipped with CPU averaging broken. v0.4.0 fixes it.
One stream synchronization bug, three policies restored.
| Policy | v0.3.0 | v0.4.0 |
|---|---|---|
| CPU Sync | broken | 92.53% |
| CPU Cadence | broken | 92.04% |
| CPU Async | broken | 92.43% |
ResNet-20 Graph on CIFAR-10, 200 epochs. Both backends are now production-ready.
Different tradeoffs for different hardware.
All three converge. The choice is throughput vs gradient freshness.
AllReduce every batch. Maximum gradient freshness. The fast GPU waits at every barrier. Best convergence, worst throughput on mixed hardware.
ResNet-20 Graph: 92.38%, 2605s (NCCL) / 92.53%, 4418s (CPU)
Slow GPU anchors the sync interval. Fast GPU processes more batches between syncs, contributing proportionally to its speed. Auto-tunes to keep AllReduce overhead below 10%.
ResNet-20 Graph: 92.42%, 2650s (NCCL) / 92.04%, 2670s (CPU)
Workers run independently, averaging when ready. The fast GPU ranges ahead, creating useful parameter diversity. Fastest wall time, convergence smooths out over epochs.
ResNet-20 Graph: 92.44%, 2697s (NCCL) / 92.43%, 2614s (CPU)
Small models with 5 epochs. DDP converges correctly across the board,
though short runs penalize El Che calibration time.
| Model | Published | solo-0 | Best DDP | Mode | Fastest DDP | Mode |
|---|---|---|---|---|---|---|
| LeNet-5 (MNIST) | ~99% | 98.57% | 98.96% | cpu-async | 7.0s | cpu-async |
| MLP (MNIST) | ~97% | 97.00% | 97.89% | cpu-sync | 3.3s | cpu-async |
| Logistic (MNIST) | ~92% | 92.37% | 92.51% | nccl-cadence | 3.1s | cpu-async |
| Conv AE (MNIST) | — | 0.0007 | 0.0006 | sync | 5.3s | cpu-cadence |
| GPT-nano (Shakespeare) | ~1.5–2.0 | 2.4743 | 2.4316 | cpu-cadence | 7.0s | cpu-async |
| Char-RNN (Shakespeare) | ~1.5 | 1.5707 | 1.5672 | solo-1 | 21.5s | cpu-cadence |
Eval is test-set accuracy (classification models) or validation loss (language models, lower is better). 5 epochs is too short for DDP to fully demonstrate its advantage — gradient averaging slows early convergence and El Che needs time to calibrate. These results confirm correctness, not peak performance.
Reproducible. Ship it with fdl ddp-bench.
Nine modes: solo-0, solo-1, sync (Graph DDP), and six builder modes (nccl/cpu × sync/cadence/async). Every combination of backend and El Che policy.
Eval protocol: Solo modes evaluate against the held-out test set every epoch. DDP builder modes run a final evaluation after training completes, loading the averaged parameters into a fresh model. Both paths use the same eval function on the same test data.
GPU utilization: Sampled at 100ms intervals via NVML. GPU0/GPU1 percentages are compute utilization (not memory, not load). Idle time is cumulative seconds with <5% utilization.
Reproducibility: Seed 42, deterministic data loading, identical hyperparameters across all modes. Every run produces a timeline (JSON/CSV/HTML) with GPU traces, sync events, and idle analysis.
fdl ddp-bench full-sweep # all models, all modes fdl ddp-bench --model resnet-graph --mode nccl-async --epochs 200 fdl ddp-bench report # generate markdown report
The full detailed report with per-epoch trajectories, idle analysis, and El Che calibration data is available in docs/ddp-benchmark.md.