DDP Benchmark

Eight models, eight distributed modes, 70 training runs.
Two mismatched GPUs training faster than the fast one alone.

Hardware: RTX 5060 Ti (sm_120, 15 GB) + GTX 1060 (sm_61, 6 GB). A 2.5x compute gap. No pre-built libtorch covers both architectures — compiled from source with fdl libtorch build.

ResNet-20 on CIFAR-10

200 epochs, every DDP mode surpasses solo.
Published reference: 91.25% accuracy (He et al. 2015, Table 6).

This is the FlowBuilder Graph version — the same ResNet-20 architecture expressed as a computation graph. One Trainer::setup() call turns it into multi-GPU training. The non-graph ResNet-20 (manual Module impl) produces identical results through Trainer::builder(), confirming that both DDP entry points converge correctly.

Mode	Eval	vs Paper	Wall time	Speedup	GPU0	GPU1
solo-0 (5060 Ti)	91.66%	+0.41pp	3127s	1.0x	100%	—
solo-1 (1060)	91.69%	+0.44pp	3702s	0.8x	—	100%
nccl-sync	92.38%	+1.13pp	2605s	1.2x	99%	100%
nccl-cadence	92.42%	+1.17pp	2650s	1.2x	100%	100%
nccl-async	92.44%	+1.19pp	2697s	1.2x	100%	100%
cpu-sync	92.53%	+1.28pp	4418s	0.7x	99%	100%
cpu-cadence	92.04%	+0.79pp	2670s	1.2x	100%	100%
cpu-async	92.43%	+1.18pp	2614s	1.2x	100%	100%
sync (Graph DDP)	91.43%	+0.18pp	6376s	0.5x	100%	100%

Eval = test set accuracy (held-out 10K samples). GPU% = compute utilization. Speedup relative to solo-0. All 9 modes beat the published 91.25%. Sync (Graph DDP) pays full barrier cost with a 2.5x speed gap, hence the lower speedup.

Same result, manual Module

ResNet-20 without the Graph builder. Same architecture,
same convergence — confirming both DDP entry points work.

Mode	Eval	vs Paper	Wall time	Speedup
solo-0 (5060 Ti)	91.62%	+0.37pp	2994s	1.0x
nccl-sync	92.19%	+0.94pp	2491s	1.2x
nccl-cadence	92.00%	+0.75pp	2574s	1.2x
nccl-async	92.10%	+0.85pp	2522s	1.2x
cpu-sync	92.44%	+1.19pp	4356s	0.7x
cpu-cadence	91.75%	+0.50pp	2467s	1.2x
cpu-async	91.92%	+0.67pp	2534s	1.2x

Trainer::builder() path. No Graph required — any Module works.

CPU averaging: fixed

v0.3.0 shipped with CPU averaging broken. v0.4.0 fixes it.
One stream synchronization bug, three policies restored.

Policy	v0.3.0	v0.4.0
CPU Sync	broken	92.53%
CPU Cadence	broken	92.04%
CPU Async	broken	92.43%

ResNet-20 Graph on CIFAR-10, 200 epochs. Both backends are now production-ready.

El Che: three policies

Different tradeoffs for different hardware.
All three converge. The choice is throughput vs gradient freshness.

Sync

AllReduce every batch. Maximum gradient freshness. The fast GPU waits at every barrier. Best convergence, worst throughput on mixed hardware.

ResNet-20 Graph: 92.38%, 2605s (NCCL) / 92.53%, 4418s (CPU)

Cadence

Slow GPU anchors the sync interval. Fast GPU processes more batches between syncs, contributing proportionally to its speed. Auto-tunes to keep AllReduce overhead below 10%.

ResNet-20 Graph: 92.42%, 2650s (NCCL) / 92.04%, 2670s (CPU)

Async

Workers run independently, averaging when ready. The fast GPU ranges ahead, creating useful parameter diversity. Fastest wall time, convergence smooths out over epochs.

ResNet-20 Graph: 92.44%, 2697s (NCCL) / 92.43%, 2614s (CPU)

Confirmation: 6 more models

Small models with 5 epochs. DDP converges correctly across the board,
though short runs penalize El Che calibration time.

Model	Published	solo-0	Best DDP	Mode	Fastest DDP	Mode
LeNet-5 (MNIST)	~99%	98.57%	98.96%	cpu-async	7.0s	cpu-async
MLP (MNIST)	~97%	97.00%	97.89%	cpu-sync	3.3s	cpu-async
Logistic (MNIST)	~92%	92.37%	92.51%	nccl-cadence	3.1s	cpu-async
Conv AE (MNIST)	—	0.0007	0.0006	sync	5.3s	cpu-cadence
GPT-nano (Shakespeare)	~1.5–2.0	2.4743	2.4316	cpu-cadence	7.0s	cpu-async
Char-RNN (Shakespeare)	~1.5	1.5707	1.5672	solo-1	21.5s	cpu-cadence

Eval is test-set accuracy (classification models) or validation loss (language models, lower is better). 5 epochs is too short for DDP to fully demonstrate its advantage — gradient averaging slows early convergence and El Che needs time to calibrate. These results confirm correctness, not peak performance.

Methodology

Reproducible. Ship it with fdl ddp-bench.

Nine modes: solo-0, solo-1, sync (Graph DDP), and six builder modes (nccl/cpu × sync/cadence/async). Every combination of backend and El Che policy.

Eval protocol: Solo modes evaluate against the held-out test set every epoch. DDP builder modes run a final evaluation after training completes, loading the averaged parameters into a fresh model. Both paths use the same eval function on the same test data.

GPU utilization: Sampled at 100ms intervals via NVML. GPU0/GPU1 percentages are compute utilization (not memory, not load). Idle time is cumulative seconds with <5% utilization.

Reproducibility: Seed 42, deterministic data loading, identical hyperparameters across all modes. Every run produces a timeline (JSON/CSV/HTML) with GPU traces, sync events, and idle analysis.

fdl ddp-bench full-sweep          # all models, all modes
fdl ddp-bench --model resnet-graph --mode nccl-async --epochs 200
fdl ddp-bench report              # generate markdown report

The full detailed report with per-epoch trajectories, idle analysis, and El Che calibration data is available in docs/ddp-benchmark.md.