70 runs, 8 models, zero failures

v0.4.0: CPU averaging works, every DDP mode beats solo, and we have the numbers to prove it

Six days ago, v0.3.0 shipped with a confession: the CPU averaging backend was broken. Three policies, all failing to converge. We shipped anyway because the NCCL path worked and an honest label beats a delayed release.

That bug is fixed. And we can prove it.

The fix

One line. comm_stream.synchronize() before reading GPU parameters for CPU averaging snapshots. The snapshot_params() call was racing with load_averaged() – a non-blocking CUDA stream copy was still writing to GPU memory when the next snapshot tried to read it. Classic race condition, subtle because it only corrupted parameters gradually over hundreds of averaging rounds.

We suspected this in v0.3.0 but couldn’t confirm it was the only bug. Now we can: 70 training runs across 8 models and 8 DDP modes, every single CPU averaging configuration converges correctly.

The proof

We built ddp-bench, a dedicated benchmark suite that runs every model against every DDP mode and validates convergence against published results. Not toy models – ResNet-20 on CIFAR-10, GPT-nano on Shakespeare, char-RNN, LeNet, convolutional autoencoders. Real architectures with known convergence curves.

The hardware: an RTX 5060 Ti (sm_120, 15GB) and a GTX 1060 (sm_61, 6GB). A 2.5x compute gap between GPUs. No pre-built libtorch covers both architectures – we compile from source with fdl libtorch build. This is the kind of setup traditional DDP frameworks choke on.

The modes:

That’s 9 configurations per model, 200 epochs on ResNet-20 and ResNet-Graph, 5 epochs on the smaller models.

ResNet-20: every DDP mode beats solo

This is the headline result. On CIFAR-10 with 200 epochs, every DDP mode exceeds the solo-0 baseline (91.62% accuracy on the RTX 5060 Ti):

Mode Eval vs Published Wall time Speedup
solo-0 (5060 Ti) 91.62% +0.37pp 2994s 1.0x
solo-1 (1060) 91.87% +0.62pp 3680s 0.8x
nccl-sync 92.19% +0.94pp 2491s 1.2x
nccl-cadence 92.00% +0.75pp 2574s 1.2x
nccl-async 92.10% +0.85pp 2522s 1.2x
cpu-sync 92.44% +1.19pp 4356s 0.7x
cpu-cadence 91.75% +0.50pp 2467s 1.2x
cpu-async 91.92% +0.67pp 2534s 1.2x

Published reference: 91.25% (He et al. 2015, Table 6).

Every mode beats the paper. Every mode beats solo-0. The fastest DDP modes (cpu-cadence, nccl-async) finish in ~42 minutes vs solo-0’s ~50 minutes. Two mismatched GPUs, training faster than the fast GPU alone, with better convergence.

cpu-sync is the outlier: best eval (92.44%) but slowest, because every-batch CPU round-trips are expensive. The El Che policies (cadence, async) let the fast GPU range ahead between sync points, which is where the speedup comes from.

CPU averaging: from broken to production

In v0.3.0, all three CPU averaging policies failed to converge. Parameters degraded over hundreds of averaging rounds until the model collapsed. We shipped it labeled as a known bug.

In v0.4.0, all three work:

Mode v0.3.0 v0.4.0
CPU Sync broken 92.44%
CPU Cadence broken 91.75%
CPU Async broken 91.92%

The CPU path now matches NCCL convergence across all three policies and all eight benchmark models. Both backends are production-ready.

Why this was hard

flodl doesn’t use PyTorch’s c10d distributed backend. We call NCCL directly through FFI – raw ncclAllReduce, ncclCommInitRank, manual stream synchronization. This gives us control (El Che’s per-rank cadence wouldn’t be possible through c10d’s collective API), but it means we own every synchronization bug.

The NCCL init pattern alone took days to get right. ncclCommInitRank called from worker threads corrupts the CUDA context on heterogeneous GPUs. The fix: always init on the main thread, then ncclCommSplit() to create per-rank communicators. Not documented anywhere. We found it by reading NCCL source and correlating CUBLAS failures with init order.

The CPU averaging path is a three-phase state machine (Idle/Collecting/Computing) running on a coordinator thread. GPU workers send parameter snapshots via channels, the coordinator averages on CPU, then distributes back. Non-blocking at every stage – the coordinator keeps processing control messages while averaging runs. Getting the stream synchronization right between snapshot reads, NCCL transfers, and CPU copies required instrumenting every stage with checksums until we found the exact copy that was racing.

Building libtorch from source for sm_61 + sm_120 (Pascal + Blackwell) takes 4 hours. No pre-built binary covers both. fdl libtorch build automates this, but when something goes wrong in DDP, you’re debugging through three layers: your Rust code, the C++ FFI shim, and the libtorch internals. There’s no Python stacktrace to fall back on.

El Che: the mismatched GPU problem

Traditional sync DDP makes the fast GPU wait for the slow one. With a 2.5x speed gap, you train at 1060 speed on 5060 Ti hardware.

El Che (Elastic Checkpoint) solves this with three policies:

Sync: every-batch AllReduce. Maximum gradient freshness, but the fast GPU idles during every barrier. Best convergence, worst throughput on mixed hardware.

Cadence: the slow GPU anchors the sync interval. The fast GPU processes more batches between syncs, contributing proportionally to its speed. Auto-tunes the anchor to keep AllReduce overhead below 10%.

Async: workers run independently, averaging when ready. The fast GPU ranges ahead, creating useful parameter diversity between averaging events. Fastest wall time, slightly noisier convergence that smooths out over epochs.

On ResNet-20, all three converge to within 0.5pp of each other. The choice is a throughput/freshness tradeoff, not a convergence risk.

The benchmark suite

ddp-bench is included in the repository. Run it yourself:

fdl ddp-bench full-sweep          # all models, all modes
fdl ddp-bench --model resnet --mode nccl-async
fdl ddp-bench report              # generate markdown report

Every run produces a timeline (JSON, CSV, HTML) with GPU utilization traces, sync event markers, idle gap analysis, and per-epoch convergence data. The full report has the complete results for all 70 runs, and the detailed data is on GitHub with reproduction instructions.

Eight models, each chosen to stress a different aspect:

Model Tests Published
Logistic regression Linear baseline MNIST ~92%
2-layer MLP Dispatch overhead MNIST ~97%
LeNet-5 Conv + pooling MNIST ~99%
Conv autoencoder Reconstruction (MSE) MNIST
Char-RNN RNN sequence modeling Shakespeare, loss ~1.5
GPT-nano Transformer + attention Shakespeare, loss ~1.5-2.0
ResNet-20 Deep residual, BN, augmentation CIFAR-10 91.25%
ResNet-20 (Graph) Same via FlowBuilder CIFAR-10 91.25%

What’s next

This release validates DDP on local hardware. The next step is cloud: multi-node training across machines, where network latency replaces PCIe as the bottleneck. El Che’s cadence policies were designed with this in mind – the same anchor/range-ahead principle applies whether the slow device is a weaker GPU or a node behind a network hop.

The full DDP benchmark report is available on the website with per-model tables, convergence trajectories, GPU idle analysis, and El Che calibration data. The raw report on GitHub includes reproduction instructions.

GitHub | Documentation | DDP Benchmark | DDP Report (GitHub) | Changelog