The number that matters isn't speed
Seven models, ten rounds, one finding: Rust doesn't just run faster, it runs the same way every time
Update: these v0.1.3 results have been superseded by the v0.2.2 benchmarks – 10 models, fused RNN kernels, and improved statistical methodology.
When I published the first benchmark a week ago, the headline was 19% faster than PyTorch. A real model, a real workload, same GPU. People noticed. But one number bugged me: the epoch standard deviation. flodl was 0.10s. PyTorch was 0.85s. Same model. Same CUDA kernels.
That wasn’t a measurement artifact. It was the actual finding.
The suite
The first benchmark was a single model — a fair criticism is that any one workload might favor one framework by accident. So I built a proper suite. Seven models spanning the architectures that matter:
| Model | What it tests | Params |
|---|---|---|
| mlp | Raw matmul throughput (5-layer, 1024→2048) | 16.8M |
| convnet | Conv2d + BatchNorm pipeline | 4.0M |
| gru_seq | 50-step unrolled GRU (sequential dispatch) | 1.3M |
| residual_tower | 12 residual blocks via skip connections | 14.7M |
| gated_routing | 8-expert MoE with learned soft routing | 2.6M |
| iterative_refine | 8-step refinement loop | 3.2M |
| feedback_fixed | 10-step feedback loop (smaller model) | 0.8M |
Tier 1 (mlp, convnet, gru_seq) uses standard nn modules — Linear, Conv2d,
GRUCell. Both sides write the same code. Tier 2 uses flodl’s graph builder
on the Rust side and equivalent manual forward() on the Python side. Same
architecture, same parameters, same optimizer.
Methodology
I’m claiming publication-grade numbers. That means publication-grade methodology:
- 10 interleaved rounds. Odd rounds run Rust first, even rounds run Python first. Thermal drift and background noise distribute equally.
- GPU clocks at 3090 MHz. The RTX 5060 Ti ran at its default applications clock. On non-WSL systems the suite auto-locks clocks; on WSL2, the tight σ across 10 rounds confirms stability without it.
- Best-of-3 per round. Each round runs 4 complete passes (first is warmup). The best of the 3 measured runs is reported, filtering process-level noise.
- 20 measured epochs with 3 warmup epochs per pass. 50 batches per epoch. Enough data points that the median is meaningful.
- 15-second GPU warmup burst before any measurement.
- Clean system. Rebooted, non-essential services stopped, display off.
- Docker isolation. Both frameworks in the same container with identical CUDA toolkit and system libraries.
The statistical model: for each model, report the median of 10 best-run medians (one per round) and their standard deviation. Ten independent samples is enough for the σ values to mean something.
Results
| Model | PyTorch | flodl | Delta | Py σ | Rs σ |
|---|---|---|---|---|---|
| mlp | 271.0 ms | 188.5 ms | -30% | ±10.1 | ±2.9 |
| convnet | 1189.4 ms | 1190.5 ms | +0% | ±2.7 | ±1.0 |
| gru_seq | 1015.3 ms | 949.7 ms | -6% | ±222.4 | ±10.8 |
| residual_tower | 371.3 ms | 278.6 ms | -25% | ±25.9 | ±3.6 |
| gated_routing | 222.6 ms | 196.9 ms | -12% | ±13.8 | ±2.6 |
| iterative_refine | 208.7 ms | 186.7 ms | -11% | ±27.2 | ±5.6 |
| feedback_fixed | 250.2 ms | 207.2 ms | -17% | ±27.3 | ±8.7 |
flodl wins 6 of 7. The speed numbers are good. But look at the σ column.
The number that matters
Every single model, without exception, shows tighter variance on flodl. Not by a little — by 3x to 20x:
-
gru_seq: PyTorch ±222ms on a 1-second epoch. That’s a 22% swing between runs. flodl: ±10.8ms — a 1% swing. Same model, same GPU, same CUDA kernels.
-
residual_tower: PyTorch ±25.9ms. flodl ±3.6ms. Seven times more predictable.
-
convnet: The one model where speed is identical (+0%). Even here, flodl’s σ is 2.7x tighter. The consistency advantage persists even when the speed advantage doesn’t.
This isn’t a tuning win. You can’t profile your way to lower variance in PyTorch. The variance comes from:
-
Garbage collector pauses. Python’s GC runs at unpredictable intervals. When it fires during a training epoch, that epoch is slower. The GC doesn’t know or care that you’re in a tight loop.
-
Interpreter overhead jitter. Python’s bytecode interpreter has variable dispatch cost depending on instruction cache state, branch prediction history, and what the OS scheduler decided to do between ops.
-
Deferred deallocation. Tensor memory freed by Python’s reference counting + cyclic GC arrives at the CUDA allocator in bursts, causing pressure spikes that fragment the allocation pool.
Rust has none of these. Tensor memory is freed by Drop at scope exit —
deterministically, on the same thread, at the exact point where the value is no
longer needed. There is no GC to pause. There is no interpreter between you and
the FFI call. The memory behavior is the same on every run because the
execution model is the same on every run.
The convnet proof
The convnet result at +0% is not a disappointment. It’s a proof.
Convolution is compute-bound. Both frameworks spend >99% of their time inside cuDNN kernels. The framework overhead — dispatch, memory management, Python interpretation — is invisible because the GPU is doing all the work.
This proves that flodl and PyTorch dispatch identical CUDA kernels. The speed advantage on other models comes entirely from what happens between kernel launches. When the gap between ops is large enough for framework overhead to matter (MLP with many small matmuls, GRU with 50 sequential steps), Rust’s zero-overhead dispatch pays off. When the ops are large enough to hide the overhead (convnet), both frameworks converge to the same number.
This is the cleanest possible evidence that the benchmark is measuring framework overhead, not CUDA kernel differences.
What changed since v0.1.1
The first benchmark (19% on FBRL letter model) ran on flodl v0.1.1. Since then:
- Fused Adam/AdamW — single multi-tensor CUDA kernel instead of 4N per-parameter launches
- Foreach operations — 7 batched tensor ops (zero, norm, scale, lerp, sqrt, add, mul) that replace N individual kernel launches
- Fused gradient clipping — 2 kernels total instead of 2N
- CUDA Graphs — capture/replay kernel sequences for static-shape models
- Automatic mixed precision — autocast + GradScaler for fp16/bf16
- Channels-last memory — NHWC layout for Conv2d on Tensor Core GPUs
- Pre-computed graph routing — Vec-indexed dispatch with cached buffers
All automatic. Users get them without changing their training code.
The benchmark suite intentionally doesn’t enable CUDA Graphs, mixed precision, or channels-last — those would give flodl an unfair advantage over standard PyTorch usage. The numbers above are apples-to-apples: same dtype (fp32), same memory layout (contiguous), same training loop structure.
Why this matters for distributed training
In single-GPU training, variance is an annoyance. In distributed training, it’s a scaling bottleneck.
Synchronous data-parallel training has an all-reduce barrier at every step. Every worker computes gradients independently, then they synchronize. The step takes as long as the slowest worker.
With PyTorch’s ±25ms variance (residual_tower), your effective step time includes the tail of the distribution, not the median. With 8 workers, the probability of at least one worker hitting a slow epoch is high. With 64 workers, it’s near-certain.
flodl’s ±3.6ms on the same model means the synchronization barrier is tight. Every worker finishes within a narrow window. Your effective throughput scales closer to the theoretical linear scaling.
And the deployment story compounds this: a flodl training image doesn’t need Python, pip, or the PyTorch distribution. Less to pull, less to cache, faster cold starts on spot instances.
Reproduce it
The entire suite runs in Docker. No local Rust or Python needed:
git clone https://github.com/fab2s/floDl.git
cd floDl
# Quick single-round
make bench
# Publication run (10 rounds, locked clocks, 15s warmup)
make bench-publish
Per-round JSON with full per-epoch timings is saved in benchmarks/rounds/.
The merged report goes to benchmarks/report.txt. If you have a different
GPU, run it and see — the absolute numbers will differ but the relative
story should hold.
The full methodology documents the protocol, environment, statistical model, and all the optimizations in detail. Every number is reproducible. Every asymmetry is accounted for.
What’s next
This benchmark measures standard fp32 training without CUDA Graphs or mixed precision — the optimizations that would further widen the gap. Future benchmark updates will add:
- Mixed precision (autocast + GradScaler) comparison
- CUDA Graphs on eligible models
- Channels-last Conv2d
- The FBRL letter model re-run on the RTX 5060 Ti with all optimizations
The suite is designed to grow. Each benchmark is a Rust module with a matching Python implementation. Adding a new model takes an afternoon.
But the core finding won’t change. The variance advantage is structural. It comes from the language, not the optimizer. And that’s the number that matters.