Ten models later, the answer hasn't changed
v0.2.2 benchmarks: fused RNN, honest variance, and still zero regressions against PyTorch
The first benchmark measured seven models on flodl v0.1.3. That was two weeks and a lot of optimization ago. This update adds three models (transformer, lstm_seq, conv_autoenc), lands fused RNN kernels with C++-side parameter caching, and switches the variance metric to something more honest.
Results (v0.2.2 vs PyTorch 2.10.0)
| Model | PyTorch | flodl | Delta | Py σ | Rs σ |
|---|---|---|---|---|---|
| transformer | 3183.0 ms | 2199.8 ms | -31% | +-0.8 | +-1.0 |
| mlp | 291.1 ms | 207.0 ms | -29% | +-4.0 | +-1.3 |
| residual_tower | 406.9 ms | 309.7 ms | -24% | +-6.0 | +-3.3 |
| feedback_fixed | 275.3 ms | 231.3 ms | -16% | +-10.0 | +-6.0 |
| gated_routing | 248.0 ms | 217.3 ms | -12% | +-9.7 | +-2.8 |
| iterative_refine | 230.7 ms | 206.0 ms | -11% | +-2.2 | +-3.1 |
| gru_seq | 1105.1 ms | 1057.5 ms | -4% | +-16.7 | +-25.4 |
| conv_autoenc | 398.2 ms | 395.3 ms | -1% | +-1.1 | +-3.7 |
| lstm_seq | 692.3 ms | 692.3 ms | 0% | +-23.3 | +-15.2 |
| convnet | 1298.0 ms | 1298.2 ms | 0% | +-0.3 | +-0.1 |
Eight wins, two ties, zero regressions. Same story as v0.1.3, but now with the transformer headline at -31% and three more architectures confirming the pattern.
What’s new
Transformer at -31%
The biggest model in the suite (~25M params, 4-layer encoder with MultiheadAttention + FFN + LayerNorm + residual, cross-entropy loss) shows the largest speed advantage. This is dispatch-bound: hundreds of small ops per forward pass, each going through Python -> TorchScript -> C++ in PyTorch, versus direct FFI in flodl. The per-op overhead compounds across attention heads, feed-forward layers, and residual connections.
Both frameworks produce identical attention weights – same cuDNN kernels, same numerics. The 31% gap is pure framework overhead.
Fused RNN kernels
LSTM and GRU now call cuDNN’s fused sequence kernels (at::lstm() /
at::gru()) – a single kernel for the entire sequence across all layers,
replacing per-timestep cell dispatch. On top of that, flodl caches the
packed parameter tensors on the C++ side behind an opaque handle
(RnnParams). After the first forward call, subsequent calls pass a
single pointer to the pre-built parameter vector, eliminating per-forward
parameter collection, FFI array marshalling, and std::vector
reconstruction.
The result: lstm_seq matches PyTorch exactly (692.3ms vs 692.3ms), and gru_seq edges ahead by 4%. Both are compute-bound – cuDNN does the real work – so the tie confirms parity in the underlying kernel dispatch.
Two new ties prove the architecture
convnet at 0% was already a proof that both frameworks dispatch identical CUDA kernels. lstm_seq at 0% extends that proof to fused RNN. When the GPU dominates, flodl converges to the same number. The speed advantage appears precisely where framework overhead has room to matter.
Honest variance: why we switched to MAD
The v0.1.3 benchmarks reported σ as standard deviation. That was correct but misleading. Here’s what I found when I dug into the raw per-round data.
The outlier problem
Standard deviation treats every data point equally. In GPU benchmarking, that’s a problem. Python’s garbage collector fires at unpredictable intervals, creating occasional 50-170% timing spikes:
- gated_routing rounds 2 and 8: PyTorch spiked from ~248ms to ~400ms. Two GC pauses in 10 rounds. Standard deviation: +-64.8ms. But 8 of 10 rounds were within +-7ms.
- feedback_fixed round 6: PyTorch spiked from ~275ms to 447ms. One GC pause. Standard deviation: +-54.9ms. The other 9 rounds: +-10ms.
Rust has the same problem, differently shaped: rarer but sometimes larger spikes from CUDA scheduling stalls. gru_seq round 2 spiked from ~1050ms to 1447ms (+390ms). Standard deviation: +-124.7ms. The other 9 rounds: +-25ms.
Scaled MAD
Scaled MAD (Median Absolute Deviation x 1.4826) is σ-equivalent for normally distributed data but treats these spikes as what they are: interference from outside the benchmark, not real framework variance.
| metric | Py total σ | Rs total σ | who looks better? |
|---|---|---|---|
| stddev | 186 | 210 | PyTorch |
| MAD | 74 | 62 | flodl |
With stddev, flodl looks noisier (because of that one gru_seq spike). With MAD, flodl looks tighter (because its steady-state variance is actually lower on most models). Neither framing is more “favorable” – MAD is simply more accurate about what each framework does when the OS isn’t interfering.
The full per-round JSON data is in benchmarks/rounds/. Anyone can
compute both metrics and inspect the outliers directly.
Where flodl is tighter, and where it isn’t
With MAD, the variance picture is honest and model-dependent:
- Dispatch-bound models: flodl is 1.5-3.5x tighter (mlp, gated_routing,
convnet, residual_tower, feedback_fixed, lstm_seq). Rust’s deterministic
Dropand zero-GC execution model eliminates host-side jitter. - Fused-kernel workloads: both frameworks show similar variance (gru_seq, conv_autoenc, transformer, iterative_refine). cuDNN dominates execution time; the variance that remains is inherent to GPU scheduling.
This is more nuanced than the “3-20x tighter on every model” from v0.1.3. It’s also more true.
The deployment angle
One number that doesn’t appear in the timing table: Docker image size.
| Image | Size |
|---|---|
| PyTorch benchmark | 38.45 GB |
| flodl benchmark | 26.86 GB |
That’s 30% smaller. No Python, no pip, no PyTorch distribution – just the Rust binary and libtorch.
On spot instances with cold starts, image pull time is real wall-clock cost. On clusters with shared storage, it’s 12 GB less per node. For distributed training where you’re spinning up dozens of workers, the deployment story compounds with the per-epoch speed advantage.
What changed since v0.1.3
| Optimization | Impact |
|---|---|
Fused RNN (at::lstm() / at::gru()) |
Single cuDNN kernel for full sequence |
RNN param caching (C++ RnnParams handle) |
Zero per-forward FFI overhead |
| flatten_parameters | Eliminates cuDNN contiguous-weight warning |
| PyTorch parity (v0.2.2) | 30+ modules, 15 losses, 7 optimizers, 769 tests |
| Scaled MAD variance | Honest σ resistant to GC/scheduling outliers |
| PyTorch 2.10.0+cu128 | Updated baseline (was 2.6.0+cu126) |
Reproduce
git clone https://github.com/fab2s/floDl.git
cd floDl
# Quick single-round
make bench
# Publication run (10 rounds, locked clocks, 15s warmup)
make bench-publish
Same Docker setup, same methodology. The full benchmark report documents the protocol, environment, and statistical model in detail.
Where this goes
The benchmark suite doesn’t use CUDA Graphs, mixed precision, or channels-last – features that would further widen the gap. Those are fair game for a future comparison.
But the core result is structural. Ten models, two framework versions, and the answer hasn’t changed: flodl matches or beats PyTorch on every architecture. The speed advantage comes from what Rust eliminates between GPU kernels. The ties come from workloads where there’s nothing left to eliminate.
Zero regressions. That’s the number.