Blog

Notes from building a deep learning framework in Rust.

HuggingFace, both ways

v0.5.3: bit-exact round-trip export across 30 head cells (BERT, RoBERTa, DistilBERT, ALBERT, XLM-RoBERTa, DeBERTa-v2/v3 with all four task heads), masked-LM heads everywhere, and a universal Trainer for transparent fine-tuning on CPU, single GPU, or heterogeneous multi-GPU

v0.5.2 made HuggingFace checkpoints load in Rust. v0.5.3 closes the loop: every supported family-and-head combination now round-trips back to the HF ecosystem too, with bit-exact verification at the safetensors layer. Six families, four task heads each, 30 cells of parity matrix, all PASS, all max abs diff = 0. Plus...

HuggingFace, in Rust

v0.5.2: AutoModel for BERT, RoBERTa, DistilBERT with three task heads each, parity-tested against the PyTorch reference, fdl add flodl-hf to scaffold a playground in one command

BERT in flodl is now from_pretrained("bert-base-uncased")?. So is RoBERTa. So is DistilBERT. So is any sentiment, NER, or SQuAD fine-tune sitting on top of those three families. The new sibling crate flodl-hf ships in 0.5.2 with PyTorch-verified numerical parity (max_abs_diff under 1e-5 on nine pinned checkpoints), three task heads per...

One struct, one manifest, one overlay

v0.5.0: #[derive(FdlArgs)] for any Rust binary, a consolidated fdl.yml, and explainable per-environment config with origin annotations

fdl has shipped standalone since 0.3.0. It has been on crates.io since 0.4.0. What 0.5.0 delivers is the next layer up: the CLI becomes something you can program against from your own Rust binaries, with a manifest that consolidates into one clean shape and configuration you can actually see before...

70 runs, 8 models, zero failures

v0.4.0: CPU averaging works, every DDP mode beats solo, and we have the numbers to prove it

Six days ago, v0.3.0 shipped with a confession: the CPU averaging backend was broken. Three policies, all failing to converge. We shipped anyway because the NCCL path worked and an honest label beats a delayed release.

Two GPUs, one bug, and a CLI

v0.3.0: heterogeneous multi-GPU that works, a CPU path that doesn't (yet), and a standalone tool for libtorch

I have two GPUs: an RTX 5060 Ti (16GB, Blackwell) and a GTX 1060 (6GB, Pascal). The 5060 Ti is 2.8x faster. Every DDP framework I’ve tried makes the fast GPU wait for the slow one. You end up with 1060-speed training on 5060 Ti hardware.

Everything you use in PyTorch, now in Rust

30+ modules, 15 losses, 7 optimizers, 100+ tensor ops, 769 tests — flodl reaches PyTorch parity

I’ve been tracking a list. Every time I hit a PyTorch op that didn’t exist in flodl, it went on the list. Every time someone asked “does it have X?”, it went on the list. The list got long.

Ten models later, the answer hasn't changed

v0.2.2 benchmarks: fused RNN, honest variance, and still zero regressions against PyTorch

The first benchmark measured seven models on flodl v0.1.3. That was two weeks and a lot of optimization ago. This update adds three models (transformer, lstm_seq, conv_autoenc), lands fused RNN kernels with C++-side parameter caching, and switches the variance metric to something more honest.

The number that matters isn't speed

Seven models, ten rounds, one finding: Rust doesn't just run faster, it runs the same way every time

Update: these v0.1.3 results have been superseded by the v0.2.2 benchmarks – 10 models, fused RNN kernels, and improved statistical methodology.

impl Drop for Tensor

How Rust replaced five layers of GPU memory management

I’m not a systems programmer. I spent twenty years writing PHP and JavaScript. But I had a deep learning project that needed a custom framework, and Python’s overhead was killing the architectures I wanted to explore. So I learned Go and built one.