v0.5.3: bit-exact round-trip export across 30 head cells (BERT, RoBERTa, DistilBERT, ALBERT, XLM-RoBERTa, DeBERTa-v2/v3 with all four task heads), masked-LM heads everywhere, and a universal Trainer for transparent fine-tuning on CPU, single GPU, or heterogeneous multi-GPU
v0.5.2 made HuggingFace checkpoints load in Rust. v0.5.3 closes the loop: every supported family-and-head combination now round-trips back to the HF ecosystem too, with bit-exact verification at the safetensors layer. Six families, four task heads each, 30 cells of parity matrix, all PASS, all max abs diff = 0. Plus...
v0.5.2: AutoModel for BERT, RoBERTa, DistilBERT with three task heads each, parity-tested against the PyTorch reference, fdl add flodl-hf to scaffold a playground in one command
BERT in flodl is now from_pretrained("bert-base-uncased")?. So is RoBERTa. So is DistilBERT. So is any sentiment, NER, or SQuAD fine-tune sitting on top of those three families. The new sibling crate flodl-hf ships in 0.5.2 with PyTorch-verified numerical parity (max_abs_diff under 1e-5 on nine pinned checkpoints), three task heads per...
v0.5.0: #[derive(FdlArgs)] for any Rust binary, a consolidated fdl.yml, and explainable per-environment config with origin annotations
fdl has shipped standalone since 0.3.0. It has been on crates.io since 0.4.0. What 0.5.0 delivers is the next layer up: the CLI becomes something you can program against from your own Rust binaries, with a manifest that consolidates into one clean shape and configuration you can actually see before...
v0.4.0: CPU averaging works, every DDP mode beats solo, and we have the numbers to prove it
Six days ago, v0.3.0 shipped with a confession: the CPU averaging
backend was broken. Three policies, all failing to converge. We shipped
anyway because the NCCL path worked and an honest label beats a delayed
release.
v0.3.0: heterogeneous multi-GPU that works, a CPU path that doesn't (yet), and a standalone tool for libtorch
I have two GPUs: an RTX 5060 Ti (16GB, Blackwell) and a GTX 1060 (6GB,
Pascal). The 5060 Ti is 2.8x faster. Every DDP framework I’ve tried makes
the fast GPU wait for the slow one. You end up with 1060-speed training on
5060 Ti hardware.
I’ve been tracking a list. Every time I hit a PyTorch op that didn’t exist
in flodl, it went on the list. Every time someone asked “does it have X?”,
it went on the list. The list got long.
v0.2.2 benchmarks: fused RNN, honest variance, and still zero regressions against PyTorch
The first benchmark measured seven models on
flodl v0.1.3. That was two weeks and a lot of optimization ago. This update
adds three models (transformer, lstm_seq, conv_autoenc), lands fused RNN
kernels with C++-side parameter caching, and switches the variance metric
to something more honest.
How Rust replaced five layers of GPU memory management
I’m not a systems programmer. I spent twenty years writing PHP and JavaScript.
But I had a deep learning project that needed a custom framework, and Python’s
overhead was killing the architectures I wanted to explore. So I learned Go
and built one.