A Rust deep learning framework on libtorch.
Fluent graph API. Heterogeneous multi-GPU DDP. Up to 31% faster than PyTorch.
One curl to try it.
$ curl -sL https://flodl.dev/fdl -o fdl && chmod +x fdl $ ./fdl setup # detect GPUs, download libtorch, configure build $ ./fdl init my-project $ cd my-project && ./fdl run
Three ways to land. Pick the one that fits how you work.
One curl, zero Rust on your host. fdl auto-detects your GPU, downloads the matching libtorch, and scaffolds a Docker-ready project.
Drop your PyTorch script into your AI coding assistant. In Claude Code, run /port. In any other tool, point the agent at the skill guide. The skill reads the source, classifies each block (model, data, training, DDP), maps to flodl, and writes a complete Rust project.
Every PyTorch module, loss, optimizer, scheduler, and training pattern mapped to its flodl equivalent. For the hands-on reader.
The training loop is the one you already know:
model.train() for epoch in range(num_epochs): for batch in loader: optimizer.zero_grad() pred = model(batch[0]) loss = criterion(pred, batch[1]) loss.backward() optimizer.step()
model.train(); for epoch in 0..num_epochs { for batch in loader.epoch(epoch) { let batch = batch?; optimizer.zero_grad(); let pred = model.forward(&batch[0].into())?; let loss = cross_entropy_loss(&pred, &batch[1].into())?; loss.backward()?; optimizer.step()?; } }
fdl still helps.The CLI is its own published crate. Outside a flodl checkout it's a first-class libtorch manager for tch-rs and PyTorch C++ users -- the installer PyTorch itself never shipped. One binary, zero native dependencies, works before anything else is set up.
Reads nvidia-smi, picks cu126 / cu128 / CPU automatically, installs side-by-side, activates one with a single command.
fdl libtorch download --cuda 12.8 fdl libtorch list fdl libtorch activate precompiled/cu128
GTX 1060 + RTX 5060 Ti in one host? No single pre-built variant covers both. fdl libtorch build compiles PyTorch from source with the exact archs you need, Docker-isolated by default.
fdl libtorch build --archs "6.1;12.0" fdl libtorch build --native --jobs 8
Point LIBTORCH or CMAKE_PREFIX_PATH at the active variant and build. No more scraping URLs from the PyTorch get-started page.
export LIBTORCH=$FLODL_HOME/libtorch/precompiled/cu128 cargo add tch cargo build
Deep-merge fdl.ci.yml or fdl.local.yml on top of the base manifest with
--env. fdl config show prints the resolved result with per-layer origin
annotations -- you see exactly which file, which line, contributed each field before running
a long job.
$ fdl --env ci config show # resolved manifest, base + overlay docker: dev # fdl.yml:2 ddp: policy: cadence # fdl.yml:8 backend: cpu # fdl.ci.yml:3 (override) divergence_threshold: 0.05 # fdl.yml:10 training: epochs: 1 # fdl.ci.yml:6 (override) seed: 42 # fdl.yml:13 batch_size: 32 # fdl.yml:14 commands: quick: # fdl.yml:18 (preset) training: { epochs: 1 } options: { model: linear, batches: 100 }
Three equivalent ways to apply an overlay: fdl --env ci test,
FDL_ENV=ci fdl test, or just fdl ci test when the name
doesn't collide with a command. Explicit selectors fail loudly on missing
files; the first-arg convention falls through so existing commands are never
shadowed.
The fluent graph builder reads as data flow.
from / through / also / split / merge / loop_body / tag / using
compose into residuals, parallel heads, loops, and cross-connections.
use flodl::*; let model = FlowBuilder::from(Linear::new(4, 32)?) .through(GELU::new()) // activation .through(LayerNorm::new(32)?) // normalization .also(Linear::new(32, 32)?) // residual: output = input + Linear(input) .through(Linear::new(32, 1)?) // output projection .build()?; // The same model composes into multi-head, gated, or looped shapes without // writing a forward() method. Every .tag() makes a point observable.
10 models, 10 interleaved rounds, locked GPU clocks. flodl wins 8 of 10, ties 2, zero regressions.
| Model | PyTorch | flodl | Delta |
|---|---|---|---|
| transformer | 3183.0 ms | 2199.8 ms | -31% |
| mlp | 291.1 ms | 207.0 ms | -29% |
| residual_tower | 406.9 ms | 309.7 ms | -24% |
| feedback_fixed | 275.3 ms | 231.3 ms | -16% |
| convnet | 1298.0 ms | 1298.2 ms | 0% |
RTX 5060 Ti, GPU at 3090 MHz. v0.3.0 vs PyTorch 2.10.0. The convnet tie proves both frameworks dispatch identical CUDA kernels. The speed gap is pure framework overhead.
A mixed consumer cluster reaches better CIFAR-10 accuracy than He et al. (2015), in less wall time than the fastest single GPU. Same training loop. One knob.
| He et al. 2015 | Single RTX 5060 Ti | Mixed 5060 Ti + 1060 6GB | |
|---|---|---|---|
| CIFAR-10 eval | 91.25% | 91.66% | 92.42% |
| Wall time | - | 3,127 s | 2,650 s |
ResNet-20 on CIFAR-10, Graph builder, nccl-cadence policy.
Published baseline: He et al. 2015, Table 6.
The slower GPU contributes useful work instead of slowing the cluster down.
ElChe cadence auto-detects per-GPU speed. The slow device anchors synchronization; the fast one ranges ahead and fills what would be idle time. Three policies × two backends, A/B-testable in one line.
Ddp::setup(&model, &builder, |p| Adam::new(p, 0.001))?; for batch in model.epoch(0) { let batch = batch?; let loss = model.forward_batch(&batch)?.mse(&target)?; loss.backward()?; model.step()?; // AllReduce + sync + optimizer + zero_grad }
let ddp = Ddp::builder(model_factory, optim_factory, train_fn) .dataset(dataset) .batch_size(32) .num_epochs(10) .policy(ApplyPolicy::Cadence) // Sync | Cadence | Async .backend(AverageBackend::Nccl) // Nccl | Cpu .run()?; let state = ddp.join()?; // averaged params + buffers on CPU
A live web dashboard with loss curves, GPU and VRAM tracking, per-rank
throughput, and ETA. Zero external dependencies. monitor.serve(3000)
and open a browser.
Live dashboard from an earlier FBRL letter model run (v0.1.1, GTX 1060), 19% faster with the training monitor running.
A Rust-native framework for researchers who care about what happens between the GPU kernels.
Tensor memory freed by Drop the instant it leaves scope. No GC, no finalizers, no VRAM budget heuristics. Five phases of PyTorch memory management replaced by impl Drop for Tensor.
Links libtorch's stable C API, not a specific CUDA toolkit version. Pascal through Blackwell work out of the box, no version pinning. If nvidia-smi runs, flodl can train on it.
Linear, Conv1d/2d/3d, LayerNorm, BatchNorm, RMSNorm, GroupNorm, LSTM, GRU, MultiheadAttention, 15 losses, 7 optimizers, 8 schedulers. Same semantics, same training loop.
fdl CLI for hardware detection and libtorch management. Dockerfiles and Makefile scaffolded. Architecture-validated checkpoints. Live dashboard. Timeline profiler. No glue code required.
Same libtorch kernels, no Python interpreter tax. 8 wins of 10 on a locked-GPU interleaved benchmark vs PyTorch 2.10, zero regressions. The convnet tie proves it's framework overhead removed, not a kernel change.
Mix any CUDA GPUs, identical training loop for 1 or N. ElChe cadence lets the fast card range ahead while the slow one anchors sync. A 5060 Ti + 1060 cluster reaches 92.42% on CIFAR-10 ResNet-20 in less wall time than the 5060 Ti alone.
We started this because training in the dark stopped being normal. Loss numbers trickling through a Python interpreter, GPUs stalling behind a wall we couldn't see through. One day, we walked out.