flodl

Coming from PyTorch?

Three ways to land. Pick the one that fits how you work.

1. Install the CLI

One curl, zero Rust on your host. fdl auto-detects your GPU, downloads the matching libtorch, and scaffolds a Docker-ready project.

fdl reference →

2. Let an AI port it

Drop your PyTorch script into your AI coding assistant. In Claude Code, run /port. In any other tool, point the agent at the skill guide. The skill reads the source, classifies each block (model, data, training, DDP), maps to flodl, and writes a complete Rust project.

Porting guide →

3. See the full mapping

Every PyTorch module, loss, optimizer, scheduler, and training pattern mapped to its flodl equivalent. For the hands-on reader.

PyTorch migration →

The training loop is the one you already know:

PyTorch

model.train()
for epoch in range(num_epochs):
    for batch in loader:
        optimizer.zero_grad()
        pred = model(batch[0])
        loss = criterion(pred, batch[1])
        loss.backward()
        optimizer.step()

flodl

model.train();
for epoch in 0..num_epochs {
    for batch in loader.epoch(epoch) {
        let batch = batch?;
        optimizer.zero_grad();
        let pred = model.forward(&batch[0].into())?;
        let loss = cross_entropy_loss(&pred, &batch[1].into())?;
        loss.backward()?;
        optimizer.step()?;
    }
}

Not writing Rust yet? `fdl` still helps.

The CLI is its own published crate. Outside a flodl checkout it's a first-class libtorch manager for tch-rs and PyTorch C++ users -- the installer PyTorch itself never shipped. One binary, zero native dependencies, works before anything else is set up.

GPU-aware downloads

Reads nvidia-smi, picks cu126 / cu128 / CPU automatically, installs side-by-side, activates one with a single command.

fdl libtorch download --cuda 12.8
fdl libtorch list
fdl libtorch activate precompiled/cu128

Custom-arch source builds

GTX 1060 + RTX 5060 Ti in one host? No single pre-built variant covers both. fdl libtorch build compiles PyTorch from source with the exact archs you need, Docker-isolated by default.

fdl libtorch build --archs "6.1;12.0"
fdl libtorch build --native --jobs 8

Drop straight into tch-rs / C++

Point LIBTORCH or CMAKE_PREFIX_PATH at the active variant and build. No more scraping URLs from the PyTorch get-started page.

export LIBTORCH=$FLODL_HOME/libtorch/precompiled/cu128
cargo add tch
cargo build

Standalone reference → flodl-cli on crates.io →

Explainable config, per environment

Deep-merge fdl.ci.yml or fdl.local.yml on top of the base manifest with --env. fdl config show prints the resolved result with per-layer origin annotations -- you see exactly which file, which line, contributed each field before running a long job.

fdl --env ci config show

$ fdl --env ci config show
# resolved manifest, base + overlay
docker: dev                     # fdl.yml:2
ddp:
  policy: cadence               # fdl.yml:8
  backend: cpu                  # fdl.ci.yml:3 (override)
  divergence_threshold: 0.05    # fdl.yml:10
training:
  epochs: 1                     # fdl.ci.yml:6 (override)
  seed: 42                      # fdl.yml:13
  batch_size: 32                # fdl.yml:14
commands:
  quick:                        # fdl.yml:18 (preset)
    training: { epochs: 1 }
    options: { model: linear, batches: 100 }

Three equivalent ways to apply an overlay: fdl --env ci test, FDL_ENV=ci fdl test, or just fdl ci test when the name doesn't collide with a command. Explicit selectors fail loudly on missing files; the first-arg convention falls through so existing commands are never shadowed.

Build your model

The fluent graph builder reads as data flow. from / through / also / split / merge / loop_body / tag / using compose into residuals, parallel heads, loops, and cross-connections.

src/main.rs

use flodl::*;

let model = FlowBuilder::from(Linear::new(4, 32)?)
    .through(GELU::new())                // activation
    .through(LayerNorm::new(32)?)       // normalization
    .also(Linear::new(32, 32)?)          // residual: output = input + Linear(input)
    .through(Linear::new(32, 1)?)        // output projection
    .build()?;

// The same model composes into multi-head, gated, or looped shapes without
// writing a forward() method. Every .tag() makes a point observable.

Graph builder reference →

Up to 31% faster than PyTorch

10 models, 10 interleaved rounds, locked GPU clocks. flodl wins 8 of 10, ties 2, zero regressions.

Model	PyTorch	flodl	Delta
transformer	3183.0 ms	2199.8 ms	-31%
mlp	291.1 ms	207.0 ms	-29%
residual_tower	406.9 ms	309.7 ms	-24%
feedback_fixed	275.3 ms	231.3 ms	-16%
convnet	1298.0 ms	1298.2 ms	0%

RTX 5060 Ti, GPU at 3090 MHz. v0.3.0 vs PyTorch 2.10.0. The convnet tie proves both frameworks dispatch identical CUDA kernels. The speed gap is pure framework overhead.

Full benchmark report →

Multi-GPU that beats published baselines

A mixed consumer cluster reaches better CIFAR-10 accuracy than He et al. (2015), in less wall time than the fastest single GPU. Same training loop. One knob.

	He et al. 2015	Single RTX 5060 Ti	Mixed 5060 Ti + 1060 6GB
CIFAR-10 eval	91.25%	91.66%	92.42%
Wall time	-	3,127 s	2,650 s

ResNet-20 on CIFAR-10, Graph builder, nccl-cadence policy. Published baseline: He et al. 2015, Table 6. The slower GPU contributes useful work instead of slowing the cluster down.

ElChe cadence auto-detects per-GPU speed. The slow device anchors synchronization; the fast one ranges ahead and fills what would be idle time. Three policies × two backends, A/B-testable in one line.

Graph DDP: one line, identical loop for 1 or N GPUs

Ddp::setup(&model, &builder, |p| Adam::new(p, 0.001))?;

for batch in model.epoch(0) {
    let batch = batch?;
    let loss = model.forward_batch(&batch)?.mse(&target)?;
    loss.backward()?;
    model.step()?;      // AllReduce + sync + optimizer + zero_grad
}

DDP Builder: any Module, A/B-testable policies and backends

let ddp = Ddp::builder(model_factory, optim_factory, train_fn)
    .dataset(dataset)
    .batch_size(32)
    .num_epochs(10)
    .policy(ApplyPolicy::Cadence)      // Sync | Cadence | Async
    .backend(AverageBackend::Nccl)     // Nccl | Cpu
    .run()?;

let state = ddp.join()?;            // averaged params + buffers on CPU

DDP benchmark report → DDP reference →

Training you can actually see

A live web dashboard with loss curves, GPU and VRAM tracking, per-rank throughput, and ETA. Zero external dependencies. monitor.serve(3000) and open a browser.

Live dashboard from an earlier FBRL letter model run (v0.1.1, GTX 1060), 19% faster with the training monitor running.

Why flodl

A Rust-native framework for researchers who care about what happens between the GPU kernels.

⚙

Deterministic memory

Tensor memory freed by Drop the instant it leaves scope. No GC, no finalizers, no VRAM budget heuristics. Five phases of PyTorch memory management replaced by impl Drop for Tensor.

⚡

Wide GPU support

Links libtorch's stable C API, not a specific CUDA toolkit version. Pascal through Blackwell work out of the box, no version pinning. If nvidia-smi runs, flodl can train on it.

⇌

Full PyTorch parity

Linear, Conv1d/2d/3d, LayerNorm, BatchNorm, RMSNorm, GroupNorm, LSTM, GRU, MultiheadAttention, 15 losses, 7 optimizers, 8 schedulers. Same semantics, same training loop.

▦

Production tooling

fdl CLI for hardware detection and libtorch management. Dockerfiles and Makefile scaffolded. Architecture-validated checkpoints. Live dashboard. Timeline profiler. No glue code required.

⚡

Up to 31% faster

Same libtorch kernels, no Python interpreter tax. 8 wins of 10 on a locked-GPU interleaved benchmark vs PyTorch 2.10, zero regressions. The convnet tie proves it's framework overhead removed, not a kernel change.

◈

Heterogeneous DDP

Mix any CUDA GPUs, identical training loop for 1 or N. ElChe cadence lets the fast card range ahead while the slow one anchors sync. A 5060 Ti + 1060 cluster reaches 92.42% on CIFAR-10 ResNet-20 in less wall time than the 5060 Ti alone.

We started this because training in the dark stopped being normal. Loss numbers trickling through a Python interpreter, GPUs stalling behind a wall we couldn't see through. One day, we walked out.

Read the manifesto The trajectory thesis