PyTorch → flodl Migration Guide

A side-by-side reference for PyTorch users learning flodl.

Want the fast path? flodl ships with an AI porting skill that reads your PyTorch script and generates a complete flodl project. Run /port my_model.py in Claude Code, or see the Porting Guide. You can also run fdl api-ref to get the full, up-to-date API surface in your terminal.

Imports

In PyTorch, import torch gives you almost everything. Rust uses explicit imports — flodl re-exports its full API from the crate root for convenience:

use flodl::*;  // brings in Tensor, Variable, all nn modules, graph builder, etc.

Or import selectively by module:

use flodl::tensor::{Tensor, TensorOptions, DType, Device};
use flodl::autograd::{Variable, no_grad};
use flodl::nn::{Linear, Adam, mse_loss, Module, Parameter};
use flodl::graph::{FlowBuilder, Graph, MergeOp};

If you’re doing…	You need
Defining a model with the graph builder	`nn`, `graph`
Writing a training loop	`nn`, `autograd`, `tensor`
Creating raw tensors	`tensor`
Everything (main training script)	`use flodl::*`

How it maps to PyTorch:

PyTorch	flodl	What’s in it
`torch.*`	`flodl::tensor`	Creation (`zeros`, `rand`, `arange`…), math ops, shape ops
`torch.autograd`	`flodl::autograd`	`Variable`, `no_grad`, gradient tracking
`torch.nn`	`flodl::nn`	Modules (`Linear`, `Conv2d`…), activations, losses
`torch.optim`	`flodl::nn`	Optimizers (`Adam`, `SGD`, `AdamW`), LR schedulers
(no equivalent)	`flodl::graph`	Fluent computation graph builder

Core Concepts

PyTorch	flodl	Notes
`torch.Tensor`	`Tensor`	Immutable, `Drop`-based VRAM cleanup, `Send`+`Sync`
`torch.autograd`	`Variable`	Wraps Tensor, tracks gradients via `Rc<RefCell>`
`torch.nn.Module`	`Module` trait	`forward(&self, &Variable) -> Result<Variable>` + `parameters()`
`model.train()`	`module.train()`	Called on individual modules
`model.eval()`	`module.eval()`	Disables dropout, freezes BatchNorm stats
`with torch.no_grad():`	`no_grad(\\|\\| { ... })`	RAII guard disables gradient tracking

Error handling: flodl returns Result<T> instead of panicking. Use ? to propagate errors:

let y = x.matmul(&w)?.add(&b)?.relu()?;

Reproducibility

# PyTorch
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

// flodl
flodl::manual_seed(42);
flodl::cuda_manual_seed_all(42);  // no-op without CUDA feature

manual_seed controls all libtorch random operations: Tensor::rand, Tensor::randn, dropout masks, weight initialization. Call it before model creation for full reproducibility.

For CPU-side randomness (data shuffling, augmentation), use Rng:

use flodl::Rng;

let mut rng = Rng::seed(42);       // deterministic
rng.shuffle(&mut indices);          // Fisher-Yates shuffle
let val = rng.f32();               // uniform [0, 1)
let coin = rng.bernoulli(0.5);    // true ~50%

Tensor Creation

# PyTorch
x = torch.zeros(2, 3)
x = torch.ones(2, 3)
x = torch.rand(2, 3)
x = torch.randn(2, 3)
x = torch.full((2, 3), 7.0)
x = torch.eye(4)
x = torch.arange(0, 10, 2)
x = torch.tensor([1.0, 2.0, 3.0])
x = torch.tensor([0, 1, 2], dtype=torch.int64)
x = torch.linspace(0, 1, 10)

// flodl
let opts = TensorOptions::default();  // Float32, CPU
let x = Tensor::zeros(&[2, 3], opts)?;
let x = Tensor::ones(&[2, 3], opts)?;
let x = Tensor::rand(&[2, 3], opts)?;
let x = Tensor::randn(&[2, 3], opts)?;
let x = Tensor::full(&[2, 3], 7.0, opts)?;
let x = Tensor::eye(4, opts)?;
let x = Tensor::arange(0.0, 10.0, 2.0, opts)?;
let x = Tensor::from_f32(&[1.0, 2.0, 3.0], &[3], Device::CPU)?;
let x = Tensor::from_i64(&[0, 1, 2], &[3], Device::CPU)?;
let x = Tensor::linspace(0.0, 1.0, 10, opts)?;

Creation helpers:

// Like-tensors (same dtype/device as source)
let y = Tensor::zeros_like(&x)?;
let y = Tensor::ones_like(&x)?;

// Stacking
let y = Tensor::stack(&[&a, &b, &c], 0)?;

Tensor Operations

Arithmetic

# PyTorch
c = a + b           # element-wise add
c = a - b           # element-wise sub
c = a * b           # element-wise mul
c = a / b           # element-wise div
c = a @ b           # matrix multiply
c = x * 2.0         # scalar multiply
c = x + 1.0         # scalar add
c = x / 3.0         # scalar divide
c = -x              # negation

// flodl
let c = a.add(&b)?;
let c = a.sub(&b)?;
let c = a.mul(&b)?;
let c = a.div(&b)?;
let c = a.matmul(&b)?;
let c = x.mul_scalar(2.0)?;
let c = x.add_scalar(1.0)?;
let c = x.div_scalar(3.0)?;
let c = x.neg()?;

Math Functions

# PyTorch
y = torch.exp(x)
y = torch.log(x)
y = torch.sqrt(x)
y = torch.abs(x)
y = torch.pow(x, 2.0)
y = torch.clamp(x, -1.0, 1.0)
y = torch.clamp(x, min=0.0)
y = torch.clamp(x, max=1.0)
y = torch.sin(x)
y = torch.cos(x)
y = torch.tan(x)
y = torch.asin(x)
y = torch.acos(x)
y = torch.atan(x)
y = torch.atan2(y, x)
y = torch.sign(x)
y = torch.floor(x)
y = torch.ceil(x)
y = torch.round(x)
y = torch.trunc(x)
y = torch.frac(x)
y = torch.reciprocal(x)
y = torch.log1p(x)
y = torch.expm1(x)
y = torch.log2(x)
y = torch.log10(x)
y = torch.erf(x)
y = torch.erfc(x)
y = torch.fmod(x, 3.0)
y = torch.remainder(x, 3.0)
y = torch.lerp(start, end, weight)
y = torch.addmm(bias, mat1, mat2)
y = torch.addcmul(input, t1, t2, value=1.0)
y = torch.addcdiv(input, t1, t2, value=1.0)
y = torch.isclose(x, y, rtol=1e-5, atol=1e-8)
y = torch.maximum(a, b)
y = torch.minimum(a, b)
y = x.masked_fill(mask, 0.0)
y = F.normalize(x, p=2, dim=1)
y = F.cosine_similarity(a, b, dim=1)

// flodl
let y = x.exp()?;
let y = x.log()?;
let y = x.sqrt()?;
let y = x.abs()?;
let y = x.pow_scalar(2.0)?;
let y = x.clamp(-1.0, 1.0)?;
let y = x.clamp_min(0.0)?;
let y = x.clamp_max(1.0)?;
let y = x.sin()?;
let y = x.cos()?;
let y = x.tan()?;
let y = x.asin()?;
let y = x.acos()?;
let y = x.atan()?;
let y = y_var.atan2(&x_var)?;            // Variable method
let y = x.sign()?;
let y = x.floor()?;
let y = x.ceil()?;
let y = x.round()?;
let y = x.trunc()?;
let y = x.frac()?;
let y = x.reciprocal()?;
let y = x.log1p()?;                      // ln(1+x), stable for small x
let y = x.expm1()?;                      // exp(x)-1, stable for small x
let y = x.log2()?;
let y = x.log10()?;
let y = x.erf()?;
let y = x.erfc()?;
let y = x.fmod(3.0)?;                    // C-style remainder
let y = x.remainder(3.0)?;               // Python-style modulo
let y = start.lerp(&end, 0.5)?;          // linear interpolation
let y = bias.addmm(&mat1, &mat2, 1.0, 1.0)?;  // beta*self + alpha*(mat1 @ mat2)
let y = inp.addcmul(&t1, &t2, 1.0)?;    // self + value * t1 * t2
let y = inp.addcdiv(&t1, &t2, 1.0)?;    // self + value * t1 / t2
let y = x.isclose(&y, 1e-5, 1e-8)?;
let y = a.maximum(&b)?;
let y = a.minimum(&b)?;
let y = x.masked_fill(&mask, 0.0)?;      // Variable method
let y = x.normalize(2.0, 1)?;            // Lp-normalize along dim
let y = a.cosine_similarity(&b, 1, 1e-8)?;

Activations and Element-wise Ops

# PyTorch
y = torch.relu(x)
y = torch.sigmoid(x)
y = torch.tanh(x)
y = F.gelu(x)
y = F.silu(x)
y = F.leaky_relu(x, 0.01)
y = F.elu(x, alpha=1.0)
y = F.softplus(x, beta=1.0)
y = F.mish(x)
y = torch.selu(x)
y = F.hardswish(x)
y = F.hardsigmoid(x)
y = torch.softmax(x, dim=1)
y = F.log_softmax(x, dim=1)

// flodl
let y = x.relu()?;
let y = x.sigmoid()?;
let y = x.tanh()?;
let y = x.gelu()?;
let y = x.silu()?;
let y = x.leaky_relu(0.01)?;
let y = x.elu(1.0)?;
let y = x.softplus(1.0, 20.0)?;
let y = x.mish()?;
let y = x.selu()?;
let y = x.hardswish()?;
let y = x.hardsigmoid()?;
let y = x.softmax(1)?;
let y = x.log_softmax(1)?;

Reductions

# PyTorch
s = x.sum()
s = x.sum(dim=1, keepdim=True)
m = x.mean()
m = x.mean(dim=1, keepdim=True)
v = x.var()
v = x.std()
v = x.max(dim=1, keepdim=True).values
v = x.min(dim=1, keepdim=True).values
idx = x.argmax(dim=1)
n = x.norm()
p = x.prod()
p = x.prod(dim=1, keepdim=True)
c = x.cumsum(dim=0)
l = torch.logsumexp(x, dim=1, keepdim=True)

// flodl
let s = x.sum()?;
let s = x.sum_dim(1, true)?;
let m = x.mean()?;
let m = x.mean_dim(1, true)?;
let v = x.var()?;
let v = x.std()?;
let v = x.max_dim(1, true)?;
let v = x.min_dim(1, true)?;
let idx = x.argmax(1, false)?;
let n = x.norm()?;
let p = x.prod()?;
let p = x.prod_dim(1, true)?;
let c = x.cumsum(0)?;
let l = x.logsumexp(1, true)?;

Shape Operations

# PyTorch
y = x.reshape(2, 3)
y = x.view(2, 3)           # same as reshape
y = x.squeeze(0)
y = x.unsqueeze(0)
y = x.flatten(1)
y = x.permute(0, 2, 1)
y = x.transpose(0, 1)
y = x.expand(4, 3)
y = x.contiguous()
y = x.movedim(0, 2)
y = x.flip([0, 1])
y = x.roll(2, dims=0)
y = x.diagonal(0, 0, 1)
y = x.tile((2, 3))
y = x.triu(0)
y = x.tril(0)
y = x.split(2, dim=0)
y = x.unbind(dim=0)
grids = torch.meshgrid(x, y, indexing='ij')

// flodl
let y = x.reshape(&[2, 3])?;
// no separate view — reshape handles it
let y = x.squeeze(0)?;
let y = x.unsqueeze(0)?;
let y = x.flatten(1, -1)?;
let y = x.permute(&[0, 2, 1])?;
let y = x.transpose(0, 1)?;
let y = x.expand(&[4, 3])?;
let y = x.contiguous()?;
let y = x.movedim(0, 2)?;
let y = x.flip(&[0, 1])?;
let y = x.roll(2, 0)?;
let y = x.diagonal(0, 0, 1)?;
let y = x.tile(&[2, 3])?;
let y = x.triu(0)?;
let y = x.tril(0)?;
let parts = x.split(2, 0)?;
let slices = x.unbind(0)?;
let grids = Tensor::meshgrid(&[&x, &y])?;

Indexing and Slicing

# PyTorch
y = x[0]                       # select first along dim 0
y = x[:, 1:3]                  # narrow: dim=1, start=1, length=2
y = x.index_select(0, indices) # gather rows
y = torch.cat([a, b], dim=0)
y = torch.stack([a, b], dim=0)
y = x.chunk(3, dim=1)
y = x.repeat(2, 3)

// flodl
let y = x.select(0, 0)?;
let y = x.narrow(1, 1, 2)?;
let y = x.index_select(0, &indices)?;
let y = a.cat(&b, 0)?;
let y = Tensor::cat_many(&[&a, &b, &c], 0)?;  // concatenate many tensors
let y = Tensor::stack(&[&a, &b], 0)?;
let chunks = x.chunk(3, 1)?;
let y = x.repeat(&[2, 3])?;
let y = x.pad(&[1, 1], 0.0)?;                  // constant-value pad
let y = x.pad_mode(&[1, 1], 1, 0.0)?;          // 0=constant, 1=reflect, 2=replicate, 3=circular

// Batch iteration (split along dim 0)
for batch in data.batches(32)? {
    let x = Variable::new(batch, false);
    // ...
}

Comparisons and Conditionals

# PyTorch
mask = x > threshold
mask = x >= 0
y = torch.where(mask, a, b)

// flodl
let mask = x.gt(&threshold)?;
let mask = x.ge_scalar(0.0)?;
let y = Tensor::where_cond(&mask, &a, &b)?;

// Scalar comparisons
let mask = x.gt_scalar(0.0)?;
let mask = x.lt_scalar(1.0)?;
let mask = x.ge_scalar(0.0)?;
let mask = x.le_scalar(1.0)?;

Dtype Casting

# PyTorch
y = x.float()       # → float32
y = x.double()      # → float64
y = x.half()        # → float16
y = x.to(torch.bfloat16)

// flodl
let y = x.to_dtype(DType::Float32)?;
let y = x.to_dtype(DType::Float64)?;
let y = x.to_dtype(DType::Float16)?;
let y = x.to_dtype(DType::BFloat16)?;

Data Access

# PyTorch
val = loss.item()          # scalar → float
data = x.numpy()           # → numpy array
data = x.tolist()          # → Python list

// flodl
let val = loss.item()?;           // scalar → f64
let data = x.to_f32_vec()?;      // → Vec<f32>
let data = x.to_f64_vec()?;      // → Vec<f64>
let data = x.to_i64_vec()?;      // → Vec<i64>

Tensor Metadata

# PyTorch
x.shape          # torch.Size([2, 3])
x.ndim           # 2
x.numel()        # 6
x.dtype          # torch.float32
x.device         # device(type='cpu')

// flodl
x.shape()        // Vec<i64>: [2, 3]
x.ndim()         // usize: 2
x.numel()        // i64: 6
x.dtype()        // DType::Float32
x.device()       // Device::CPU

Autograd

# PyTorch
x = torch.tensor([1.0, 2.0], requires_grad=True)
y = (x ** 2).sum()
y.backward()
print(x.grad)  # tensor([2.0, 4.0])

// flodl
let xt = Tensor::from_f32(&[1.0, 2.0], &[2], Device::CPU)?;
let x = Variable::new(xt, true);  // requires_grad = true
let y = x.pow_scalar(2.0)?.sum()?;
y.backward()?;
println!("{:?}", x.grad());  // Some(tensor([2.0, 4.0]))

Key Differences

Aspect	PyTorch	flodl
Gradient access	`x.grad` (attribute)	`x.grad()` returns `Option<Tensor>`
Clear gradients	`x.grad.zero_()`	`x.zero_grad()`
Detach	`x.detach()`	`x.detach()` — returns new leaf Variable
No-grad block	`with torch.no_grad():`	`no_grad(\\|\\| { ... })` or `let _g = NoGradGuard::new();`
Check grad enabled	`torch.is_grad_enabled()`	`is_grad_enabled()`
Leaf check	`x.is_leaf`	`x.is_leaf()`

Differentiable ops on Variable:

Arithmetic: Add, Sub, Mul, Div, Matmul, MulScalar, AddScalar, DivScalar, Neg

Activations: ReLU, Sigmoid, Tanh, GELU, SiLU, LeakyReLU, ELU, Softplus, Mish, SELU, Hardswish, Hardsigmoid, PReLU, Softmax, LogSoftmax

Math: Exp, Log, Sqrt, Abs, Pow, Sin, Cos, Sign, Floor, Ceil, Round, Reciprocal, Clamp, ClampMin, ClampMax, Log1p, Expm1, Log2, Log10, Atan2, Maximum, Minimum, MaskedFill, Normalize, CosineSimilarity, Triu, Tril

Reductions: Sum, SumDim, Mean, MeanDim, Var, Std, VarDim, StdDim, Min, Max, MinDim, MaxDim, Prod, ProdDim, Cumsum, Logsumexp

Shape: Reshape, Transpose, Permute, Squeeze, Unsqueeze, UnsqueezeMany, Flatten, Expand, Select, Narrow, Cat, CatMany, Stack, Chunk, Repeat, Pad, IndexSelect, Gather, TopK, Sort

NN: Conv1d, ConvTranspose1d, Conv2d, ConvTranspose2d, Conv3d, ConvTranspose3d, MaxPool2d, AvgPool2d, MaxPool1d, AvgPool1d, AdaptiveAvgPool2d, AdaptiveMaxPool2d, InstanceNorm, GroupNorm, LayerNorm, GridSample, PixelShuffle, PixelUnshuffle, Bilinear, EmbeddingBag, Im2col, Col2im

Neural Network Layers

# PyTorch
layer = nn.Linear(784, 128)
layer = nn.Linear(784, 128, bias=False)
layer = nn.Conv1d(3, 16, kernel_size=5, stride=2, padding=2)
layer = nn.Conv2d(3, 64, kernel_size=3, padding=1)
layer = nn.Conv3d(1, 32, kernel_size=3, padding=1)
layer = nn.ConvTranspose1d(16, 3, kernel_size=5)
layer = nn.ConvTranspose2d(64, 3, kernel_size=4, stride=2, padding=1)
layer = nn.ConvTranspose3d(32, 1, kernel_size=3)
layer = nn.MaxPool2d(kernel_size=2)
layer = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
layer = nn.AvgPool2d(kernel_size=2)
layer = nn.MaxPool1d(kernel_size=2)
layer = nn.AvgPool1d(kernel_size=2)
layer = nn.AdaptiveMaxPool2d((7, 7))
layer = nn.AdaptiveAvgPool2d((1, 1))
layer = nn.PixelShuffle(2)
layer = nn.PixelUnshuffle(2)
layer = nn.Upsample(size=(64, 64), mode='bilinear')
layer = nn.Unfold(kernel_size=3)
layer = nn.Fold(output_size=(28, 28), kernel_size=3)
layer = nn.LayerNorm(128)
layer = nn.RMSNorm(128)
layer = nn.GroupNorm(4, 16)
layer = nn.BatchNorm1d(128)
layer = nn.BatchNorm2d(64)
layer = nn.InstanceNorm2d(64, affine=True)
layer = nn.Dropout(p=0.5)
layer = nn.Dropout2d(p=0.1)
layer = nn.AlphaDropout(p=0.1)
layer = nn.ZeroPad2d(1)
layer = nn.ReflectionPad2d(1)
layer = nn.Embedding(1000, 128)
layer = nn.EmbeddingBag(1000, 128)
cell = nn.GRUCell(128, 256)
cell = nn.LSTMCell(128, 256)
layer = nn.GRU(128, 256, num_layers=2)
layer = nn.LSTM(128, 256, num_layers=2)
layer = nn.MultiheadAttention(512, 8)
layer = nn.Bilinear(128, 64, 32)

// flodl
let layer = Linear::new(784, 128)?;
let layer = Linear::no_bias(784, 128)?;
let layer = Conv1d::configure(3, 16, 5).with_stride(2).with_padding(2).done()?;
let layer = Conv2d::new(3, 64, 3)?;                                          // defaults: stride=1, padding=0
let layer = Conv2d::configure(3, 64, 3).with_padding(1).with_stride(2).done()?;    // fluent builder
let layer = Conv2d::build(3, 64, 3, true, [1,1], [1,1], [1,1], 1, Device::CPU)?;   // full control
let layer = Conv3d::configure(1, 32, [3,3,3]).with_padding([1,1,1]).done()?;
let layer = ConvTranspose1d::new(16, 3, 5)?;
let layer = ConvTranspose2d::new(64, 3, 4)?;
let layer = ConvTranspose3d::new(32, 1, [3,3,3])?;
let pool = MaxPool2d::new(2);                                                // kernel=2, stride=2 (defaults to kernel)
let pool = MaxPool2d::with_stride(3, 2).padding(1);                          // kernel=3, stride=2, padding=1
let pool = AvgPool2d::new(2);
let pool = MaxPool1d::new(2);
let pool = AvgPool1d::new(2);
let pool = AdaptiveMaxPool2d::new(7, 7);
let pool = AdaptiveAvgPool2d::new([1, 1]);                                   // global avg pool (ResNet head)
let layer = PixelShuffle::new(2);
let layer = PixelUnshuffle::new(2);
let layer = Upsample::new(&[64, 64], 1);                                    // mode: 0=nearest, 1=bilinear
let layer = Unfold::new([3,3], [1,1], [0,0], [1,1]);
let layer = Fold::new([28,28], [3,3], [1,1], [0,0], [1,1]);
let layer = LayerNorm::new(128)?;
let layer = RMSNorm::new(128)?;
let layer = GroupNorm::new(4, 16)?;
let layer = BatchNorm::new(128)?;                                            // for [B, features] after Linear
let layer = BatchNorm2d::new(64)?;                                           // for [B, C, H, W] after Conv2d
let layer = InstanceNorm::new(64, true)?;                                    // affine=true
let layer = Dropout::new(0.5);
let layer = Dropout2d::new(0.1);
let layer = AlphaDropout::new(0.1);                                          // for SELU networks
let layer = ZeroPad2d::new(1);
let layer = ReflectionPad2d::new(1);
let layer = Embedding::new(1000, 128)?;
let layer = EmbeddingBag::new(1000, 128)?;
let cell = GRUCell::new(128, 256)?;
let cell = LSTMCell::new(128, 256)?;
let layer = GRU::new(128, 256, 2)?;                                          // 2-layer GRU
let layer = LSTM::new(128, 256, 2)?;                                         // 2-layer LSTM
let layer = MultiheadAttention::new(512, 8)?;
let layer = Bilinear::new(128, 64, 32, true)?;

// On a specific device (all modules have on_device() variants):
let layer = Linear::on_device(784, 128, Device::CUDA(0))?;
let layer = Conv2d::configure(3, 64, 3).with_padding(1).on_device(Device::CUDA(0)).done()?;
let layer = LayerNorm::on_device(128, Device::CUDA(0))?;
let layer = RMSNorm::on_device(128, Device::CUDA(0))?;
let layer = GroupNorm::on_device(4, 16, Device::CUDA(0))?;
let layer = BatchNorm::on_device(128, Device::CUDA(0))?;
let layer = BatchNorm2d::on_device(64, Device::CUDA(0))?;
let layer = InstanceNorm::on_device(64, true, Device::CUDA(0))?;
let layer = Embedding::on_device(1000, 128, Device::CUDA(0))?;
let cell = GRUCell::on_device(128, 256, Device::CUDA(0))?;
let cell = LSTMCell::on_device(128, 256, Device::CUDA(0))?;
let layer = GRU::on_device(128, 256, 2, false, Device::CUDA(0))?;
let layer = LSTM::on_device(128, 256, 2, false, Device::CUDA(0))?;
let layer = MultiheadAttention::on_device(512, 8, Device::CUDA(0))?;

Activations (as Modules)

# PyTorch
nn.ReLU()
nn.Sigmoid()
nn.Tanh()
nn.GELU()
nn.SiLU()
nn.LeakyReLU(0.01)
nn.ELU(alpha=1.0)
nn.Softplus(beta=1.0)
nn.Mish()
nn.SELU()
nn.Hardswish()
nn.Hardsigmoid()
nn.PReLU(num_parameters=1)
nn.Softmax(dim=-1)
nn.LogSoftmax(dim=-1)
nn.Flatten(start_dim=1)
nn.Identity()

// flodl — unit structs (no allocation, bare-name usage)
ReLU
Sigmoid
Tanh
GELU                       // erf form (default) — GELU::tanh() for the tanh approximation
SiLU
Mish
SELU
Hardswish
Hardsigmoid
Identity

// Parameterized at construction
LeakyReLU::new(0.01)
ELU::new(1.0)
Softplus::new(1.0, 20.0)         // beta, threshold
Softmax::new(-1)                  // dim
LogSoftmax::new(-1)
Flatten::new(1, -1)               // start_dim, end_dim

// Learnable parameters
let prelu = PReLU::new(1, Device::CPU)?;

Preprocessing Modules

# PyTorch
blur = torchvision.transforms.GaussianBlur(kernel_size=7, sigma=1.5)
# or functional:
y = torchvision.transforms.functional.gaussian_blur(x, kernel_size=7, sigma=1.5)

// flodl — as a Module (for use in FlowBuilder graphs)
let blur = GaussianBlur::new(1.5);  // kernel size auto-computed from sigma
let y = blur.forward(&x)?;

// flodl — as a free function
let y = gaussian_blur_2d(&x, 1.5)?;  // input must be [B, C, H, W]

GaussianBlur is stateless (no parameters). Kernel size is 2 * ceil(3 * sigma) + 1, matching OpenCV’s default. Runs under NoGradGuard – no autograd graph built.

Composite Modules

In PyTorch, nn.Module.__init__ auto-discovers child modules assigned to self. In flodl, composite modules implement the Module trait and declare children via sub_modules() — enabling recursive device placement, training mode toggling, and parameter collection.

# PyTorch — children auto-discovered
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        return self.fc2(F.relu(self.fc1(x)))

model.to(device)           # walks children automatically
model.parameters()         # collects from all children
model.train()              # propagates to children

// flodl — declare children via parameters()
struct MLP {
    fc1: Linear,
    fc2: Linear,
}

impl Module for MLP {
    fn forward(&self, input: &Variable) -> Result<Variable> {
        let h = self.fc1.forward(input)?.relu()?;
        self.fc2.forward(&h)
    }

    fn parameters(&self) -> Vec<Parameter> {
        nn::collect_parameters(&[&self.fc1 as &dyn Module, &self.fc2])
    }

    fn name(&self) -> &str { "MLP" }
}

Or skip manual structs entirely — use the graph builder (see below).

Aspect	PyTorch	flodl
Child discovery	Implicit (`self.x = ...`)	Explicit (`sub_modules()`)
Parameter collection	Automatic	`collect_parameters()` walks tree
Device move	`model.to(device)`	`module.move_to_device(Device::CUDA(0))`
Training mode	`model.train()` / `model.eval()`	`module.train()` / `module.eval()`

Loss Functions

# PyTorch
loss = F.mse_loss(pred, target)
loss = F.cross_entropy(logits, labels)
loss = F.nll_loss(log_probs, labels)
loss = F.binary_cross_entropy(probs, target)
loss = F.binary_cross_entropy_with_logits(pred, target)
loss = F.l1_loss(pred, target)
loss = F.smooth_l1_loss(pred, target, beta=1.0)
loss = F.kl_div(log_probs, targets, reduction='batchmean')
loss = F.ctc_loss(log_probs, targets, input_lengths, target_lengths)
loss = F.poisson_nll_loss(pred, target, log_input=True)
loss = F.triplet_margin_loss(anchor, positive, negative, margin=1.0)
loss = F.cosine_embedding_loss(x1, x2, labels, margin=0.0)
loss = F.hinge_embedding_loss(input, labels, margin=1.0)
loss = F.margin_ranking_loss(x1, x2, labels, margin=0.0)
# focal_loss — not in PyTorch, popular in object detection

// flodl — free functions, return Variable (differentiable)
let loss = mse_loss(&pred, &target)?;
let loss = cross_entropy_loss(&logits, &labels)?;   // labels: [B] indices or [B,C] one-hot
let loss = nll_loss(&log_probs, &labels)?;           // after log_softmax
let loss = bce_loss(&probs, &target)?;               // from probabilities
let loss = bce_with_logits_loss(&logits, &target)?;  // numerically stable
let loss = l1_loss(&pred, &target)?;
let loss = smooth_l1_loss(&pred, &target, 1.0)?;
let loss = kl_div_loss(&log_probs, &targets)?;
let loss = ctc_loss(&log_probs, &targets, &input_lengths, &target_lengths, 0)?;
let loss = poisson_nll_loss(&pred, &target, true)?;
let loss = focal_loss(&logits, &target, 0.25, 2.0)?;  // alpha, gamma — class imbalance
let loss = triplet_margin_loss(&anchor, &positive, &negative, 1.0)?;
let loss = cosine_embedding_loss(&x1, &x2, &labels, 0.0)?;
let loss = hinge_embedding_loss(&input, &labels, 1.0)?;
let loss = margin_ranking_loss(&x1, &x2, &labels, 0.0)?;

Optimizers

# PyTorch
opt = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
opt = torch.optim.Adam(model.parameters(), lr=0.001)
opt = torch.optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
opt = torch.optim.RMSprop(model.parameters(), lr=0.01)
opt = torch.optim.Adagrad(model.parameters(), lr=0.01)
opt = torch.optim.RAdam(model.parameters(), lr=0.001)
opt = torch.optim.NAdam(model.parameters(), lr=0.001)

opt.zero_grad()
loss.backward()
opt.step()

// flodl — optimizers own a clone of the param list
let mut opt = SGD::new(&params, 0.01, 0.9);     // lr, momentum
let mut opt = Adam::new(&params, 0.001);         // lr
let mut opt = AdamW::new(&params, 0.001, 0.01);  // lr, weight_decay
let mut opt = RMSprop::new(&params, 0.01);       // lr (alpha=0.99, eps=1e-8)
let mut opt = Adagrad::new(&params, 0.01);       // lr
let mut opt = RAdam::new(&params, 0.001);        // rectified Adam — auto warmup
let mut opt = NAdam::new(&params, 0.001);        // Nesterov-accelerated Adam

opt.zero_grad();
loss.backward()?;
opt.step()?;  // returns Result<()>

Parameter groups

# PyTorch
opt = torch.optim.Adam([
    {"params": encoder.parameters(), "lr": 1e-5},
    {"params": decoder.parameters(), "lr": 1e-3},
])

// flodl — builder API
let mut opt = Adam::with_groups()
    .group(&encoder_params, 1e-5)
    .group(&decoder_params, 1e-3)
    .build();

opt.set_group_lr(0, 1e-6);  // adjust one group
opt.set_lr(1e-4);           // adjust all groups

Freezing parameters

# PyTorch
for param in model.encoder.parameters():
    param.requires_grad = False

// flodl
for param in &encoder_params {
    param.freeze()?;
}
// Later: param.unfreeze()?;
// Check: param.is_frozen()

Learning Rate Scheduling

# PyTorch
scheduler = torch.optim.lr_scheduler.StepLR(opt, step_size=30, gamma=0.1)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=100, eta_min=1e-6)
scheduler = torch.optim.lr_scheduler.ExponentialLR(opt, gamma=0.95)
scheduler = torch.optim.lr_scheduler.MultiStepLR(opt, milestones=[30, 60, 90], gamma=0.1)
scheduler = torch.optim.lr_scheduler.OneCycleLR(opt, max_lr=0.01, total_steps=1000)
scheduler = torch.optim.lr_scheduler.CyclicLR(opt, base_lr=1e-4, max_lr=1e-2, step_size_up=500)
scheduler.step()

// flodl — schedulers produce an lr, you apply it
let sched = StepDecay::new(0.001, 30, 0.1);
let sched = CosineScheduler::new(0.001, 1e-6, 100);
let sched = ExponentialLR::new(0.001, 0.95);
let sched = MultiStepLR::new(0.001, &[30, 60, 90], 0.1);
let sched = OneCycleLR::new(0.01, 1000);                  // 30% warmup
let sched = CyclicLR::new(1e-4, 1e-2, 500);               // symmetric triangle
let lr = sched.lr(step);
opt.set_lr(lr);

// Composable warmup:
let sched = WarmupScheduler::new(CosineScheduler::new(0.001, 1e-6, 100), 0.001, 10);

// Plateau (reduce on metric stall):
let mut sched = PlateauScheduler::new(0.001, 10, 0.1, 1e-6);
let lr = sched.observe(val_loss);
opt.set_lr(lr);

Key difference: PyTorch schedulers wrap optimizers. flodl schedulers are pure functions — you call .lr(step) or .observe(metric) and set the lr yourself.

Gradient Clipping

# PyTorch
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)

// flodl
let total_norm = clip_grad_norm(&params, 1.0)?;
let max_val = clip_grad_value(&params, 0.5)?;

Saving and Loading

# PyTorch
torch.save(model.state_dict(), "model.pt")
model.load_state_dict(torch.load("model.pt"))

// flodl — one-call checkpoint (saves params + buffers + structural hash)
model.save_checkpoint("model.fdl")?;
let report = model.load_checkpoint("model.fdl")?;
// report.loaded, report.skipped, report.missing

// Or with any io::Write / io::Read for custom I/O:
let named = model.named_parameters();
let buffers = model.named_buffers();
let hash = Some(model.structural_hash());
save_checkpoint(&mut writer, &named, &buffers, hash)?;
let report = load_checkpoint(&mut reader, &named, &buffers, hash)?;

Full training resume (model + optimizer)

Optimizers implement the Stateful trait for save/load:

// Save
model.save_checkpoint("model.fdl")?;
let mut f = File::create("optimizer.fdl")?;
optimizer.save_state(&mut f)?;

// Load
let report = model.load_checkpoint("model.fdl")?;
let mut f = File::open("optimizer.fdl")?;
optimizer.load_state(&mut f)?;

Migrating checkpoints across versions

When parameter naming changes between flodl versions (e.g., tag renames from the graph tree release), use migrate_checkpoint_file() to remap an old checkpoint to match your current model:

use flodl::nn::{checkpoint_version, migrate_checkpoint_file};

// Check if migration is needed
if checkpoint_version("model.fdl")? < 2 {
    let report = migrate_checkpoint_file(
        "model.fdl",          // old checkpoint (v1)
        "model_v2.fdl",       // migrated output (v2)
        &model.named_parameters(),
        &model.named_buffers(),
    )?;
    println!("{}", report);
    // unchanged (1):
    //   shared/weight
    // remapped (2):
    //   linear_0/weight -> encoder/weight
    //   linear_0/bias -> encoder/bias
}

// Load the migrated checkpoint normally
model.load_checkpoint("model_v2.fdl")?;

The migration matches entries by exact name first, then by shape+dtype in positional order. MigrateReport::is_complete() returns true when nothing was dropped or missing. Only works for the same model architecture – if you changed the architecture, retrain.

Device Placement

# PyTorch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = MyModel().to(device)
x = x.to(device)

// flodl
let device = if cuda_available() { Device::CUDA(0) } else { Device::CPU };

// Move model parameters
module.move_to_device(device);

// Move tensors
let x = x.to_device(device)?;
let x = x.to_device_of(&weights)?;  // match another tensor's device

// Move variables
let x = x.to_device(device)?;

// Create directly on device
let opts = TensorOptions { dtype: DType::Float32, device: Device::CUDA(0) };
let x = Tensor::zeros(&[2, 3], opts)?;

Aspect	PyTorch	flodl
Device check	`torch.cuda.is_available()`	`cuda_available()`
Device count	`torch.cuda.device_count()`	`cuda_device_count()`
Model move	`model.to(device)`	`module.move_to_device(device)`
Tensor move	`x.to(device)`	`x.to_device(device)?`
cuDNN benchmark	`torch.backends.cudnn.benchmark = True`	`set_cudnn_benchmark(true)`

Weight Initialization

# PyTorch — in-place mutation
nn.init.xavier_uniform_(layer.weight)
nn.init.xavier_normal_(layer.weight)
nn.init.kaiming_uniform_(layer.weight, a=math.sqrt(5))
nn.init.kaiming_normal_(layer.weight)
nn.init.uniform_(layer.weight, -0.1, 0.1)
nn.init.normal_(layer.weight, 0.0, 0.01)
nn.init.orthogonal_(layer.weight, gain=1.0)
nn.init.trunc_normal_(layer.weight, std=0.02)

// flodl — returns a new Tensor, then set_data() to apply
let w = xavier_uniform(&[out, inp], inp, out, device)?;
let w = xavier_normal(&[out, inp], inp, out, device)?;
let w = kaiming_uniform(&[out, inp], inp, 0.0, device)?;   // a=0.0 for ReLU
let w = kaiming_normal(&[out, inp], inp, 0.0, device)?;
let w = uniform(&[out, inp], -0.1, 0.1, device)?;
let w = normal(&[out, inp], 0.0, 0.01, device)?;
let w = orthogonal(&[out, inp], 1.0, device)?;             // 2D only
let w = trunc_normal(&[out, inp], 0.0, 0.02, -2.0, 2.0, device)?;  // mean, std, a, b

layer.parameters()[0].set_data(&w);

Mixed Precision Training

# PyTorch
scaler = torch.cuda.amp.GradScaler()

with torch.cuda.amp.autocast():
    output = model(input)
    loss = criterion(output, target)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

// flodl
let mut scaler = GradScaler::new();

// Cast model to float16
cast_parameters(&params, DType::Float16);

// Forward + backward with scaled loss
let output = model.forward(&input)?;
let loss = mse_loss(&output, &target)?;
let scaled = scaler.scale(&loss)?;
scaled.backward()?;

// Step with automatic unscaling + inf/nan checking
let stepped = scaler.step(&params, &mut || opt.step())?;
scaler.update();

Training Loop Pattern

# PyTorch
model.train()
for epoch in range(num_epochs):
    for inputs, targets in dataloader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
    scheduler.step()
    print(f"Epoch {epoch}: loss={loss.item():.4f}")

// flodl
model.train();
for epoch in 0..num_epochs {
    // your data loading here
    opt.zero_grad();
    let output = model.forward(&input)?;
    let loss = cross_entropy_loss(&output, &target)?;
    loss.backward()?;
    clip_grad_norm(&params, 1.0)?;
    opt.step()?;

    opt.set_lr(sched.lr(epoch));
    println!("Epoch {}: loss={:.4}", epoch, loss.item()?);
}

// Inference
model.eval();
let pred = no_grad(|| model.forward(&test_input))?;

Error Handling

PyTorch raises Python exceptions. flodl returns Result<T, TensorError>. Both Result and TensorError are re-exported from flodl::*. Use Rust’s ? operator for clean propagation:

fn train_step(model: &Graph, input: &Variable, target: &Variable,
              optimizer: &mut Adam) -> Result<f64> {
    optimizer.zero_grad();
    let output = model.forward(input)?;
    let loss = mse_loss(&output, target)?;
    loss.backward()?;
    optimizer.step()?;
    loss.item()
}

Memory Management

Aspect	PyTorch	flodl
Model memory	Python GC + reference counting	Rust `Drop` trait — deterministic deallocation
GPU memory	GC-delayed; `torch.cuda.empty_cache()`	Freed immediately when last reference drops
Gradient graph	Freed after `.backward()`	`backward()` also calls `detach_()` — grad_fn chain freed synchronously
No-grad inference	`with torch.no_grad():`	`no_grad(\\|\\| { ... })` or `NoGradGuard::new()`
Handle diagnostics	N/A	`live_tensor_count()`, `rss_kb()`

No manual memory management needed. Rust’s ownership system handles it.

Graph Builder (flodl-specific)

flodl’s unique feature: a fluent API for building computation graphs declaratively. No PyTorch equivalent — this replaces manual nn.Module subclassing.

Sequential

# PyTorch
model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 10),
)

// flodl
let model = FlowBuilder::from(Linear::new(784, 128)?)
    .through(ReLU)
    .through(Linear::new(128, 10)?)
    .build()?;

Residual Connections

# PyTorch — manual
class ResBlock(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(nn.ReLU(), nn.Linear(128, 128))
    def forward(self, x):
        return x + self.net(x)

// flodl — one line
let model = FlowBuilder::from(Linear::new(128, 128)?)
    .also(ReLU)    // skip connection: output = input + ReLU(input)
    .build()?;

Residual with Projection Skip (ResNet downsample)

When the skip path needs its own transform (e.g. 1×1 conv + BN to match channel/stride changes in ResNet’s downsample blocks), use also_with. It generalizes also with an explicit skip path alongside the main path: output = skip(x) + main(x).

# PyTorch — ResNet BasicBlock with downsample
class BasicBlock(nn.Module):
    def __init__(self, c_in, c_out, stride):
        super().__init__()
        self.main = nn.Sequential(
            nn.Conv2d(c_in, c_out, 3, stride, 1, bias=False), nn.BatchNorm2d(c_out), nn.ReLU(),
            nn.Conv2d(c_out, c_out, 3, 1, 1, bias=False), nn.BatchNorm2d(c_out),
        )
        self.downsample = (
            nn.Sequential(nn.Conv2d(c_in, c_out, 1, stride, bias=False), nn.BatchNorm2d(c_out))
            if stride != 1 or c_in != c_out else nn.Identity()
        )
    def forward(self, x):
        return F.relu(self.downsample(x) + self.main(x))

// flodl — same block, one builder chain
FlowBuilder::from(prev)
    .also_with(
        downsample_1x1_bn,         // skip branch (Identity if no projection needed)
        conv_bn_relu_conv_bn,      // main branch
    )
    .through(ReLU)
    .build()?;

See ddp-bench/src/models/resnet_graph.rs for a full ResNet-20 built with also_with.

Parallel Branches

# PyTorch — manual fork/merge
class ParallelModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.branch1 = nn.Linear(128, 64)
        self.branch2 = nn.Linear(128, 64)
    def forward(self, x):
        return self.branch1(x) + self.branch2(x)

// flodl
let model = FlowBuilder::from(Linear::new(128, 128)?)
    .split(modules![Linear::new(128, 64)?, Linear::new(128, 64)?])
    .merge(MergeOp::Add)
    .build()?;

Loops (Recurrent)

let model = FlowBuilder::from(init_module)
    .loop_body(step_module)
    .for_n(10)              // fixed 10 iterations
    .through(head_module)
    .build()?;

// Or with learned halting:
    .loop_body(step_module)
    .until_cond(ThresholdHalt::new(0.95), 20)  // max 20 iters

Routing (Mixture of Experts)

// Hard routing — one expert per input
let model = FlowBuilder::from(encoder)
    .switch(ArgmaxSelector::new(128, 3)?, modules![expert1, expert2, expert3])
    .build()?;

// Soft routing — weighted mixture
let model = FlowBuilder::from(encoder)
    .gate(SoftmaxRouter::new(128, 3)?, modules![expert1, expert2, expert3])
    .build()?;

Tags and Cross-References

// Tag intermediate outputs for later use
let model = FlowBuilder::from(encoder)
    .tag("encoded")
    .through(decoder)
    .input(&["encoded"])       // declare named input
    .using(&["encoded"])       // wire tagged value as named input
    .build()?;

Observation and Profiling

let mut model = FlowBuilder::from(Linear::new(4, 8)?)
    .through(ReLU)
    .build()?;

// After each batch
model.end_step();

// After each epoch
model.end_epoch();

// Query metrics
let trend = model.trend("loss");
println!("mean: {}", trend.mean());

// Profiling
model.enable_profiling();
// ... run forward passes ...
if let Some(profile) = model.profile() {
    for t in &profile.nodes {
        println!("{}: {}", t.id, format_duration(t.duration.as_secs_f64()));
    }
}

// Visualization
model.dot();                         // GraphViz DOT string
model.svg(Some("model.svg"))?;      // render to SVG

The graph implements Module, so it works with optimizers, checkpointing, and everything else.

Graph Tree (Hierarchical Composition)

PyTorch uses nn.Module nesting and named_modules() for hierarchical access. flodl’s graph tree provides label-path addressing for the same patterns — freeze by path, per-subgraph optimizer groups, subgraph checkpoint loading, and cross-boundary observation.

Labeling subgraphs

# PyTorch — child modules are auto-discovered
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = Encoder()
        self.decoder = Decoder()

// flodl — label graphs for tree features
let encoder = FlowBuilder::from(scan_module)
    .through(read_module)
    .label("encoder")
    .build()?;

let model = FlowBuilder::from(encoder)  // child "encoder" registered
    .through(decoder)
    .build()?;

Selective freeze/thaw by path

# PyTorch
for param in model.encoder.parameters():
    param.requires_grad = False
# Thaw a sub-part:
for param in model.encoder.scan.parameters():
    param.requires_grad = True

// flodl — declarative, by label path
model.freeze("encoder")?;
model.thaw("encoder.scan")?;
assert!(model.is_frozen("encoder.read")?);  // read stays frozen

Per-subgraph optimizer groups

# PyTorch
optimizer = torch.optim.Adam([
    {'params': model.encoder.scan.parameters(), 'lr': 1e-4},
    {'params': model.meta.parameters(), 'lr': 1e-3},
])

// flodl
let mut optimizer = Adam::with_groups()
    .group(&model.parameters_at("encoder.scan")?, 0.0001)
    .group(&model.parameters_at("meta")?, 0.001)
    .build();

Subgraph checkpoint loading

# PyTorch — load weights into a submodule
state = torch.load("encoder_v1.pt")
model.encoder.load_state_dict(state)

// flodl — loads using the child's own namespace and hash validation
let report = model.load_subgraph_checkpoint("encoder", "encoder_v1.fdl.gz")?;

Cross-boundary observation

# PyTorch — manual: register hooks or store intermediates in forward()

// flodl — read tags and metrics across graph boundaries
model.forward(&input)?;
let hidden = model.tagged_at("encoder.hidden")?;  // Option<Variable>

// Record and track metrics in children
model.record_at("encoder.loss", loss_value)?;
model.flush(&[]);  // flushes entire tree automatically

let trend = model.trend_at("encoder.loss")?;

Training mode propagation

# PyTorch
model.encoder.eval()  # BatchNorm uses running stats

// flodl — by label path
model.set_training_at("encoder", false)?;

See Graph Tree tutorial for the full API reference.

Training Monitor (replaces TensorBoard)

PyTorch researchers typically use TensorBoard, Weights & Biases, or MLflow for training visibility. In floDl, the training monitor is built in — no external process, no pip install, no separate UI.

# PyTorch + TensorBoard
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter("runs/experiment_1")

for epoch in range(num_epochs):
    # ... training ...
    writer.add_scalar("loss", loss.item(), epoch)
    writer.add_scalar("lr", scheduler.get_last_lr()[0], epoch)

# Then: tensorboard --logdir runs/

// flodl — built-in monitor with live dashboard
use flodl::monitor::Monitor;

let mut monitor = Monitor::new(num_epochs);
monitor.serve(3000)?;                        // live dashboard at http://localhost:3000
monitor.watch(&model);                       // graph SVG in dashboard
monitor.save_html("training_report.html");   // archive at finish

for epoch in 0..num_epochs {
    let t = std::time::Instant::now();
    // ... training ...

    monitor.log(epoch, t.elapsed(), &[("loss", loss_val), ("lr", lr)]);
    // epoch  42/100  loss=0.0023  lr=0.001  [1.2s  ETA 1m 10s]  VRAM: 2.1/6.0 GB (82%)
}

monitor.finish_with(&model);  // profiled SVG + archive saved

Feature	TensorBoard	flodl Monitor
Setup	`pip install tensorboard` + `SummaryWriter` + `tensorboard --logdir`	`monitor.serve(3000)`
Terminal output	None (web only)	One-line per epoch with ETA
Resource tracking	Manual (no GPU metrics built in)	CPU/RAM/GPU/VRAM automatic
Live charts	Yes (web)	Yes (SSE, no polling)
Architecture viz	`add_graph()` (limited)	`monitor.watch(&model)` — full DOT/SVG with profiling heat map
Offline archive	Log files (need TensorBoard to view)	Self-contained HTML
Dependencies	protobuf, gRPC, webpack frontend	Zero — 16KB inline HTML/JS

GPU Memory Queries

# PyTorch
torch.cuda.memory_allocated()     # bytes allocated by tensors
torch.cuda.memory_reserved()      # bytes reserved by caching allocator
torch.cuda.max_memory_allocated()  # peak allocated

// flodl — hardware-level via cudaMemGetInfo
let (used, total) = cuda_memory_info()?;   // (bytes_used, bytes_total)
let util = cuda_utilization();              // Option<u32> — GPU % via NVML

// Allocator-level queries
let active = cuda_active_bytes()?;             // bytes backing live tensors
let peak = cuda_peak_active_bytes()?;          // max since last reset
let peak_reserved = cuda_peak_reserved_bytes()?;  // max allocator reservation

// Reset peak tracking (e.g., between profiling phases)
cuda_empty_cache();
cuda_reset_peak_stats();

PyTorch	flodl	What it reports
`torch.cuda.mem_get_info()`	`cuda_memory_info()?`	`(used, total)` bytes via `cudaMemGetInfo`
`torch.cuda.memory_allocated()`	`cuda_active_bytes()?`	Bytes currently backing live tensors
`torch.cuda.memory_reserved()`	`cuda_allocated_bytes()?`	Bytes reserved by caching allocator (includes spill)
`torch.cuda.max_memory_allocated()`	`cuda_peak_active_bytes()?`	Peak allocated since last reset
`torch.cuda.max_memory_reserved()`	`cuda_peak_reserved_bytes()?`	Peak reserved since last reset
`torch.cuda.reset_peak_memory_stats()`	`cuda_reset_peak_stats()`	Reset peak counters
`torch.cuda.empty_cache()`	`cuda_empty_cache()`	Release unused cached blocks
(no built-in)	`cuda_utilization()`	GPU compute % via NVML

The monitor samples these automatically on every log() call — you don’t need to query them manually during training.

Multi-GPU Training (DDP)

PyTorch’s DDP requires multi-process coordination, environment variables, and a launcher. floDl keeps everything in a single process.

Setup comparison

# PyTorch: requires torchrun, process groups, env vars
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler

dist.init_process_group("nccl")
rank = dist.get_rank()
model = DDP(MyModel().to(rank), device_ids=[rank])
sampler = DistributedSampler(dataset)
loader = DataLoader(dataset, sampler=sampler, batch_size=32)

# Launch: torchrun --nproc_per_node=2 train.py

// floDl (Graph DDP): one line, no process groups, no launcher
Trainer::setup(&model, &builder, |p| Adam::new(p, 0.001))?;

// Or (DDP Builder): works with any Module
let state = Trainer::builder(model_factory, optim_factory, train_fn)
    .dataset(dataset)
    .batch_size(32)
    .num_epochs(10)
    .run()?
    .join()?;

Concept mapping

PyTorch	floDl	Notes
`dist.init_process_group("nccl")`	Automatic	NCCL init handled internally
`DistributedDataParallel(model)`	`Trainer::setup()` or `Trainer::builder()`	Single process, multi-thread
`DistributedSampler`	Automatic	DataLoader handles partitioning
`torchrun --nproc_per_node=N`	Not needed	Single-process model
`model.to(rank)`	`model_factory(device)`	Per-device model in closure
Equal batch per GPU only	`ElChe` cadence	Heterogeneous GPU support
`NCCL` or `Gloo`	`AverageBackend::Nccl` or `Cpu`	A/B testable backends
No built-in A/B testing	`ApplyPolicy` x `AverageBackend`	6 combinations, swap with one line

Key differences

Single process: no torchrun, no MASTER_ADDR/MASTER_PORT, no rank calculation. floDl detects GPUs and spawns threads internally.
Heterogeneous GPUs: PyTorch DDP requires equal batch sizes across ranks. floDl’s El Che assigns proportional work based on measured throughput.
A/B testing: swap AverageBackend::Nccl for AverageBackend::Cpu with one line. PyTorch has no equivalent mechanism.
Single-GPU fallback: both Trainer::setup() and Trainer::builder() work identically on single GPU/CPU. No conditional code needed.

See the DDP Reference for complete API documentation.

Quick Reference Table

PyTorch	flodl	Notes
`torch.zeros(2,3)`	`Tensor::zeros(&[2,3], opts)?`	Shape as slice
`x + y`	`x.add(&y)?`	Returns `Result`
`x.requires_grad_(True)`	`Variable::new(x, true)`	Set at creation
`loss.backward()`	`loss.backward()?`	Same pattern
`optimizer.zero_grad()`	`opt.zero_grad()`	Same pattern
`optimizer.step()`	`opt.step()?`	Returns `Result`
`nn.Linear(in, out)`	`Linear::new(in, out)?`	Returns `Result`
`F.relu(x)`	`x.relu()?`	Method on Variable
`F.mse_loss(a, b)`	`mse_loss(&a, &b)?`	Free function
`model.to(device)`	`module.move_to_device(device)`
`with torch.no_grad():`	`no_grad(\\|\\| { })` or `NoGradGuard::new()`	Closure or RAII guard
`nn.Sequential(...)`	`FlowBuilder::from(...).through(...).build()?`	Fluent builder
`model.train()`	`module.train()`
`model.eval()`	`module.eval()`
`torch.save(...)` / `torch.load(...)`	`model.save_checkpoint("m.fdl")?` / `model.load_checkpoint("m.fdl")?`	Named `.fdl` format with `LoadReport` + structural hash validation
(no built-in)	`migrate_checkpoint_file(src, dst, &params, &bufs)?`	Remap parameter names across versions by shape+dtype matching
(no built-in)	`checkpoint_version(path)?`	Peek at checkpoint version (1=0.1.x, 2=0.2.0+)
`param.requires_grad = False`	`param.freeze()?`	Also: `unfreeze()`, `is_frozen()`
`Adam([{"params":..., "lr":...}])`	`Adam::with_groups().group(&p, lr).build()`	Per-group LR
`torch.cuda.memory_reserved()`	`cuda_allocated_bytes()?`	Bytes reserved by caching allocator
`x.pin_memory()`	`x.pin_memory()?`	Page-locked CPU memory for async transfers
`x.is_pinned()`	`x.is_pinned()`	Check if tensor is in pinned memory
`x.to(device, non_blocking=True)`	`x.to_device_async(device)?`	Non-blocking transfer (pair with `pin_memory`)
`x.to(memory_format=channels_last)`	`x.to_channels_last()?`	NHWC layout for Conv2d (8-35% on Tensor Cores)
`x.is_contiguous(channels_last)`	`x.is_channels_last()`	Check memory format
`torch.cuda.amp.autocast()`	`autocast(DType::Float16, \\|\\| { })`	Automatic mixed precision dispatch
`torch.cuda.amp.GradScaler()`	`GradScaler::new()`	Dynamic loss scaling for AMP
`torch.cuda.CUDAGraph()`	`CudaGraph::new()?`	CUDA graph capture/replay
`torch.cuda.graph(g)`	`cuda_graph_capture(warmup, pool, \\|\\| { })`	Convenience capture helper
`SummaryWriter` + TensorBoard	`Monitor::new(n).serve(3000)?`	Built-in live dashboard