Tutorial 4: Training
This tutorial puts everything together: loss functions, optimizers, gradient clipping, mixed precision training, and the training loop. It builds on Tutorial 3: Modules.
Loss Functions
All loss functions are free functions returning a scalar Variable ready
for backward().
Regression Losses
use flodl::*;
let loss = mse_loss(&pred, &target)?; // Mean Squared Error
let loss = l1_loss(&pred, &target)?; // Mean Absolute Error
let loss = smooth_l1_loss(&pred, &target, 1.0)?; // Huber loss (beta=1.0)
let loss = poisson_nll_loss(&pred, &target, true)?; // Poisson NLL (log_input=true)
Classification Losses
// Cross-Entropy from raw logits.
// pred: [batch, classes] logits.
// target: [batch] class indices (Int64) or [batch, classes] one-hot/soft labels.
let loss = cross_entropy_loss(&logits, &target)?;
// Negative Log Likelihood — use after log_softmax
let loss = nll_loss(&log_probs, &target)?;
// Binary Cross-Entropy (from probabilities, after sigmoid)
let loss = bce_loss(&probs, &target)?;
// Binary Cross-Entropy with logits (numerically stable — preferred)
let loss = bce_with_logits_loss(&logits, &target)?;
// Focal Loss — down-weights easy examples for class imbalance
let loss = focal_loss(&logits, &target, 0.25, 2.0)?; // alpha, gamma
// KL Divergence
let loss = kl_div_loss(&log_pred, &target)?;
// CTC Loss — for sequence-to-sequence without alignment (speech, OCR)
let loss = ctc_loss(&log_probs, &targets, &input_lengths, &target_lengths, 0)?;
Metric Learning Losses
// Triplet margin loss — push negatives away from anchor-positive pairs
let loss = triplet_margin_loss(&anchor, &positive, &negative, 1.0)?;
// Cosine embedding loss — similar pairs close, dissimilar far
let loss = cosine_embedding_loss(&x1, &x2, &labels, 0.5)?;
// Hinge embedding loss — for binary tasks with {-1, +1} labels
let loss = hinge_embedding_loss(&input, &labels, 1.0)?;
// Margin ranking loss — x1 should be ranked higher than x2
let loss = margin_ranking_loss(&x1, &x2, &labels, 0.0)?;
Optimizers
All optimizers implement the Optimizer trait:
pub trait Optimizer {
fn step(&mut self) -> Result<()>;
fn zero_grad(&self);
fn set_lr(&mut self, lr: f64);
fn set_group_lr(&mut self, group: usize, lr: f64); // per-group LR
}
SGD
let optimizer = SGD::new(¶ms, 0.01, 0.9); // lr, momentum (0.0 for vanilla SGD)
Adam / AdamW
let optimizer = Adam::new(¶ms, 0.001); // default betas (0.9, 0.999), eps=1e-8
let optimizer = AdamW::new(¶ms, 0.001, 0.01); // decoupled weight decay
RMSprop
Adaptive learning rate with exponential moving average of squared gradients:
let optimizer = RMSprop::new(¶ms, 0.01); // default alpha=0.99, eps=1e-8
Adagrad
Accumulates all past squared gradients — works well for sparse features:
let optimizer = Adagrad::new(¶ms, 0.01);
RAdam / NAdam
Rectified Adam (variance-aware warmup) and Nesterov-accelerated Adam:
let optimizer = RAdam::new(¶ms, 0.001); // auto-warmup via variance rectification
let optimizer = NAdam::new(¶ms, 0.001); // Nesterov momentum with Adam
Fused CUDA Optimizers
On CUDA, both Adam and AdamW automatically use _fused_adamw_ – a
single multi-tensor kernel that updates all parameters, gradients, and
moment buffers in one launch. A naive implementation would require 4N
separate kernels (one each for momentum update, variance update, bias
correction, and parameter update, per parameter). The fused path reduces
this to a single kernel launch for all parameters in a group.
This is completely automatic. Adam::new() and AdamW::new() use the
fused path whenever parameters live on CUDA. No API changes are needed.
The fused kernel also exposes grad_scale and found_inf tensor
parameters internally, which GradScaler uses for integrated mixed
precision training (see Mixed Precision Training
below).
Gradient Clipping
Prevent exploding gradients by clipping after backward and before the optimizer step:
// Scale gradients so total L2 norm <= max_norm
clip_grad_norm(¶ms, 1.0)?;
// Clamp each gradient element to [-max_val, max_val]
clip_grad_value(¶ms, 0.5)?;
Under the hood, clip_grad_norm uses _foreach_norm + _foreach_mul_
internally – two kernels total regardless of the number of parameters,
instead of 2N kernels with a naive per-parameter approach. This is
particularly beneficial on CUDA where kernel launch overhead dominates
for small per-parameter operations.
Device Placement
By default, all tensors and parameters live on CPU. To train on CUDA, use
move_to_device on the graph.
Moving the model
let model = build_model()?;
if flodl::cuda_available() {
model.move_to_device(Device::CUDA(0));
}
// Create optimizer AFTER move_to_device.
let params = model.parameters();
let optimizer = Adam::new(¶ms, 0.001);
Trainer: write a step, get the loop
flodl’s Trainer is the universal training entry, working on any Module
(Graph or otherwise), CPU, single GPU, or multi-GPU, all with the same
code. You describe one training step as a closure (forward + loss);
Trainer::builder owns the rest: the loop, the backward pass, the
optimizer step, the gradient sync, and the device replication.
// Step closure: takes the replica's model and one batch, returns the
// loss Variable. The framework calls backward + optimizer step + sync.
fn train_step(model: &dyn Module, batch: &[Tensor]) -> Result<Variable> {
let input = Variable::new(batch[0].clone(), false);
let target = Variable::new(batch[1].to_dtype(DType::Int64)?, false);
let pred = model.forward(&input)?;
cross_entropy_loss(&pred, &target)
}
// Three closures: model factory (per-device build), optimizer factory,
// step. Then run().
let handle = Trainer::builder(
|dev| build_model_on(dev),
|params| Adam::new(params, 0.001),
train_step,
)
.dataset(dataset)
.batch_size(32)
.num_epochs(10)
.run()?;
let state = handle.join()?; // averaged params + buffers, ready for inference
This is the highest-level entry: framework owns the loop, the data
dispatch, the gradient sync (NCCL), and the optimizer. The
ddp-bench suite
is the canonical reference for this pattern across MLP, LeNet, ResNet,
GPT-nano, char-RNN, and conv-AE models, each wired through the same
train_step closure.
For policy choice (Sync / Cadence / Async), backend choice (NCCL / CPU averaging), and El Che heterogeneous-GPU cadence, see Multi-GPU Training and the DDP Reference.
Decomposing Trainer: keep your loop, share the setup
When you want explicit control of the training loop (multi-stage losses,
per-step observation hooks, conditional backward, custom gradient
clipping placement), Trainer::setup exposes the setup tier of the
framework-managed mode without taking over the loop. It runs the same
hardware detection, GPU replication, optimizer creation, and
training-mode toggle as Trainer::builder; the loop is yours.
let model = build_model()?;
// One call: pick best device, replicate across GPUs if available, set
// per-replica optimizer, enable training mode. Auto-tuned El Che
// cadence on heterogeneous multi-GPU.
Trainer::setup(
&model,
|dev| build_model_on(dev),
|p| Adam::new(p, 0.001),
)?;
// Loop is yours. Same code on CPU, single GPU, multi-GPU.
for (input_t, target_t) in &batches {
let input = Variable::new(input_t.clone(), false);
let target = Variable::new(target_t.clone(), false);
let loss = mse_loss(&model.forward(&input)?, &target)?;
loss.backward()?;
model.step()?; // AllReduce + buffer sync + optimizer + zero_grad
}
Task-head wrappers (e.g. flodl-hf’s BertForSequenceClassification)
use the matching Trainer::setup_head entry. See
HuggingFace Integration for a fine-tune walkthrough.
The Training Loop
When you can’t or don’t want to use Trainer (non-Graph custom code,
single-device prototype, or you’re learning the mechanics), the manual
pattern is: forward -> loss -> zero_grad -> backward -> clip -> step.
The same six steps run inside Trainer::setup’s model.step(); the
train_step closure of Trainer::builder is the forward + loss portion.
model.train();
for (input_t, target_t) in &batches {
let input = Variable::new(input_t.clone(), true);
let target = Variable::new(target_t.clone(), false);
// 1. Forward
let pred = model.forward(&input)?;
// 2. Loss
let loss = mse_loss(&pred, &target)?;
// 3. Zero gradients
optimizer.zero_grad();
// 4. Backward
loss.backward()?;
// 5. Clip gradients
clip_grad_norm(¶ms, 1.0)?;
// 6. Update parameters
optimizer.step()?;
}
Observing Training
Tag the nodes you want to monitor when building the graph:
let model = FlowBuilder::from(Linear::new(2, 16)?)
.through(GELU)
.through(Linear::new(16, 2)?).tag("output")
.build()?;
Collect and Flush
For epoch-level metrics, collect scalar values during the batch loop and flush at epoch boundaries:
for epoch in 0..num_epochs {
for (input, target) in &batches {
let pred = model.forward(&Variable::new(input.clone(), true))?;
let loss = mse_loss(&pred, &Variable::new(target.clone(), false))?;
optimizer.zero_grad();
loss.backward()?;
optimizer.step()?;
model.collect(&["output"])?; // from graph tag
model.record_scalar("loss", loss.item()?); // external metric
}
model.flush(&["output", "loss"]); // batch mean -> epoch history
model.end_epoch();
}
collect appends the scalar value of each tagged node to a batch buffer.
record pushes raw f64 values into the same buffer. flush computes
the mean, stores it in epoch history, and clears the buffer.
Stateful Graphs — end_step
Call end_step() after each training step. It severs autograd references
held by the graph and increments the step counter (used by schedulers and
observation). It detaches:
- Forward-reference state buffers (recurrent state carried between calls)
- Tagged outputs (Variables captured by
tag()for observation) - Module internal state (e.g., recurrent hidden state in custom modules)
Warning: Forgetting
end_step()causes linear memory growth — the autograd graph accumulates across batches without bound. If you see steadily rising RAM during training, a missingend_step()is the most likely cause.
model.train();
for (input_t, target_t) in &batches {
let input = Variable::new(input_t.clone(), true);
let target = Variable::new(target_t.clone(), false);
let pred = model.forward(&input)?;
let loss = mse_loss(&pred, &target)?;
optimizer.zero_grad();
loss.backward()?;
clip_grad_norm(¶ms, 1.0)?;
optimizer.step()?;
model.end_step(); // break gradient chains + increment step counter
}
When is it needed? For any graph with forward references (using("x")
before tag("x")) it is mandatory. For graphs that use tag() for
observation, it prevents tagged output Variables from holding stale
autograd graph references between batches. Even for simple graphs, it is
good practice — it keeps the step counter accurate and costs nothing.
The lower-level detach_state() is available if you need to break gradient
chains without incrementing the step counter.
Parameter Groups
All optimizers support per-group learning rates via a builder API:
let mut opt = Adam::with_groups()
.group(&scan_params, 1e-3) // group 0: high LR
.group(&read_params, 1e-5) // group 1: low LR
.build();
// Adjust one group
opt.set_group_lr(1, 1e-4);
// Adjust all groups at once
opt.set_lr(1e-3);
Adam::new(¶ms, lr) still works for single-group usage. SGD and
AdamW have the same builder pattern (SGD::with_groups(momentum),
AdamW::with_groups(weight_decay)).
Parameter Freezing
Freeze parameters to disable gradient tracking — useful for transfer learning:
for param in &encoder_params {
param.freeze()?; // no gradients will accumulate
}
// Later, unfreeze for fine-tuning:
for param in &encoder_params {
param.unfreeze()?;
}
// Check status:
if param.is_frozen() { /* ... */ }
Frozen parameters are automatically skipped by optimizers (they produce
no gradient). Freezing works through Rc<RefCell> — a freeze is visible
everywhere the parameter is referenced.
Checkpoints
Save and restore model parameters using named checkpoints:
// Save — includes all parameters, buffers, and structural hash
model.save_checkpoint("/tmp/model.fdl")?;
// Load — validates architecture, returns a report
let report = model.load_checkpoint("/tmp/model.fdl")?;
For custom destinations (network, in-memory buffer), the lower-level API is available:
let named = model.named_parameters();
let buffers = model.named_buffers();
let hash = Some(model.structural_hash());
save_checkpoint(&mut writer, &named, &buffers, hash)?;
let report = load_checkpoint(&mut reader, &named, &buffers, hash)?;
Partial loading (transfer learning)
Named checkpoints match by qualified name, so you can load a subset of parameters from a different model using Graph::named_parameters():
// Save with qualified names
model.save_checkpoint("/tmp/model.fdl")?;
// Load into a different model — matches by name
// Use the lower-level API with None hash to skip architecture validation
let new_named = new_model.named_parameters();
let new_buffers = new_model.named_buffers();
let report = load_checkpoint_file("/tmp/model.fdl", &new_named, &new_buffers, None)?;
println!("loaded: {:?}", report.loaded); // matched and loaded
println!("skipped: {:?}", report.skipped); // in checkpoint, not in model
println!("missing: {:?}", report.missing); // in model, not in checkpoint
// Freeze transferred params, train the rest
for (name, param) in &new_named {
if report.loaded.contains(name) {
param.freeze()?;
}
}
Named parameters use the tag name as prefix when available ("encoder/weight"),
otherwise the node ID ("linear_1/weight"). Shape mismatches on matched names
are errors.
LR Scheduling
Schedulers compute learning rates without owning the optimizer — they are pure calculators:
let scheduler = CosineScheduler::new(0.001, 1e-6, 100); // base_lr, min_lr, total_steps
for step in 0..100 {
let lr = scheduler.lr(step);
optimizer.set_lr(lr);
}
Compose with warmup:
let inner = CosineScheduler::new(0.001, 1e-6, 100);
let scheduler = WarmupScheduler::new(inner, 0.001, 10); // inner, target_lr, warmup_steps
Mixed Precision Training
Mixed precision training runs eligible operations (matmul, convolutions,
linear layers) in a reduced-precision dtype (typically Float16 or
BFloat16) while keeping numerically sensitive operations (losses, norms,
softmax) in full Float32. On GPUs with Tensor Cores (RTX 30xx, RTX 40xx,
RTX 50xx), this can deliver up to 3x speedup with minimal accuracy impact.
Autocast
The AutocastGuard RAII guard enables automatic dtype dispatch for the
duration of its lifetime. The autocast() closure helper provides a
convenient scoped interface:
use flodl::*;
// RAII guard style
let _amp = AutocastGuard::new(DType::Float16);
let output = model.forward(&input)?; // matmul dispatches to fp16
let loss = mse_loss(&output, &target)?; // stays fp32
drop(_amp);
// Closure style (preferred)
let loss = autocast(DType::Float16, || {
let output = model.forward(&input)?;
mse_loss(&output, &target)
})?;
// Query whether autocast is active
if is_autocast_enabled() {
// inside an autocast region
}
GradScaler
Half-precision gradients can underflow to zero. GradScaler solves this
by scaling the loss before backward (inflating gradient magnitudes), then
unscaling gradients before the optimizer step. It dynamically adjusts the
scale factor – growing it when gradients stay finite, backing off when
inf/nan is detected.
let mut scaler = GradScaler::new();
// Initial scale: 65536, growth: 2x, backoff: 0.5x, interval: 2000 steps
The step method handles unscaling, inf/nan checking, and the optimizer
step in a single call. It returns true if the step was taken, or false
if it was skipped due to non-finite gradients:
let stepped = scaler.step(¶ms, &mut || optimizer.step())?;
scaler.update(); // adjust scale factor -- call after every step()
Complete Mixed Precision Loop
let mut scaler = GradScaler::new();
model.train();
for (x, y) in &batches {
let input = Variable::new(x.clone(), false);
let target = Variable::new(y.clone(), false);
// Forward under autocast -- eligible ops run in fp16
let loss = autocast(DType::Float16, || {
let pred = model.forward(&input)?;
mse_loss(&pred, &target)
})?;
// Scale loss and backward
let scaled = scaler.scale(&loss)?;
optimizer.zero_grad();
scaled.backward()?;
// Unscale gradients, check for inf/nan, clip, and step
clip_grad_norm(¶ms, 1.0)?;
let stepped = scaler.step(¶ms, &mut || optimizer.step())?;
scaler.update();
}
Manual Dtype Conversion
For cases where you need explicit control over parameter dtypes rather than relying on autocast:
// Cast all parameters to fp16
cast_parameters(¶ms, DType::Float16);
// Cast back to fp32
cast_parameters(¶ms, DType::Float32);
Parameters already at the target dtype are skipped (no-op).
Eval Mode
Switch to eval mode for inference:
model.eval();
no_grad(|| {
let output = model.forward(&input)?;
// No graph built, no gradient tracking overhead.
Ok(output)
})?;
Training Housekeeping
The graph tracks step and epoch counts for schedulers and observation.
end_step() should be called after every training step (it detaches state
and increments the counter — see above). end_epoch() closes out the epoch:
model.end_step(); // detach state + increment step counter (call every batch)
model.end_epoch(); // increment epoch counter, reset step count
Reproducibility
For deterministic training, seed both libtorch and CPU-side RNG at the start:
use flodl::*;
fn main() -> Result<()> {
// Seed libtorch: controls rand, randn, dropout, weight init
manual_seed(42);
// CPU-side RNG: controls data shuffling, augmentation
let mut rng = Rng::seed(42);
// Build model AFTER seeding — weight initialization uses the seed
let model = FlowBuilder::from(Linear::new(2, 16)?)
.through(GELU)
.through(Linear::new(16, 2)?)
.build()?;
// Shuffle training data deterministically
let mut indices: Vec<usize> = (0..dataset_len).collect();
rng.shuffle(&mut indices);
// ...
Ok(())
}
manual_seed covers all libtorch operations (tensor creation, dropout masks,
weight initialization). Rng covers everything else (data loading order,
augmentation parameters). Together they give full reproducibility.
Complete Example
use flodl::*;
fn main() -> Result<()> {
manual_seed(42);
// Build model.
let model = FlowBuilder::from(Linear::new(2, 16)?)
.through(GELU)
.through(LayerNorm::new(16)?)
.also(Linear::new(16, 16)?)
.through(Linear::new(16, 2)?)
.build()?;
// Set up training.
let params = model.parameters();
let mut optimizer = Adam::new(¶ms, 0.01);
model.train();
// Training loop (simplified — no data loader yet).
let input_t = Tensor::randn(&[20, 2], TensorOptions::default())?;
let target_t = Tensor::randn(&[20, 2], TensorOptions::default())?;
for epoch in 0..50 {
let input = Variable::new(input_t.clone(), true);
let target = Variable::new(target_t.clone(), false);
let pred = model.forward(&input)?;
let loss = mse_loss(&pred, &target)?;
optimizer.zero_grad();
loss.backward()?;
clip_grad_norm(¶ms, 1.0)?;
optimizer.step()?;
if epoch % 10 == 0 {
println!("epoch {} loss={:.6}", epoch, loss.item()?);
}
}
// Eval.
model.eval();
let test_input = Tensor::from_f32(&[0.5, 0.3], &[1, 2], Device::CPU)?;
let pred = no_grad(|| {
model.forward(&Variable::new(test_input, false))
})?;
println!("pred: {:?}", pred.data().to_f32_vec()?);
Ok(())
}