Troubleshooting
Real error messages and how to fix them.
Build Errors
error: could not find native static library 'torch_cpu'
libtorch is not found. The build system needs LIBTORCH_PATH set to the
libtorch directory.
Fix (Docker — recommended): All builds should run in the Docker container.
fdl build # CPU
fdl cuda-build # CUDA
Fix (host): Use the download script to install libtorch and set up paths:
curl -sL https://raw.githubusercontent.com/flodl-labs/flodl/main/download-libtorch.sh | sh
This downloads libtorch to ~/.local/lib/libtorch and prints the exports to
add to your shell profile. Or set the variables manually:
export LIBTORCH_PATH=/path/to/libtorch
export LD_LIBRARY_PATH=$LIBTORCH_PATH/lib
export LIBRARY_PATH=$LIBTORCH_PATH/lib
error[E0433]: could not find 'flodl' in the list of imported crates
Cargo can’t resolve the flodl dependency.
Fix: Check your Cargo.toml. If using a git dependency:
[dependencies]
flodl = { git = "https://github.com/flodl-labs/flodl.git" }
Make sure you have network access inside the container. If behind a proxy, configure git:
git config --global http.proxy http://proxy:port
undefined reference to 'cudaGetDeviceCount'
CUDA libraries aren’t linked. This happens on CUDA builds when the linker
drops CUDA libs with --as-needed.
Fix: Make sure your main.rs calls the CUDA link anchor:
fn main() -> flodl::Result<()> {
flodl_sys::flodl_force_cuda_link();
// ... rest of your code
}
And build with the cuda feature:
cargo build --features cuda
error: linker 'cc' not found
Missing C/C++ toolchain. The Docker images include this, but if building on host:
Fix:
# Ubuntu/Debian
sudo apt-get install gcc g++ pkg-config
# macOS
xcode-select --install
Docker Errors
permission denied while trying to connect to the Docker daemon
Fix: Add your user to the docker group:
sudo usermod -aG docker $USER
# Log out and back in, then retry
could not select device driver "nvidia" / nvidia-container-cli: initialization error
NVIDIA Container Toolkit is not installed or the driver is missing.
Fix:
# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Verify:
docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi
service "cuda" depends on ... which is not healthy
GPU not available to Docker. For CPU-only development:
Fix: Use CPU targets instead:
fdl build # not cuda-build
fdl test # not cuda-test
Runtime Errors
TensorError: shape mismatch in matmul: [2, 3] x [4, 5]
Matrix dimensions don’t align. Inner dimensions must match for matmul.
Fix: Check your layer dimensions. For Linear::new(in, out), the input
tensor’s last dimension must equal in:
// Input is [batch, 10] — so first linear must accept 10
let model = FlowBuilder::from(Linear::new(10, 32)?) // [batch, 10] -> [batch, 32]
.through(Linear::new(32, 1)?) // [batch, 32] -> [batch, 1]
.build()?;
TensorError: shape mismatch: cannot add [4, 8] and [4, 16]
Shapes don’t broadcast. Common with also() (residual connections) — input
and output must have the same shape.
Fix: Make sure the residual branch preserves dimensions:
// Wrong: also() adds input (dim 8) to Linear output (dim 16)
FlowBuilder::from(Linear::new(4, 8)?)
.also(Linear::new(8, 16)?) // 8 != 16, shape mismatch
// Right: preserve dimension through the residual
FlowBuilder::from(Linear::new(4, 8)?)
.also(Linear::new(8, 8)?) // 8 == 8, ok
cannot borrow 'optimizer' as mutable, as it is not declared as mutable
Rust borrow checker: you need mut for things that change state.
Fix:
let mut optimizer = Adam::new(¶ms, 0.001); // add mut
thread 'main' panicked at 'called Result::unwrap() on an Err value'
An operation failed and you used .unwrap() instead of ?.
Fix: Use ? to propagate errors, and make sure your function returns Result<T>:
fn main() -> flodl::Result<()> {
let t = Tensor::zeros(&[2, 3], opts)?; // ? not unwrap()
Ok(())
}
Checkpoint Errors
checkpoint error: parameter count mismatch (expected 6, got 4)
The model architecture changed since the checkpoint was saved.
Fix: Ensure the model definition is identical when loading:
// Save and load must use the same architecture
let model = FlowBuilder::from(Linear::new(1, 32)?)
.through(GELU)
.through(Linear::new(32, 1)?)
.build()?;
model.save_checkpoint("model.fdl")?;
// Later — rebuild the SAME architecture before loading
let model = FlowBuilder::from(Linear::new(1, 32)?)
.through(GELU)
.through(Linear::new(32, 1)?)
.build()?;
let report = model.load_checkpoint("model.fdl")?;
checkpoint error: invalid magic bytes
The file is not a valid .fdl checkpoint (wrong format, corrupted, or a
PyTorch .pt file).
Fix: floDl uses its own .fdl format. You cannot load PyTorch checkpoints
directly. Retrain or export from PyTorch as numpy and reconstruct tensors.
Visualization
dot: command not found / SVG rendering fails
Graphviz is not installed.
Fix (Docker): The floDl Docker images include graphviz. Use fdl shell.
Fix (host):
# Ubuntu/Debian
sudo apt-get install graphviz
# macOS
brew install graphviz
model.svg() returns empty or minimal output
The graph has no nodes. Make sure you’ve built the graph with FlowBuilder:
let model = FlowBuilder::from(Linear::new(4, 8)?)
.through(GELU)
.build()?;
// Now model.dot() and model.svg() produce output
Training Issues
Loss is NaN
Common causes:
- Learning rate too high — gradients explode
- Log of zero or negative —
log(0)= -inf, poisons everything - Division by zero in normalization
Fix:
// Lower the learning rate
let mut optimizer = Adam::new(¶ms, 1e-4); // try 10x smaller
// Add gradient clipping
clip_grad_norm(¶ms, 1.0)?;
// Check for NaN in loss
let loss_val = loss.item()?;
if loss_val.is_nan() {
eprintln!("NaN detected at epoch {}", epoch);
break;
}
Loss not decreasing
- Learning rate too low — try 10x higher
- Model too small — add capacity
- Data issue — verify inputs and targets are correct
Debugging checklist:
// Print loss every epoch
println!("epoch {}: loss={:.6}", epoch, loss.item()?);
// Verify gradients exist
for p in ¶ms {
if let Some(g) = p.var().grad() {
let norm = g.norm()?.item()?;
println!("grad norm: {:.6}", norm);
}
}
// Try a known-working config first (sine wave example)
// cargo run --example sine_wave
Loss oscillates wildly
Fix: Reduce learning rate and/or add gradient clipping:
let mut optimizer = Adam::new(¶ms, 1e-4);
// ...
clip_grad_norm(¶ms, 0.5)?;
Out of VRAM
The CUDA allocator ran out of GPU memory. Common when batch sizes are too large or inference runs without disabling gradients.
Diagnose first — check how much VRAM you actually have and how it’s used:
if let Ok((used, total)) = cuda_memory_info() {
let active = cuda_active_bytes().unwrap_or(0);
let reserved = cuda_allocated_bytes().unwrap_or(0);
println!("VRAM: {:.0}/{:.0} MB (active: {:.0} MB, reserved: {:.0} MB)",
used as f64 / 1e6, total as f64 / 1e6,
active as f64 / 1e6, reserved as f64 / 1e6);
}
If reserved is much larger than active, the allocator is holding freed
blocks. Call cuda_empty_cache() to release them before checking again.
Fix:
- Reduce batch size — the single biggest lever for VRAM usage
- Use
no_gradfor inference — backward graphs consume significant memory:let pred = no_grad(|| model.forward(&input))?; - Use mixed precision —
Float16/BFloat16halves parameter memory:cast_parameters(¶ms, DType::Float16); let scaler = GradScaler::new(); - Use smaller model dimensions — reduce hidden sizes, fewer layers
- Detach state between steps — for recurrent models, call
model.detach_state()to break gradient chains across time steps - Check for tensor leaks — if VRAM grows linearly across epochs, tensors
are being retained. Use
live_tensor_count()to track:println!("live tensors: {}", live_tensor_count());
Common Rust Compiler Messages
value used here after move
You used a variable after it was moved into something else.
Fix: Clone before the move, or restructure to use references:
let t = Tensor::zeros(&[2, 3], opts)?;
let input = Variable::new(t.clone(), true); // clone t before moving into Variable
// t is still usable
expected &Tensor, found Tensor
A function expects a reference but you passed an owned value.
Fix: Add &:
let result = a.add(&b)?; // &b, not b
the trait 'Module' is not implemented for ...
You’re passing something that doesn’t implement Module to a graph builder method.
Fix: Check the type. All built-in layers implement Module. Custom types
need impl Module for YourType { ... }.
Multi-GPU / DDP
For DDP-specific troubleshooting (NCCL init failure, parameter mismatch, CUDA context corruption, NCCL deadlock, OOM on smaller GPU, CPU averaging timeout), see the dedicated DDP Reference – Troubleshooting section.
Common quick fixes:
- NCCL init fails: Check
nvidia-smi topo -m. TryAverageBackend::Cpu. - CUBLAS_STATUS_EXECUTION_FAILED after NCCL: Use
NcclComms::new()+split()on main thread, notinit_rank()from worker threads. - Training hangs: A worker died mid-collective.
DdpHandleauto-aborts viaNcclAbortHandle. - OOM on one GPU: Use
Cadencepolicy (El Che assigns fewer batches to smaller GPU).
Getting Help
- Examples:
cargo run --example quickstartandcargo run --example sine_wave - Tutorials: Start with Rust for PyTorch Users
- Multi-GPU: DDP Reference and Multi-GPU Tutorial
- Migration guide: PyTorch -> floDl
- Issues: github.com/flodl-labs/flodl/issues