Building a graph builder that doesn't feel like a graph builder

How a consuming self pattern and forward references turned architecture description into data flow

There are two ways to define a neural network in PyTorch.

The first is nn.Sequential:

model = nn.Sequential(
    nn.Linear(2, 16),
    nn.GELU(),
    nn.Linear(16, 2),
)

Clean, readable, instant. But try adding a residual connection. Or a loop. Or a branch that routes to different experts based on the input. You can’t. Sequential is a list. Data goes in one end and comes out the other.

The second is a custom nn.Module:

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = nn.Linear(2, 16)
        self.decoder = nn.Linear(16, 2)
        self.residual = nn.Linear(16, 16)

    def forward(self, x):
        h = F.gelu(self.encoder(x))
        h = h + self.residual(h)
        return self.decoder(h)

Now you can express anything. But the architecture lives in imperative Python. The framework can’t see it. You can’t automatically visualize the graph, profile individual nodes, or collect metrics at intermediate points without instrumenting forward() by hand. And every new architecture means a new class with new wiring.

There’s a gap between “too simple” and “too manual.” floDl’s FlowBuilder tries to live in that gap.

Data flow as description

The design constraint was: the builder chain should read like a description of how data flows through the model. Not a set of construction commands. You should be able to look at the chain and see the architecture.

let model = FlowBuilder::from(Linear::new(2, 16)?)
    .through(GELU)
    .through(Linear::new(16, 2)?)
    .build()?;

That’s the Sequential equivalent. Data flows from Linear through GELU through Linear. Same concepts, same readability. But the builder doesn’t stop here.

Residual connections: the first thing Sequential can’t do

let model = FlowBuilder::from(Linear::new(2, 16)?)
    .through(GELU)
    .also(Linear::new(16, 16)?)
    .through(Linear::new(16, 2)?)
    .build()?;

also means: run this module on the current tensor, then add the result to the input. One word. That’s a residual connection. In PyTorch, this requires either a custom forward() or a helper class. Here it’s part of the vocabulary.

Read it aloud: “data flows through GELU, also through a Linear (with a residual add), then through the output projection.” The architecture is the description.

Parallel branches

let model = FlowBuilder::from(encoder)
    .split(modules![head_a, head_b, head_c])
    .merge(MergeOp::Mean)
    .through(output)
    .build()?;

split fans the stream to multiple modules. merge combines their outputs. This is multi-head attention, or an ensemble, or any architecture where the same input needs to go through parallel paths.

Named references

Here’s where it gets interesting. Tags let you name points in the flow and reference them later:

let model = FlowBuilder::from(encoder).tag("encoded")
    .through(transform)
    .through(cross_attention).using(&["encoded"])
    .through(output)
    .build()?;

tag("encoded") names the encoder’s output. using(&["encoded"]) makes it available to the cross-attention module as a named input. The data doesn’t need to flow linearly anymore. You can reach back to any tagged point.

This is how you build skip connections, cross-attention, U-Net architectures — any pattern where a downstream module needs to see an earlier activation.

Loops

let model = FlowBuilder::from(initial_state)
    .loop_body(refinement_block).for_n(3)
    .through(output_head)
    .build()?;

loop_body takes a module and repeats it, feeding each iteration’s output as the next iteration’s input. for_n(3) runs it three times. The gradient flows through all iterations — full backpropagation through time.

There’s also while_cond(halt_fn, max) for adaptive computation: keep iterating until a learned halt condition is met, up to a maximum.

Routing

// Soft routing: all experts run, outputs weighted by router
.gate(router, modules![expert_a, expert_b])

// Hard routing: only selected branch executes
.switch(selector, modules![light_path, heavy_path])

gate is mixture-of-experts: the router produces weights, all experts execute, outputs are combined. switch is conditional computation: the selector picks one branch, only that branch runs.

Both support .using() to give the router access to earlier context.

Side branches

.fork(classifier_head).tag("class_logits")

fork runs a module on the current stream but doesn’t change the main flow. The output is captured by the tag. The stream continues as if the fork wasn’t there. This is for observation points, auxiliary classifiers, or anything that reads the stream without modifying it.

The consuming self pattern

Every builder method consumes self and returns a new builder. This isn’t an accident. It means you can’t do this:

let a = FlowBuilder::from(encoder);
let b = a.through(decoder_1);  // a is consumed
let c = a.through(decoder_2);  // compile error: a was moved

The compiler prevents you from branching from intermediate states. The builder chain is a linear narrative: each step follows the previous one. There’s exactly one path through the description, and the type system enforces it.

This is what makes the builder read as data flow rather than as a bag of mutations. You’re writing a description, not modifying state.

Forward references: the trick that enables recurrence

Everything so far has been about referring to the past — modules that need earlier activations. But some architectures need to refer to their own output from the previous forward pass. Recurrent networks. Attention with memory. Anything where state persists across calls.

The builder handles this with a simple rule: if you using() a tag before it exists, it becomes a forward reference.

.through(StateAdd).using(&["memory"]).tag("memory")

On the first forward pass, "memory" doesn’t exist yet — the reference returns a zero tensor. On subsequent passes, it returns the value from the previous call. The graph owns the state buffer. The module doesn’t need to know it’s recurrent.

One line. That’s an RNN-style feedback loop. No hidden state threading, no manual detach() between steps, no state containers. The graph manages it.

Graphs are modules

A Graph built by the builder implements Module — the same trait as Linear or GELU. This means a graph is just another building block:

let ffn = FlowBuilder::from(Linear::new(16, 32)?)
    .through(GELU)
    .through(Linear::new(32, 16)?)
    .build()?;

let model = FlowBuilder::from(encoder)
    .through(ffn)             // graph nested inside graph
    .through(output_head)
    .build()?;

Parameters, buffers, named references — everything composes. Build sub-graphs as functions, nest them arbitrarily. The same API works at every scale.

What you get for free

Because the architecture is declared rather than buried in imperative forward() code, the framework can inspect it. Without any user code:

None of this is possible when the architecture lives in a Python forward() method. It’s only possible because the builder knows the graph structure at build time.

A real architecture

This isn’t a framework that only handles toy examples. FBRL uses the builder for a foveal attention model — a recurrent architecture that learns where to look in an image. Here’s a simplified view of the actual builder chain:

FlowBuilder::from(Identity).tag("image")
    .input(&["case"])
    .through(h0_init)
    .loop_body(scan_step).for_n(n_scan).using(&["image"])
    .loop_body(read_step).for_n(n_read).using(&["image"]).tag("latent")
    .fork(letter_head)
    .fork(case_head)
    .through(decoder).using(&["latent", "case"])
    .build()?

Two recurrent loop phases (scan and read), each accessing the original image through a named reference. Fork branches for classification heads. A decoder that pulls from the latent bottleneck and an auxiliary input. Nine loss components drive attention guidance, reconstruction, classification, and fixation diversity.

This architecture trains at ~40 seconds per epoch on a GTX 1060. Restructuring it — from separated phases to a unified graph for transfer to word-level recognition — meant changing the builder chain. Not rewriting forward(). Not a new class. The restructured version trained faster.

That’s the point. The builder makes architecture changes cheap. You describe the new flow, and the training infrastructure — checkpoints, visualization, observation, profiling — follows automatically.

The gap

Sequential is beautiful for what it handles. Custom modules are necessary for what it can’t. The FlowBuilder tries to push the boundary of what’s declarative: residuals, branches, loops, routing, recurrence, and composition, all as vocabulary in a fluent chain.

The architecture is the description. The description is the architecture. Everything else follows.


flodl is open source: GitHub | crates.io | docs | showcase example

The graph builder tutorial: step-by-step guide