Recursive Transformer Modules (RTMs) Plain Draft

Abstract

Modern LMs spend a fixed depth on every token regardless of hardness. Recursive Transformer Modules (RTMs) invert this: a single shared reasoning block is recursively applied to a persistent latent workspace, and a learned halting policy stops when sufficient confidence is reached (token- or sequence-level). The result is selective depth, better accuracy-per-FLOP, and a step-trace that is easy to audit. RTMs unify the inference-time recursion from RINS and the compact solvers from HRM/TRM into a practical, deployable architecture.

Motivation: Selective Reasoning Beats Fixed Depth

In math/logic/code workloads, difficulty varies dramatically across inputs. Fixed-depth transformers overspend on easy cases and underspend on hard ones. Compute-on-demand is the arbitrage: share parameters and spend depth only where it pays.

"Recursively applying an early portion of your network to refine its output… improves performance significantly." RINS (2025) source

"HRM… executes through two interdependent recurrent modules (high-level planning, low-level computation)… With only 27M parameters…" HRM (2025) source

"TRM… a single tiny network with only 2 layers… achieves significantly higher generalization than HRM." TRM (2025) source

RTM: The Architecture

Objects: input \(x\); latent workspace \(H_t \in \mathbb{R}^{B\times L\times d}\); output head \(r_\theta\).

Shared Transformer Block \(f_\theta\) (self-attn + MLP + residuals) repeatedly updates a latent workspace:

\[ \begin{aligned} H_{t+1} &= f_\theta(H_t;\, x) \\ p_{\text{halt}}(t) &= g_\theta(H_{t+1}) \\ y_{t+1} &= r_\theta(H_{t+1}) \end{aligned} \]

Stop when \(\displaystyle\sum_{t=1}^{T} p_{\text{halt}}(t) \ge 1\) (per-token or per-sequence)

Key Properties

Parameter sharing → fixed memory, unbounded test-time depth.
Adaptive depth → FLOPs scale with instance difficulty.
Workspace → explicit scratchpad; loggable and auditable.

Training Signals

Ponder/halting loss to minimize steps without losing accuracy.
Curriculum on step cap \(T_{\max}\); entropy on halting to avoid collapse.
Stability: RMSNorm, RoPE, QK-norm, EMA across steps.

Relationship to HRM, TRM, and RINS

HRM → RTM

HRM’s two-timescale modules are recovered by RTM as one shared block plus a scheduler (halting policy) without hardwiring hierarchy.

TRM → RTM

TRM shows hierarchy isn't required. RTM subsumes it by letting steps \(T\) and workspace size adapt per input, keeping the block tiny if desired.

RINS → RTM

RINS is the recipe for inference-time recursion; RTM architecturalizes it with halting + workspace → trainable, auditable compute selection.

Relevance to Aleph Alpha

Plug RTM as an energy-aware reasoning layer atop tokenizer-free HAT stacks: gate depth on demand, keep params small, expose audit hooks (stop signals, traces).

"We have presented the Hierarchical Autoregressive Transformer (HAT), a tokenizer-free approach… a promising avenue towards more robust and adaptable language models." Aleph Alpha (2025) arXiv · blog

"A collaboration… to push the boundaries of explainable and interpretable Generative AI… prioritizing safety and transparency." Lab 1141 (2024) site

Training & Inference Recipe

Backbone. Compact transformer block (\(d=512\)–1024, 4–8 heads) with parameter tying across steps.
Workspace. Initialize \(H_0\) from encodings; optionally seed \(y_0\) (schema for math/code).
Controller. MLP on pooled \(H_t\) (seq-level) or per-token head. Halting penalty \(\lambda\sum_t(1-p_{\text{halt}}(t))\).
Curriculum. Train with capped steps; increase \(T_{\max}\); maintain minimum-compute floors.
Deployment knobs. Cap \(T\); bucket by predicted steps; log \(\|H_{t+1}-H_t\|\) and \(p_{\text{halt}}\).

Mechanics (pseudo-API)

// one-step recurrence
H = f_theta(H, x);
p_halt = g_theta(H);
y = r_theta(H);

// stop when cumulative halting ≥ 1 (per token or sequence)

Evaluation Plan (Compute-Aligned)

Datasets: GSM8K, ARC-Challenge/ARC-AGI-1/2, Sudoku-Extreme, MBPP/HumanEval.
Fairness: Match baselines on parameters and FLOPs; report accuracy-per-FLOP & joules/answer.
Ablations: token vs sequence halting; workspace size; step cap \(T\); full vs partial sharing.
Telemetry: depth histograms; halt-confidence trends; no-regret curves (accuracy @ fixed compute).

Expected Contributions

A deployable recursive transformer with adaptive depth and stable training.
Evidence for superior accuracy-per-FLOP vs fixed-depth transformers and parity/lead vs HRM/TRM at lower inference cost.
Built-in interpretability: halting signals and workspace deltas as auditable step traces.

Risks & Mitigations

Over-thinking / latency tails: cap \(T\), early-exit priors.
Training instability: curriculum on \(T_{\max}\), halting-loss annealing, EMA across steps.
Controller collapse: entropy regularization; minimum-compute floors.

Why This Wins

RTMs buy accuracy with targeted inference instead of universal scaling. That is lower capex (fewer params) and lower opex (compute-on-demand) without quality loss, ideal for on-prem, sovereign settings and price-sensitive workloads.

References

Alabdulmohsin & Zhai. Recursive INference Scaling (RINS), 2025. arXiv:2502.07503.
“Recursively applying an early portion of your network to refine its output… improves performance significantly.”
Wang et al. Hierarchical Reasoning Model (HRM), 2025. arXiv:2506.21734.
“Two interdependent recurrent modules… high-level planning and low-level computation… With only 27M parameters…”
Jolicoeur-Martineau. Less is More: Recursive Reasoning with Tiny Networks (TRM), 2025. arXiv:2510.04871.
“A single tiny network with only 2 layers… achieves significantly higher generalization than HRM.”
Aleph Alpha Research. T-Free / HAT (Tokenizer-free hierarchical transformers), 2025. arXiv:2501.10322 · Blog.
“HAT… a tokenizer-free approach… a promising avenue towards more robust and adaptable language models.”
Aleph Alpha × TU Darmstadt. Lab 1141 (Explainable, auditable GenAI), 2024. Site.
“A collaboration… to push the boundaries of explainable and interpretable Generative AI… prioritizing safety and transparency.”