v0.3.0 Retrieval-Augmented Attention is here

Language modeling that scales to millions of tokens.

HydraLM is a hybrid sub-quadratic language model that combines Gated DeltaNet, Sliding-Window Attention, and chunk-sparse Retrieval Attention — giving you constant-memory inference and precise long-range information extraction without the quadratic price tag of full attention.

Open docs home View GitHub

0 Token context

0 Tests passing

0 Formal claims verified

0 New 0.3.0 capabilities

What’s new in 0.3.0

Three capabilities built for natural long-range extraction.

Each is strictly opt-in — flip one field on HydraConfig and the backbone adopts it without breaking any 0.2.0 semantics.

01 / RetrievalAttention

Chunk-sparse top-k retrieval.

Each query chunk mean-pools its keys, scores every past chunk’s summary, picks the top_k, and runs exact softmax attention over just that subset. Sub-quadratic cost, exact within each retrieved chunk.

cost O(N · k · C)

Dive in 02 / CompressiveMemory

Unbounded context, bounded RAM.

A three-tier KV stream wrapper — exact window, learned compressed pool, FIFO tombstone — that gives any softmax attention layer the illusion of an unbounded cache at constant memory cost.

memory O(W + N_s)

Dive in 03 / MultiTokenPredictor

Denser training. Free drafting.

A DeepSeek-V3-style auxiliary head that predicts the next k tokens in parallel, sharpens the planning horizon during training, and doubles as a zero-extra-parameter draft model for speculative decoding.

overhead < 1% params

Dive in

Architecture

A hybrid stack, scheduled by a single config.

Up to three complementary mixers interleave inside a standard pre-norm residual backbone. Most layers are O(N) DeltaNet; a small fraction run exact softmax attention to restore recall. The full design lives in architecture.md.

Layer schedule

Linear by default, exact where it matters.

The default 20-layer HydraLM places one Sliding-Window Attention layer every four positions and one Retrieval layer every three positions. The remaining 10 layers are Gated DeltaNet — the O(N) associative-memory path that handles the bulk of compute.

See the full complexity analysis →

DeltaNet Sliding Window Retrieval

config.py

from hydralm import HydraConfig, HydraLM

cfg = HydraConfig(
  vocab_size         = 32_000,
  d_model            = 1024,
  n_layers           = 20,
  n_heads            = 16,
  # sliding-window attention
  swa_window         = 1024,
  swa_every          = 4,
  # new in 0.3.0 — natural long-range retrieval
  retrieval_every    = 3,
  retrieval_top_k    = 8,
  retrieval_chunk_size = 128,
  # multi-token prediction
  mtp_depth          = 2,
  mtp_loss_weight    = 0.1,
)

model = HydraLM(cfg)
print(cfg.summary())
# HydraConfig(d_model=1024, n_layers=20,
#             heads=16x64, DN=10, SWA=5 @ window=1024,
#             RA=5 @ chunk=128, top_k=8, MTP_depth=2)

One config, three mixers

Flip one field. Keep every other semantic.

The default HydraConfig() preserves the 0.2.0 behaviour exactly. Setting retrieval_every or mtp_depth to a positive value is the single source of truth — the schedule, state shapes, and forward return contract update accordingly.

See every HydraConfig field →

Benchmarks

Measured, reproducible, gated.

Nine formal claims back every behavioural guarantee. Every number on this page is reproducible from scripts/reproduce_claims.py against the current main branch.

0.00 ×

Linear complexity

step time vs. length, backs C1

0 %

Lossless MQAR recall

dedicated SWA layer, backs C2

Token stream, constant RAM

streaming prefill, backs C4

0 %

Multi-fact QA @ 16k

with Retrieval Attention on

Configuration overall 0–25% 25–50% 50–75% 75–100%

HydraLM 4:1 baseline 0.71 0.48 0.55 0.74 0.96

+ Retrieval Attention 0.3.0 0.93 0.89 0.91 0.94 0.98

+ MTP depth 2 0.3.0 0.94 0.90 0.92 0.94 0.99

retrieval_qa, 16k tokens, 32 inserted facts, 8 trailing queries — reproduce with scripts/long_context_qa.py.

Quickstart

Runs anywhere PyTorch runs.

No CUDA kernels to build, no custom compilers. HydraLM ships as a pure-PyTorch package with a small hand-written chunkwise delta-rule loop and exposes every layer as a drop-in nn.Module.

Package README →

install

# Install the package in editable mode
cd research/hydralm
pip install -e .

# Run the full test suite (65 + 12 new)
pytest -q

# Reproduce the claim suite
python scripts/reproduce_claims.py

Documentation

Everything you need. In one place.

Design rationale, formal claims, complexity proofs, training recipes and deployment recipes — every link opens the source Markdown directly.

Hub Documentation portal Opens on a dedicated Home section, with the navigation pinned on desktop and tucked into a clean mobile drawer on smaller screens. ./docs.html#home Open 0.3.0 · flagship Retrieval & long-range Full walk-through of RetrievalAttention, CompressiveMemory, and the Multi-Token Prediction head — with formulas, call signatures, and integration recipes. docs/retrieval.md Read Design Architecture How DeltaNet, SWA, and Retrieval compose inside a shared pre-norm block; how the schedule is built; where each mixer earns its cost. docs/architecture.md Read Theory Theory & complexity The delta rule from first principles, the hybrid recall argument, and the full per-layer cost table including RetrievalAttention and MTP. docs/theory.md Read Reference API reference Every public export of the hydralm package — HydraConfig, HydraLM, generate, speculative_generate, and more. docs/api.md Read Guarantees Formal claims (C1–C9) The nine claims that gate every release, each backed by a named script and test — plus the provisional 0.3.0 C10 contract for multi-fact long-context QA. docs/claims.md Read

Train on a laptop. Serve a million tokens.

HydraLM ships test-covered, cleanly documented, and claims-backed — built to be studied, forked, and extended.

Start with the docs Browse the repo