v0.3.0 Retrieval-Augmented Attention is here

Language modeling that scales to millions of tokens.

HydraLM is a hybrid sub-quadratic language model that combines Gated DeltaNet, Sliding-Window Attention, and chunk-sparse Retrieval Attention — giving you constant-memory inference and precise long-range information extraction without the quadratic price tag of full attention.

0 Token context
0 Tests passing
0 Formal claims verified
0 New 0.3.0 capabilities
Architecture

A hybrid stack, scheduled by a single config.

Up to three complementary mixers interleave inside a standard pre-norm residual backbone. Most layers are O(N) DeltaNet; a small fraction run exact softmax attention to restore recall. The full design lives in architecture.md.

Layer schedule

Linear by default, exact where it matters.

The default 20-layer HydraLM places one Sliding-Window Attention layer every four positions and one Retrieval layer every three positions. The remaining 10 layers are Gated DeltaNet — the O(N) associative-memory path that handles the bulk of compute.

See the full complexity analysis →

DeltaNet Sliding Window Retrieval
One config, three mixers

Flip one field. Keep every other semantic.

The default HydraConfig() preserves the 0.2.0 behaviour exactly. Setting retrieval_every or mtp_depth to a positive value is the single source of truth — the schedule, state shapes, and forward return contract update accordingly.

See every HydraConfig field →

Benchmarks

Measured, reproducible, gated.

Nine formal claims back every behavioural guarantee. Every number on this page is reproducible from scripts/reproduce_claims.py against the current main branch.

0.00 ×
Linear complexity
step time vs. length, backs C1
0 %
Lossless MQAR recall
dedicated SWA layer, backs C2
0
Token stream, constant RAM
streaming prefill, backs C4
0 %
Multi-fact QA @ 16k
with Retrieval Attention on
Configuration overall 0–25% 25–50% 50–75% 75–100%
HydraLM 4:1 baseline 0.71 0.48 0.55 0.74 0.96
+ Retrieval Attention 0.3.0 0.93 0.89 0.91 0.94 0.98
+ MTP depth 2 0.3.0 0.94 0.90 0.92 0.94 0.99

retrieval_qa, 16k tokens, 32 inserted facts, 8 trailing queries — reproduce with scripts/long_context_qa.py.

Quickstart

Runs anywhere PyTorch runs.

No CUDA kernels to build, no custom compilers. HydraLM ships as a pure-PyTorch package with a small hand-written chunkwise delta-rule loop and exposes every layer as a drop-in nn.Module.

Package README →

install
# Install the package in editable mode
cd research/hydralm
pip install -e .

# Run the full test suite (65 + 12 new)
pytest -q

# Reproduce the claim suite
python scripts/reproduce_claims.py

Train on a laptop. Serve a million tokens.

HydraLM ships test-covered, cleanly documented, and claims-backed — built to be studied, forked, and extended.