Language modeling that scales to millions of tokens.
HydraLM is a hybrid sub-quadratic language model that combines Gated DeltaNet, Sliding-Window Attention, and chunk-sparse Retrieval Attention — giving you constant-memory inference and precise long-range information extraction without the quadratic price tag of full attention.
Three capabilities built for natural long-range extraction.
Each is strictly opt-in — flip one field on
HydraConfig and the backbone adopts it
without breaking any 0.2.0 semantics.
Chunk-sparse top-k retrieval.
Each query chunk mean-pools its keys, scores every past
chunk’s summary, picks the top_k,
and runs exact softmax attention over just that subset.
Sub-quadratic cost, exact within each retrieved chunk.
O(N · k · C)
Unbounded context, bounded RAM.
A three-tier KV stream wrapper — exact window, learned compressed pool, FIFO tombstone — that gives any softmax attention layer the illusion of an unbounded cache at constant memory cost.
O(W + N_s)
Denser training. Free drafting.
A DeepSeek-V3-style auxiliary head that predicts the next
k tokens in parallel, sharpens
the planning horizon during training, and doubles as a
zero-extra-parameter draft model for speculative decoding.
< 1% params
A hybrid stack, scheduled by a single config.
Up to three complementary mixers interleave inside a standard pre-norm residual backbone. Most layers are O(N) DeltaNet; a small fraction run exact softmax attention to restore recall. The full design lives in architecture.md.
Linear by default, exact where it matters.
The default 20-layer HydraLM places one Sliding-Window Attention layer every four positions and one Retrieval layer every three positions. The remaining 10 layers are Gated DeltaNet — the O(N) associative-memory path that handles the bulk of compute.
Flip one field. Keep every other semantic.
The default HydraConfig()
preserves the 0.2.0 behaviour exactly. Setting
retrieval_every or
mtp_depth to a positive value
is the single source of truth — the schedule,
state shapes, and forward return contract update
accordingly.
Measured, reproducible, gated.
Nine formal claims back every behavioural guarantee. Every
number on this page is reproducible from
scripts/reproduce_claims.py against
the current main branch.
retrieval_qa, 16k tokens, 32 inserted facts, 8
trailing queries — reproduce with
scripts/long_context_qa.py.
Runs anywhere PyTorch runs.
No CUDA kernels to build, no custom compilers. HydraLM
ships as a pure-PyTorch package with a small hand-written
chunkwise delta-rule loop and exposes every layer as a
drop-in nn.Module.
# Install the package in editable mode cd research/hydralm pip install -e . # Run the full test suite (65 + 12 new) pytest -q # Reproduce the claim suite python scripts/reproduce_claims.py
Everything you need. In one place.
Design rationale, formal claims, complexity proofs, training recipes and deployment recipes — every link opens the source Markdown directly.
hydralm package —
HydraConfig,
HydraLM,
generate,
speculative_generate, and more.
docs/api.md
Read
Guarantees
Formal claims (C1–C9)
The nine claims that gate every release, each backed by
a named script and test — plus the provisional
0.3.0 C10 contract for multi-fact long-context QA.
docs/claims.md
Read
Train on a laptop. Serve a million tokens.
HydraLM ships test-covered, cleanly documented, and claims-backed — built to be studied, forked, and extended.