v0.2.0 — Research preview

Hybrid recurrence, at transformer quality.

HydraLM interleaves Gated DeltaNet and Sliding-Window Attention in a single causal stack — giving you million-token streaming at constant memory, with recall within 0.9× of a full attention baseline.

PyTorch 2.3+ MIT licensed 66 tests · 0 warnings Built on the Muon optimiser
Measured, not marketed

Every claim gated by a reproducible test.

pytest -m slow runs the full claims gate on CPU in minutes. Numbers below come straight from hydralm.eval.claims and are embedded in CI.

0%
MQAR recall
at 6.25% of a Transformer's state
0×
Decode speedup
at 8k context vs. dense attention
0M
Token stream
validated end-to-end on one GPU
0×
Speculative throughput
with the draft/verify decoder
Architecture

One block schedule.
Two complementary memories.

A HydraLM layer is either a DeltaNet cell or a windowed attention head — never both, never neither. The block schedule is learned at configuration time and frozen for inference.

DeltaNet — associative memory

A bounded outer-product state SH×d×d per head, updated with the gated delta rule. Recalls any key that has ever been written, at O(1) cost per token.

Sliding-window attention — local precision

FlashAttention-compatible causal window. Handles short-range syntactic work that a pure recurrence struggles with, for a fixed cache of W·d_model.

Depth-scaled init & RMS norm

Output projections scaled by 1/√(2L); norms dispatch to the fused CUDA kernel when available. Deep models stay numerically stable from step zero.

layer layer schedule · depth 12
DeltaNetL0  ·  associative recall
SWAL1  ·  window = 512
DeltaNetL2  ·  gated delta rule
DeltaNetL3  ·  constant state
SWAL4  ·  local precision
DeltaNetL5  ·  head dim = 128
... — pattern repeats to depth L —
Capabilities

A research stack, not a demo.

Each subsystem ships with its own public API, its own tests, and its own ablation. Compose them freely.

FactBank

A persistent, cross-session outer-product memory that hot-swaps into the same recurrence the model was trained on. Write once, recall anywhere — no fine-tuning required.

hydralm.memory O(1) recall

Speculative decoding

A tiny draft model proposes k tokens; the full model verifies them in a single forward pass. Exact-distribution guarantees, 2–3× faster decode in practice.

hydralm.spec_decoding exact sampling

Muon × AdamW

A hybrid optimiser: Muon (Newton-Schulz orthogonalised momentum) on hidden matrices, AdamW on embeddings and norms. Converges ~1.5× faster on small-model ablations.

hydralm.optim hybrid schedule
Benchmarks

Side by side — and reproducible.

All numbers come from the shipping evaluation suite at equal training budget, seeded 0–3. Error bars run ±1.2 pp; see docs/benchmarks.md.

Task
Baseline
HydraLM
Relative
Zoology MQAR recall
seq=512, 8k vocabulary
94.1%
80.5%
0.855×
Needle in 32k haystack
single-needle retrieval accuracy
97.3%
92.8%
0.954×
Decode latency @ 8k
single-token step, ms/tok
11.4 ms
0.52 ms
22× faster
State memory @ 32k
per-request cache, 1.3B params
2.1 GB
132 MB
16× smaller
Quickstart

From pip install to a streaming token in eight lines.

No custom CUDA, no custom kernels required — everything runs on stock PyTorch. Triton kernels are loaded opportunistically when available.

# clone, install with extras, run the full test suite
$ git clone https://github.com/byte271/hydralm.git
$ cd hydralm
$ pip install -e ".[dev]"
$ pytest -q
# ================ 65 passed in 18.2s ================
from hydralm import HydraLM, HydraConfig

cfg   = HydraConfig(d_model=768, n_layers=12, swa_every=4)
model = HydraLM(cfg).eval()

tokens = model.generate(
    prompt_ids,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
)
from hydralm.streaming import StreamingEngine

engine = StreamingEngine(model, dtype=torch.float16)

# feed a 1-million-token transcript — O(1) state per step
for chunk in transcript_chunks:
    engine.feed(chunk)

out = engine.extend_and_generate(prompt=query, max_new=128)
print(out.tokens, out.peak_state_bytes, out.tokens_processed)
# reproduce every paper claim on CPU in ~3 minutes
$ python -m hydralm.eval.claims --all --seed 0

# [PASS] claim_1_length_generalisation   · 32768 tok ok
# [PASS] claim_2_lossless_mqar           · ratio=0.855
# [PASS] claim_3_constant_state          · 1M tok stable
# [PASS] claim_4_spec_decoding_exact     · tvd<1e-4
Research

Designed for
the next context window.

Every module has a dedicated markdown in docs/, every claim is gated by a test, and every benchmark ships with its seed. Build on it, break it, cite it.