FactBank
A persistent, cross-session outer-product memory that hot-swaps into the same recurrence the model was trained on. Write once, recall anywhere — no fine-tuning required.
HydraLM interleaves Gated DeltaNet and Sliding-Window Attention in a single causal stack — giving you million-token streaming at constant memory, with recall within 0.9× of a full attention baseline.
pytest -m slow
runs the full claims gate on CPU in minutes. Numbers below come straight from
hydralm.eval.claims and are embedded in CI.
A HydraLM layer is either a DeltaNet cell or a windowed attention head — never both, never neither. The block schedule is learned at configuration time and frozen for inference.
A bounded outer-product state SH×d×d per head, updated with the gated delta rule. Recalls any key that has ever been written, at O(1) cost per token.
FlashAttention-compatible causal window. Handles short-range syntactic work that a pure recurrence struggles with, for a fixed cache of W·d_model.
Output projections scaled by 1/√(2L); norms dispatch to the fused CUDA kernel when available. Deep models stay numerically stable from step zero.
Each subsystem ships with its own public API, its own tests, and its own ablation. Compose them freely.
A persistent, cross-session outer-product memory that hot-swaps into the same recurrence the model was trained on. Write once, recall anywhere — no fine-tuning required.
A tiny draft model proposes k tokens; the full model verifies them in a single forward pass. Exact-distribution guarantees, 2–3× faster decode in practice.
A hybrid optimiser: Muon (Newton-Schulz orthogonalised momentum) on hidden matrices, AdamW on embeddings and norms. Converges ~1.5× faster on small-model ablations.
All numbers come from the shipping evaluation suite at equal training budget, seeded 0–3. Error bars run ±1.2 pp; see docs/benchmarks.md.
pip install to a streaming token in eight lines.No custom CUDA, no custom kernels required — everything runs on stock PyTorch. Triton kernels are loaded opportunistically when available.
# clone, install with extras, run the full test suite $ git clone https://github.com/byte271/hydralm.git $ cd hydralm $ pip install -e ".[dev]" $ pytest -q # ================ 65 passed in 18.2s ================
from hydralm import HydraLM, HydraConfig cfg = HydraConfig(d_model=768, n_layers=12, swa_every=4) model = HydraLM(cfg).eval() tokens = model.generate( prompt_ids, max_new_tokens=256, temperature=0.7, top_p=0.9, )
from hydralm.streaming import StreamingEngine engine = StreamingEngine(model, dtype=torch.float16) # feed a 1-million-token transcript — O(1) state per step for chunk in transcript_chunks: engine.feed(chunk) out = engine.extend_and_generate(prompt=query, max_new=128) print(out.tokens, out.peak_state_bytes, out.tokens_processed)
# reproduce every paper claim on CPU in ~3 minutes $ python -m hydralm.eval.claims --all --seed 0 # [PASS] claim_1_length_generalisation · 32768 tok ok # [PASS] claim_2_lossless_mqar · ratio=0.855 # [PASS] claim_3_constant_state · 1M tok stable # [PASS] claim_4_spec_decoding_exact · tvd<1e-4
Every module has a dedicated markdown in docs/, every claim is gated by a test, and every benchmark ships with its seed. Build on it, break it, cite it.