A frozen transformer learned that wombats produce cube shaped droppings and still knows after cold reload [R]
A transformer with a separate, isolated memory buffer. Backbone frozen. 300 gradient steps on the memory weights only:
| Query | Prediction | p |
|---|---|---|
| "wombats produce cube-shaped" | droppings | 0.9997 |
| "kaniva gets hot in" | su (summer) | 0.9998 |
| "Lions Club president eats" | V (Vegemite) | 0.9990 |
Save, kill process, cold reload, query again. Same result. 20 unrelated facts encoded jointly: 20/20 correct, median p = 0.997. Two subjects encoded simultaneously with cross-contamination < 0.03.
The mechanism: BDH (Kosowski et al., arXiv:2509.26507) computes a co-activation outer product at every token step and discards it. This accumulates it instead, with a learned content-addressing projection so the address reflects full causal context, not just token identity.
Sits in the fast weight programmer tradition (Schmidhuber 1991). Closest concurrent work: FwPKM (arXiv:2601.00671) and In-Place TTT (arXiv:2604.06169), which independently converges on a similar write rule.
15M params, 250M tokens, single consumer GPU, single seed. Encoding is 300 steps not one-shot. Capacity beyond 20 facts untested.
You can run this yourself. Instructions are all in the README.
Code: https://github.com/fleeb83/bdh-fast-weights (Apache 2.0)
[link] [comments]
Want to read more?
Check out the full article on the original site