A frozen transformer learned that wombats produce cube shaped droppings and still knows after cold reload [R]

A transformer with a separate, isolated memory buffer. Backbone frozen. 300 gradient steps on the memory weights only:

Query	Prediction	p
"wombats produce cube-shaped"	`droppings`	0.9997
"kaniva gets hot in"	`su` (summer)	0.9998
"Lions Club president eats"	`V` (Vegemite)	0.9990

Save, kill process, cold reload, query again. Same result. 20 unrelated facts encoded jointly: 20/20 correct, median p = 0.997. Two subjects encoded simultaneously with cross-contamination < 0.03.

The mechanism: BDH (Kosowski et al., arXiv:2509.26507) computes a co-activation outer product at every token step and discards it. This accumulates it instead, with a learned content-addressing projection so the address reflects full causal context, not just token identity.

Sits in the fast weight programmer tradition (Schmidhuber 1991). Closest concurrent work: FwPKM (arXiv:2601.00671) and In-Place TTT (arXiv:2604.06169), which independently converges on a similar write rule.

15M params, 250M tokens, single consumer GPU, single seed. Encoding is 300 steps not one-shot. Capacity beyond 20 facts untested.

You can run this yourself. Instructions are all in the README.

Code: https://github.com/fleeb83/bdh-fast-weights (Apache 2.0)

submitted by /u/fleebrun83
[link] [comments]

A frozen transformer learned that wombats produce cube shaped droppings and still knows after cold reload [R]

Want to read more?

Tagged with