DeepSeek Engram: Conditional Memory via Scalable Lookup [pdf]
github.comThis direction is clearly not unique to DeepSeek. Similar ideas have shown up before, like Gemma’s per-layer embeddings, RWKV’s deepembed, and ByteDance’s UltraMem. From an algorithm perspective, using hash-based n-gram lookup is far from ideal. The same fact phrased differently, or in another language, ends up mapping to different keys. Generalization is weak, and duplication and noise are hard to avoid. UltraMem tries to address this with learnable mappings, but that also brings more parameters and a harder-to-tune structure. What DeepSeek seems to be doing is making a system-level trade-off. Instead of chasing a cleaner algorithm, they simplify the computation and move it before inference, using raw input tokens for lookup and running the whole step in CPU memory. The cost is lower algorithmic efficiency; the benefit is zero GPU memory usage, very simple logic, and a preprocessing path that can be fully offloaded to CPUs. Once this runs in CPU memory, the optimization target changes. Parameter efficiency and per-query optimality matter less. Even if the hash mapping is noisy or redundant, the cost is low and it doesn’t touch scarce GPU memory. At the system level, that trade-off can make sense.
Unlike most improvements to LLMs that modify the architecture or optimizer or something about the model, this paper discusses a novel technique that relies on some external lookup table in the forward pass computation, with the external lookup happening in parallel with some of the compute. It's a really interesting idea with a lot of cool engineering work behind it, but it looks too convoluted without improvements that could justify the complexity.