StreamingLLM: tiny tweak to KV LRU improves long conversations

news.mit.edu

91 points by lucasluitjes 2 years ago · 8 comments

Reader

There was a really interesting post a while ago about adjusting the softmax function to allow attention heads to not make a choice (https://www.evanmiller.org/attention-is-off-by-one.html). It seems like that might remove the need for these attention sinks entirely. I keep meaning to go in and perform tests on this but boy time gets away from you...

zorgmonkey 2 years ago

Feel free to mess with it, his tweak to softmax was actually supported by pytorch before the article was written, but off by default. Maybe it needs to be more widely used though, after all good ideas are often independently discovered multiple times. Details are in this tweet https://twitter.com/SamuelMullr/status/1683582347793530884 or if you don't like twitter the option is add_zero_attn for pytorch MultiheadAttention.
magicalhippo 2 years ago

Interesting! HN discussion of it here: https://news.ycombinator.com/item?id=36851494

popinman322 2 years ago

Previous discussion, on a link to the implementation: https://news.ycombinator.com/item?id=37740932

Translationaut 2 years ago

This seems only to work cause large GPTs have redundant, undercomplex attentions. See this issue in BertViz about attention in Llama: https://github.com/jessevig/bertviz/issues/128

gremlinsinc 2 years ago

I wonder if it could make sense to maybe have break away bots, where at 10k tokens a new one launches with the first 2k, and the last 1k and a table of contents such that when you go back to something you're handed off to a model where that data is stronger reinforced or something like that. Sort of like mixture of experts but they're only an expert about individual snippets of a long conversational thread.

kgeist 2 years ago

Here they simply used different models for different turns and apparently it gave more "engaging" results:
https://arxiv.org/abs/2401.02994
joshspankit 2 years ago

You’re right: A lot of the conversation can be condensed, especially if there are enough cues for the AI to arrive in the same “neuronal neighborhood” as the previous conversation.

Settings

StreamingLLM: tiny tweak to KV LRU improves long conversations

Keyboard Shortcuts