StreamingLLM: tiny tweak to KV LRU improves long conversations
news.mit.eduThere was a really interesting post a while ago about adjusting the softmax function to allow attention heads to not make a choice (https://www.evanmiller.org/attention-is-off-by-one.html). It seems like that might remove the need for these attention sinks entirely. I keep meaning to go in and perform tests on this but boy time gets away from you...
Feel free to mess with it, his tweak to softmax was actually supported by pytorch before the article was written, but off by default. Maybe it needs to be more widely used though, after all good ideas are often independently discovered multiple times. Details are in this tweet https://twitter.com/SamuelMullr/status/1683582347793530884 or if you don't like twitter the option is add_zero_attn for pytorch MultiheadAttention.
Interesting! HN discussion of it here: https://news.ycombinator.com/item?id=36851494
Previous discussion, on a link to the implementation: https://news.ycombinator.com/item?id=37740932
This seems only to work cause large GPTs have redundant, undercomplex attentions. See this issue in BertViz about attention in Llama: https://github.com/jessevig/bertviz/issues/128
I wonder if it could make sense to maybe have break away bots, where at 10k tokens a new one launches with the first 2k, and the last 1k and a table of contents such that when you go back to something you're handed off to a model where that data is stronger reinforced or something like that. Sort of like mixture of experts but they're only an expert about individual snippets of a long conversational thread.
Here they simply used different models for different turns and apparently it gave more "engaging" results:
You’re right: A lot of the conversation can be condensed, especially if there are enough cues for the AI to arrive in the same “neuronal neighborhood” as the previous conversation.