Settings

Theme

StreamingLLM: tiny tweak to KV LRU improves long conversations

news.mit.edu

91 points by lucasluitjes 2 years ago · 8 comments

Reader

TrueDuality 2 years ago

There was a really interesting post a while ago about adjusting the softmax function to allow attention heads to not make a choice (https://www.evanmiller.org/attention-is-off-by-one.html). It seems like that might remove the need for these attention sinks entirely. I keep meaning to go in and perform tests on this but boy time gets away from you...

popinman322 2 years ago

Previous discussion, on a link to the implementation: https://news.ycombinator.com/item?id=37740932

Translationaut 2 years ago

This seems only to work cause large GPTs have redundant, undercomplex attentions. See this issue in BertViz about attention in Llama: https://github.com/jessevig/bertviz/issues/128

gremlinsinc 2 years ago

I wonder if it could make sense to maybe have break away bots, where at 10k tokens a new one launches with the first 2k, and the last 1k and a table of contents such that when you go back to something you're handed off to a model where that data is stronger reinforced or something like that. Sort of like mixture of experts but they're only an expert about individual snippets of a long conversational thread.

  • kgeist 2 years ago

    Here they simply used different models for different turns and apparently it gave more "engaging" results:

    https://arxiv.org/abs/2401.02994

  • joshspankit 2 years ago

    You’re right: A lot of the conversation can be condensed, especially if there are enough cues for the AI to arrive in the same “neuronal neighborhood” as the previous conversation.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection