Settings

Theme

Ask HN: Do LLMs need a context window?

1 points by kleene_op 2 years ago · 5 comments · 1 min read

Reader

Excuse me for this potentially dumb question, but..

Why don't we train LLMs with user inputs at each step instead of keeping the model static and feeding the whole damn history everytime?

I think I may have a clue as to why actually: Is it because this would force us to duplicate the model for every user (since their weights would diverge) and company like OpenAI deem it too costly?

If so, will the rise in affordability of local models downloaded by individuals enable the switch for the continuous training approach soon enough or am I forgetting something?

minimaxir 2 years ago

You can't train an LLM in real time feasibly. Training and inference are two different things.

OpenAI has a separate service for finetuning ChatGPT and it is not speedy, and that's likely with shenanigans.

  • kleene_opOP 2 years ago

    Even if the user input at every step is just normal conversation sentences? By that I mean very short, not MB of text.

    • sp332 2 years ago

      Yes, it would take several times as much memory (e.g. the AdamW optimizer uses 8 bytes per paramter). Plus, your latest sentences would just get mixed in to the model, and it wouldn't have a concept of the current conversation.

      There are very different approaches that might behave more like what you want, maybe RWKV.

      • kleene_opOP 2 years ago

        Thanks for your suggestion about RWKV. I'll read up on that.

        • westurner 2 years ago

          https://github.com/BlinkDL/RWKV-LM#rwkv-discord-httpsdiscord... lists a number of implementations of various versions of RWKV.

          https://github.com/BlinkDL/RWKV-LM#rwkv-parallelizable-rnn-w... :

          > RWKV: Parallelizable RNN with Transformer-level LLM Performance (pronounced as "RwaKuv", from 4 major params: R W K V)

          > RWKV is an RNN with Transformer-level LLM performance, which can also be directly trained like a GPT transformer (parallelizable). And it's 100% attention-free. You only need the hidden state at position t to compute the state at position t+1. You can use the "GPT" mode to quickly compute the hidden state for the "RNN" mode.

          > So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding (using the final hidden state).

          > Our latest version is RWKV-6,

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection