Settings

Theme

Llama 32K Context Released by Together AI

together.ai

84 points by averylamp 2 years ago · 10 comments

Reader

behnamoh 2 years ago

IIRC, there was a paper which showed GPT models pay most attention to the beginning and end of context window, and much less attention to what's in the middle. In that regard, they behave like human brains. But I'm wondering if these efforts to increase context window actually make the models pay almost uniform attention to all the prompt?

  • saliagato 2 years ago

    You are correct. The paper is called "Lost in the middle" [1] and it is probably one of the worst drawbacks of this technology. It makes a lot of use cases biased (think of law).

    [0] https://arxiv.org/pdf/2307.03172.pdf

    • trsohmers 2 years ago

      The research is slightly misleading... the models they experimented all had an original pretrained context length significantly less than the fine tuned context length they tested for, e.g. they used MPT-30B-Instruct, which was pretrained for 2k sequence length and then fine tuned for 8k sequence length. A real test of if current self attention has this issue would be natively training a model with the extended sequence length.

    • behnamoh 2 years ago

      > It makes a lot of use cases biased (think of law).

      Yes, it's unfortunate. I wonder if GPT-4 with 32k ctx window is in a sense "smarter than GPT-4 with 8k ctx.

  • lamuswawir 2 years ago

    I have seen this also, but attention is far better for gtp-4, it seems to follow a system prompt for say, outputing json, uniformly compared to gpt-3.5. I have also found that gpt-3.5 follows a system prompt to do the same for only 2 successive outputs. You have to give it that prompt with every single time. So I think increasing context windows may not make it follow the system prompt uniformly.

  • npsomaratna 2 years ago

    My understanding is that in NTK aware RoPE scaling, the model does pay uniform attention. With older methods, not as much.

m3kw9 2 years ago

The red flag is when they don’t compare it to GPT3.5

  • yumraj 2 years ago

    It’s a 7B model, it’s not supposed to, nor going to, compete with GPT3.5

  • avereveard 2 years ago

    The article original title does a better job at conveying the current state as early exploration.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection