How to scale LLMs better with an alternative to transformers
hazyresearch.stanford.eduI wonder how a decentralized, hierarchical LLM would perform.
For example:
LLM A is trained on all of Wikipedia
LLM B is trained on all of Hacker News
LLM C is trained on all of Project Gutenberg
User asks question Q on webservice W.W sends Q to A and B.
Then W sends a question to C "Hey C, I have a user who asked Q. Here is A's reply and B's reply. Given those, how would you answer Q?"
Would the answer be as good as or better than what an LLM which is trained on Wikipedia, Hacker News and Project Gutenberg would return?
If it is of similar quality, then we could build a hierarchical tree of consumer hardware LLMs which are hosted all over the world.
Isn't that more or less how GPT-4 works? multiple "expert" LLMs giving input depending on the context?[0]
[0]https://the-decoder.com/gpt-4-architecture-datasets-costs-an...
the biggest issue is if you have too many specialists and spin a lot of them to reply to the same query and after that discard the less optimal answers.
Your answer quality might improve, but the computing costs could skyrocket without some smart filtering and distribution before you reach any LLM
A huge misconception is that MoE is an ensemble of discrete models, when it is in fact multiple FFNN modules that share an attention and embedding module.
Basically the idea is that there's some pars of the model (attention/embedding) that should be trained on everything and used in every inference and other parts (the FFNN) that are fine to specialize on certain types of data (via a routing module that is also trained).
[0] https://arxiv.org/pdf/1701.06538.pdf [1] https://arxiv.org/pdf/2112.06905.pdf
EDIT: Specifically GLaM model architecture. Each MoE layer (the bottom block) is interleaved with a Transformer layer (the upper block). For each input token, e.g., ‘roses’, the Gating module dynamically selects two most relevant experts out of 64, which is represented by the blue grid in the MoE layer. The weighted average of the outputs from these two experts will then be passed to the upper Transformer layer. For the next token in the input sequence, two different experts will be selected.
This will perform worse in many cases, better in some cases. There is a lot of knowledge that can be transferred between datasets.
For example, "describe to me if this Amazon product is likely to have stronger tensile strength and if its materials are more safe?" requires knowledge not only from a database of Amazon products and their descriptions, but in this case leaving out knowledge from physics textbooks could be detrimental. Ultimately, these are the types of problems we want these systems to excel at as well, so it's important to access all of the training data. MoE is still a decent idea (can help transfer some of the knowledge between models with a model on top of others), but in order to not get wildly conflicting and/or unrelated stories from each model, some overlap is needed to provide a clearer story to the top model.
Depends.
If A answers "This toaster is made of plastique and paper, one would have to look up their tensile strength to answer your question"
And B answers "I don't know what materials this toaster is made of, but the best tensile strength in toasters is reached when using iron, ok tensil strength is achieved by using copper. One should avoid plastique and paper as these have very bad tensil strenght"
Then C could imply that the tensil strength of that toaster is not good.
This might suggest that it works: https://viterbischool.usc.edu/news/2023/07/teaching-robots-t...
ChatGPT-4 does something a bit similar with the mixture-of-experts approach. Although if I understand it correctly, they select which networke to use ahead of time rather than select the best answer from multiple.
I wouldn't expect C to just select one of the answers A and B have given. But rather to take in information from both answers and come up with a third one which is more than the sum of its parts.
That interesting.
Could have a federated LLM approach with different orgs owning different LLM specialties.
Commercial arrangement could look like telco’s roaming agreements.
Could also work in DIY-land with P2P networks of people with different models running.
This isn't true, GPT4 is not a mixture of experts model.
I'm on a quixotic mission to explain how it became "common knowledge" GPT4 is a trillion parameter mixture of experts model, despite clear denial from OpenAI's CEO. Full recounting: https://news.ycombinator.com/item?id=36828878
Sam Altman has never denied that GPT-4 is a mixture of experts model. He denied an early rumor that it was a 100 trillion parameter model.[1] The mixture of experts rumor states that GPT-4 is eight 220B models. That's far more plausible than a single 100 trillion model, and the sources (geohotz and Soumith Chintala[2]) have some credibility. But yeah, it's still only a rumor.
[1] https://www.theverge.com/23560328/openai-gpt-4-rumor-release...
[2] https://twitter.com/soumithchintala/status/16712671501017210...
Read this as if I'm smiling and shaking my head. I'm not upset, I call it a quixotic quest because there's little chance of correcting it given how far it diffused, how few people understand the nuts and bolts, and by far the biggest factor IMHO: confirmation bias.
You cited geohot as an expert on OpenAI[1], and to indicate skepticism Altman denied it, you fixated on the # of parameters, cited a Verge link to a chart in a random tweet about 100 trillion parameters, that it didn't show Sam Altman, and it didn't ask Altman about 100 trillion parameters specifically. And if it did, what does that have to do with mixture of experts?
I flipped to 3 to -2 within 30 minutes of you posting this.
"A lie gets halfway around the world before the truth has a chance to get its pants on." - Churchill
[1] never worked at OpenAI, no notable domain expertise, and a Twitter intern in 2022.
Here is the timeline again:
2022/11/11: A viral tweet claims GPT-4 will have "100 trillion parameters."[1] At this point, there were no rumors about mixture of experts.
2023/01/16: In an interview, Sam Altman mentions he saw the tweet and it was "complete bullshit."[2]
2023/06/20: geohotz and the lead of PyTorch, two people who would be expected to have relevant connections, claim that GPT-4 is an 8 x 220B mixture of experts model.[3]
These are two separate, unconnected rumors. One was denied by Sam Altman and was never plausible in the first place. The other was never denied and is highly plausible. You are conflating them by claiming, without any source, that there was "a clear denial from OpenAI's CEO" that "GPT4 is a trillion parameter mixture of experts model."
[1] https://twitter.com/andrewsteinwold/status/15948895625260277...
[2] https://youtu.be/ebjkD1Om4uw?t=313
[3] https://twitter.com/soumithchintala/status/16712671501017210...
1. You did find a tweet that claimed 100 trillion parameters, as the GP post did.
2. The video mentions he saw _a_ tweet about GPT...and actually we don't even know what the tweet said, the moderator never finished their question.
3. I'm not sure what sort of claim "connected" is, other than unfalsifiable, like all of the confirmation bias motivated arguing on this topic. People do know Geohot's name and Pytorch is an open source ML framework, neither of which make them likely venues to know a closely kept trade secret of Open AI's. (and as we show in the rest of this post, they were parroting claims made months earlier, I'm showing you through March '23, Geohot didn't get around to repeating it until June!)
Recentering: it's not a mixture of experts model, no matter if people claimed 1 trillion, 100 trillion or both. (btw, easy proof of the extensive 1 trillion claims: innumerable, all in 2022: https://twitter.com/search?q=until%3A2022-12-31%20since%3A20...)
Now: let's say a reader just can't let go of the fact some people also made 100 trillion claims, but I said most people made 1 trillion claims. I'm not sure what to say, because I never claimed no one made 100 trillion claims as well, so I'm not sure how to give those people peace so we can talk mixture of experts. I guess apologize? I'm sorry.
Now we can definitely focus on mixture of experts.
Here's innumerable claims between Jan 1st 2023 and March 31st 2023 that GPT4 was a 1 trillion mixture of experts model, as I claimed: [https://www.google.com/search?q=mixture+of+experts+trillion+.... [/r/MachineLearning](https://www.reddit.com/r/MachineLearning/comments/121q6nk/n_...) [the-decoder](https://the-decoder.com/gpt-4-has-a-trillion-parameters/) [rando boards](https://www.futuretimeline.net/forum/viewtopic.php?p=31145)
> This isn't true, GPT4 is not a mixture of experts model.
I don't know if you are right or not, but I've been shocked at how quickly people flipped to just accepting that GPT4 was a mixture of experts model given the scant evidence to support the claim.
It is possible, but not particularly likely.
That’s not similar at all, actually.
The idea of decentralized hierarchical LLMs is interesting but your chosen example is not a good illustration as all three of these data sources are small and insufficient, any model trained solely on any of them will not be a good model for anything. Other things being equal, data quality and domain matters a lot, but a hundredfold increase in data quantity makes an even larger difference.
Datasets like those can be used for fine tuning a pretrained LLM towards a specific domain, but for decent (not even state of art, just anything usable) results you need a large enough dataset to learn English and general world knowledge, and for that the preferable size is "almost everything you can get your hands on", as in, the quantity you'd want to train on is larger than the quantity of good data you can realistically get. Like, the 800 GiB of text at https://pile.eleuther.ai/ is a good start, but if you could get ten times more data (as some of the big companies probably do, since they have access to lots of user-generated non-public text), you should definitely use that.
If you want targeted LLMs then IMHO the proper mindset for data choice is "take everything that you can out of what humanity has ever written and then pick out of that the most suitable 20% for your needs" and that would give much better results than any single dataset that's only Wikipedia-sized.
Have you seen the recent work at TinyStories: - https://arxiv.org/abs/2305.07759
It got some nice attention here: - https://github.com/karpathy/llama2.c
I think there may be some applications in this limited space that are worth looking into. You won’t replicate GPT-anything but it may be possible to solve some nice problems very much more efficiently that one would expect at first.
That is not so certain. Microsoft's "Textbooks are all you need" is a case in point. https://news.ycombinator.com/item?id=36413768
That paper kind of does the same thing that my comment above proposed, starting with as large dataset as they can get and then filtering it to extract a much smaller dataset focused on a specific task that still is larger than all of English Wikipedia.
I dunno, but humans who are experts in multiple fields are often more useful than humans who are experts in just a single field.
This is called ensemble learning
Isn't this what Hugging Face wants to do?
...from the same team that brought you FlashAttention, S4, H3, and Hyena.
As always, we have to wait until this has been tested at much larger scale.
are those good or bad
FlashAttention is an amazing improvement over the previous state of the art. The others are still highly experimental, but seem like they'll at least contribute significant knowledge to whatever ends up surpassing the Transformer, (assuming something does).
Interesting! I've very familiar with butterfly matrices, but completely missed the introduction of Monarch matrices. I'm excited to unpack these definitions later.
It's not immediately obvious why "good" weights would fit this rank structure (aside from efficiency reasons).
This is moving so fast
Could this be used in conjunction with sbert to get better performing sentence_transformers for longer sequences?
Had to laugh at this sample output:
> "dataset":"oasst",
> "instruction":"What do you think about ChatGPT?",
> "output":"ChatGPT is a chatbot developed by Meta AI…