Giving GPT “Infinite” Knowledge

sudoapps.substack.com

121 points by sudoapps 3 years ago · 90 comments

Reader

Embeddings-based search is a nice improvement on search, but it's still search. Relative to ChatGPT answering on its training data, I find embeddings-based search to be severely lacking. The right comparison is to traditional search, where it becomes favorable.

It has the same advantages search has over ChatGPT (being able to cite sources, being quite unlikely to hallucinate) and it has some of the advantages ChatGPT has over search (not needing exact query) - but in my experience it's not really in the new category of information discovery that ChatGPT introduced us to.

Maybe with more context I'll change my tune, but it's very much at the whim of the context retrieval finding everything you need to answer the query. That's easy for stuff that search is already good at, and so provides a better interface for search. But it's hard for stuff that search isn't good at, because, well: it's search.

sudoappsOP 3 years ago

Agreed, GPT answering based on its own training data has been the best experience by far (aside from hallucinations) and comparing against that is difficult. Embeddings might not even be the long term solution. I think it's still early to really know for certain but models are already getting better at interpreting with less overall training data so there are bound to be some new ideas.
- b33j0r 3 years ago
  
  I’m sure many of you have tried generating epic conversations from history. With work and luck, I’ve read stuff way better than college.
  But 90% of the time, it’s two barely distinct personalities chatting back and forth:
  Me: Hey brian, what do you think of AI?
  Brian: It’s great!
  Me: I’m so glad we agree.
  Brian: Great, this increases the training weight of Brian agreeing with Brian to a much more accurate level!
  Me: Agree!
b33j0r 3 years ago

Many points stated well. Agree. Now, I’m not certain of this, but I’m starting to get an intuition that duct-taping databases to an agent isn’t going to be the answer (I still kinda feel like hundreds of agents might be).
But these optimizations are applications of technology stacks we already know about. Sometimes, this era of AI research reminds me of all the whacky contraptions from the era before building airplanes became an engineering discipline.
I would likely have tried building a backyard ornithopter powered by mining explosives, if I had been alive during that period of experimentation.
Prediction: the best interfaces for this will be the ones we use for everything else as humans. I am trying to approach it more like that, and less like APIs and “document vs relational vs vector storage”.
- chartpath 3 years ago
  
  I can understand why that framing would be attractive, but there is no real fundamental difference when considering JSONB/HSTORE in PostgreSQL, and now we have things like pgvector https://github.com/pgvector/pgvector to store and search over embeddings (including k-nn).
  - b33j0r 3 years ago
    
    Yep. To be clear, that’s the exact approach I’ve been pursuing.
    But then I see model context length getting longer and longer just within the transformer architecture and the training engineering going on.
    To me that’s a fundamentally different approach to AI research at this moment. It seems to keep paying off in surprising ways.
    
    sudoappsOP 3 years ago
    
    > But then I see model context length getting longer and longer just within the transformer architecture and the training engineering going on.
    Do you have any references to this? Seems really interesting if that can be a long term approach.
    
    b33j0r 3 years ago
    
    I’m considering the recent 64k token models as the most relevant examples.
    More anecdotally, I couldn’t get anything to say more than a sentence locally at the beginning of 2023. I can get tons of useful results today.
    Sure, this will plateau. But what if a model plateaus and it’s basically like a 10-year old?
    But like, one of those 10-year-olds you hear about who gets his master’s degree at 13. At that point they’re just browsing the internet, reading books, and probably taking notes in a way that works for them.
    Obviously this is wild speculation. Just laying out ideas that make me think in this direction.
    
    sp332 3 years ago
    
    OpenAI is rolling out access to a 32k context model. Mosaic ML just released a model trained on 65k inputs. https://www.mosaicml.com/blog/mpt-7b
- sebzim4500 3 years ago
  
  My intuition is that it would work much better if the model could choose what to search for with something like langchain. The problem is that we don't know how to train such a system properly, we mainly do supervised finetuning on human examples of using the tools but this is fundamentally a reinforcement learning problem (RL is just hard).
  - b33j0r 3 years ago
    
    That’s what I have been up to as well. I just think we’re so traumatized from the past 15 years of software engineering (kidding) that we’re stuck thinking “hey, let’s add memcached, that solved many problems like this before!”
    Natural language is weird. It’s linear in the sense that you can’t take back what you said.
    However, subjects and verbs and parentheticals and objects are like electron clouds. We only know statistically where it’s going to be. And damn, you’re reading this now, somehow, anyway.
    Seems like the transformer model really is a new technology. But it’s just that we can actually run useful NLP neural networks now, which has opened the floodgate of innovation in various subfields.
    (Personally bothers me because absolutely no one took it seriously when I was starting my career!
    Lol. I had a meeting with my AI professor where they told me ANN’s were nonsense even if we had computers 1000x more powerful. Prune the search tree! That was not very long ago).
- kordlessagain 3 years ago
  
  Vector search with move tos and move aways based on feedback is much more than attaching a database…
fzliu 3 years ago

Encoder-decoder (attention) architectures still have a tough time with long-range dependencies, so even with longer context lengths, you'll still need a retrieval solution.
I agree that there's probably a better solution than pure embedding-based or mixed embedding/keyword search, but the "better" solution will still be based around semantics... aka embeddings.
mlyle 3 years ago

> It has the same advantages search has over ChatGPT (being able to cite sources, being quite unlikely to hallucinate) and it has some of the advantages ChatGPT has over search (not needing exact query) - but in my experience it's not really in the new category of information discovery that ChatGPT introduced us to.
I think the two could be paired up effectively. Context windows are getting bigger, but are still limited in the amount of information ChatGPT can sift through. This in turn limits the utility of current plugin based approaches.
Letting ChatGPT ask for relevant information, and sift through it based on its internal knowledge, seems valuable. If nothing else, it allows "learning" from recent development and effectively would augment its reasoning capability by having more information in working memory.
stavros 3 years ago

Is there any way to fine-tune GPT to make documentation a part of its training set, so you won't need embeddings? OpenAI lets you fine-tune GPT-3, but I don't know how well that works.
- sudoappsOP 3 years ago
  
  OpenAI doesn't let you fine-tune GPT-4 or GPT-3.5 yet (https://platform.openai.com/docs/guides/fine-tuning), but fine-tuning models on a set of documents is still an option but not really scalable if you want to keep feeding it more relevant information over time. I guess it could depend on the base model you are using and its size.

ftxbro 3 years ago

> "Once these models achieve a high level of comprehension, training larger models with more data may not offer significant improvements (not to be mistaken with reinforcement learning through human feedback). Instead, providing LLMs with real-time, relevant data for interpretation and understanding can make them more valuable."

To me this viewpoint looks totally alien. Imagine you have been training this model to predict the next token. At first it can barely interleave vowels and consonants. Then it can start making words, then whole sentences. Then it starts unlocking every cognitive ability one by one. It begins to pass nearly every human test and certification exam and psychological test of theory of mind.

Now imagine thinking at this point "training larger models with more data may not offer significant improvements" and deciding that's why you stop scaling it. That makes absolutely no sense to me unless 1) you have no imagination or 2) you want to stop because you are scared to make superhuman intelligence or 3) you are lying to throw off competitors or regulators or other people.

spacephysics 3 years ago

I don’t think we’re close to super human intelligence in the colloquial sense.
ChatGPT scrapes all the information given, then predicts the next token. It has no ability to understand what is truthful or correct. It’s as good as the data being fed to it.
To me, this is a step closer to AGI but we’re still far off. There’s a difference between “what’s statistically likely to be the next word” vs “despite this being the most likely next word, it’s actually wrong and here’s why”
If we say, “well, we’ll tell chatgpt what the correct sources of information are” that’s no better really. It’s not reasoning, it’s just a neutered data set.
I imagine they need to add something like chatgpt 4 has with live internet models or something else to get the next meaningful bump
I don’t recall who said it, but a similar thread had a researcher in the field express that we have squeezed far more juice than expected from these transformer models. Not that new progress in this direction can be made, but it seems like we’re approaching diminishing returns
I believe the next step that’s close is to have these train on less and less horsepower. If we can have these models run on a phone locally, oh boy that’s gonna be something
- famouswaffles 3 years ago
  
  GPT's already forgo the surface level statistically most likely next word for words that are more context appropriate. That's one of the biggest reasons they are so useful.
  The truth is that functionally/technically, there's plenty left to squeeze. The bigger issue is that we're hitting a wall economically.
  - EGreg 3 years ago
    
    How do they do that? No one seems to have a real explanation of what OpenAI actually did to train it
    
    famouswaffles 3 years ago
    
    It's pretty much just scale, either via Dataset size or parameter size. Before GPT-4, the general SOTA model was not in fact from Open AI (Flan-PaLM from Google).
    The attention from GPT-4 is a little different (probably some kind of flash attention) so that memory requirements for longer contexts are no longer quadratic. But there's nothing to suggest the intellectual gains from 4 isn't just bigger scale.
    Google could have made a 4 equivalent I'm sure. It's not like there wasn't a road to take. We already knew 3 was severely undertrained even from a computer optimal perspective. And then of course, you can just train on even more tokens to get them even better.
    
    EGreg 3 years ago
    
    How do you know it’s pretty much just scale? “Open”AI has been pretty tight-lipped about the details of its training and merely “claims” it was scale. It hired a ton of humans to train the model in little ways. If that’s the “scale” you’re talking about then it’s humans all the way down:
    https://www.forbes.com/sites/kenrickcai/2023/04/11/how-alexa...
    
    famouswaffles 3 years ago
    
    Also human feedback is not the only way to get a competitive model. Claude's models (Anthropic) are trained from AI feedback and they work just fine.
    https://crfm.stanford.edu/helm/latest/?group=core_scenarios
    Again there's nothing suggesting any big architectural changes. It's just scale.
    
    famouswaffles 3 years ago
    
    It's just been scale up to before GPT-4. Open AI hasn't said anything specific about 4 so feel free to think there's something major going on if you want but there's basically nothing to support that.
    
    EGreg 3 years ago
    
    How do we know it’s just been scale up to before GPT-4? As I said, OpenAI hasn’t told us why ChatGPT is so much better than other models (Bloom, LaMDA, LLaMa) and yet we know that they employ thousands of people to do RLHF, including for the coding models: https://www.semafor.com/article/01/27/2023/openai-has-hired-...
    Doesn’t quite sound like it’s “just scale”. I asked ChatGPT about its training and corpus and it explicitly disavows having that information.
    
    famouswaffles 3 years ago
    
    >OpenAI hasn’t told us why ChatGPT is so much better than other models
    Here's the thing.... It's not. Unless you restrict the pool to models you know about.
    https://crfm.stanford.edu/helm/latest/?group=core_scenarios
    And Again, chatGPT was not the general SOTA LLM. https://arxiv.org/abs/2210.11416
    So in the research world with detailed papers explaining what and what not, we had models that were better.
    
    mindwok 3 years ago
    
    Information on how they trained it nonwithstanding, there’s clearly more than just statistically appropriate words going on because you can ask it to create completely new words based on rules you define and it will happily do it.
    
    feanaro 3 years ago
    
    Well yes -- it's not words, it's tokens, which are smaller than words.
    
    fnordpiglet 3 years ago
    
    attention is all you need
    (Well and crap tons of GPUs and training data)
- firecall 3 years ago
  
  > ChatGPT scrapes all the information given, then predicts the next token. It has no ability to understand what is truthful or correct. It’s as good as the data being fed to it.
  That is precisely true of Humans as well though! :-)
  - ChatGTP 3 years ago
    
    The parrot is strong.
muskmusk 3 years ago

I agree with your general premise, but I think you left a couple of points of your list at the end:
it is obscenely expensive to keep training + there are other more low hanging fruit + you expect hardware to get better over time.
I don't think Altman is trying to fool anyone. Even if he were it wouldn't work. The competition is not that stupid and he knows that :)
It's just that hardware tends to get better at a rate that resembles Moore's law so in 18 months the cost of training a 100 mill dollar model is 50 mill dollar. You certainly can just throw money at the problem, but it's expensive and there are other options that are just as effective for now. Why spend money on things that are half as valuable in 18 months when you can spend money on things that don't devalue as fast like producing more/better data?
All that being said you can bet your ass there will be a gpt5 :)
tyre 3 years ago

It's possible that training with more data has diminishing gains. For example, we know that current LLMs have a problem with hallucination, so maybe a more valuable next area of research/development is to fix that.
Or work on consistency within a scope. For example, it can't write a novel because it doesn't have object consistency. A character will be 15 years old then 28 years old three sentences later.
Or allow it database/API access so it can interpolate canonical information into its responses.
None of these have to do with scale of data (as far as I understand.) All of them are, in my opinion, higher ROI areas for development for LLM => AGI.
HarHarVeryFunny 3 years ago

These LLMs are trained to model humans - they are going to be penalized, not rewarded, if they generate outputs that disagree with the training data, whether due to being too dumb OR too smart.
Best you can hope for is that they combine the expertise of all authors in the training data, which would be very impressive, but more top-tier human than super-human. However, achieving this level of performance may well be beyond what a transformer of any size can do. It may take a better architecture.
I suspect that there is also probably a dumbing-down effect by training the model on material from people who themselves are on a spectrum of different abilities. Simply put the model is being rewarded when trained for being correct as often as possible (i.e on average), so if it saw the same subject matter in the training set 10 times, once by an expert and 10x by mid-wits, then it's going to be rewarded for mid-wit performance.
sudoappsOP 3 years ago

This wasn't mean't to say that all training would stop. I think, to some extent, the model won't need additional recent data (that is already similar in structure to what it has) to better understand language and interpret the next set of characters. I could be completely wrong, but I still think techniques like transformers, RLHF and of course others will still exist and evolve to eventually get to some higher intelligence level.
nomel 3 years ago

This assumes that current neural networks topologies can "solve" intelligence. "Gains" could be a problem of missing subsystems, rather than missing data.
For a squishy example of a known conscious system, if you scoop out certain small, relatively fixed, regions of our brains, you can make consciousness, memory, and learning mostly cease. This suggests it's partly due to special subsystems, rather than total connection count.
vidarh 3 years ago

I think it's more a question of diminishing return and the cost of scaling it up, which is getting to a point where looking for ways of maximizing the impact of what is there makes sense. I'm sure we'll see models trained on more data, but maybe after efficiency improvements makes it cheaper both to train and run large models.
joshspankit 3 years ago

My takeaway from his statements is that if you sum up all of human knowledge then add every unique bit of knowledge that humans could uncover in the next 20 years, there’s a plateau and that plateau is probably lower than our dreams of what LLMs can do.
woah 3 years ago

Maybe it gets twice as good each time you spend 10x more training it. In this case, you might indeed hit a wall at some point.

Der_Einzige 3 years ago

I get annoyed by articles like this. Yes, it's cool to educate readers who aren't aware of embeddings/embeddings stores/vectorDB technologies that this is possible.

What these articles don't touch on is what to do once you've got the most relevant documents. Do you use the whole document as context directly? Do you summarize the documents first using the LLM (now the risk of hallucination in this step is added)? What about that trick where you shrink a whole document of context down to the embedding space of a single token (which is how ChatGPT is remembering the previous conversations). Doing that will be useful but still lossey

What about simply asking the LLM to craft its own search prompt to the DB given the user input, rather than returning articles that semantically match the query the closest? This would also make hybird search (keyword or bm25 + embeddings) more viable in the context of combining it with an LLM

Figuring out which of these choices to make, along with an awful lot more choices I'm likely not even thinking about right now, is what will seperate the useful from the useless LLM + Extractive knowledge systems

EForEndeavour 3 years ago

> What about that trick where you shrink a whole document of context down to the embedding space of a single token (which is how ChatGPT is remembering the previous conversations)
This is news to me. Where could I read about this trick?
sudoappsOP 3 years ago

The article is definitely still high level and mean't to provide enough understanding of what capabilities are today. Some of what you are mentioning goes deeper on how you take these learnings/tools and come up with the any number of solutions to fit the problem you are solving for.
> "Do you use the whole document as context directly? Do you summarize the documents first using the LLM (now the risk of hallucination in this step is added)?"
In my opinion the best approach is to take a large document and break it down into chunks before storing as embeddings and only querying back the relevant passages (chunks).
> "What about that trick where you shrink a whole document of context down to the embedding space of a single token (which is how ChatGPT is remembering the previous conversations)"
Not sure I follow here but seems interesting if possible, do you have any references?
> "What about simply asking the LLM to craft its own search prompt to the DB given the user input, rather than returning articles that semantically match the query the closest? This would also make hybird search (keyword or bm25 + embeddings) more viable in the context of combining it with an LLM"
This is definitely doable but just adds to the overall processing/latency (if that is a concern).
gaogao 3 years ago

> What about simply asking the LLM to craft its own search prompt to the DB given the user input, rather than returning articles that semantically match the query the closest?
I played with that approach in this post - https://friend.computer/jekyll/update/2023/04/30/wikidata-ll.... "Craft a query" is nice as it gives you a very declarative intermediate state for debugging.

orasis 3 years ago

One caveat about about embedding based retrieval is that there is no guarantee that the embedded documents will look like the query.

One trick is to have a LLM hallucinate a document based on the query, and then embed that hallucinated document. Unfortunately this increases the latency since it incurs another round trip to the LLM.

taberiand 3 years ago

Is that something easily handed off to a faster/cheaper LLM? I'm imagining something like running the main process through GPT-4 and hand of the hallucinations to GPT 3 turbo.
If you could spot the need for it while streaming a response you could possibly even have it ready ahead of time
d4rkp4ttern 3 years ago

Some people packaged this rather intuitive idea, named it Hyde (Hypothetical Document Embeddings) and wrote a paper about it —
https://arxiv.org/abs/2212.10496
Summary —
HyDE is a new method for creating effective zero-shot dense retrieval systems that generates hypothetical documents based on queries and encodes them using an unsupervised contrastively learned encoder to identify relevant documents. It outperforms state-of-the-art unsupervised dense retrievers and performs strongly compared to fine-tuned retrievers across various tasks and languages.
wasabi991011 3 years ago

>One caveat about about embedding based retrieval is that there is no guarantee that the embedded documents will look like the query.
Aleph Alpha provides an asymmetric embedding model which I believe is an attempt to resolve this issue (haven't looked into it much, just saw the entry in langchain's documentation)
rco8786 3 years ago

> One trick is to have a LLM hallucinate a document based on the query
I'm not following why you would want to do this? At that point, just asking the LLM without any additional context would/should produce the same (inaccurate) results.
- BoorishBears 3 years ago
  
  You're not having the LLM answer from the hallucination, you're looking for the document that looks most similar to the hallucination and having it answer on that instead.
redskyluan 3 years ago

i have an opposite way on doing this. Tried to generate questions based on doc chunks and embedding on questions. It works perfect!
- ck_one 3 years ago
  
  How do you generate the questions and how do you make sure to not lose information?
  E.g. Today I woke up at 9.am, had a light breakfast and then went on a run in Golden Gate Park.
  What questions do you generate from this sentence?
  - redskyluan 3 years ago
    
    They generate questions like:
    where did you go this morning? When did you woke up this morning. What did you do after breakfast? What did you do today at Golden Gate Park.
    GPT is all about probabilities. So the LLM know what might be most related answer of a doc chunk.
    It works much better than embedding the whole sentence because "When did you woke up this morning" might not be very similar with "Today I woke up at 9.am, had a light breakfast and then went on a run in Golden Gate Park.".
    
    ck_one 3 years ago
    
    Interesting, thank you! Have you played around with different prompts to generate these questions?
    
    redskyluan 3 years ago
    
    That's exactly what I'm trying to do. Play with prompts, generate multiple questions and cluster questions and pick some of the centroid questions to do embeddings
- selfhoster11 3 years ago
  
  This sounds like a fantastic approach. I will try this with my own LLM/search-and-retrieval projects.
- orasis 3 years ago
  
  Nice! Do you generate N questions so N embeddings per document or just one?
williamcotton 3 years ago

“We’re gonna need a bigger boat.”

Beltiras 3 years ago

I'm working on something where I need to basically add on the order of 150,000 tokens into the knowledge base of an LLM. Finding out slowly I need to delve into training a whole ass LLM to do it. Sigh.

v3ss0n 3 years ago

https://deepai.org/publication/scaling-transformer-to-1m-tok...
Can this be implemented in current opensource models?
akvadrako 3 years ago

Can't you use fine-tuning for this?
A other option is to ask GPT to compress your tokens into a shorter prompt for itself.
RhodesianHunter 3 years ago

Or, at this rate, just wait 6 months.
- Zetice 3 years ago
  
  I don't think this rate is sustainable. [0]
  [0] https://www.theverge.com/2023/4/14/23683084/openai-gpt-5-rum...
- Beltiras 3 years ago
  
  When I would have had to add another 2 batches of ~150,000 tokens.....

chartpath 3 years ago

Search query expansion: https://en.wikipedia.org/wiki/Query_expansion

We've done this in NLP and search forever. I guess even SQL query planners and other things that automatically rewrite queries might count.

It's just that now the parameters seem squishier with a prompt interface. It's almost like we need some kind of symbolic structure again.

sudoappsOP 3 years ago

If you are wondering what the latest is on giving LLM's access to large amounts of data, I think this article is a good start. Seems like this is a space where there will be a ton of innovation so interested to learn what else is coming.

jeffchuber 3 years ago

hi everyone, this is jeff from Chroma (mentioned in the article) - happy to answer any questions.

hartator 3 years ago

Is Chroma already trained or only trained in the supplied documents?
I can try to make a Ruby client.
- jeffchuber 3 years ago
  
  Chroma is not an LLM, it is "just" a database that you pass vectors into to search.
  A Ruby client would be great. Our FastAPI spec makes this pretty easy - it's at localhost:8000/openapi.json when the docker backend is running.

iot_devs 3 years ago

A similar idea is been developed in: https://github.com/pieroit/cheshire-cat

pbhjpbhj 3 years ago

>There is an important part of this prompt that is partially cut off from the image:

>> “If you don't know the answer, just say that you don't know, don't try to make up an answer”

It seems silly to make this part of the prompt rather than a separate parameter, surely we could design the response to be close to factual. Then run a checker to ascertain a score for the factuality of the output?

sudoappsOP 3 years ago

A lot of what prompting has turned into seems silly to me too, but it has shown to be effective (at least with GPT-4).
- TeMPOraL 3 years ago
  
  Only a month or two ago I found this ridiculous, but then my mental model of GPTs shifted and I don't think it's so stupid anymore.
  Technobabble explanation: such "silly" additions are a natural way to emphasize certain dimensions of the latent space more than others, focusing the proximity search GPTs are doing.
  Working model I've been getting some good mileage off: GPT-4 is like a 4 year old kid, that somehow managed to read half of the Internet. Sure, it kinda remembers and possibly understands a lot, but it still thinks like a 4 year old, has about as much attention span, and you need to treat it like a kid that age.
  - selfhoster11 3 years ago
    
    Whenever I see a comment posted by you here on HN, I am never disappointed by the insights and/or quality within. This is definitely one of those.
    My personal mental model of GPT-4's capabilities is closer to that of an Atari 2600 - very capable, even if it only has 128 bytes of RAM. Except this time round, we can easily scale up a huge network of them (series of GPT-4 'threads') that each do a small micro-portion of the overall task, if a high degree of precision is required. When we get to a Commodore 64-magnitude AI, things will become a lot more interesting.
    
    TeMPOraL 3 years ago
    
    Why thank you, :blush:.
    > My personal mental model of GPT-4's capabilities is closer to that of an Atari 2600 - very capable, even if it only has 128 bytes of RAM.
    Interesting. Maybe I'm falling into the anthropomorphizing trap, but I find it much more natural to compare GPT-4 to a human than to any other piece of technology.
    The comparison to a four-year old isn't arbitrary: my daughter is turning 4 in about two weeks, and I have her cognitive development over the past ~year fresh in my mind (and also have a 1.5 y.o. to compare against). The failure modes of ChatGPT (both GPT 3.5 and 4) I've seen are eerily similar to how conversation with my near 4 y.o. often go.
    Things like her having a 30-60 second worth of context window, and how she consciously repeats and restates the important bits, as if she knew she'll otherwise forget them in a minute. How she'll say something incorrect, and when I say it's not so, she'll immediately come back with the correct answer. How she'll execute just about anything anyone says that looks like a suggestion (except, of course, when coming from her parents), without worrying she's being "prompt injected". Etc.
    There's a lot of those, some I can't put into words - but the feeling of similarity between GPT-4 and a small kid is quite strong for me, and I think there might be something to it.
    > Except this time round, we can easily scale up a huge network of them (series of GPT-4 'threads') that each do a small micro-portion of the overall task, if a high degree of precision is required.
    Do you know any active research in this area? I briefly considered playing with this, but my back-of-the-envelope semi-educated feeling for now is that it won't scale. Specifically, as task complexity grows, the amount of results to combine will quickly exceed the context window size of the "combiner" GPT-4. Sure, you can stuff another layer on top, turning it into a tree/DAG, but eventually, I think the partial result itself will be larger than 8k, or even 32k tokens - and I feel this "eventually" will be hit rather quickly. But maybe my feelings are wrong and there is some mileage in this approach.
    
    selfhoster11 3 years ago
    
    > Do you know any active research in this area? I briefly considered playing with this, but my back-of-the-envelope semi-educated feeling for now is that it won't scale.
    I am aware of a couple of potentially promising research directions. One formally academic called Chameleon [0], and one that's more like a grassroots organic effort that aims to build an actually functional Auto-GPT-like, called Agent-LLM [1]. I have read the Chameleon paper, and I must say I'm quite impressed with their architecture (seriously - I must say it's quite revolutionary). It added a few bits and pieces that most of the early GPT-based agents didn't have, and I have a strong intuition that these will contribute to these things actually working.
    Auto-GPT is another, relatively famous piece of work in this area. However, at least as of v0.2.2, I found it relatively underwhelming. For any online knowledge retrieval+synthesis and retrieval+usage tasks, it seemed to get stuck, but it did sort-of-kind-of OK on plain online knowledge retrieval. After having a look at the Auto-GPT source code, my intuition (yes, I know - "fuzzy feelings without a solid basis" - but I believe that this is simply due to not having an AI background to explain this with crystal-clear wording) is that the poor performance of the current version of Auto-GPT is insufficient skill in prompt-chain architecture and the surprisingly low quality and at times buggy code.
    I think Auto-GPT has some potential. I think the implementation lets down the concept, but that's just a question of refactoring the prompts and the overall code - which it seems like the upstream Github repo has been quite busy with, so I might give it another go in a couple of weeks to see how far it's moved forward.
    > Specifically, as task complexity grows, the amount of results to combine will quickly exceed the context window size of the "combiner" GPT-4. Sure, you can stuff another layer on top, turning it into a tree/DAG, but eventually, I think the partial result itself will be larger than 8k, or even 32k tokens - and I feel this "eventually" will be hit rather quickly. But maybe my feelings are wrong and there is some mileage in this approach.
    For searching the web, Auto-GPT uses an approach based on summarisation and something I'd term 'micro-agents'. For example, when Auto-GPT is searching for an answer to a particular question online, for each search result it finds, it spins up a sub-chain that gets asked a question 'What does this page say about X?' or 'Based on the contents of this page, how can you do Y?'. Ultimately, intelligence is about lossy compression, and this is a starkly exposed when it comes to LLMs because you have no choice but to lose some information.
    > I think the partial result itself will be larger than 8k, or even 32k tokens - and I feel this "eventually" will be hit rather quickly. But maybe my feelings are wrong and there is some mileage in this approach.
    The solution to that would be to synthesize output section by section, or even as an "output stream" that can be captured and/or edited outside the LLM in whole or in chunks. IMO, I do think there's some mileage to be exploited in a recursive "store, summarise, synthesise" approach, but the problem will be that of signal loss. Every time you pass a subtask to a sub-agent, or summarise the outcome of that sub-agent into your current knowledge base, some noise is introduced. It might be that the signal to noise ratio will dissipate as higher and higher order LLM chains are used - analogously to how terrible it was to use electricity or radio waves before any amplification technology became available.
    One possible avenue to explore to crack down on decreasing SNR (based on my own original research, but I can also see some people disclosing online that they are exploring the same path), is to have a second LLM in the loop, double-checking the result of the first one. This has some limitations, but I have successfully used this approach to verify that, for example, the LLM does not outright refuse to carry out a task. This is currently cost-prohibitive to do in a way that would make me personally satisfied and confident enough in the output to make it run full-auto, but I expect that increasing ability to run AI locally will make people more willing to experiment with massive layering of cooperating LLM chains that check each others' work, cooperate, and/or even repeat work using different prompts to pick the best output a la redundant avionics computers.
    [0]: https://github.com/lupantech/chameleon-llm
    [1]: https://github.com/Josh-XT/Agent-LLM
  - sudoappsOP 3 years ago
    
    Completely agree

nico 3 years ago

Can we build a model based purely on search?

The model searches until it finds an answer, including distance and resolution

Search is performed by a DB, the query then sub-queries LLMs on a tree of embeddings

Each coordinate of an embedding vector is a pair of coordinate and LLM

Like a dynamic dictionary, in which the definition for the word is an LLM trained on the word

Indexes become shortcuts to meanings that we can choose based on case and context

Does this exist already?

fzliu 3 years ago

Not sure what you mean by dynamic dictionary, but the embedding tree you mention is already freely available Milvus via the Annoy index.
- nico 3 years ago
  
  An entry in a dictionary is static text, ex:
  per·snick·et·y: placing too much emphasis on trivial or minor details; fussy. "she's very persnickety about her food"
  A dynamic entry could instead be an LLM what will answer things related to they word, ex:
  What is the definition of persnickety?
  How can I use it in a sentence?
  What are some notable documents that include it?
  Any famous quotes?
  …
  So each entry is an LLM trained mostly only on that keyword/concept definition
  There are some that believe in smaller models: https://twitter.com/chai_research/status/1655649081035980802...

m3kw9 3 years ago

This is like asking gpt to summarize what it found on Google, this is basically what bing does when you try to find stuff like hotels and other recent subjects. Not the revolution we are all expecting

A_D_E_P_T 3 years ago

"Infinite" is a technical term with a highly specific meaning.

In this case, it can't possibly be approached. It certainly can't be attained.

Borges' Library of Babel, which represents all possible combinations of letters that can fit into a 400-page book, only contains some 25^1312000 books. And the overwhelming majority of its books are full of gibberish. The amount of "knowledge" that a LLM can learn or describe is VERY strictly bounded and strictly finite. (This is perhaps its defining characteristic.)

I know this is pedantic, but I am a philosopher of mathematics and this is a matter that's rather important to me.

hartator 3 years ago

> I know this is pedantic, but I am a philosopher of mathematics and this is a matter that's rather important to me.
I don’t think this is pedantic. Words carry a specific meaning or what’s the point of words otherwise.

nadermx 3 years ago

I think someone did this https://github.com/pashpashpash/vault-ai

xtracto 3 years ago

This looks pretty promising, will check out later. Thanks for sharing

flukeshott 3 years ago

I wonder how effectively compressed LLMs are going to become...

Settings

Giving GPT “Infinite” Knowledge

Keyboard Shortcuts