Can Gemini 1.5 read all the Harry Potter books at once?
twitter.comML 101: Do not evaluate on the training data.
Yes of course it can, because they fit in the context window. But this is an awful test of the model's capabilities because it was certainly trained on these books and websites talking about the books and the HP universe.
Given that it is pretrained on the material, it would be interesting to do a differential test on in-context reinforcement. What is the recall % before reading the books and after?
I know, for instance, that gpt4 does much better with the python manual when we quote relevant context, even though it was trained on the python manual. This suggests pretraining is less than perfect.
Likewise, in the Harry Potter case I expect a significant difference between its background knowledge and the context enhanced trial. But I don't have intuition about the effect size we should expect! That makes it a fun experiment.
Not so fast. If you were evaluating the model on its ability to predict the next word in a Harry Potter book, you'd be right, because it's already seen the entire book, but that's not what's happening here.
The linked X post shows that the user asked the model to generate a graph of the characters, which was presumably a novel question. This is a legitimate test of the model's ability to understand and answer questions about the training data. Repeating the books in the prompt for emphasis makes sense, since the model probably didn't memorize all the relevant details.
The training data may not be HP itself. It may be millions of pages summarising/discussing/dissecting HP, which already contain the relationships spelled out better than in the book itself.
That's true, but the model still analyzed all that disparate information and produced a very detailed graph of the relevant relationships. If anyone can show that the graph itself was in the training data, then I would agree that it's not a good test.
> disparate information
I wouldn't call it disparate when there's about a dozen wikis each spelling it out like this: https://harrypotter.fandom.com/wiki/Severus_Snape
If eat my hat if multiple graphs almost exactly like this one weren’t in the training days. This is like fandoms 101.
The frustrating thing about all this speculations is, that we don't know what was in the training data, but I think we should know that, to have any meaningful discussion about it.
We should. However in this case, isn't it a bit of a stretch to assume they didn't put just about everything in the training data?
It would have been fairly trivial to AB test this where the other side is to ask the same question but without all the books in-window.
It's a novel question and impressive that Gemini was able to solve it but the tweet's author is claiming that this is because of the large context window and not because of all the Harry Potter related training data that was available to it.
The generic models definitely know a lot about Harry Potter without any additional context.
Probably 80% of my questions to ChatGPT were about Harry Potter plot and character details as my kid was reading the books. It was extremely knowledgeable about all the minutiae, probably thanks to all the online discussion more than the books themselves. It was actually the first LLM killer app for me.
That's a good point. I would describe this as a test of Gemini's ability to re-read something it's already familiar with, not a valid test of its ability to read a large new corpus.
It could have been trained on this exact picture created by a fan and uploaded to some forum. Ultimately it is impossible to know unless testing with brand new material.
I have the same problem with benchmarks that use real world tests (like SAT/LSAT/GRE or whatever else). The model got a good score, sure, but how many thousands of variations of this exact test was it trained on? How many questions did it encounter that were similar or the same?
You could modify the source material (change a name or character relationship) and see if it correctly reports the modification in the graph.
It seems from the replies that he tried it without the context too and didn't get as detailed answers. I'd really like to see the actual difference, but yeah, it would be so much more interesting to use books which aren't summarised and discussed all over internet.
I got some interesting results by feeding Claude 3 a very sparse primer for a conlang I wrote when I was 18.
There is zero chance this is anywhere in the models dataset, and we were able to perform basic translation to and from English.
How much of that character map is already in its training data and how much of it is actually read from the input prompt?
I’m always suspicious of these kinds of tests. It needs to be run with an unpublished book, not one of the most popular series in the 21st century.
This, Harry potter is a terrible example, even weak models know the rough story of Harry Potter unprompted.
If you want a real test, go test it on some Japanese light novel, or some harry potter fanfiction, and see if the model actually understands the plot details.
For reference, Opus/GPT-4 know the rough story of moderately popular light novels/mangas without any context given. They however do not precisely understand the fine-grained details of the story, like which character will win in a fight.
Japanese light novels were almost certainly in the training set, either in their original Japanese, an English translation in that Books2 pile, or in a fan translation that happened to get scraped.
That's expected, and why the model can reproduce the basic details.
But those Japanese light novels don't have millions of forum discussions and essays written on it. So it shows how well the model can recall sparse data in its training dataset, rather than recalling a dataset that basically shows up 100000 times in different forms.
Not sure about all of the Harry Potter books, but I gave it My entire data export from ChatGPT and handled it very well. I was able to search through it and have conversation again from past conversations. It was good.
I'm curious if that would work for me. How many megabytes was the export?
By way of reference, mine is currently around 7mb.
Mine was 8-9 mb, there’s an html in the zip that contains them all.
What I’m not sure about is if 1.5 is truncating it.
You could do the needle recall test, detailed in the following article, and used in the Gemini technical report:
https://vladbogo.substack.com/p/gemini-15-unlocking-multimod...
https://arxiv.org/abs/2403.05530
In the needle test, we generate "needle" queries at each point in the context. We then graph the model's recall for each needle.
In this case we might
1. Generate needles: Iterate through your conversations one by one, and generate multiple choice questions for each conversation. You might need to break down conversations into chunks.
2. Test the haystack. Given the full context, run random batches of needle queries. Run multiple batches, to give good coverage of the context.
3. Visualize recall. Graph the conversation # vs the needle recall score for that conversation
Let's assume Gemini is truncating your context, but has perfect recall for no truncated context. Then your needle graph will be ~100% for all conversation in context, and then fall off like a cliff at the exact point of truncation.
My main concern with this approach is the cost, as you have to load the entire context for each batch of needles. It's likely that testing all needles at the same time would skew the results or exceed the allowed context.
I don't know how the authors deal with this issue, and I don't know if they have published code for needle testing. But if you're interested in working on this, I'd like to collaborate. We can look at the existing solutions, and if necessary we can build a needle testing fixture for working with GPT exports. I'd also be interested in supporting more broad needle testing use cases, like books, API docs, academic papers, etc.
So, according to Gemini pricing, the call would cost approx. $11. Now, hopefully all goes to plan and the input correct and the result is what you wished for. If not, how many $11 calls do you need? Sure, pricing will go down, but my observation is that people just ignore the cost of context. When it's all about tech it's fine, but not if it's about efficiency.
If you're a business wanting to process highly technical training material into shorter handbooks, paying $11/each is practically free.
> All the books have ~1M words (1.6M tokens). Gemini fits about 5.7 books out of 7. I used it to generate a graph of the characters and it CRUSHED it.
An LLM could read all of the books with Infini-attention (2024-04): https://news.ycombinator.com/item?id=40001626#40020560
There might be enough Harry Potter related content in its training set that it's not really "reading" the books in its context.
OK so my next question is what can you do with a model loaded with Harry Potter Context? Answer Harry Potter Trivia at a superhuman level? Write the next Harry Potter adventure?
Having used GPTs to do creative writing I can report that they are good for solving the tyranny of the blank page, but then you have to read and edit hundreds of pages of dank AI prose, which never quite aligns with your creative vision, to harvest a few nuggets of creativity. Does it end up saving any time?
Think about all the soulless powerpoint presentations this could replace. They were never written with a creative vision in mind, but you meed a tom of context for accurate information.
I can’t see how this map would be useful to anyone. While it gets some of the relationships right, it has a bunch of unneeded detail and focuses on areas not crucial to the stories.
At a service level, LLM’s wow, but when you dig into the details there are often still huge gaps in output quality for many tasks.
It would be more impressive (and cleaner, btw) if it was fed with fan-fiction books and not the original books. Then we can see what it can make out of the context and what it "borrows" from the training data.
Why fan-fiction? Well, fan-fictions are not famous enough to be included in any training corpus, I believe. But fan-fictions of Harry Potter are numerous enough to test the context limit. There are also similarities and distinctions from the originals, which require correct recall to distinguish between them. That would be a good test, isn't it?
Why are fanfictions not famous enough to be included? There are huge archives of them online, which make for great sources of information. Archive of our own for example lists over 12 million works on their site.
It’s cheap to gather, unlikely to have any recourse, and has a huge range of quality.
Shouldn't the title be rephrased to not be clickbait?
I refuse even read it bc clickbait makes me sad but something like "Gemini 1.5 can read all the HP books at once" would be a more appropriate title for this forum, imo.
FWIW, i actually think this is pretty cool.
People created a map of all the Star Wars characters manually years ago. Being able to see all the characters mapped out from a story you’re interested in is pretty fun and helpful.
How can I trust a result like this without reading it myself to verify?
Answer: No (but almost)