While I in this article use the llama3.1:8b LLM, I fairly sure that larger models like gpt-4o has the same type of limitation as I describe. But the problem will be less frequent, as larger models contain more tokens.
In my ongoing effort to understand key figures in the Maduro regime in Venezuela, I have collected over 500 articles from trusted investigative sources. This content is not available in any Large Language Model (LLM) that I can run on my MacBook. The text is currently 10MB in size, too large to send in its entirety to my Ollama server along with my questions. As a result, I need to selectively identify the most relevant parts and send those.
I want to be able to ask questions like “How is XYZ helping the Maduro regime?” and receive meaningful answers. The key challenge here is that I want to ask questions about names or transliterations of names of people and organizations not famous enough to be included in standard LLMs.
Let’s see how it responds to my query. First, without any additional context (the real name is replaced with XYZ):
$ ollama run llama3.1:8b
>>> How is XYX helping the Maduro regime?
I couldn't find any information on a person named XYX being involved with
the Maduro regime in a significant way. It's possible that this is a private
individual or not a public figure, and I don't have enough context to provide
an accurate answer.This is expected since it would be unusual for such a small model to contain information about him, so I need to provide it with content from my 10MB corpus.
Retrieval-Augmented Generation (RAG) works by splitting the 10MB text into chunks and generating a vector for each chunk using an LLM like llama3.1:8b. A similar vector is created for the question, and the N closest chunks (those with the minimum distance between the vectors) are selected. This method works well for finding relevant chunks for questions like “How should I verify the business purpose of a customer?” because all the words in the query are likely included in the LLM’s training data. This means the model has a good chance of generating similar vectors for the relevant chunks.
I want to pause here for a second and reiterate that I’m talking about two different uses of the same LLM: One is to generate soem text depending on my question, the other is generate numeric vectors for data. I could have used two different LLMs for the two cases, but I’m using llama3.1:8b for both.
However, this approach does not work as well for names, particularly in smaller LLMs that can run on my hardware. If a name is not included in the training model, it will not result in similar vectors. This is because no matter the length of the content being converted into a vector, the vector will have the same fixed length (e.g., 4096 numbers). This means there’s no place to account for unknown tokens or words.
For my project, I created one chunk for each paragraph, resulting in around 20,000 chunks. These chunks were embedded using llama3.1:8b and stored in a vector database.
Including the 50 closest chunks as context to my question doesn’t improve the output:
I couldn't find any information on how XYZ might be helping the Maduro regime
based on the provided text. The text appears to discuss various scandals,
individuals, and events related to corruption in Venezuela, but it does not
mention XYZ or his involvement with the Maduro regime. If you could provide
more context or information about who XYZ is and what he might be doing, I'd
be happy to try and help further.The “nearest” chunk, according to the generated vectors:
For his part, Ramírez Carreño was the Venezuelan ambassador to the UN,
and Merentes, president of the Central Bank until December of last year
when President Nicolás Maduro made changes to his CabinetNot a single one of the 50 retrived chunks mentions the individual I’m asking about. And that is why RAG struggles with people and organizations. It found chunks related to Maduro, but none about the specific individual. Without the chunks that mention him (there are 58 of them in total), the LLM has no material to work with and cannot provide a good answer.
However, there is a relatively simple fix: Incorporate Named Entity Recognition (NER) into the retrieval process. In addition to retrieving the N closest chunks from the vector database, I can extract entities from the question and search for them using a conventional search engine like Lucene. Indexing the chunks in Lucene along with the vector embeddings,, at least for my 10 megabytes of text, is straightforward.
There are many NER tools available to extract entities from text. I have used spaCy for a lot of it, but use a specialized method I’ve developed that is able to recognize alternative spellings and make up for other poor data quality. This should make it feasible to ask similar questions based on automatically generated captions in movies and audio.
When I incude the top 25 chunks from the query “XYZ” along with the top 25 chunks from the vector database as context, the answer looks quite different:
According to the text, XYZ is helping the Maduro regime by:* Managing the purchase of Chinese Y8 aircraft for the Air Force at a price
that was not publicly understood
* Leading corporations, ministries, and embassies without impediments due
to his proximity to the hierarchy
* Serving as president of the Venezuelan Export and Import Corporation
(Veximca), which was initially responsible for supplying weapons and war
equipment for Venezuela, and later for all procurement for the Venezuelan
state. Veximca was later replaced by Corpovex, also chaired by XYZ.
* Managing the Venezuelan Foreign Trade Corporation (Corpovex), which is at
the heart of Nicolás Maduro's government's business operations and is
responsible for deciding what to buy, at what price, and from whom.
XYZ's close relationships with key officials, including Diosdado
Cabello, Nicolás Maduro, and the Rodríguez Gómez brothers, also give him
influence and power within the regime.
The simple reason is that is include 25 paragraphs that mentions the entities I’m asking about.
The combination of search and vectors significantly improves the utility of llama 3.1:8b running on my machine my kinds of questions. I hope it can help others too!