Getting started with LLMs and structured data
neum.aiTL;DR for blog: If you are using RAG across structured data, make sure you consider the role of metadata as it is likely that not all the fields within your data carry semantic meaning and might lead to bad embeddings. Leverage hybrid search with the metadata added to improve the search quality. (SelfQueryRetriever from Langchain helps simplify this.)
Recently, I have heard a lot of chatter about using LLMs on top of structured data. At Neum AI, we have been dealing with our fair share of this and wanted to create a small blog outlining some of our thinking and best practices. Also show developers that might be new to the space what are some of the basics on how to get started. Most of the blog focuses on the intersection of structured data and retrieval augmented generation as that is the area where we have seen most questions. I do acknowledge that there are other places where LLMs can intersect with structured data like using LLMs to help write queries, but I am sure at this point we have all seen at least one or two services that do that.
As part of the blog, I included a small code sample showcasing how to build a simple app that consumes CSV files and using Langchain + Chroma can be queried for information.