RAG vs Traditional ML

I’ve been diving into RAGs recently, and it made me start to think about how building RAG systems is different than traditional ML.

One of the big challenges of traditional ML is that it’s NOT what-you-see-is-what-you-get.

If you’re a data scientist or ML engineer you write the code to train a model, and then you also write the code to evaluate the quality of the model you just built.

This is a huge area of risk for companies that can’t afford to hire senior data science or ML talent. It’s quite easy to evaluate an ML model incorrectly and think that it’s performing significantly better (or worse) than it actually is.

This is different than other areas of engineering where it’s much easier to measure correctness. For example, checking whether a frontend UI button opens the correct page when you click on it, or validating the data schema returned by a backend API service.

RAGs are different than traditional ML in this way. Most use cases for RAGs involve conversational UIs, and are more natural to sanity check. A non-technical or semi-technical stakeholder can play around with a RAG system and get a decent heuristic of how well the system is working.

In addition, the metrics that are used to evaluate RAG systems at scale like LLM-as-a-judge or a human giving a thumbs up or down on a response, feel much more like product metrics than statistical metrics — it’s easier to interpret “20% of users are giving a thumbs down” than evaluating precision and recall, and F1 score.

Creating evals for RAG apps are not easy — they present a lot of novel challenges compared to traditional ML. But I think it’s easier to get a baseline heuristic of how well the RAG product is working, and once you have eval metrics set up, they seem generally easier to interpret.

I think this is very good news for startups and emerging AI teams at larger companies. If business stakeholders other than trained data scientists & ML engineers can more easily validate the quality of a RAG app, it means businesses can be more bullish in building and deploying these products.

It also means there’s less risk for businesses to put young, inexperienced, but highly motivated software developers on these projects to build RAG apps, which will hopefully expand the opportunities for people wanting to break into AI.