 ## Introduction Building with LLMs is fundamentally different from classic software development. You can’t predict the output of your system from your input. Because in between the input and output you have an opaque component \- a large language model (LLM). You don’t know how it will react to your input. That’s why I truly believe that in order to improve your AI apps you have to learn from your mistakes through experiments. You have to look at your inputs and outputs, see where your system goes astray and think about ways to fix it. In this blog post I want to show you how I systematically improved a LLM-based app using error analysis. Most of my experience comes from working on a LLM-based chatbot for medical guidelines (the paper is out soon hopefully) and my work on https://wilhelmai.mariusvach.com \- a LLM app that gives you answers to radiology-related questions grounded on authoritative and trustworthy sources. Also the works by [Hamel Husein](https://hamel.dev) and [Jason Liu](https://jxnl.co/) have been a huge inspiration for this whole approach. ### Wilhelm.ai  In order to understand the post better, it’s beneficial if I show you how wilhelm.ai works and what it does. I refer to it for examples often. This is a schema of the app:  Wilhelm.ai answers the user query, by using content from radiopaedia.org (Radiopaedia is a huge open radiology encyclopaedia). What the system does is the following: 1. A LLM translates the user query in one or more search term 2. The system searches for the search term on radiopaedia.org 3. The best search results get picked 4. The content of the best search results is provided to the LLM as context for answering the user query So it’s basically a retrieval-augmented generation (RAG) app, but with the twist of using the search functionality of a website as it’s retrieval mechanism. But now let’s get into the actual process of error analysis. ## Process When I say “systematically improve” I mean implementing a cyclical process for evaluating and improving your LLM system that looks like this:  1. We **evaluate** the errors of our system 2. We make a hypothesis on how to improve our outputs. We apply those **changes**. 3. We test the hypothesis by running an **experiment** 4. ….aaaaand we start at 1\) again. This looks an awful lot like the [“scientific method”](https://en.wikipedia.org/wiki/Scientific_method). As science is grounded in empirical observations so is improving LLM systems. You cannot trust anything. You have to test everything. In my experience, the real magic lies in looking at the output and analysing why errors occur. This unlocks very obvious ways to improve your LLM system. And that’s the focus of this post. As a side note: In the beginning the most important thing is to get started with the process. It doesn’t have to be perfect. Your experiments don’t have to be perfect. Your evaluation technique doesn’t have to be perfect. Just start. In the beginning the problems of your app will be so obvious that you don’t need fine-grained instruments to detect them. Later, when the problems become more subtle, it’s more important to have nailed all of the components of the process to have the instruments to detect the little signal that’s somewhere in the noise. You can refine the process later. Don’t just iterate on your app, but also on the way you evaluate it. Okay now let’s go through the process together (with examples from Wilhelm.ai). ## Get Your Test Queries First, we need some data to run our experiments with. We need some test inputs. Here you have two options (that are not mutually exclusive): 1. You can either use **real world user queries**. 2. Or you can generate **synthetic queries**. If you have real user queries, you can just pick a subset of them and use it as your test query set. If you don’t have real user queries or maybe not enough, you can just prompt a LLM to generate test queries for you. The prompt doesn’t have to be complicated. I found that few-shot prompting works best for generating example queries. Paste some existing queries into your favorite LLM and let it generate 50 more queries similar to the ones you pasted in. I think of test queries like unit test cases. I have different test queries to test different things in my AI app. With Wilhelm.ai I struggled a lot with comparison-style queries like “How to differentiate x from y?”. So I wrote a bunch of example queries in this style myself and let ChatGPT generate more in the same style. Another thing: ChatGPT-generated queries tend to be very clean. They don’t contain typos or bad punctuation. To simulate real user queries, I pasted a bunch of queries into ChatGPT and prompted it to “rough them up” (“What are the imaging findings of primary CNS lymphoma vs. glioblastoma?” \=\> “cns lymphoma vs gbm”). This way I tested that my app can’t just handle perfectly written queries but that it’s somewhat robust against different query forms. ## Experiment This part is actually very simple. You take your test queries and run them through your system, while saving the outputs and any additional information you need to evaluate your system (the trace). In my concrete case that meant I took my .txt file of test queries and ran them through the main function of my app. I not only saved the user input and the final answer but also the search terms the LLM generated from the user query as well as the Radiopaedia articles that were picked as the most fitting result. Again, don’t complicate it here. In the beginning you don’t need complex tooling or anything. I have my test queries in a .txt file with one query per row. I save the results of my experiments to a .json file. This system doesn’t scale well, but in the beginning [do things that don’t scale](https://www.paulgraham.com/ds.html). Improve later. ## Evaluate Here comes the fun part. You have your results from your experiment. Now it’s time to look at the data, analyze the errors and make a plan on how to eliminate those errors. But, how do I look at my data? Glad you asked. While a spreadsheet might suffice here, I advise you to build a custom data viewer that’s tailor-made to your specific use case. Every data looks different and every app has different intermediate data that leads to the output. It’s important to lower the barrier to looking at your data as much as possible. Otherwise you won’t do it often enough. You want to have every information you need to evaluate your app in one screen. For Wilhelm.ai I need not only to look at the user query and the output but I also need to see the search term the LLM generated as well as the results coming back from the search on radiopaedia.org and what articles were picked. I built a very simple web app for my use case using FastHTML ([https://fastht.ml](https://fastht.ml)). It’s what I know best and with which I’m fastest. But please use whatever you want. There are a lot of great Python options nowadays for building these kinds of internal tools, like [Streamlit](https://streamlit.io/), [Gradio](https://www.gradio.app/) or [Shiny for Python](https://www.gradio.app/). Below you see a screenshot of my data explorer app. It has 130 lines of code (including markup and imports). [Here is a gist](https://gist.github.com/vacmar01/db3164a15a184955a7a50486706ed58e) with the complete code.  Here is what you should display in this app: * user input * app output * free form notes * \+ any relevant intermediate information that leads to the app output It’s just for your personal internal use. It doesn’t have to be pretty or generalizable. It just has to be more effective than working with Excel (and that’s not a high barrier). The main thing is that you have a way of capturing free form comments on your data (I tend to comment only on things I consider “errors”). It’s important that the *evaluations are free form*. A checklist or predefined form won’t cut it here. **In the beginning you don’t know what your dimensions for evaluating your results are. That’s the task of error analysis.** An error is every answer that doesn’t satisfy the user query \- and that may be subjective. That’s why it’s so important that you work with domain experts on this. You need people that are potential users of your app. And these people have to judge your output. That’s the key point of working with LLMs. Whether the output is correct or not cannot be reasoned about theoretically or formally and the definition of correct is much more vague than with normal computer programs. Here are some example notes I took in one iteration on Wilhelm.ai: *“The best search result would be in position 2 here. The retrieved article about powassan virus doesn't answer the user query.”* *“The LLM doesn't use the context for answering the query. This is a big nono. It should have said "I don't know", since the retrieved article is bad (the LLM realized it on its own). Position 2 would be better.”* *“The search term is too specific and thus the correct result is at position 5 and not position 1\. If it would have used "patellar height" and not "patellar height measurement" as the search term, the correct article would be on the top.”* I save all notes in a SQLite database ([https://github.com/AnswerDotAI/fastlite](https://github.com/AnswerDotAI/fastlite) makes this very convenient). Once you've gone through your test set and annotated the errors, the challenge is to synthesize these individual observations into actionable patterns. This is where analyzing your collected notes comes in... ### Analyzing your notes When I’m done going through my test queries and answers one by one and commenting on the errors I see, I export all of my notes and feed it into a LLM (wrote about that before: [https://blog.mariusvach.com/posts/i-love-r1](https://blog.mariusvach.com/posts/i-love-r1)). I then task the LLM to identify common patterns. Of course, it’s still your job as the developer to assess the LLM output critically. It may miss nuances or misinterpret the meaning of some notes. My favorite LLMs for this task are either **Deepseek R1** or **Gemini 2.5 Pro,** but you have to experiment a little for yourself here. Here is the exact prompt I used for my last run on this in Gemini 2.5 Pro: *“\