Choosing the "Brain" for your AI-powered app – My new method, feedback requested
arxiv.orgWhen building some app or pipeline that has an LLM component at some point (what I mean by "the brain"), there's a question of which model should you choose? Many models nowadays can basically be drop-in replacements for each other. All else being equal, it's better to choose one that's more "knowledgeable" about your application's domain area than not. "Knowledge" of LLMs is usually measured, kind of like giving students a test in school, with a question-answer benchmark dataset where someone makes questions and an answer key. The LLMs are asked the questions, someone has to grade the answers, and whichever LLM scores higher on this test (the QA benchmark) "knows more" about that domain.
HOWEVER, creating the QA benchmark is labor intensive! So what I have created is a procedure you can use that runs programmatically without needing a QA benchmark set at all. Effectively what the procedure is is to probe the LLM with an opinion question in the desired domain many times, and see how consistent or inconsistent its responses are. Fracturing of the answers is quantified by "response dispersion", and the preprint I linked to shows a strong inverse correlation between response dispersion and accuracy on QA benchmark datasets. The point of course is not for people to do the comparison themselves, but to just use response dispersion to get a similar result as they would have if they had instead gone through the entire process of QA benchmark tests.
I'm posting it here because I am requesting constructive criticism on my preprint before submitting it to a journal. The paper itself is geared a little more towards the NLP research community than the average developer, however one of the main products of the paper itself is meant to benefit the average developer who is building AI-powered applications (by which I mean they drop-in an LLM at some point in their pipeline) and wishes for a quick and cheap (nearly-free) way to compare LLMs for his or her application domain.
I will reply to every response here, my responses will be early drafts of improvements I wish to make to the paper, so please criticize them as well. Thank you!