Note: InterroBench has been superseded by a new benchmark called SherlockBench at https://sherlockbench.com/
A Novel AI Benchmark
InterroBench is an independent LLM benchmark, designed to test pro-active problem solving.
The design of this benchmark means answers cannot be memorised. So InterroBench should provide a more accurate measure of how intelligent these models are, as opposed to testing their knowledge or fine-tuning.
How it works
Each problem is a conversation where the LLM has to interrogate a mystery function to find out what it does.
This is done using the "tool use" or "function calling" features that most LLMs provide. The LLM can call the tool as many times as it wants to test the function and figure-out what it does.
When the LLM is confident it has figured it out, then the system tests the LLM got it right by asking it what outputs it would expect the function to provide from given inputs.
Leaderboard (v6)
| claude-3-5-sonnet-20241022 | 62% |
| claude-3-5-haiku-20241022 | 49% |
| gpt-4o-2024-08-06 | 46% |
| gpt-4o-mini-2024-07-18 | 33% |
| gemini-1.5-pro | 23% |
| command-r-plus | 10% |
| llama3-groq-70b-8192-tool-use-preview | 7% |
| grok-beta | 7% |
Anticipated FAQ
The test is fully automated and the code for it and problem-set is on GitHub as-per the link in the title.
There are 25 problems and each is attempted 4 times (clearing the LLM's conversation history each time). This reveals how variable the performance is on each problem
They haven't given me access yet but I hope so soon.
Great question! I could polish up the Demo and load the full problem set into it and we could test humans.
I'm certain that a very smart human could beat the current frontier models. Though getting 100% would be difficult as there are a couple of super hard ones in there.
However the average person off the street would perform pretty badly on this benchmark due to lack of general logic and maths skills.
This is a research project of Joseph Graham. I don't have any particular AI credentials but I have a blog if it's interesting.
I think AI benchmarking is a fascinating subject as it's both practical and philosophical. Practical because it helps us understand AI and philosophical because it makes us consider what intelligence actually is.
There are lots of interesting benchmarks already but I wanted to make one that tests the AI in a conversation rather than the strict question/answer pattern that most benchmarks use.
In preperation for testing expensive reasoning models I wanted to refine down the problem-set. So for version 6 I deleted many of the easy questions.