Show HN: Llmfao – Human-Ranked LLM Leaderboard with Sixty Models
dustalov.github.ioIn September 2023, I noticed a tweet [1] on difficulties with LLM evaluation, which resonated with me a lot. A bit later, I spotted a nice LLMonitor Benchmarks dataset [2] with a small set of prompts and a large set of model completions. I decided to make my attempt without running a comprehensive suite of hundreds of benchmarks: https://dustalov.github.io/llmfao/
I also wrote a detailed post describing the methodology and analysis: https://evalovernite.substack.com/p/llmfao-human-ranking
[1]: https://twitter.com/_jasonwei/status/1707104739346043143
[2]: https://benchmarks.llmonitor.com/
Unfortunately, I did my analysis before the Mistral AI model was released, but published it after the model was released. I’d be happy to add it to the comparison if I had their completions. This is really cool, nice work. Did you try out any of the grading yourself to compare it to the contractors you used? One thing I've found, especially for coding questions is that models can produce an answer that _looks_ great, but then turns out to use libraries or methods that don't exist. And that human graders tend to rate these highly since they don't actually run the code. Thank you! I excluded the coding tasks as most annotators don't possess this expertise. I trust them in comparing pairs of dissimilar model outputs that don't require any specific skill but commonsense reasoning. The only manual analysis was when I checked the passed/failed prompts of the top-performing model.