LLMFAO: Large Language Model Feedback Analysis and Optimization¶
This is a minimalistic large language model (LLM) leaderboard that is based on human and machine feedback on pairwise responses of the models based on a carefully-selected set of 13 prompts and 59 different models.
When you see the outputs of two different systems for the specific query, you can determine the better one using a smaller instruction. I decided to give pairwise comparisons a try on the data kindly provided by the llmonitor.com team.
I asked carefully chosen crowd annotators to evaluate every pair to determine the winner. If both models performed similarly well or poorly, it’s a tie. Five different annotators evaluated every pair according to the instruction; there were 124 annotators in total. I also asked GPT-3.5 Turbo Instruct and GPT-4 to do the same using a shorter evaluation prompt, but I subjectively found human performance to be superior.
A more detailed description of this study is available at https://evalovernite.substack.com/p/llmfao-human-ranking.
The datasets and code are available at https://huggingface.co/datasets/dustalov/llmfao and https://github.com/dustalov/llmfao under open-source licenses.
The pairwise comparisons are transformed into scores using the Evalica library.