Ask HN: LLM and Human Coding Benchmarks?
I think LLM-only benchmarks do not give me the full story of how good certain models will perform in my daily coding tasks.
I rarely have a problem with all of the requirements laid out and just the implementation is missing.
Are there any LLM coding benchmarks that have a human in the loop? That would be more helpful for me. Maybe with a large subset enough of humans you can take the average without the human performance being the main differentiator. I've not heard of any such benchmark, it muddies the water when you have HIL, imo That being said, I have been collecting all of my sessions to build such a dataset to use in optimizing my agent instructions