We might be overestimating coding agent performance on SWE-Bench
cgft.ioHey everyone! We recently came across a ICLR submission highlighting dataset contamination issues with SWE-Bench. After filtering out those issues, the authors saw the performance of SWE-Agent + GPT-4 drop significantly, from 12.47% to 3.97%.
This led us to think more deeply about SWE-Bench as an evaluation tool. We've put together a blog post that reviews this paper, other relevant research, and also our thoughts on additional gaps in SWE-Bench.
Blog: https://www.cgft.io/blog/swe-bench-evals
Paper: https://openreview.net/forum?id=pwIGnH2LHJ
Would love your thoughts as well! This post isn’t meant to criticize SWE-Bench; it’s still the best dataset out there for evaluating coding agents. Instead, we hope this discussion can spark ideas on how to make it even better!
We might be overestimating coding agent performance on SWE-Bench