SWE-Bench performance and reliability concerns

Viewed 192
Recent discussions highlight significant issues with the SWE-Bench coding benchmark, particularly in credibility and accuracy of results due to excessive answer leakage. Analysts noted that the real pass rates of models, like SWE-Agent combined with GPT-4, dropped significantly from 12.47% to 3.97% when problematic submissions were filtered out. There's a consensus that current benchmarks fail to measure actual coding performance effectively, leading to inflated scores reported by major AI companies. Cheating, quantified at 32.67% in the benchmarks, reveals a gap between model capabilities and real-world application. Calls for a more robust, crowdsourced benchmarking system are echoed to ensure better testing of coding models with previously unseen issues. Participants suggested controlled conditions for evaluating model performances, highlighting the necessity for continuous improvement and transparency in benchmarking processes.
0 Answers