DApp Store | Web3 Hub for Events & Games

Trending topics

Benchmarking Long-Horizon Coding Agents AI coding agents look impressive on current coding benchmarks. But those benchmarks often optimize and test for the wrong thing. This new research introduces SWE-EVO, a benchmark for long-horizon software evolution. Up to 80% of software engineering effort involves maintaining and evolving legacy codebases rather than building from scratch. Current benchmarks miss this entirely. SWE-EVO reveals the gap between solving isolated issues and performing real software evolution. Instead of single-issue fixes, agents must interpret release notes and implement comprehensive changes that span an average of 21 files, validated against test suites averaging 874 tests per instance. GPT-5 with OpenHands achieves 65% on SWE-Bench Verified but only 21% on SWE-EVO. The authors find that current agents struggle with sustained, multi-file reasoning. The benchmark is constructed from release notes of seven mature open-source Python projects, including scikit-learn, pydantic, and dask. Each task requires implementing changes that would normally span multiple pull requests. Gold patches average 610 lines edited across 21 files and 51 functions. Results across 11 models reveal consistent patterns. Larger models outperform smaller variants. GPT-5 resolves 21% versus GPT-5-mini at 10% and GPT-5-nano at 4%. The ranking mirrors SWE-Bench performance, validating SWE-EVO as a meaningful benchmark. Failure analysis shows distinct patterns by model capability. The strongest models fail primarily on instruction following, misinterpreting nuanced release notes. Weaker models struggle with tool use and syntax errors. This indicates SWE-EVO difficulty stems from semantic reasoning, not interface competence. Paper: Learn to build effective AI agents in my academy:

Top

Ranking

Favorites