SWE-bench Verified

0-100% of issues resolved

what it measures

Whether a model can resolve real GitHub issues in real Python repos: read the codebase, write a patch, pass the hidden tests. The Verified subset is a human-checked slice where the task is known to be solvable.

why it matters

It is the closest public benchmark to what people actually pay models to do all day, which is fix bugs in code they did not write.

the take
The first number I look at for coding. It maps closest to real work, and the gap between the top labs here is still real, not rounding error.

Source: https://www.swebench.com

See it on the leaderboard