SWE-Bench: The $500B Benchmark | Marginlab

6 min read Original article ↗

This is part 1 of a multi-part deep dive into the state of LLM/AI-based coding assistants. Part 2 covers Terminal-Bench.

Explore the data: Browse all 731 SWE-Bench Pro instances in the SWE-Bench Pro Explorer.

Why should you care? I’ll give two answers:

If you don’t use LLM coding assistants, you might know that approximately 92% of US GDP growth in 2025 can be attributed to capital expenditures for LLM hardware. And the most prominent professional use-case of LLMs is coding assistants. Claude Code is perhaps the fastest product ever to $1B ARR.

If you do use LLM coding assistants, you probably already use LLMs to help you code in some capacity today, and understanding how frontier models are evaluated on coding tasks will help you understand benchmarks and decide what models to use in what situations.

I will start this series with a brief deep-dive into the most influential benchmark for coding assistants: SWE-Bench.

What is and isn’t SWE-Bench?

SWE-Bench (SoftWare Engineering-Bench) is a benchmark for evaluating the accuracy of LLMs on complex software engineering tasks in real-world codebases.

To know what SWE-Bench is, it helps to know what SWE-Bench isn’t. Before the coding assistant hype cycle started, there were many LLM benchmarks that focused on short self-contained tasks. For example, one task from OpenAI’s HumanEval benchmark prompted the LLM to implement this function:

from typing import List


def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """
    Check if in given list of numbers, are any two numbers closer to each other
    than the given threshold.

    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """

The LLM will be given one or more independent attempts to generate a solution, unit tests will be run on the solution, and if the unit tests pass, then the LLM passed the test.

The defining characteristic of these early benchmarks was that they consisted of self-contained problems. The solution did not require any context about a broader codebase.

SWE-Bench originated in 2023 as the first large-scale benchmark that went past self-contained problems. What SWE-Bench tests for is the ability for an LLM to resolve real-world Github issues in real-world codebases.

The original SWE-Bench and most later variants (more on the history later) were created by scraping GitHub issues in popular repositories.

A Real SWE-Bench Problem

Let’s look at an example SWE-Bench problem. This example comes from scikit-learn:

A SWE-Bench problem consists of the following primary parts:

  • Issue text: The title and body of a real GitHub issue.
  • Codebase: The entire codebase for the repository at the base commit (before the issue was fixed).
  • Fail-To-Pass Tests: A set of tests related to the issue that fail in the original codebase, and must be passed for the issue to be resolved. These tests check that code changes resolved the issue.
  • Pass-To-Pass Tests: A set of tests unrelated to the issue that pass in the original codebase, and must be passed after the issue is resolved. These tests check that code changes did not break functionality unrelated to the issue.

Typically, SWE-Bench tests are run in containers. The entire repo at the base commit is cloned, and the coding assistant is prompted with the issue text and given full access to the repo at the base commit. The coding assistant runs and generates a potential solution, which is verified against both the Fail-To-Pass and Pass-To-Pass tests. If all tests passed, that problem is considered solved. Importantly, the coding assistant should NOT have access to the Fail-To-Pass tests to avoid hacking a solution that narrowly solves the tests.

The SWE-Bench scores you see frontier labs report are the percentage of SWE-Bench problems solved by the LLM.

The History

SWE-Bench started as a single benchmark in late 2023, but is now a name for a family of benchmarks. A lot of confusion in understanding coding assistant capabilities comes from confusion about SWE-Bench variants. The timeline below illustrates the history and evolution of SWE-Bench.

Where we Stand

Most frontier labs now report accuracy on SWE-Bench Verified. A few things to emphasize about this when you look at benchmark figures:

  • SWE-Bench Verified consists of public GitHub issues from public repos (with public PRs that solved the issue!). There is a very good chance that significant chunks of SWE-Bench Verified are in the pre-training corpus of most LLMs (even accidentally).
  • SWE-Bench Verified is limited to Python problems with relatively constrained solutions (median problem requires less than 10 lines of code to be changed).
  • All of the SWE-Bench repositories are public, open-source repositories with very high code quality and documentation.

Some people see a chart like the one below and think “80% of software engineering is solved!”.

A more grounded interpretation is “80% of small, constrained issues in a typical large open-source Python repository can be solved”. No small feat by any means, but still a far cry from an automated SWE system, and even this interpretation is called into question by the SWE-Bench Verified problems likely being in the pre-training corpus of frontier models.

marginlab MARGIN LAB

SWE-Bench Verified

ACCURACY (%)

80706050403020100

Opus 4.5

Sonnet 4.5

Opus 4.1

Gemini 3 Pro

GPT-5.1- Codex-Max

GPT-5.1

SWE-bench Verified scores for frontier models, according to data from Anthropic

SWE-Bench-Pro is certainly a more reliable estimate of real software engineering capabilities than previous variants.

It consists of issues where the solution requires more code changes, in larger codebases with multiple languages, and where the codebase is either GPL or fully private (so labs should not be training on it). Here is where we stand with regard to that benchmark as of late 2025.

marginlab MARGIN LAB

SWE-Bench Pro

ACCURACY (%)

6050403020100

GPT-5.2 Codex*

GPT-5.2 Thinking*

Claude Opus 4.5

GPT-5.1 Thinking*

Gemini 3 Pro

*Maximum reasoning effort

SWE-bench Pro scores for frontier models, according to data from OpenAI

Where are we Going?

Sometimes I forget how fast things are moving. gpt-4o was released in May of 2024 and scores about 5% on SWE-Bench Pro. claude-4-sonnet was released in May of 2025 and scores about 18% on SWE-Bench Pro.

gpt-5.2 and claude-4.5-opus were released in November and December of 2025, respectively. Both of these models now comfortably exceed 50% on SWE Bench Pro.

I’m really not sure where we are going, but we are definitely going there quickly.

Explore the data: Browse all 731 SWE-Bench Pro instances in the SWE-Bench Pro Explorer.


Next: Terminal-Bench: Agents in Open-Ended Environments