SemantiSemanti
Menu
Scatter chart comparing DeepSWE score against average cost per task for frontier and open-weight coding models.

The Benchmarks Were Lying. We Already Knew.

There is a quiet frustration that runs through every forward-deployed engagement: the model that topped a leaderboard last week cannot navigate the client's codebase this week. We have learned to discount the numbers, because the numbers have not matched the work for some time. A new benchmark — DeepSWE, from Data Curve — finally puts evidence behind the instinct. It measures something close to what we actually do, and the results say what we have been telling clients for a year: the gap between state-of-the-art and open-weight models is far larger than the marketing admits.

Why the old benchmarks failed

The industry standard, SWE-bench Pro, was meant to be the realistic test. It is not. Three problems compound.

First, contamination. Its tasks are adapted from existing commits and pull requests, so the corresponding implementation, tests and discussion are already online. Models do not always solve these problems; sometimes they retrieve them. Data Curve's audit found the container even ships the repository's full .git history, and a meaningful share of "cheated" passes involved the agent running git log or git show to read the merged fix and paste it in. (DeepSWE blog)

Second, grading. SWE-bench Pro leans on an inherited test suite to verify results, and it is unreliable: Data Curve's audit measured 8.5% false positives and 24% false negatives, with an independent analyser disagreeing with the SWE-bench Pro verifier on roughly 32% of trials. A benchmark that misgrades nearly a third of its pass/fail decisions is not measuring capability with any confidence. (DeepSWE blog)

Third, and most telling, the prompts. The SWE-bench Pro template explicitly tells the model the test files are already handled and that it should not modify the testing logic or any of the tests. No engineer briefs an agent that way. Data Curve found that one line suppresses self-testing: on SWE-bench Pro every model wrote its own tests between 3% and 28% of the time; on DeepSWE, where the prompt says nothing about tests, the same models test on 24–85% of runs. The benchmark was, in effect, training models to skip verification. (DeepSWE blog)

What DeepSWE measures instead

DeepSWE was built to remove each of those failures. Its 113 tasks span 91 active open-source repositories across five languages — TypeScript, Go, Python, JavaScript and Rust — against SWE-bench Pro's 11. Every task is written from scratch and never merged upstream, so no public solution exists to lift, and the task container ships a shallow clone with only the base commit — there is no gold hash to find. The prompts are roughly half the length of SWE-bench Pro's, yet the reference solutions require 5.5x more code (a mean of 668 lines added versus 120) and edit more files. Most importantly, the verifiers are hand-written to test observable behaviour rather than implementation details, so any correct shape passes. (DeepSWE)

The effect on grading quality is stark: DeepSWE's verifiers produced false positives 0.3% of the time and false negatives 1.1%, against 8.5% and 24% for SWE-bench Pro.

The results

The headline numbers reorder the field and widen it. Pass@1, all models run on the same minimal mini-swe-agent harness for consistency:

Model

DeepSWE Pass@1

SWE-bench Pro (reported)

GPT-5.5

70%

59%

Claude Opus 4.8

58%

GPT-5.4

56%

58%

Claude Opus 4.7

54%

64%

Claude Sonnet 4.6

32%

54%

Gemini 3.5 Flash

28%

Kimi K2.6 (open weight)

24%

GLM 5.1 (open weight)

18%

DeepSeek V4 Pro (open weight)

8%

Source: DeepSWE leaderboard, live figures.

Two things matter here. The first is the cliff. On DeepSWE the spread from worst to best is 70 points; on SWE-bench Pro the same models cluster within roughly 30 points. There is finally a useful range to reason about, rather than a saturated band where adjacent models overlap inside their own error bars.

The second is the position of the open-weight models. Not one of them reaches half the score of the previous generation of frontier models. On aggregated marketing dashboards these same models appear within a few points of the frontier; in a realistic agentic test they collapse to single or low-double digits. Note the direction of the SWE-bench Pro column above, too: Opus 4.7 scores higher there (64%) than on DeepSWE (54%) — a direct artefact of contamination and a verifier that rewarded the wrong things.

This matches the experience we have in client work. Weaker models run in loops, fail to locate the right file, and produce code that does not compile. They can edit a single file to make something prettier; they cannot navigate a real codebase and finish the job.

The cost dimension the leaderboards omit

Because DeepSWE produces trustworthy results, the secondary numbers — cost, tokens, wall-clock time — become meaningful, and they belong in a procurement conversation. Median cost per task on the live leaderboard:

  • GPT-5.5 — 70% at $6.61 per task, 47k output tokens, ~21 minutes.

  • Claude Opus 4.7 — 54% at $18.19 per task, the most expensive configuration measured.

  • Claude Opus 4.8 — 58% at $12.58, cheaper than 4.7 for a higher score: a genuine improvement in the cost curve.

  • Gemini 3.5 Flash — nominally the cheap, fast option, yet it burned 189k output tokens (four times GPT-5.5's), cost $7.42, and scored 28%.

The lesson for a portfolio company watching unit economics on an agentic workflow: "cheap model" and "cheap to run" are not the same thing, and across the board higher token burn, longer runtime and higher cost did not correlate with a higher pass rate. Spending more bought nothing reliable. (DeepSWE results)

What this means for the work we do

We are not interested in benchmarks as sport. We care because our clients — particularly PE-backed businesses building agentic workflows into their cost base — are making model selection decisions on the strength of marketing that overstates open-weight parity by a wide margin. Choosing a model that scores well on a contaminated, mis-graded leaderboard and badly in your actual repository is a direct route to a stalled programme and a sceptical board.

The practical takeaways we carry into engagements:

  • Build your own corpus. Every time an agent fails on real work, record the model, the prompt, the tooling and the commit hash. That failure log becomes your private benchmark, and it will tell you more about model fit than any public leaderboard.

  • Prompt the way you brief an engineer. Describe the problem and roughly what the solution should look like; trust the agent to find the path. Fifteen-step scaffolds suppress capability and flatter weaker models.

  • Cost is a capability metric. Token burn and wall-clock time belong in the selection criteria alongside accuracy, especially at portfolio scale where the workflow runs thousands of times.

  • Treat parity claims with suspicion. Where a model sits within a few points of the frontier on a marketing chart, assume the test is the reason, not the model.

DeepSWE is not perfect, and Data Curve says so plainly: it runs every model through a minimal harness few engineers actually use, under-weights bug localisation and refactoring, covers only five languages, and draws solely from repositories with 500+ GitHub stars, so results may not generalise to long-tail or proprietary codebases. (Limitations) But it is the first coding benchmark in some time whose results we recognise. For once, the data matches the lived experience. That is rarer than it should be, and worth saying plainly.

*Reference: Huang, Lee, Tng and Ge, DeepSWE: Measuring frontier coding agents on original, long-horizon engineering tasks, Data Curve, 2026. Leaderboard figures cited as live at time of writing.*