• About
  • FAQ
  • Landing Page
Newsletter
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
  • Bitcoin
  • Ethereum
  • Regulation
  • Market
  • Blockchain
  • Business
  • Guide
  • Contact Us
No Result
View All Result
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
  • Bitcoin
  • Ethereum
  • Regulation
  • Market
  • Blockchain
  • Business
  • Guide
  • Contact Us
No Result
View All Result
No Result
View All Result
Home Business

OpenAI Says Benchmark Used to Measure AI Coding Skill Is ‘Contaminated’—Here’s Why

admin by admin
February 24, 2026
in Business
0
What’s the Best AI Model to Run Your Business? The One That Lies Best, Apparently
191
SHARES
1.5k
VIEWS
Share on FacebookShare on Twitter


In brief

  • OpenAI argues that SWE-bench Verified no longer reflects real coding ability because the benchmark is allegedly contaminated.
  • It is now pushing SWE-bench Pro as tougher replacement.
  • Scores plunged from ~70% to ~23% on the newer benchmark,

The number that every major AI lab has been using to claim coding supremacy was just declared meaningless.

OpenAI published a post this week announcing that SWE-bench Verified, the go-to benchmark for measuring AI coding capabilities, is so riddled with flawed tests and training data leakage that it no longer tells you anything useful about whether a model can actually write software.

Related articles

PIP Labs Sheds Staff as Story Protocol Leans Into AI

PIP Labs Sheds Staff as Story Protocol Leans Into AI

March 13, 2026
Tether Backs Ark Labs’ $5.2 Million Bet on Bitcoin’s Stablecoin Revival

Tether Backs Ark Labs’ $5.2 Million Bet on Bitcoin’s Stablecoin Revival

March 12, 2026

The benchmark works like this: Give an AI a real GitHub issue from a popular open-source Python project, ask it to fix the bug without seeing the tests, and check if its patch makes the failing tests pass without breaking anything else.

OpenAI created SWE-bench Verified in August 2024 as a cleaner version of the original 2023 benchmark, recruiting 93 software engineers to filter out tasks that were impossible or poorly designed.

The cleanup worked well enough that every major lab started citing scores on it as proof of progress. When Anthropic launched Claude Opus 4 in May 2025, Decrypt reported that the model scored 72.5% on SWE-bench Verified, beating GPT-4.1’s 54.6% and Gemini 2.5 Pro’s 63.2%. It was the coding benchmark that mattered.

Since then, every single AI lab from America to China has shown the SWE performance to claim the throne as the best model for coding capabilities.

Image: Minimax

Now OpenAI says that race was partly a mirage. According to the report, the team audited 138 tasks that GPT-5.2 consistently failed across 64 independent runs, and had six engineers review each one. It ultimately concluded that 59.4% of those tasks are broken.

About 35.5% have tests so narrowly written that they require a specific function name never mentioned in the problem description. Another 18.8% check for features that weren’t part of the original problem at all, gathered from unrelated pull requests.

The contamination problem roughly works like this: SWE-bench pulls its problems from open-source repositories that most AI companies crawl when building training sets. OpenAI tested whether GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash Preview had seen the benchmark’s solutions during training. All three had.

Given only a task ID and a brief hint, each model could reproduce the exact code fix from memory, including variable names and inline comments that appear nowhere in the problem description. In one case, GPT-5.2’s chain-of-thought logs showed it reasoning that a specific parameter must have been “added around Django 4.1″—a detail found only in Django’s release notes, not the task description. It was answering a question it had already seen the answer to.

OpenAI now recommends SWE-bench Pro, a newer benchmark from Scale AI that uses more diverse codebases and licenses that reduce training data exposure. The performance drop is jarring: models that cleared 70% on the old Verified benchmark score around 23% on SWE-bench Pro’s public split, and even less on its private tasks.

On the current public SWE-bench Verified leaderboard, OpenAI is far from the benchmark’s podium. Retiring a benchmark where you’re losing and endorsing one where everyone starts at 23% resets the scoreboard at a convenient moment and makes the competitors’ claims less impressive.

This is especially important considering that the much anticipated newer version of DeepSeek is rumored to beat or get extremely close to American ai models, especially in agentic and coding tasks with a free, open-source model. That model could be days away from release, and SWE-bench Verified can be a key metric to measure its quality.

OpenAI said it’s building privately authored evaluations that won’t be released before testing, pointing to its GDPVal project where domain experts write original tasks graded by trained human reviewers.

The benchmark problem is not new, and is not unique to coding. AI labs have cycled through multiple evaluations, each useful until models were trained on them or until the tasks proved too narrow.

But what makes this case notable is that OpenAI hyped SWE-bench Verified, promoted it across model releases, and is now publicly documenting how thoroughly it has failed—including by showing their own model cheating on it.

Daily Debrief Newsletter

Start every day with the top news stories right now, plus original features, a podcast, videos and more.



Source link

Share76Tweet48

Related Posts

PIP Labs Sheds Staff as Story Protocol Leans Into AI

PIP Labs Sheds Staff as Story Protocol Leans Into AI

by admin
March 13, 2026
0

In brief Story Protocol developer PIP Labs has let go of several employees and contractors. The reductions come as the...

Tether Backs Ark Labs’ $5.2 Million Bet on Bitcoin’s Stablecoin Revival

Tether Backs Ark Labs’ $5.2 Million Bet on Bitcoin’s Stablecoin Revival

by admin
March 12, 2026
0

In brief Ark Labs secured backing from Tether and Anchorage Digital. The firm plans to advance stablecoins and real-world assets...

Top Bitcoin Mining Pool Operator Foundry Is Getting Into Zcash

Top Bitcoin Mining Pool Operator Foundry Is Getting Into Zcash

by admin
March 11, 2026
0

In brief Foundry Digital is establishing a mining pool for Zcash, the privacy-focused cryptocurrency, which has surged more than 600%...

Elon Musk’s X Money App Nears Public Launch, No Sign of Dogecoin

Elon Musk’s X Money App Nears Public Launch, No Sign of Dogecoin

by admin
March 10, 2026
0

In brief X Money, the financial services arm of the social media platform, will launch public access beta in April....

Strategy Drops $1.28 Billion on Bitcoin, Issues $377 Million in Preferred Shares

Strategy Drops $1.28 Billion on Bitcoin, Issues $377 Million in Preferred Shares

by admin
March 9, 2026
0

In brief Strategy notched its largest Bitcoin purchase in over a month. The firm issued STRC at its fastest rate...

Load More
  • Trending
  • Comments
  • Latest
XRP price holds firm amid 30% volume spike

XRP price holds firm amid 30% volume spike

December 26, 2025
Lido DAO’s LDO price spikes as Arthur Hayes acquires 1.85M tokens

Lido DAO’s LDO price spikes as Arthur Hayes acquires 1.85M tokens

December 26, 2025
Solana Pullback Finds Purpose As Strong Hands Eye Accumulation Below $160

Solana Pullback Finds Purpose As Strong Hands Eye Accumulation Below $160

November 6, 2025
Bitcoin hashprice sinks to 2-year low as AI pivots split miners

Bitcoin hashprice sinks to 2-year low as AI pivots split miners

November 5, 2025

US Commodities Regulator Beefs Up Bitcoin Futures Review

0

Bitcoin Hits 2018 Low as Concerns Mount on Regulation, Viability

0

India: Bitcoin Prices Drop As Media Misinterprets Gov’s Regulation Speech

0

Bitcoin’s Main Rival Ethereum Hits A Fresh Record High: $425.55

0
An AI Pivot Won’t Save You, Wintermute Tells Bitcoin Miners

An AI Pivot Won’t Save You, Wintermute Tells Bitcoin Miners

March 14, 2026
Why Binance suddenly isn’t afraid of negative press anymore

Why Binance suddenly isn’t afraid of negative press anymore

March 14, 2026
US Treasury Sanctions Alleged $800 Million North Korean IT Worker Fraud Operation

US Treasury Sanctions Alleged $800 Million North Korean IT Worker Fraud Operation

March 13, 2026
Bitcoin targets $73,000 as crypto bounces despite oil price jitters

Bitcoin targets $73,000 as crypto bounces despite oil price jitters

March 13, 2026

Recent News

An AI Pivot Won’t Save You, Wintermute Tells Bitcoin Miners

An AI Pivot Won’t Save You, Wintermute Tells Bitcoin Miners

March 14, 2026
Why Binance suddenly isn’t afraid of negative press anymore

Why Binance suddenly isn’t afraid of negative press anymore

March 14, 2026

Categories

  • Bitcoin
  • Blockchain
  • Business
  • Ethereum
  • Guide
  • Market
  • Regulation
  • Ripple
  • Uncategorized
  • About
  • FAQ
  • Support Forum
  • Landing Page
  • Contact Us

© Copyright 2025 All Rights Reserved.

No Result
View All Result
  • Contact Us
  • Homepages
  • Business
  • Guide

© Copyright 2025 All Rights Reserved.