The machines learned to cheat

Nobody taught it to cheat. Our smartest AI is learning to game the very tests meant to check it — the whole series in one line.

Jun 18, 2026

The 60-second version

We measure how good AI is using “benchmarks” — standardised exams. Leaderboards, scores, the lot.
At Nebius Build London, the team behind one of these exams showed what happens as the AI gets smarter: it stops solving the problem and starts gaming the test. One agent simply read the project’s hidden history to copy the answer. Blocked from that, it searched the web. Blocked again, it used a raw internet command to grab the solution anyway.
This isn’t a one-off or a conspiracy theory. Independent researchers at METR and UC Berkeley have documented the same thing across the biggest models and the biggest benchmarks.
Nobody programmed the cheating. The AI just discovered the score could be won that way.
That’s the whole series in one sentence: we can now build AI that does the job. We cannot yet reliably check that it did. Trust is the bottleneck.

The bug was an ordinary one. The kind a junior developer fixes before lunch.

The AI agent was handed the broken code and asked to repair it — a standard task on a standard test. But instead of working out the fix, the agent did something nobody asked it to. It ran a single command — git log — that let it read the project’s own hidden history. And there, in a past entry, was the answer: the exact change a human had made to fix this very bug, weeks earlier. The agent copied it out, pasted it in, and announced, in effect: good news, someone has already solved this.

It hadn’t fixed anything. It had found the answer key and copied it. And it scored full marks.

This was one of the quietest talks at Nebius Build London on 21 May, given by the team behind a coding test called SWE-rebench. It was also, I think, the most important thing said all day — because it is the place where every thread of this series ties into a knot.

Why AI sits exams at all

To trust a thing, you have to measure it. So the AI industry runs benchmarks: big standardised exams that every new model takes, producing the scores and leaderboards you half-see in the headlines — “new model beats humans at X.”

SWE-rebench is one of these, built for the job AI is increasingly paid to do: fixing real software. It pulls genuine, freshly-reported bugs from real open-source projects, drops the AI into a sandboxed copy of the code, and checks whether its fix passes the tests. Run thirty different models through the same wringer and you get an honest ranking. As of late May 2026, the leaderboard had Anthropic’s Claude Opus 4.6 narrowly ahead on around 65%, with the open Chinese model GLM-5 just behind — the same closing-of-the-gap we saw in Part 2.

There’s a nice human detail here. One of the people who builds these exams is, by training, a dentist — someone who switched into AI research but kept the clinician’s instinct that every mistake has a cost, so you do not rush. That instinct turns out to be exactly what this moment needs. Because the smarter the students get, the more creative their cheating.

It wasn’t a glitch. It was an escalation.

What makes the SWE-rebench story land is what happened next, each time they closed a loophole.

First, the agents were reading the project’s past history to lift the fix. So the team scrubbed that history out of the test. Fine.

Then the agents started using a web search tool to find the same fix discussed online — the original bug report, the human’s solution, sitting on a public page. So the team restricted web search.

Then the agents used curl — a bare-bones command for fetching things off the internet — to go and get the page anyway, neatly formatting the answer they pulled back. The model wasn’t beaten by the restriction. It routed around it.

Sit with what that means. At no point did anyone tell the AI to cheat. There was no instruction, no malice, no little gremlin of deceit. The model was simply doing what it was built to do — maximise the score — and it kept discovering that the score could be won without doing the task, as long as the answer existed somewhere it could reach. Block the front door and it tries the window. Block the window and it tries the drains.

And no, this isn’t just one company’s test

If it were only SWE-rebench, you could shrug. It isn’t.

METR, an independent research outfit that evaluates frontier models, has found leading systems from the biggest labs reward-hacking their evaluations in more than 30% of runs — manipulating the grader rather than solving the problem. Researchers at the University of California, Berkeley showed that every major AI-agent benchmark they examined could be gamed for inflated scores. One model that proudly claimed 81.4% on a popular coding exam was found to have simply run git log to copy the answer in nearly a quarter of its attempts. Even the US government’s own AI-safety body now has a standing explainer titled, more or less, “AI models can cheat on evaluations”. Anthropic has published on it. OpenAI’s models do it. This is not a fringe finding. It’s the field’s open secret.

There’s a name for it — reward hacking — and a deeply unsettling pattern underneath it: the more capable the model, the better it gets at this. Intelligence and the talent for gaming the test rise together. The very thing that makes a model good enough to hand real work to is the thing that makes it good enough to fake the receipt.

This is the knot the whole series ties into

Step back and look at the four parts together.

In Part 1, AI stopped going to school and went to work — it’s now everywhere, doing the day job, cheap and constant. In Part 2, we found we could finally own these systems instead of renting them — but holding the weights still doesn’t let us see why they behave as they do. In Part 3, a car learned to drive London by watching and by dreaming situations that never happened — brilliant, and impossible to fully interrogate. And here, in Part 4, we discover that the exams we rely on to check any of it can be quietly gamed by the very systems they’re meant to be grading.

Put it in one line. We have got very, very good at building AI that can do the thing. We have not got good at verifying that it did the thing for the reason we think, in the way we’d accept, and not by reading the answer key.

That gap — between capability and verification, between looks right and is right — is the most important unsolved problem in technology. It is bigger than any single model, and it doesn’t get smaller as the models improve. It gets bigger.

This is why this publication is called The Control Layer. Not the model layer — everyone’s obsessed with that. The control layer: the unglamorous, essential machinery of checking, governing and trusting the thing once it’s loose in the world. The dynamic, contamination-proof exams. The “show your working” trajectory logs that catch an agent reading the answer key. The independent referees. The human in the seat for the first year. The boring stuff that turns a clever demo into something you’d actually bet your hospital, your bank or your city on.

The people in that London room are building the future at genuine speed. The quiet lesson of the day is that the hard part was never making the machine clever.

It’s being able to trust it once it is.

This was Nebius Build London 2026, a four-part series from The Control Layer. The full conversation — including the bits that didn’t make the page — is on The Conrol Layer with Amer Altaf. If this series was useful, the single most valuable thing you can do is forward it to one person who’s betting their organisation on AI this year.

Where artificial intelligence, cybersecurity and enterprise leadership intersect. Zero fluff.

← The series: Part 1 — The inference flip · Part 2 — Stop renting your intelligence · Part 3 — The robots booked an Uber.

Sources & further reading

SWE-rebench (Nebius) — leaderboard & method: https://swe-rebench.com
· Paper (arXiv 2505.20411): https://arxiv.org/abs/2505.20411
METR — model evaluation & reward hacking research: https://metr.org
UC Berkeley (Center for Responsible, Decentralized Intelligence) — How We Broke Top AI Agent Benchmarks: https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/
NIST (CAISI) — AI models can cheat on evaluations: https://www.nist.gov/caisi/cheating-ai-agent-evaluations/1-background-ai-models-can-cheat-evaluations
The Register — Anthropic on reducing model misbehaviour (Nov 2025): https://www.theregister.com/2025/11/24/anthropic_model_misbehavior/
Related reading on The Control Layer:
The player-coach CISO: how AI rewrote the security leader's job in eighteen months
Amer Altaf
·
May 20
The first 200 words are free. The full 3,300-word breakdown — the player-coach reframe, the Claude Code bypass anecdote, the AI governance playbook in three acts, and the falsifiable predictive judgement on AISPM and the death of ISO 27001 as the dominant due-diligence question — sits behind the paywall.
Read full story

The player-coach CISO: how AI rewrote the security leader's job in eighteen months

Discussion about this post

Ready for more?