Code Health Is Necessary, Verification Is Sufficient

Research shows AI fails in unhealthy code, but even healthy code has a non-zero break rate. Why you need both code health and change-level verification

The premise

A growing body of research is quantifying something practitioners already felt: AI coding tools don't perform equally across all codebases. The quality of the code AI touches determines whether AI accelerates delivery or accelerates defects.

The most cited finding comes from Code for Machines, Not Just Humans (2026), a large-scale study of 5,000 real programs across six different LLMs. The result: a 60% higher defect risk when AI-generated changes are applied to structurally unhealthy code. LLMs consistently performed better in clean, well-structured modules and consistently worse in tangled, complex ones.

That study only included code scoring 7 or above on a 10-point health scale. It never touched the truly unhealthy code found in most legacy systems, the modules scoring 4, 3, or 1. Based on non-linear patterns observed across all code health research, the real-world AI breakage rate for those modules is likely much steeper than 60%.

The numbers keep coming

The defect risk finding isn't isolated. Other recent studies paint a consistent picture:

41% more defects from AI adoption, with no measurable increase in throughput. Teams that adopted AI coding tools shipped more bugs without shipping more value. The volume of code went up. The quality went down. The net effect on delivery was negative.

Developers estimated AI saved them 20% of their time. In reality, they took 19% longer than a control group working without AI. The perception gap is the part that stings. Engineers genuinely believed they were faster. They weren't. The time went somewhere. Likely into debugging code they didn't fully understand, re-reading AI output to build mental models they didn't have, and fixing subtle issues that slipped past initial review.

Initial AI velocity gains are fully cancelled out after two months, driven by a massive increase in code complexity. The first few weeks feel like a productivity miracle. By month three, the accumulated complexity has eaten every hour saved. The codebase got bigger, but nobody got smarter about it.

Code health is necessary

These findings make a strong case for investing in code health as infrastructure. If AI performs dramatically better on clean code, then keeping your codebase healthy isn't just a long-term quality play. It's a prerequisite for getting value from your AI tools today.

Organizations that let their codebases degrade while scaling AI adoption are running into a wall. The AI generates code fast, but the code it generates in unhealthy modules creates more problems than it solves. Technical debt has always had a cost. Now it has a multiplier.

This argument is sound. Clean up the code, and AI works better. Invest in modularity, reduce coupling, break dependency cycles, and the AI error rate drops. The research supports it.

But code health alone isn't enough

Here's the part that gets less attention: even in the healthiest code the study measured, AI still introduced defects. The break rate at Code Health 9+ is lower, but it isn't zero. The researchers themselves flagged this: "AI break rate is never zero."

This matters because it means code health is a necessary condition for safe AI adoption, but not a sufficient one. You can have a perfectly healthy codebase and still ship AI-generated changes that quietly remove error handling, widen auth boundaries, or break contracts between modules. The code quality reduces the probability of failure. It doesn't eliminate it.

And in practice, nobody has a perfectly healthy codebase everywhere. Most organizations have a mix: some modules are clean, some carry years of accumulated complexity. AI works across all of them. The risk isn't theoretical.

The verification gap

Code health tells you about the state of the codebase before a change. It doesn't tell you about the change itself.

Knowing that a module scores 9 on a health scale doesn't tell you whether the specific diff an AI just generated removed a critical null check. Knowing your dependency graph is clean doesn't tell you whether the new import an agent added crosses a module boundary in a way that creates a circular dependency. Knowing your test coverage is high doesn't tell you whether the AI-generated tests actually test the right things.

The gap is between repository-level health and change-level verification. They operate at different granularities and answer different questions:

Code health answers: Is this codebase in a state where AI can work effectively?

Change verification answers: Did this specific change introduce risk, and where should a human look?

Teams that invest only in code health get a better starting position but still ship blind on individual changes. Teams that verify changes without caring about code health are fighting uphill — their verification layer catches more issues simply because there are more issues to catch.

Both layers, not one

The teams that will do best with AI-assisted development are the ones that treat code health and change verification as complementary layers, not alternatives.

Code health sets the floor. It determines the baseline probability that AI will produce good output. Higher health means fewer problems to catch, faster reviews, and more reliable automation.

Change verification is the safety net. It catches what falls through regardless of the baseline. It operates on the actual diff, with the actual context, at the moment the decision to merge is being made.

The research is clear that the floor matters. A lot. But floors don't catch you when you fall. That's what safety nets are for.

The cost of skipping either

Skip code health, and your AI tools fight uphill on every change. More breakage, more noise, more false confidence. The verification layer works harder and catches more, but the volume of problems eventually overwhelms any review process.

Skip verification, and you're betting on probability. AI performs well in healthy code most of the time. But "most of the time" is not a policy. It's a hope. One missed auth boundary change, one silently swallowed error in a payment flow, and the cost exceeds everything the AI saved you.

The economics only work when both layers are in place. Code health reduces the volume of problems. Verification ensures the remaining ones don't ship.

What this means in practice

If you're investing in AI coding tools, audit your codebase health first. Know where the healthy modules are and where the debt lives. Direct AI toward the healthy zones where it performs reliably, and be cautious about letting it work unsupervised in the rest.

Then verify at the point of change. Every diff, every merge. Not with another AI agreeing that it looks fine, but with structured analysis that separates facts from inferences and tells you what it couldn't verify.

Code health is the foundation. Verification is the last line of defense. The research says you need both. The production incidents from skipping either will prove it.

← Anatomy of a Good Verification Report Getting Started with vdiff→