Gaps, Not Bugs: How AI Code Fails

AI code doesn't fail on what's wrong. It fails on what's missing. Three production failure patterns that pass every test

A different failure mode

AI-generated code doesn't fail the way human code does. Human bugs tend to be visible: a typo, a wrong variable, a logic error that breaks a test. AI failures are subtler. The code compiles, the tests pass, the PR looks clean. The problem isn't in what the code does. It's in what it doesn't do.

One architect described it as: "AI doesn't write bugs. It writes gaps."

Three patterns that pass every test

These are real failure patterns from production systems. All three pass unit tests. All three pass integration tests. All three break under production load.

Silent error swallowing. AI rewrites an error handler into a try/catch that logs a warning and returns a default value. The code looks cleaner after the rewrite. Tests pass because the error is "handled." In production, the system continues in a corrupt state and nobody is alerted until downstream data is already wrong.

Missing concurrency handling. AI writes the happy path. The code works perfectly in sequential tests. Under concurrent production traffic, two requests modify the same resource simultaneously. No locking, no optimistic concurrency, no conflict detection. The race condition surfaces as intermittent data corruption that takes weeks to diagnose.

Dropped idempotency. Retry logic reprocesses the same operation without a deduplication key. Works fine in staging with controlled request volumes. Under retry storms in production, the same payment processes multiple times.

Why tests don't catch gaps

Tests verify expected behavior against known cases. They validate what you asked for, not what you forgot to define.

AI-generated tests have an additional problem: they're often generated by the same model that wrote the code. The tests validate the implementation, not the requirements. An AI can generate tests that pass by silently swallowing assertion failures, testing only the happy path, or asserting on the wrong values.

Test coverage creates confidence, but not completeness. High coverage numbers tell you that the code paths you thought of are exercised. They tell you nothing about the code paths nobody thought of.

Checking for absence

Most tools check for correctness: is this code right? The harder question is completeness: is something missing?

When reviewing a change, you need to know three things: what factually changed, what might be risky about those changes, and what can't be verified from the diff alone. The third category is where gaps hide.

A verification process that only reports what it found wrong will never catch what's absent. The teams handling this well have tooling that explicitly lists what couldn't be verified, not just what was. That's the difference between a review that says "looks good" and a review that says "here's what I checked, here's what I flagged, and here's what I couldn't tell you about."

← The Problem with AI Reviewing AI The Code Review Bottleneck→