AI-Generated Code Without Architecture Review Creates Hidden Debt

GitHub Copilot reached 1.8 million paying subscribers by early 2025, according to Microsoft's fiscal reports. Amazon CodeWhisperer, Cursor, Cody: the list of AI coding assistants keeps growing. Developers adopt them because they work. Autocompletions turn boilerplate into a non-issue. Prototypes materialize in hours instead of days.

But there is a problem nobody talks about at standup: these tools generate code, not architecture. They produce syntactically correct functions that pass unit tests while silently violating the design principles your team agreed on six months ago. The result is a new category of technical debt that does not show up in sprint retrospectives until it is too late.

The Gap Between Correct Code and Sound Architecture

A coding assistant operates on a narrow context window. It sees the current file, maybe a few imports. It does not see your C4 diagrams, your ADRs, or the decision your lead architect made about event-driven communication between services. It optimizes for local correctness: does this function compile, does it handle the obvious edge cases.

Architecture operates on a different level entirely. It answers questions like: which service owns this data? What is the contract between these two modules? Where do we draw the boundary between synchronous and asynchronous processing?

When a developer accepts an AI suggestion that creates a direct database call in a service that was supposed to communicate through an event bus, the code works. Tests pass. The PR looks clean. But the architectural boundary is breached, and nobody notices because the review focused on logic, not on structure.

GitClear analyzed 153 million lines of code changed in 2023 and found that code churn (lines rewritten within two weeks of being added) increased by 39% in repositories with high Copilot usage. That is not a sign of bad developers. It is a sign of code that gets written fast, merged fast, and corrected fast because it did not fit the larger picture.

What Goes Wrong in Practice

Duplicate Business Logic Across Services

This is the most common failure mode. Two teams working on the same Salesforce org use Copilot to generate Apex classes for lead conversion. Team A gets a synchronous trigger-based implementation. Team B gets an asynchronous Queueable approach. Both work in isolation. Both pass code review. Neither team checks whether a shared service for lead conversion already exists, or should exist.

Six months later, a business rule changes: converted leads now need a compliance flag. The change goes into Team A's trigger. Team B's Queueable still runs the old logic. Customer data diverges. The root cause takes three days to find because nobody mapped which components own which business rules.

Invisible Dependency Creep

AI assistants pull from training data that includes millions of open-source repositories. When Copilot suggests an HTTP client implementation, it might default to a library your project does not use. The developer installs it because the suggestion looks clean. The dependency gets added to package.json or pom.xml without discussion.

Snyk's 2024 State of Open Source Security report found that the average JavaScript project carries 49 direct dependencies and 298 transitive ones. Each AI-suggested addition compounds this. Semgrep can flag unauthorized imports in CI, but only if someone configures the rules first. Most teams do not.

In one Salesforce integration project we reviewed, an AI-generated REST callout used a custom HTTP wrapper instead of the org's established CalloutService class. The wrapper lacked the retry logic and circuit breaker that the shared service provided. When the external API started throttling, only the AI-generated callout failed. The incident took eight hours to diagnose because the team assumed all callouts went through the shared service.

Security Gaps That Pass Code Review

AI-generated code treats security as optional context. A Copilot suggestion for an API endpoint might skip input validation because the training data included plenty of quick prototypes without it. A generated Apex controller might use without sharing because the AI optimized for simplicity, not for the org's sharing model.

SonarQube's 2024 analysis of AI-assisted codebases found that AI-generated code contained 36% more code smells than human-written code in the same repositories. Security hotspots, specifically, were 1.5x more likely to appear in AI-suggested blocks.

The problem is not that these issues are unfixable. The problem is that they are invisible in a standard code review. A reviewer checking logic and test coverage will approve a function that works correctly but exposes data to users who should not see it. Architecture review catches this because it asks a different question: does this component respect the security boundaries we defined?

Why Standard Code Review Is Not Enough

Most code review processes check three things: does the code work, is it readable, does it have tests. These are necessary but insufficient when AI generates the code.

AI-generated code passes readability checks because it is trained on well-formatted code. It passes functional checks because it is optimized for correctness within its context window. It passes test checks because it can generate tests for its own output. What it cannot do is validate its own output against system-level constraints that exist outside its context.

An architecture review adds a fourth dimension: does this code fit the system? It checks API contracts, data ownership, security boundaries, and integration patterns. Without it, each merged PR is a small bet that the AI happened to guess the right architectural pattern. Over hundreds of PRs, those bets compound into structural drift that no amount of refactoring sprints can easily reverse.

Building Architecture Review Into the AI Workflow

The solution is not to ban AI coding assistants. They deliver real productivity gains. The solution is to make architecture review a first-class part of the workflow, specifically because AI makes it easy to produce code that looks right but fits wrong.

Tag AI-generated PRs for design review. GitHub and GitLab both support label-based review routing. Any PR where more than 30% of changed lines come from AI suggestions gets a design-review label and requires sign-off from someone who owns the architecture. Tools like Copilot now emit metadata about suggestion acceptance rates, making this measurable.

Encode architectural constraints in static analysis. SonarQube custom rules can enforce layering (no direct DB access from controller classes). Semgrep can flag imports from unauthorized packages. ArchUnit (Java) and ts-arch (TypeScript) can verify dependency directions at build time. These tools turn architectural decisions from documents into automated gates.

Maintain a service ownership map. When AI generates code, it does not know that the PaymentService is owned by the billing team and should not be modified by the marketing team's PR. A CODEOWNERS file in your repository is the minimum. For Salesforce orgs, a metadata spreadsheet mapping Apex classes to business domains catches cross-boundary changes before they merge.

Run dependency audits on every build. Snyk, Dependabot, or Socket can flag new dependencies automatically. The rule is simple: if a dependency was not in the project before this PR, it needs explicit approval. This catches the silent library additions that AI suggestions introduce.

Review AI output against ADRs. If your team has Architecture Decision Records, use them as a review checklist. When the ADR says "all inter-service communication uses the event bus," and a PR introduces a synchronous REST call between services, that is a rejection, regardless of how clean the code looks.

The Real Cost of Skipping This

A 2024 Stripe developer survey estimated that developers spend 42% of their time on maintenance and technical debt. AI coding assistants were supposed to reduce this. Without architecture review, they risk increasing it by making it faster to produce code that needs to be rewritten.

The math is straightforward. An architecture review adds 15 to 30 minutes per PR. Diagnosing and fixing an architectural violation that reached production takes days. In regulated industries (finance, healthcare, automotive), it can trigger compliance reviews that take weeks.

AI-generated code is not the problem. Unreviewed AI-generated code merging into systems without architectural guardrails is the problem. The fix is not more process. It is the right process at the right point: a design review gate that asks one question before every merge. Does this code fit the system we are building, or just the function it was asked to write?

Teams that answer this question consistently will get both the speed of AI-assisted development and the structural integrity that keeps systems maintainable. Teams that skip it will ship faster for six months, then spend the next two years untangling the result.

When the copilot writes the code but nobody reviews the architecture