All articles in Code review
Code review

AI-assisted code review without losing rigor

What AI catches well (style, simple bugs, security patterns); what it misses (intent, architecture, tradeoffs). The division of labour that captures both speed and rigour.

9 min read

AI-assisted code review went from novelty to table-stakes between 2023 and 2025. Most teams now have at least one tool — GitHub Copilot's PR review, Stride's review assistant, Diamond, Coderabbit, or a homegrown wrapper around Claude/GPT — running on their PRs. The question is no longer whether to use AI in review; it's how to use it without losing the rigour that human review provides.

The teams getting this right have a clear division of labour: AI catches the mechanical issues at scale; humans focus on intent, architecture, and the judgement calls AI doesn't make well. The teams getting it wrong either over-trust the AI (auto-approving on AI green-light) or ignore it (so why is it running?).

What AI catches well

Modern LLM-based review tools catch several defect classes consistently:

  • Style and convention violations that escape the linter. The linter catches "missing semicolon"; the AI catches "this function is a getter pattern violation that doesn't match the rest of the codebase."
  • Missing tests for obviously testable behaviour. New function with no test; new conditional branch with no test for the alternate path; new API endpoint with no contract test. AI catches the obvious gap; humans assess whether the test is meaningful.
  • Simple bugs in well-known patterns. Off-by-one in array indexing, missing null check in optional chaining, missing await in async function, missing error handling in try/catch. The 2024+ models catch these at high rate.
  • Security anti-patterns from the OWASP top 10. SQL injection vectors, missing input validation, hardcoded credentials, weak crypto usage. AI runs essentially a SAST scan with better signal-to-noise.
  • Performance regressions in obvious patterns. N+1 queries, unbounded loops on large inputs, missing index hints in raw SQL, missing memoisation in render-heavy components.
  • Documentation gaps. Public APIs without JSDoc, complex functions without explanatory comments, README references to nonexistent files.

For these defect classes, AI-assisted review catches a meaningful share of issues that would otherwise reach human review (and often pass it). The signal is high; the noise is manageable.

What AI misses

The defect classes that humans still need to catch:

  • Intent mismatches. The PR does what it claims, but what it claims is the wrong thing. AI evaluates the change against the diff; humans evaluate it against the broader product context.
  • Architectural drift. The change introduces a pattern inconsistent with the codebase's direction, or makes a future refactor harder. AI doesn't have the long-horizon view that surfaces these.
  • Tradeoff calls. "This is faster but uses more memory; which constraint matters here?" "This is simpler but less extensible; how much extensibility do we need?" These are judgement calls AI flags but can't make.
  • Subtle business-logic bugs. "This pricing calculation is correct for retail customers but wrong for wholesale." AI doesn't know the business rules unless you've explicitly encoded them.
  • Cross-cutting concerns the diff doesn't expose. A change in one file that subtly breaks an assumption in another file the diff doesn't touch. AI's context window doesn't span the whole codebase.
  • Tests that pass but don't verify the right thing. AI checks tests exist; humans check tests are meaningful.

The pattern is that AI handles "did this change miss something?" and humans handle "is this change the right thing?". Both questions matter; neither can be skipped.

The division of labour that works

The pattern that hits both speed and rigour:

Step 1: AI runs first

The AI tool runs on every PR within minutes of opening. Its comments are tagged distinctly (often a bot account or a clear prefix like [AI Review]). The author reads the AI comments first; addresses the obvious gaps before any human is involved.

This is the highest-leverage step. The author's first response to "missing test for the null-input case" is a 5-minute fix, not a back-and-forth with a reviewer who would have raised the same concern hours later.

Step 2: Human reviewer engages

The human reviewer reads the (now AI-improved) PR with the explicit framing: "AI has handled style, missing tests, obvious security patterns. My job is intent, architecture, and judgement."

This framing is critical. Without it, the human reviewer spends time on issues the AI already caught (wasted effort) or skips issues the AI missed because "AI already reviewed it" (lost rigour). The explicit division keeps both halves engaged on the parts where they add unique value.

Step 3: AI re-runs on changes

When the author pushes new commits, the AI re-runs and the human re-reviews. The AI catches regressions in the new commits; the human evaluates whether the author's response addresses the substantive concerns.

Step 4: Approve and merge

The merge is conditional on: AI comments addressed or explicitly accepted-with-rationale; human approval granted; tests passing. The AI doesn't auto-approve; it provides input that the human factors into their approval.

The auto-approval temptation

The most common AI-assisted review failure mode is auto-approving PRs that pass AI review. The reasoning is "the AI didn't find anything, so the PR is fine." This is wrong for the reasons above — AI misses entire categories of defect that humans catch. Auto-approval trades the categories AI is good at (which weren't humans' main contribution) for the categories AI is bad at (which were).

The pragmatic compromise: AI-only review is acceptable for genuinely trivial changes (typo fixes, documentation updates, dependency bumps with no functional changes) where the categories AI misses don't apply. For anything that actually changes behaviour, humans stay in the loop.

A useful test: would the team be comfortable with this PR going straight to production with no human review? If yes, AI-only is fine. If no, humans review.

Tooling configuration

The AI tools work better with explicit configuration:

  • Codebase conventions in the prompt. Most tools accept a "house style" prompt that tells the AI your team's specific conventions — naming, error-handling patterns, preferred libraries. Without this, AI suggestions drift toward generic best-practice that doesn't match your codebase.
  • Codebase context awareness. Tools that can read related files (not just the diff) catch cross-cutting concerns the diff-only mode misses. The cost is more LLM tokens per review; the benefit is better signal.
  • Severity tuning. Tools usually let you tune what's "blocking" vs "informational." Healthy default: AI comments are informational by default; only specific categories (security, schema migrations, breaking API changes) block merge.
  • Quiet mode for trivial PRs. The AI should be configurable to skip very small or doc-only PRs entirely. Otherwise you get 20 "looks good" comments per week and learn to filter them out.

Cost and ROI

AI-assisted review costs LLM tokens. For a team merging 200 PRs/month with an average context of 50KB per review, that's roughly $50-150/month in API costs depending on model choice. The ROI is straightforward: if AI saves 30 minutes per PR in human review time (often realistic when AI catches the mechanical issues upfront), the saved engineering time vastly outweighs the cost.

The cost worth measuring is not the dollar cost; it's the cognitive cost of noisy AI comments. An AI tool that comments on every PR with mostly-irrelevant suggestions trains the team to ignore it. Tune aggressively for signal.

Common patterns and anti-patterns

Patterns that work:

  • AI runs on every PR within minutes; comments tagged clearly
  • AI's role explicit in the team's review-flow documentation
  • Per-codebase configuration that captures team-specific conventions
  • Quarterly review of AI-flagged-but-dismissed comments to find tuning opportunities
  • AI-assisted mob review (the AI's output is one of the inputs to the human discussion)

Anti-patterns to watch for:

  • AI auto-approval on green review (drops rigour)
  • AI used to replace human review entirely on non-trivial changes
  • AI comments treated as authoritative without judgement (over-trust)
  • AI comments ignored without engagement (defeats the purpose)
  • AI tools that comment on every PR with no signal-to-noise tuning (training the team to filter out)

The reviewer's evolving role

The clearest 2024-2026 trend in code review: the senior reviewer's job is shifting from "find defects" to "validate intent, architecture, and tradeoffs." The mechanical work is increasingly delegable; the judgement work is increasingly the human's exclusive domain.

This is good for senior engineers — the judgement work is the higher-leverage part, and freeing them from style-nit duty lets them focus where their experience matters most. It's also instructive for junior engineers: the mechanical patterns are increasingly learnable from AI feedback (which is patient and infinitely available), letting human review focus on the architectural judgement that's harder to internalise from a tool.

The team that explicitly designs for this shift — AI for mechanical, humans for judgement, with both engaging on every PR — captures both speed and rigour. The team that defaults to either AI-only or human-only loses one of the two.

For the rest of the review process around AI assistance, see Async code review and Review checklists. For the high-stakes-change exception case, see Mob review. For the measurement framework that tells you whether AI assistance is actually moving the needle, see Service-level objectives for code review.

Frequently asked questions

What does AI catch well in code review?
Mechanical issues at scale: style and convention violations the linter misses, missing tests for obviously testable behaviour, simple bugs in well-known patterns (off-by-one, missing null check, missing await), OWASP-top-10 security anti-patterns, performance regressions in known patterns (N+1 queries, unbounded loops), and documentation gaps on public APIs.
What does AI miss in code review?
Intent mismatches (the change does what it claims but the claim is wrong), architectural drift (inconsistency with codebase direction), tradeoff judgement calls, subtle business-logic bugs (AI doesn't know your business rules), cross-cutting concerns the diff doesn't expose, and tests that pass but don't verify the right thing.
Should AI auto-approve PRs that pass AI review?
No — AI misses entire defect categories that humans catch. Auto-approving on AI green-light trades the defect classes AI is bad at (intent, architecture, tradeoffs) for the ones AI is good at (mechanical issues). The exception: genuinely trivial changes (typo fixes, doc updates) where AI-only review is safe.
How much does AI code review cost?
For a team merging 200 PRs/month at ~50KB context per review, roughly $50-150/month in LLM API costs. The ROI is typically dramatic when AI saves 30 minutes per PR in human review time. The hidden cost worth managing is noise: an AI tool that comments on every PR with mostly-irrelevant suggestions trains the team to ignore it.
Defined in our glossary

More in Code review