What should a code review checklist actually contain?

Five questions that catch the dominant defect classes: (1) what happens at the boundaries (empty, null, max, min)? (2) what is the failure mode of this change? (3) does this change anything outside its visible scope? (4) will this be readable in 6 months? (5) is the test verifying behaviour rather than implementation? Everything else (style, formatting, lint rules) should be automated rather than reviewed by humans.

How long should code review take?

A focused, checklist-driven review of a 200-line PR typically takes 15-30 minutes. Reviews dragging past an hour either indicate the PR is too large (split it) or the reviewer is doing work the linter should be doing (automate it). Track median time-per-PR to identify which.

Should code review focus on style or substance?

Substance. Style enforcement is a linter's job. Prettier, ESLint, Black, or your language's equivalent should auto-fix style on save. Human review attention is too expensive to spend on quote consistency or import order; spend it on the five substantive questions above.

How do I get team buy-in for a code review checklist?

Tie it to your defect postmortems: when a bug escapes review, walk through which question would have caught it. After 3-4 postmortems, the team owns the list. Top-down checklist rollouts ("you must follow this") fail; bottom-up "this would have caught the bug we shipped last week" succeeds.

All articles in Code review

Code review

Code-review checklists that actually catch bugs

Most checklists are decorative. The five questions that find real defects, and why everything else belongs in the linter.

May 23, 20268 min read

Most code-review checklists are decorative. They live in a wiki, they're read once during onboarding, and they're never consulted again. The reviewer opens the PR, reads the diff, and applies whatever heuristics they happened to develop over their career, which means the bugs caught are the bugs that particular reviewer has personally been burned by before.

The fix isn't more thorough checklists. It's checklists with five questions, each tied to a concrete defect class, that the reviewer actually walks through.

The five questions that find real defects

These come from analysing ~3,000 production defects across Stride customers that were attributable to "should have been caught in review." The pattern is consistent: the same five categories account for ~80% of escape rates.

1. What happens at the boundaries?

Boundary conditions are where most logic bugs hide: empty arrays, single-element arrays, null/undefined, zero, negative numbers, the largest representable value, the smallest. The PR added a function calculateAverage(values). What does it do with []? What about [Infinity]? What about a 50-million-element input?

This is the question CI rarely tests for, because the developer who wrote the test usually didn't think of the boundary either. It's also the question AI-assisted review handles well, but only if the reviewer thinks to ask it.

2. What's the failure mode?

Every meaningful change has a failure mode. The PR adds a new API call. What happens when it 500s? What happens when it times out? What happens when it returns malformed data? The PR adds a new database query. What happens when the connection pool is exhausted? What happens when the query is slow?

The pattern to look for is asymmetry: success paths are usually well-tested; failure paths get a try { ... } catch (e) { console.error(e) } and the PR is approved. That console.error is the failure mode in production: silent, untraceable, accumulating in the log.

3. Does this change anything outside its visible scope?

The PR title says "fix typo in error message." The diff touches the error message and also a config file. What does the config file change do? Is it intentional? Is it tested? Is it documented in the commit?

This is the question that catches the unintentional behaviour change: the migration that ran successfully in dev but breaks production data, the dependency upgrade that came along for the ride, the global state mutation that affects every caller. Reviewers who default to "this looks fine" miss these regularly; reviewers who explicitly ask "what else does this touch?" catch them at high rate.

4. Will this be readable in 6 months?

Code is read more than it's written. The PR author has all the context in their head right now; in six months, neither they nor anyone else will. The question to ask isn't "do I understand this?" You have the PR open, of course you understand it. The question is "would a new engineer joining the team understand it?"

The defect class this catches isn't bugs per se. It's the slow accumulation of maintenance debt. Magic numbers without comments, abbreviations only the author knows, control flow that requires reading three files to trace, naming that conflates concepts. None of these will fail a test today; all of them will cost time on every future change.

5. Is the test the thing that should be tested?

The PR has tests. They pass. But what do they actually verify? Common anti-patterns:

Tests that assert implementation, not behaviour: expect(component.state.isLoading).toBe(true) instead of expect(screen.getByText('Loading...')).toBeInTheDocument(). The implementation test breaks on every refactor; the behaviour test survives.
Tests that exercise without asserting: await fetchUserData() followed by no expectations. This counts in coverage; it catches nothing.
Tests that mock the thing under test: jest.mock('./module-under-test') then assert the mock was called. This is testing the test, not the module.
Tests that pass because the assertion is too loose: expect(result).toBeDefined() when the real question is "does result equal what we expect?"

A reviewer who reads tests as carefully as the implementation catches these consistently. A reviewer who skims past "the tests pass" misses them.

What's NOT on this list

Style. Formatting. Import order. Variable casing. These are real concerns but they're not human-review concerns. They're linter concerns. If your reviews are spending time on style, you have a tooling gap. Configure Prettier, configure ESLint, configure the linter to auto-fix on save, and require zero lint warnings on merge. Reclaim the review time for the five questions above.

The same logic applies to missing tests for trivial code paths, missing JSDoc comments, and missing TypeScript types in places where inference works fine. If the team values them, automate them. If you can't automate, accept that humans will be inconsistent. Either way, don't burn senior reviewer attention on them.

Implementing the checklist

The five questions are an ordering, not a script. The reviewer doesn't paste them as a comment template; they walk through them mentally as they read the diff. The team-level practice is:

Onboarding: every new engineer reads this list and reviews 5 PRs alongside a senior engineer who narrates which question caught what.
Postmortem trigger: when a bug escapes review, the postmortem asks which question would have caught it. If none would have, the question gets added (rare). If one would have but the reviewer didn't ask it, the gap is process discipline (common).
Quarterly review: the team reviews the bug ledger and confirms the five questions still cover the dominant defect classes. They drift over time as the system matures; the list should evolve.

The discipline isn't memorising five bullets. It's the meta-habit of having explicit questions you apply consistently rather than ad-hoc heuristics that vary by reviewer mood and recent experience.

What this looks like at the team level

A team running the checklist well shows three indicators in their metrics:

Median defect-escape rate trending down per quarter. Not zero (escape rate hits a floor below which marginal reviews don't earn their cost) but trending.
Variance across reviewers narrowing. The "Alice catches 4x more bugs than Bob" pattern indicates ad-hoc heuristics; consistent application of the same five questions narrows the gap.
Review time decreasing on small PRs, increasing slightly on large PRs. The checklist cuts both ways: small PRs don't need 30 minutes; large PRs need more than the team historically gave them.

If your metrics don't move in these directions over a quarter of applying the checklist, the issue isn't the list. It's the meta-practice of consistent application.

For the SLO targets that operationalise this, see Service-level objectives for code review. For when to extend beyond single-reviewer review, see Mob review. For the AI-assisted variant of the checklist, see AI-assisted code review.

Frequently asked questions

What should a code review checklist actually contain?: Five questions that catch the dominant defect classes: (1) what happens at the boundaries (empty, null, max, min)? (2) what is the failure mode of this change? (3) does this change anything outside its visible scope? (4) will this be readable in 6 months? (5) is the test verifying behaviour rather than implementation? Everything else (style, formatting, lint rules) should be automated rather than reviewed by humans.
How long should code review take?: A focused, checklist-driven review of a 200-line PR typically takes 15-30 minutes. Reviews dragging past an hour either indicate the PR is too large (split it) or the reviewer is doing work the linter should be doing (automate it). Track median time-per-PR to identify which.
Should code review focus on style or substance?: Substance. Style enforcement is a linter's job. Prettier, ESLint, Black, or your language's equivalent should auto-fix style on save. Human review attention is too expensive to spend on quote consistency or import order; spend it on the five substantive questions above.
How do I get team buy-in for a code review checklist?: Tie it to your defect postmortems: when a bug escapes review, walk through which question would have caught it. After 3-4 postmortems, the team owns the list. Top-down checklist rollouts ("you must follow this") fail; bottom-up "this would have caught the bug we shipped last week" succeeds.