What SLOs should I set for code review?

Three metrics: time-to-first-touch (target p50 ≤2 hours, p95 ≤8 hours), time-to-merge (target p50 ≤8 hours for PRs under 200 LOC), defect escape rate (target ≤2% for PRs >50 LOC). Together they balance speed and quality: speed alone trades rigour for throughput; quality alone produces slow review.

How do I measure code review effectiveness?

Track the three SLOs weekly via your version-control system's API (GitHub, GitLab). Watch trends over 4-8 weeks, not week-to-week variation. Pair with leading indicators (PR queue depth, round trips per PR, PR size distribution) that predict where the trends are heading.

What does a "good" defect escape rate look like?

≤2% for non-trivial PRs (>50 LOC changed) is achievable for teams running disciplined checklists and AI-assisted review. Rates above 5% usually indicate review-rigour problems; rates below 1% often mean the team is over-reviewing (mob-reviewing too many PRs, or the checklist is too long) and trading speed for marginal quality.

How long does it take to improve review SLOs?

Teams committing to structural changes (auto-assignment, PR size cap, comment templates) see measurable improvement in 4-6 weeks. Further compounding improvements (AI assistance, checklist refinement, mob review for high-stakes) come over 12-16 weeks. The metric trajectory is more important than absolute values for the first 2-3 months.

All articles in Code review

Code review

Service-level objectives for code review

Three numbers that make review effectiveness measurable: time-to-first-touch, time-to-merge, defect escape rate. Targets, measurement, and what to do when they slip.

May 23, 202610 min read

Most teams have no SLOs for code review. They have norms ("we try to review within a day"), they have complaints ("reviews take forever"), and they have a vague sense that things could be better, but no measurement, no target, and no way to detect whether changes to the process are actually working.

Service-level objectives for code review fix this. Three numbers, measured consistently, evaluated quarterly. The teams that adopt them see cycle time drop measurably within 4-8 weeks; the teams that don't keep complaining without improving.

The three SLOs that matter

SLO 1: Time-to-first-touch

Definition: median (p50) elapsed time from PR open to the first substantive reviewer engagement: approval, request for changes, or substantive comment. Excludes auto-comments, lint bot output, and "I'll look at this later" placeholders.

Target: p50 ≤ 2 hours during working windows. p95 ≤ 8 hours. p99 ≤ 24 hours.

Why this metric: time-to-first-touch is the single number most strongly correlated with cycle time and developer satisfaction. A PR that sits unread is the most expensive PR: the author has context-switched away, the work is blocked, and the cost of re-engaging compounds as time passes.

The working-windows qualifier matters: a team spanning SF + Berlin + Singapore has roughly 18 hours per day of overlapping working time. The SLO measures within those windows, not against a 24/7 calendar. A PR that opens at 11pm SF time can legitimately have a 10-hour time-to-first-touch (until SF morning); a PR that opens at 11am SF time and sits for 10 hours has missed the SLO.

SLO 2: Time-to-merge

Definition: median elapsed time from PR open to merge (excluding the optional "merge queue" wait if your team uses one). Captures the full review-and-iterate cycle.

Target: p50 ≤ 8 hours for PRs under 200 LOC. p50 ≤ 24 hours for PRs 200-500 LOC. p95 ≤ 48 hours regardless of size.

Why this metric: time-to-merge is what the team actually feels. Time-to-first-touch tells you whether someone engaged; time-to-merge tells you whether the engagement converged. A team with 2-hour first-touch but 5-day merge has a different problem (round-trip noise, comment density, re-review delay) than a team with 24-hour first-touch and 30-hour merge (single-touch reviews but slow first engagement).

The PR-size split matters because a 50-LOC fix and a 500-LOC feature should not be held to the same SLO. The split prevents the SLO from either being too loose (averaging in slow large PRs that should be slow) or too tight (forcing rushed reviews of complex changes).

SLO 3: Defect escape rate

Definition: defects discovered in production that the team agrees should have been caught in review, divided by total PRs merged in the measurement window.

Target: ≤2% for non-trivial PRs (>50 LOC). The threshold for "should have been caught" is judgement. Apply it in a blameless postmortem and accept that some defects could have been caught but weren't.

Why this metric: time-to-first-touch and time-to-merge are speed measures; defect escape is the quality measure. Without it, "improve review SLOs" turns into "approve faster," which trades quality for speed. Tracking both keeps the speed/quality tradeoff visible.

The denominator matters: dividing by all PRs (including trivial fixes that don't represent meaningful review work) understates the real rate. Stride's recommendation is to filter to PRs >50 LOC, which is a rough proxy for "had enough change to plausibly hide a defect."

How to measure

The data lives in your version-control system (GitHub, GitLab, Bitbucket) and your incident tracker. The pipeline:

Export PR data weekly: opened time, first-comment-or-approval time, merge time, LOC changed, author, reviewer(s).
Tag PRs by category: trivial (typo, docs, config), small (under 200 LOC), medium (200-500), large (500+). Use a script over the diff stats, not human labels.
Compute the SLOs: p50 / p95 / p99 per category; defect escape rate over the rolling 4-week window.
Track trends: chart the SLOs weekly; the trend over 4-8 weeks is the signal. Week-to-week variation is noise; quarter-over-quarter trend is what matters.

Tooling varies. GitHub's REST API + a Python script + a CSV in a dashboard is sufficient; managed tools (LinearB, Code Climate Velocity, Stride's review analytics) automate the pipeline at the cost of vendor dependency.

Setting the right targets

The targets above are reasonable defaults for distributed teams shipping daily. They're not universal. Adjust based on:

Team size: smaller teams (under 6) have less reviewer redundancy; the 2-hour first-touch is harder. Adjust to 4 hours.
Timezone spread: teams with no overlap (e.g. Bangalore + San Francisco only) cannot hit a 2-hour median; the absence of overlap is the binding constraint. Make peace with 8-12 hour first-touch and focus on time-to-merge.
Regulatory environment: teams in regulated industries (finance, healthcare, government) often have mandatory review steps that extend time-to-merge for reasons unrelated to the team's discipline. Adjust accordingly.
PR-size cap: a team that caps PRs at 200 LOC can hold tighter time-to-merge targets than one that allows 1000-LOC PRs.

The discipline is to set the targets explicitly, agree them as a team, and revisit quarterly.

What changes when the SLOs are missed

A missed SLO is information, not blame. The interesting question is what to do about it.

Time-to-first-touch missed: usually a reviewer-routing or notification problem. Common fixes: add a second auto-assigned reviewer, route notifications to a team channel rather than personal Slack, set up a "PR queue" dashboard the team checks at the start of each day.

Time-to-merge missed but first-touch hit: usually a round-trip-count or PR-size problem. Common fixes: enforce a PR-size cap, train reviewers on batch-commenting (vs drip-feed), add async-review templates to compress back-and-forth.

Defect escape rate exceeded: usually a review-rigour problem. Common fixes: introduce or refresh the checklist; ensure the mob review trigger list catches the right PR categories; consider whether AI-assisted review would catch the dominant defect class.

The fix is structural, not exhortation. "Try to review faster" doesn't move the SLO; changing the reviewer-routing rule does.

The leading indicators

The three SLOs are lagging. They measure what's already happened. Three leading indicators predict trend changes:

PR queue depth: number of PRs open and awaiting first review. Trending up = time-to-first-touch will rise.
Round trips per PR: median number of review-comment / author-push cycles per PR. Trending up = time-to-merge will rise.
PR size distribution: share of PRs >500 LOC. Trending up = time-to-merge will rise and defect-escape will rise.

A team that watches the leading indicators weekly catches drift before it shows up in the lagging SLOs. A team that watches only the lagging SLOs spends every quarter playing catch-up.

Cultural anti-patterns

The SLOs work only when applied with a few cultural commitments:

Blameless interpretation. The SLO is the team's measure, not an individual scorecard. "Reviewer X always misses the SLO" is not a useful framing; "the routing rule sends 80% of the work to one reviewer" is.
Speed and quality are not opposed. A team that hits speed SLOs by approving without review will eventually blow the defect-escape SLO. A team that hits the defect-escape SLO by mob-reviewing everything will blow speed. The discipline is balancing all three.
The SLOs are negotiable. As the team learns its actual capacity, the targets should evolve. The first set is a starting point; the third or fourth quarterly revision is where they stabilise.

A worked example

Team of 8 engineers, spanning Berlin + New York. Initial baseline (measured over 4 weeks before introducing SLOs):

Time-to-first-touch p50: 6 hours, p95: 22 hours
Time-to-merge p50: 28 hours, p95: 96 hours
Defect escape rate (PRs >50 LOC): 4%

After 6 weeks of focused intervention (added second reviewer auto-assignment, introduced PR-size cap, refreshed the checklist):

Time-to-first-touch p50: 2.5 hours, p95: 9 hours
Time-to-merge p50: 14 hours, p95: 52 hours
Defect escape rate: 2.5%

After 12 weeks (added AI-assisted review for routine PRs, established the mob trigger list):

Time-to-first-touch p50: 1.5 hours, p95: 6 hours
Time-to-merge p50: 8 hours, p95: 36 hours
Defect escape rate: 1.5%

The numbers compound. The first improvement comes fast; subsequent improvements come from progressively deeper changes. The team that doesn't measure can't tell.

For the practices that hit the SLOs, see Async code review and Review checklists. For when the SLO target should be relaxed (mob review for high-stakes changes), see Mob review. For the AI-assisted lever that often moves all three SLOs at once, see AI-assisted code review.

Frequently asked questions

What SLOs should I set for code review?: Three metrics: time-to-first-touch (target p50 ≤2 hours, p95 ≤8 hours), time-to-merge (target p50 ≤8 hours for PRs under 200 LOC), defect escape rate (target ≤2% for PRs >50 LOC). Together they balance speed and quality: speed alone trades rigour for throughput; quality alone produces slow review.
How do I measure code review effectiveness?: Track the three SLOs weekly via your version-control system's API (GitHub, GitLab). Watch trends over 4-8 weeks, not week-to-week variation. Pair with leading indicators (PR queue depth, round trips per PR, PR size distribution) that predict where the trends are heading.
What does a "good" defect escape rate look like?: ≤2% for non-trivial PRs (>50 LOC changed) is achievable for teams running disciplined checklists and AI-assisted review. Rates above 5% usually indicate review-rigour problems; rates below 1% often mean the team is over-reviewing (mob-reviewing too many PRs, or the checklist is too long) and trading speed for marginal quality.
How long does it take to improve review SLOs?: Teams committing to structural changes (auto-assignment, PR size cap, comment templates) see measurable improvement in 4-6 weeks. Further compounding improvements (AI assistance, checklist refinement, mob review for high-stakes) come over 12-16 weeks. The metric trajectory is more important than absolute values for the first 2-3 months.