Regression strategy that scales past 10,000 tests
At 10k+ tests, "run everything" stops being a strategy. The 4-tier approach (smoke / affected / full nightly / pre-release) keeps iteration fast without sacrificing coverage.
Past 10,000 tests, "run everything" stops being a strategy. The full suite takes hours; CI capacity is a constraint; flaky tests compound; and the marginal value of running test #9,997 to catch a regression that's already caught by 6 other tests is essentially zero. The teams that don't address this either accept multi-hour CI loops (which kill iteration speed) or quietly stop running large parts of the suite (which kills coverage).
The healthier path: tier tests by risk and impact, and run by tier on a per-change basis. The full suite still runs — but rarely, and never as the iteration-loop blocker.
The four tiers
A well-tiered regression strategy looks like this:
Tier 1: Smoke (runs on every PR, in under 5 minutes)
The smallest set of tests that verifies the application works end-to-end at all. Login, key transaction, primary read path, primary write path. If any of these break, nothing else matters; the application is broken.
A healthy smoke suite is 50-100 tests. It runs on every PR as a pre-merge gate. Authors get feedback within 5 minutes that they haven't broken anything fundamental.
Tier 2: Affected (runs on every PR, scoped to impacted code paths)
Tests that exercise the code paths the PR touched, derived via test-impact analysis (TIA). For a typical 200-line PR, this is usually 200-2,000 tests of the total 10K+ suite — enough to catch direct regressions, fast enough to run within 10-15 minutes.
TIA tooling matters here. Microsoft's TIA, Bazel's incremental test selection, and language-specific tools (nx for monorepo TS, pytest-testmon for Python) all support some form. The investment in TIA pays back permanently for any suite >30 minutes long.
Tier 3: Full regression (runs nightly on main, takes 1-3 hours)
The complete test suite, run end-to-end on a clean environment. This catches regressions that TIA missed because the dependency wasn't captured in the graph (configuration changes, generated code, infrastructure quirks). Failures get triaged the next morning; the team fixes forward or reverts.
Full regression doesn't gate any individual PR. It gates the next release window — if main has been red overnight, the morning standup includes "investigate the red main".
Tier 4: Pre-release (runs before each release, takes 2-6 hours)
The full suite plus longer-running tests skipped from nightly: load tests, soak tests, full browser-matrix tests, security scans, accessibility audits. Anything too expensive to run every night but important to verify before customers see it.
Pre-release runs gate the release itself. The team's release runbook includes "all pre-release tests green" as a hard requirement.
Why this tiering works
The math is straightforward. With 10K tests at average 30 seconds each:
- Run everything on every PR: 5 hours per PR
- Run smoke only on every PR: 5 minutes per PR
- Run smoke + affected on every PR: ~15 minutes per PR (typical TIA selectivity)
- Run full nightly: 5 hours, but blocks nothing
- Run pre-release: 5-8 hours, runs ~weekly
The cumulative test runtime is the same. The user-experienced latency is dramatically different. The PR author waits 15 minutes for the iteration loop, not 5 hours. Throughput improves correspondingly.
What goes in which tier?
The hardest part of regression tiering isn't the structure; it's deciding which test goes where. The framework:
Smoke criteria
A test belongs in smoke if its failure means the application is broken at a level no customer should experience. The login flow. The primary checkout. The main dashboard render. If you're not sure whether a test belongs in smoke, it probably doesn't — smoke should be small enough that every team member knows what's in it.
Affected criteria (automated)
TIA determines this dynamically. The team doesn't manually assign tests; the tooling does. Quality depends on the dependency graph's accuracy.
Full regression criteria
Everything not in smoke goes here by default. Tests that prove edge cases, that exercise less-common code paths, that verify data-migration correctness, that hit third-party integrations under mock.
Pre-release criteria
Tests that are too slow, too expensive, or too disruptive to run frequently. Browser-matrix tests (Chrome/Safari/Firefox/Edge × Win/macOS/Linux). Load tests that hit 10K req/s. Soak tests that run for hours. Tests requiring expensive cloud resources (large GPUs, datacenter-scale clusters).
Migration from "run everything" to tiered
The pragmatic migration path takes 4-6 weeks:
Week 1: define the smoke set. Pick ~50 tests covering the absolute fundamentals. Run on every PR. Get the team comfortable that "smoke passing" is sufficient for PR confidence.
Week 2-3: set up TIA tooling. Configure the dependency graph. Validate on historical PRs (does TIA catch the regressions the team actually saw last quarter?).
Week 4: shift the default from "run everything" to "smoke + affected" on PRs. Run full nightly. Communicate the change explicitly so the team isn't surprised when CI suddenly finishes in 15 minutes.
Week 5-6: tune. Watch which regressions slip past smoke + affected and get caught by nightly. Add tests to the smoke or affected tier if a pattern emerges; otherwise accept that the nightly catches the long tail.
The change is mostly cultural. Engineers used to "CI passes = ready to ship" need to absorb that "smoke + affected passes = safe to merge; nightly + pre-release passes = safe to release". Both are necessary; neither is sufficient alone.
Flaky tests in a tiered strategy
Flaky tests are the wild card. A flake in the smoke tier blocks every PR; a flake in nightly is annoying but absorbable. The pragmatic policy:
- Smoke tier: zero tolerance for flakes. A flake in smoke gets the test quarantined within a day or the smoke gate disabled. The smoke set has to be reliable enough to trust as a blocker.
- Affected tier: low tolerance. A flake in affected affects ~10% of PRs (varies); persistent flakes get quarantined within a week.
- Full regression: moderate tolerance. A flake in nightly produces a noisy report; the team triages within the sprint.
- Pre-release: case-by-case. Some pre-release tests (load tests especially) are inherently noisy; the policy is to capture trends rather than enforce per-run determinism.
Tooling
The category leaders for tiered regression:
- Test-impact analysis: Microsoft Test Impact Analysis, Bazel (with proper rule configuration), nx (TypeScript monorepo), pytest-testmon (Python), Jest with
--changedSince(less sophisticated but free). - Parallel execution: GitHub Actions matrix, CircleCI parallelism, BuildKite agents, Jenkins parallel pipelines. Most CI systems support this; what matters is configuring it well (avoiding shared state, partitioning evenly).
- Test selection by failure history: TestImpact extensions (Java/.NET), Launchable (cloud service), proprietary tools at Google/Facebook scale. The pattern: prioritise tests historically most likely to fail.
- Flaky-test detection: Datadog CI Visibility, Buildkite Test Analytics, CircleCI Test Insights. The output: lists of historically-flaky tests for quarantine triage.
Common pitfalls
- Skipping the migration: jumping straight to tiered without team buy-in produces resentment. Communicate explicitly.
- Over-aggressive smoke set: pushing too many tests into smoke makes the gate slow and trains the team to skip it. Keep smoke small.
- Ignoring nightly failures: if nightly is "always red", the team learns to ignore it. Triage nightly daily; fix or revert within the sprint.
- TIA without verification: trusting TIA without measuring its catch rate produces false confidence. Run full regression weekly to validate that TIA isn't missing systematic regressions.
Related reading
For the test-case structure that supports tiering, see Test-case design. For tracking which tests cover which requirements, see Traceability matrix. For what to do with the defects regression surfaces, see Defect triage.
Frequently asked questions
- When does "run everything" stop scaling?
- Around the 30-60 minute mark for the full test suite — at that point, PR cycle time becomes dominated by CI rather than code review or implementation. The 4-tier approach (smoke, affected, full nightly, pre-release) keeps iteration fast while preserving full coverage at slower cadences.
- What is test impact analysis?
- Test impact analysis (TIA) identifies the subset of tests relevant to a code change by analysing the dependency graph between source files and test files. CI runs only the affected tests instead of the entire suite. TIA can cut CI time by 5-20x for large suites with disciplined dependency tracking.
- How big should the smoke test suite be?
- 50-100 tests covering the absolute fundamentals — login, primary read path, primary write path, key transactions. Should complete in under 5 minutes. Small enough that every team member knows what's in it; reliable enough to trust as a pre-merge gate.
- How do I migrate from "run everything" to tiered?
- 4-6 week migration: Week 1 define smoke set, Week 2-3 set up TIA tooling, Week 4 shift default from "run everything" to "smoke + affected" on PRs, Week 5-6 tune. Communicate the change explicitly so the team isn't surprised when CI suddenly finishes in 15 minutes. The hardest part is cultural, not technical.
Longer-form blog posts that go deeper on regression strategy that scales past 10,000 tests.
- Are AI-generated test cases worth shipping?Yes, with a sharp caveat — when they're tied to AC and reviewed by a human. Five categories where AI test generation is great, five anti-patterns to catch.9 min read
- What's the actual ROI of AI in software delivery?$4-$8 back for every dollar spent within 6 months, for most teams. The honest math from real data, not the deck.7 min read
More in Test management
- Test-case design that doesn't go stale9 min · Behaviour-anchored Gherkin survives refactors that break step-anchored UI tests. The 5 components every good case has, and structural moves that age well.
- Traceability matrix without spreadsheet hell9 min · Manual spreadsheets drift within weeks. The derived-matrix approach — auto-generated from the entity graph — stays accurate and pays back for audit-grade compliance.
- Exploratory testing alongside automation10 min · Charters, time-boxes, observed defect rates. The structured discipline that finds the bugs automation never catches — UX issues, unexpected combinations, real-world data quirks.
- Defect triage that doesn't drown the team10 min · Severity × frequency × impact, with explicit non-fix criteria and SLAs per severity tier. The process that prevents the backlog from growing to 400+ untriaged items.