All articles in Test management
Test management

Regression strategy that scales past 10,000 tests

At 10k+ tests, "run everything" stops being a strategy. The 4-tier approach (smoke / affected / full nightly / pre-release) keeps iteration fast without sacrificing coverage.

11 min read

Past 10,000 tests, "run everything" stops being a strategy. The full suite takes hours; CI capacity is a constraint; flaky tests compound; and the marginal value of running test #9,997 to catch a regression that's already caught by 6 other tests is essentially zero. The teams that don't address this either accept multi-hour CI loops (which kill iteration speed) or quietly stop running large parts of the suite (which kills coverage).

The healthier path: tier tests by risk and impact, and run by tier on a per-change basis. The full suite still runs — but rarely, and never as the iteration-loop blocker.

The four tiers

A well-tiered regression strategy looks like this:

Tier 1: Smoke (runs on every PR, in under 5 minutes)

The smallest set of tests that verifies the application works end-to-end at all. Login, key transaction, primary read path, primary write path. If any of these break, nothing else matters; the application is broken.

A healthy smoke suite is 50-100 tests. It runs on every PR as a pre-merge gate. Authors get feedback within 5 minutes that they haven't broken anything fundamental.

Tier 2: Affected (runs on every PR, scoped to impacted code paths)

Tests that exercise the code paths the PR touched, derived via test-impact analysis (TIA). For a typical 200-line PR, this is usually 200-2,000 tests of the total 10K+ suite — enough to catch direct regressions, fast enough to run within 10-15 minutes.

TIA tooling matters here. Microsoft's TIA, Bazel's incremental test selection, and language-specific tools (nx for monorepo TS, pytest-testmon for Python) all support some form. The investment in TIA pays back permanently for any suite >30 minutes long.

Tier 3: Full regression (runs nightly on main, takes 1-3 hours)

The complete test suite, run end-to-end on a clean environment. This catches regressions that TIA missed because the dependency wasn't captured in the graph (configuration changes, generated code, infrastructure quirks). Failures get triaged the next morning; the team fixes forward or reverts.

Full regression doesn't gate any individual PR. It gates the next release window — if main has been red overnight, the morning standup includes "investigate the red main".

Tier 4: Pre-release (runs before each release, takes 2-6 hours)

The full suite plus longer-running tests skipped from nightly: load tests, soak tests, full browser-matrix tests, security scans, accessibility audits. Anything too expensive to run every night but important to verify before customers see it.

Pre-release runs gate the release itself. The team's release runbook includes "all pre-release tests green" as a hard requirement.

Why this tiering works

The math is straightforward. With 10K tests at average 30 seconds each:

  • Run everything on every PR: 5 hours per PR
  • Run smoke only on every PR: 5 minutes per PR
  • Run smoke + affected on every PR: ~15 minutes per PR (typical TIA selectivity)
  • Run full nightly: 5 hours, but blocks nothing
  • Run pre-release: 5-8 hours, runs ~weekly

The cumulative test runtime is the same. The user-experienced latency is dramatically different. The PR author waits 15 minutes for the iteration loop, not 5 hours. Throughput improves correspondingly.

What goes in which tier?

The hardest part of regression tiering isn't the structure; it's deciding which test goes where. The framework:

Smoke criteria

A test belongs in smoke if its failure means the application is broken at a level no customer should experience. The login flow. The primary checkout. The main dashboard render. If you're not sure whether a test belongs in smoke, it probably doesn't — smoke should be small enough that every team member knows what's in it.

Affected criteria (automated)

TIA determines this dynamically. The team doesn't manually assign tests; the tooling does. Quality depends on the dependency graph's accuracy.

Full regression criteria

Everything not in smoke goes here by default. Tests that prove edge cases, that exercise less-common code paths, that verify data-migration correctness, that hit third-party integrations under mock.

Pre-release criteria

Tests that are too slow, too expensive, or too disruptive to run frequently. Browser-matrix tests (Chrome/Safari/Firefox/Edge × Win/macOS/Linux). Load tests that hit 10K req/s. Soak tests that run for hours. Tests requiring expensive cloud resources (large GPUs, datacenter-scale clusters).

Migration from "run everything" to tiered

The pragmatic migration path takes 4-6 weeks:

Week 1: define the smoke set. Pick ~50 tests covering the absolute fundamentals. Run on every PR. Get the team comfortable that "smoke passing" is sufficient for PR confidence.

Week 2-3: set up TIA tooling. Configure the dependency graph. Validate on historical PRs (does TIA catch the regressions the team actually saw last quarter?).

Week 4: shift the default from "run everything" to "smoke + affected" on PRs. Run full nightly. Communicate the change explicitly so the team isn't surprised when CI suddenly finishes in 15 minutes.

Week 5-6: tune. Watch which regressions slip past smoke + affected and get caught by nightly. Add tests to the smoke or affected tier if a pattern emerges; otherwise accept that the nightly catches the long tail.

The change is mostly cultural. Engineers used to "CI passes = ready to ship" need to absorb that "smoke + affected passes = safe to merge; nightly + pre-release passes = safe to release". Both are necessary; neither is sufficient alone.

Flaky tests in a tiered strategy

Flaky tests are the wild card. A flake in the smoke tier blocks every PR; a flake in nightly is annoying but absorbable. The pragmatic policy:

  • Smoke tier: zero tolerance for flakes. A flake in smoke gets the test quarantined within a day or the smoke gate disabled. The smoke set has to be reliable enough to trust as a blocker.
  • Affected tier: low tolerance. A flake in affected affects ~10% of PRs (varies); persistent flakes get quarantined within a week.
  • Full regression: moderate tolerance. A flake in nightly produces a noisy report; the team triages within the sprint.
  • Pre-release: case-by-case. Some pre-release tests (load tests especially) are inherently noisy; the policy is to capture trends rather than enforce per-run determinism.

Tooling

The category leaders for tiered regression:

  • Test-impact analysis: Microsoft Test Impact Analysis, Bazel (with proper rule configuration), nx (TypeScript monorepo), pytest-testmon (Python), Jest with --changedSince (less sophisticated but free).
  • Parallel execution: GitHub Actions matrix, CircleCI parallelism, BuildKite agents, Jenkins parallel pipelines. Most CI systems support this; what matters is configuring it well (avoiding shared state, partitioning evenly).
  • Test selection by failure history: TestImpact extensions (Java/.NET), Launchable (cloud service), proprietary tools at Google/Facebook scale. The pattern: prioritise tests historically most likely to fail.
  • Flaky-test detection: Datadog CI Visibility, Buildkite Test Analytics, CircleCI Test Insights. The output: lists of historically-flaky tests for quarantine triage.

Common pitfalls

  • Skipping the migration: jumping straight to tiered without team buy-in produces resentment. Communicate explicitly.
  • Over-aggressive smoke set: pushing too many tests into smoke makes the gate slow and trains the team to skip it. Keep smoke small.
  • Ignoring nightly failures: if nightly is "always red", the team learns to ignore it. Triage nightly daily; fix or revert within the sprint.
  • TIA without verification: trusting TIA without measuring its catch rate produces false confidence. Run full regression weekly to validate that TIA isn't missing systematic regressions.

For the test-case structure that supports tiering, see Test-case design. For tracking which tests cover which requirements, see Traceability matrix. For what to do with the defects regression surfaces, see Defect triage.

Frequently asked questions

When does "run everything" stop scaling?
Around the 30-60 minute mark for the full test suite — at that point, PR cycle time becomes dominated by CI rather than code review or implementation. The 4-tier approach (smoke, affected, full nightly, pre-release) keeps iteration fast while preserving full coverage at slower cadences.
What is test impact analysis?
Test impact analysis (TIA) identifies the subset of tests relevant to a code change by analysing the dependency graph between source files and test files. CI runs only the affected tests instead of the entire suite. TIA can cut CI time by 5-20x for large suites with disciplined dependency tracking.
How big should the smoke test suite be?
50-100 tests covering the absolute fundamentals — login, primary read path, primary write path, key transactions. Should complete in under 5 minutes. Small enough that every team member knows what's in it; reliable enough to trust as a pre-merge gate.
How do I migrate from "run everything" to tiered?
4-6 week migration: Week 1 define smoke set, Week 2-3 set up TIA tooling, Week 4 shift default from "run everything" to "smoke + affected" on PRs, Week 5-6 tune. Communicate the change explicitly so the team isn't surprised when CI suddenly finishes in 15 minutes. The hardest part is cultural, not technical.
Defined in our glossary

More in Test management