Back to all posts

Are AI-generated test cases worth shipping?

Yes, with a sharp caveat — when they're tied to AC and reviewed by a human. Five categories where AI test generation is great, five anti-patterns to catch.

Stride Team· Engineering9 min read

Short answer: Yes, with a sharp caveat. AI-generated test cases are worth shipping when they're tied to specific acceptance criteria and reviewed by a human before they enter the regression suite. AI-generated tests that aren't reviewed, or that aren't tied to AC, become flaky noise that erodes trust in the suite.

Below is the honest map: which AI tests are great, which are garbage, and how to tell the difference before they ship.

What AI is genuinely good at

Five categories of test where AI consistently produces shippable output:

1. AC-derived behavioural tests

When given a clear AC line ("the list endpoint returns 200 with a sorted array when there are ≥1 items"), the AI produces clean, focused tests. Each assertion maps to the AC line. Easy to review, easy to maintain.

The pattern: AC line → 1-3 tests. If the AI generates 7 tests for one AC line, something's wrong with the AC (probably ambiguous).

2. Negative-path coverage

Humans write happy-path tests. AI writes happy-path AND:

  • Empty input
  • Maximum input
  • Malformed input
  • Permission denied
  • Rate-limited
  • Concurrent modification

This is the single biggest defect-reduction lever. ~70% of post-release bugs are in these categories; AI-generated tests catch them before ship.

3. Schema-aware integration tests

When AI has access to your data model, integration tests reflect actual relationships:

  • "Test that deleting a project cascades to its stories" — AI generates this if the schema says cascade-delete
  • "Test that orphaned comments are cleaned up" — AI flags this if there's a foreign-key constraint without cascade

These are the integration tests humans skip because the schema isn't loaded in their head at test-writing time.

4. Test-data generation

Faker-style test data, but smarter. AI generates valid-but-edge-case data:

  • Strings with Unicode + emoji
  • Dates near DST boundaries
  • Numbers near floating-point precision limits
  • IDs near integer overflow

Stuff that breaks in production exactly because humans tested with John Smith and 2024-01-15.

5. Regression test scaffolds

When a defect is fixed, AI generates a regression test from the bug report + the diff. The test verifies the bug doesn't return. Faster than humans authoring from scratch; more reliable than the engineer skipping the regression test under deadline pressure.

What AI is bad at

Five anti-patterns to watch:

1. Generic tests that could apply to anything

The most common failure. AI generates:

test('it works', () => {
  const result = doThing();
  expect(result).toBeTruthy();
});

That's a green test that asserts nothing. Coverage looks great; protection is zero.

How to catch: Read every AI-generated test. If you can't articulate "this test would fail if X broke," delete it.

2. Tests that test the implementation, not the behaviour

test('it calls the userService.create method', () => {
  const spy = jest.spyOn(userService, 'create');
  signupUser({ email: 'x' });
  expect(spy).toHaveBeenCalled();
});

This test fails when the implementation refactors, even if the user-visible behaviour is correct. Brittle.

How to catch: Tests that mock heavily and assert on mocks (rather than asserting on user-visible outputs) are implementation tests in disguise.

3. Hallucinated APIs

AI tests sometimes invoke methods that don't exist (expect(user).toHaveValidEmail() — a matcher the AI imagined). The test runner catches the failure, but the engineer has to triage why.

How to catch: Run AI-generated tests immediately after generation. Anything that fails on the first run is suspect — either the matcher is hallucinated or the assertion is wrong.

4. Flaky setup

AI sometimes generates tests with timing-sensitive setup (sleeps, polling) that pass on a fast machine and fail on a slow CI runner. Becomes a flaky test that the team eventually ignores.

How to catch: Watch for setTimeout, setInterval, wait() in AI-generated tests. Most should be deterministic; if the AI used a timer, ask why.

5. Over-fitted to current implementation

AI generates a test that captures the EXACT current output of a function (expect(result).toEqual({...50-field-snapshot...})). Test passes today; fails the moment a field is added.

How to catch: Snapshot-style tests with hardcoded large objects are usually over-fitted. Refactor to assert on the specific properties that matter.

The workflow that works

Four steps:

1. Generate from AC

Don't ask AI for "tests for this feature." Ask for "tests for AC line 3 of STORY-42." Each AI test should map to a specific AC; without that anchor, the tests drift toward generic.

2. Review immediately

Before the test enters the regression suite, a human reads each one and asks:

  • Does it test behaviour or implementation?
  • Would it actually fail if the underlying logic broke?
  • Is the setup deterministic?

Tests that fail this review get edited or deleted. Don't merge unreviewed AI tests; you're committing future maintenance burden.

3. Run immediately

Generate, review, run. If the test fails on the first run, fix it now (while context is fresh) rather than letting it land as flaky.

4. Track failure provenance

When a test fails in CI, the link back to its source AC matters. "Test X failed → AC line 3 of STORY-42 failed → the user-visible behaviour Y broke" is debuggable. "Test X failed" with no provenance is mysteries the team learns to ignore.

Stride does this automatically: every AI-generated test carries a link to its source AC. When the test fails, the failure UI shows which AC line is at risk.

Should you ship AI tests AT ALL?

Three patterns where AI test generation is genuinely worth it:

1. High-AC-volume teams. If your stories have 5-10 AC lines each and you ship 20+ stories per sprint, you're authoring 100-200 test cases per sprint. AI reduces that authoring time by 75% without reducing test quality (when reviewed). Big win.

2. Regression-heavy products. Products with a long defect history benefit most — AI generates regression tests faster than humans write them.

3. New testing initiatives. Teams adding test coverage to a previously-undertested codebase get the biggest absolute lift. Authoring 500 missing tests from scratch is a year; AI-generating + reviewing is a quarter.

Patterns where AI test generation is NOT worth it:

1. Test-first / TDD-heavy teams. TDD's value is in the design pressure of writing the test before the code. Generating tests after the fact misses the design value. AI can suggest tests; don't let it replace the TDD discipline.

2. Highly-regulated industries. FDA, FAA, medical device, automotive — environments where test documentation is part of compliance. AI-generated tests need careful attribution + provenance; some regulators don't accept "AI did it" as authorship.

3. Tiny codebases. Under 5K LOC, the AI overhead (generate, review, edit) costs more than the saved authoring time. Manual tests are fine.

How Stride does this

In the Verify module:

  • Each AC line gets test case skeletons at story-creation time
  • Skeletons follow the team's existing test format (Gherkin / Jest / pytest / etc.)
  • QA reviews each before they enter the suite (one-click approve / edit / reject)
  • Approved tests link bidirectionally to the AC — when AC changes, affected tests flag for re-review
  • Test runs (from CI) feed back into the graph; failures link to their source AC

The traceability is the graph: AC → Test Case → Test Run → Defect. When something breaks, you can answer "what behaviour broke?" in one click.

The Verify module — AC-derived test cases, traceability that maintains itself, defect prediction.

See AI test generation in Stride

The honest summary: AI-generated test cases are worth shipping when they're (a) tied to specific AC, (b) reviewed by a human, and (c) tracked back to source. Without any of those three, you're committing future maintenance burden disguised as productivity. With all three, you cut authoring time 75% without losing quality.

Defined in our glossary

Keep reading