Back to all posts
AIPM

How AI writes acceptance criteria (and where it fails)

The honest map of where AI is dramatically better than humans at writing acceptance criteria — and the five places it confidently writes garbage. Plus the prompts that work.

Stride Team· Engineering10 min read

Every PM has watched an engineer mistake an ambiguous user story for a clear one. Twenty minutes into the dev cycle, the team is debating whether "the system handles invalid input" means returning an error message or just not crashing. By the end of the sprint, you've shipped both interpretations on different code paths.

Acceptance criteria are supposed to prevent this. They mostly don't. Real-world AC tends to drift toward two failure modes: too vague to constrain the build, or so detailed it becomes a re-spec of the story itself.

AI can write AC. We ship that feature. What's surprised us, after watching ~5,000 stories pass through Stride's prompt, is where the AI is dramatically better than humans and where it confidently writes garbage. This post is the honest map.

What good acceptance criteria look like

Before the AI question: what are we even aiming at?

Good AC has three properties:

  1. Testable. A reasonable QA engineer (human or AI) could execute the criterion and return pass/fail. "The system performs well" fails. "The list endpoint returns < 200ms p95 with 10K rows" passes.

  2. Bounded. Lists what the story does AND what it doesn't do. Implicit scope is where teams collide later. "Adds CSV export. Does not export attachments." Both halves matter.

  3. Behavioral. Describes observable behavior, not implementation. "Stores the audit log in Postgres" is implementation. "The audit log persists across page reload" is behavior. The story should constrain behavior; the architecture decision is its own artifact.

The Gherkin format (Given/When/Then) helps with the first two. It doesn't automatically deliver the third — plenty of teams write Gherkin that leaks implementation.

Where AI is great at AC

Five places the AI consistently outperforms humans.

1. Surface area coverage

Humans write AC for the happy path. AI writes AC for the happy path plus nine variations the human forgot:

  • Empty state
  • Single-item state
  • Pagination boundaries
  • Permission denied
  • Rate-limited
  • Concurrent modification
  • Network error mid-action
  • Browser-back navigation
  • Accessibility (keyboard-only, screen reader announcement)

Given a story like "Users can bulk-archive issues," the AI generates AC for all 10 cases. Humans, on a good day, write 3. On most days, they write 1 and the rest emerge as bugs in QA.

This single property is why teams that adopt AI-AC report 30-50% fewer defects per story in the first quarter. Not because the AC is more clever — because the AC exists for cases humans skip.

2. Gherkin syntax discipline

Most teams that "use Gherkin" use it loosely. Mixed verb tenses, ambiguous subjects, embedded conjunctions ("the user sees X and the system also stores Y"). The AI doesn't do any of this. Output is mechanically consistent:

Feature: Bulk archive issues
 
Scenario: User archives 5 issues from the board
  Given I am viewing the project board with at least 5 issues
  When I select 5 issues using the checkbox column
  And I click the "Archive" action
  Then the selected issues disappear from the board
  And an undo toast appears for 8 seconds
  And the archived issues are visible under Filters → Archived

This isn't necessarily deeper than what a human would write — but it's consistent. Consistency compounds when AI test-case generation reads the AC later.

3. Anti-requirements

Humans almost never write what's not in scope. AI, when prompted to, does. We ship a system prompt that explicitly asks for "out of scope" lines, and the model fills them reliably:

Out of scope:

  • Archiving issues in bulk via the API (this story is UI-only)
  • Restoring archived issues (separate story)
  • Audit log of who archived what (separate story)

That second list saves days of re-spec when devs hit the boundary mid-implementation.

4. Edge cases that come from the data model

Given access to the project graph (which Stride has by design), the AI surfaces edge cases that depend on relationships humans don't track in their head. Example: "Users can change a story's assignee." The AI notices:

  • Story has open comments → mentions persist after reassignment?
  • Story has time entries → time entries reassign to new owner?
  • Story is in a sprint → sprint capacity recomputes?
  • New assignee is on leave → warning shown?

A human PM might catch one or two of these. The AI catches all four every time, because it's reading the actual schema relationships, not its training set's idea of a generic "issue assignment" workflow.

5. Translation from PRD to story to AC

The biggest unlock: skip the manual decomposition step. Drop a 4-page PRD on the AI. Get back:

  • 12-15 stories with titles and descriptions
  • AC for each
  • Test cases linked to each AC
  • Estimated story points (calibrated to your team's historical velocity)
  • Dependencies between stories

The human PM's job becomes editing, not authoring. Editing 12 generated stories takes 30-60 minutes. Authoring them takes a half-day. The compounding savings over a quarter are substantial.

Where AI breaks (5 anti-patterns)

This is the part that matters. AI confidently writes garbage in five recognizable patterns. Watch for these.

1. Generic AC that could fit any story

The single most common failure. The AI fills the space with verbs and nouns that are technically related to the story but don't constrain anything.

The story: "Add Slack notifications for sprint completion."

The bad AI output:

Given a sprint is closed When the closure happens Then a notification is sent to Slack

That's three lines of nothing. Which channel? Which message? What if Slack is offline? What about already-completed sprints? What does the user see in-product when the notification fires?

Fix: Always pair the AI's first draft with a "what could go wrong" follow-up prompt. The AI is much better at finding holes in its own draft than at writing the draft in one shot.

2. AC that re-specs the implementation

The AI sometimes writes AC like a JIRA ticket comment from a junior engineer: implementation choices disguised as behavior.

The bad AI output:

Given the API receives a POST to /api/sprints/:id/close When the request is authenticated Then a row is inserted into the sprint_events table with type=closure

This is the story you want to ship later, after the architecture has been decided. The AC for the user-facing story shouldn't constrain the implementation. Reject and rewrite.

Fix: System prompt should explicitly say "describe behavior visible to the user, not implementation choices." Models follow this constraint well when told.

3. False-positive completeness

The AI writes 8 AC and confidently asserts the story is fully specified. Sometimes it is. Often it isn't — there's a 9th case that requires domain knowledge the AI doesn't have.

Example: A story about exporting reports. The AI writes AC for CSV, Excel, PDF, JSON, scheduled email export. It misses: what if the report has 10M rows? (The user expects pagination or chunking; the AI defaults to "just generate it.")

Fix: Always read AI-written AC with the question "what does this story not say about scale or limits?" Volume edge cases are the single most common gap.

4. Hallucinated dependencies

When the AI has access to your project graph, it sometimes invents relationships. "This story depends on STORY-451" — but STORY-451 is about an unrelated feature.

This is rare in well-prompted systems (the AI has to ground the dependency in real graph data) but happens often in naive integrations that just feed the AI a list of story titles.

Fix: Validate dependency claims against the actual story graph before accepting. Stride does this server-side; teams using ChatGPT directly to write AC don't, and pay for it.

5. Wrong granularity on edge cases

The AI sometimes treats trivial cases as critical. "What if the user's display name contains a single quote?" gets the same prominence as "What if the user is offline?" — when one is a 5-minute fix and the other is a strategic call about whether to support offline.

Fix: Edit pass. Group edge cases by tier (must-handle / should-handle / nice-to-have) and let the team decide which tier each falls into. The AI can suggest cases; humans triage.

Five prompts that work (and five that don't)

We A/B-tested prompt structures against ~2,000 stories. The patterns:

Prompts that work

  1. "Write AC for this story, with at least one criterion for empty state, error state, and permission denied." Forces minimum surface area.

  2. "Write AC in Gherkin (Given/When/Then). Use 'I' as the subject for user actions. Use present tense throughout." Locks syntax.

  3. "After writing the AC, list 3 cases your AC does not cover that a senior QA engineer would think to test." AI is much better at self-critique than self-completion.

  4. "For each AC, identify if it describes user-visible behavior or system implementation. Rewrite any implementation lines to be behavior-only." Catches the implementation-leak anti-pattern.

  5. "Reference these existing stories when relevant: [linked story IDs]. Do not invent new dependencies." Constrains hallucination.

Prompts that don't work

  1. "Write good AC for this story." Goodness is undefined. Output is generic.

  2. "Write comprehensive AC covering all cases." The model interprets "all" as "everything I can think of," which over-specifies.

  3. "Write AC like a senior PM would." Persona prompting works for marketing copy. For technical AC, it produces flowery prose without sharper logic.

  4. "Make sure the AC is testable." Too abstract. Better: "For each AC, write a one-sentence test plan."

  5. "Cover edge cases." Same problem as "comprehensive" — undefined target.

How Stride structures the prompt

Specifics, for teams using Stride or thinking about it. The system prompt that drives our acceptance-criteria generation:

  • Reads the story body + the project's data model graph (existing stories, sprints, integrations).
  • Asks for AC in Gherkin format with a Feature: header.
  • Requires minimum 5 AC including: happy path, empty state, error state, permission denied, and one behavior-vs-implementation check.
  • Generates a separate "Out of scope" list under each story.
  • Surfaces 3 follow-up cases the AC doesn't cover, marked as suggestions for the PM to triage.
  • Cross-references real story IDs only — no hallucinated dependencies.

The PM's editing pass usually drops 2-3 AC and tightens the wording on the rest. Total time per story: 4-7 minutes from prompt to merge.

Teams using AI-AC ship 31% fewer defects per story in the first 90 days. The lift is entirely from edge cases the AI surfaces that humans skip.

Internal data, n=5,200 stories · Stride telemetry, Q1 2026

What this means for you

If you're considering AI-AC, here's the honest take:

  • It's not magic. The AI doesn't replace the PM. It eliminates the bottom 70% of effort so the PM can spend their time on the top 30%.
  • It works best with graph context. Generic ChatGPT prompts produce generic AC. AI that reads your real story graph produces AC tied to real relationships.
  • You will edit every output. That's not a bug. That's the workflow. The 5-7 minute edit beats the 30-minute author.
  • It pays off in QA, not PM. The first month feels like "this is fine but didn't save me much." The third month, your defect rate has dropped meaningfully. That's the actual win.

For teams shipping software inside Stride, AI-AC is on by default in the Plan module. For teams stitching ChatGPT into their existing tracker, the prompts above are a starting point — but you'll lose ~40% of the value from the missing graph context.

Watch AI-AC generate a sprint's worth of stories in real time, with the project-graph context that makes the difference.

See the Plan module

If you want the procurement-stage comparison of how Stride does this vs. Jira's add-on approach, see Stride vs Jira. For the connected-graph thesis that makes Stride's AC distinct, the connected delivery graph post breaks down what we mean by "one graph, one prompt."

The point isn't that AI is going to write your AC for you. It's that the boring 70% of AC writing is over, and the PMs who use the time savings to deeply understand the remaining 30% are the ones whose teams ship cleaner software.

Keep reading