Back to all posts

Can AI write Gherkin? (yes — here's how)

Yes. AI writes Gherkin well, often better than humans for surface area coverage. Five wins, five recognisable failure modes, and the prompts that work.

Stride Team· Engineering8 min read

Short answer: Yes — AI writes Gherkin well, often better than humans for surface area coverage. But the AI fails predictably in five places, and those failures look identical to passing tests if you don't know what to watch for.

Below is the map: where AI wins at Gherkin, where it confidently writes garbage, and the prompts that work.

What good Gherkin looks like

Before the AI question: what are we even aiming at?

Feature: Bulk archive issues
 
Scenario: User archives 5 issues from the board
  Given I am viewing the project board with at least 5 issues
  When I select 5 issues using the checkbox column
  And I click the "Archive" action
  Then the selected issues disappear from the board
  And an undo toast appears for 8 seconds
  And the archived issues are visible under Filters → Archived

Three properties:

  1. Behavioural. Describes what the user does and observes, not what the system implements (no mentions of databases, services, or APIs).

  2. Bounded. Each scenario tests one path; combinations get separate scenarios.

  3. Unambiguous. Read it out loud — does it have one possible interpretation? If "the issues disappear" could mean "from this view" or "from all views" or "permanently," the AC is broken before testing.

Most teams use Gherkin loosely. The AI doesn't — it writes mechanically consistent output. That's the win.

Where AI is great at Gherkin (5 wins)

1. Surface-area coverage

Humans write Gherkin for the happy path. AI writes Gherkin for the happy path plus nine variations the human forgot:

  • Empty state
  • Single-item state
  • Pagination boundaries
  • Permission denied
  • Rate-limited
  • Concurrent modification
  • Network error mid-action
  • Browser-back navigation
  • Accessibility (keyboard-only, screen reader announcement)

For "Users can bulk-archive issues" the AI generates 9-12 scenarios. Humans, on a good day, write 3. The 6-9 it adds are the bugs you'd otherwise discover in QA.

2. Syntax discipline

Mixed verb tenses, ambiguous subjects, conjunction-spaghetti ("the user sees X and the system also stores Y"). AI doesn't do any of these. Output is consistent. Test runners parse cleanly. Less debugging of malformed scenarios.

3. Anti-requirements (out-of-scope explicitness)

Humans almost never write what's NOT in scope. AI, when prompted to, does:

Out of scope for this story:
- Bulk archive via the API (UI-only story)
- Restoring archived issues (separate story)
- Audit log of who archived what (separate story)

That second list saves days of re-spec when devs hit the boundary mid-implementation.

4. Schema-aware edge cases

When the AI has access to the project graph (as Stride does), edge cases that depend on relationships humans don't track mentally surface automatically. "Users can change a story's assignee" → AI notices:

  • Story has open comments → do mentions persist?
  • Story has time entries → do they reassign?
  • Story is in a sprint → does sprint capacity recompute?
  • New assignee is on leave → warning shown?

Humans catch 1-2. AI catches all 4 every time, because it's reading the actual schema.

5. PRD-to-Gherkin pipeline

Drop a 4-page PRD on the AI. Get back 12-15 user stories, each with Gherkin AC, test case skeletons, and inferred dependencies. The PM's job becomes editing 12 generated stories (30-60 min) instead of authoring 12 stories (a half day).

Where AI breaks (5 anti-patterns)

This is the part that matters. AI writes Gherkin garbage in five recognizable patterns.

1. Generic Gherkin that could fit any story

Most common failure. AI fills space with verbs and nouns technically related to the story but not constraining anything.

Story: "Add Slack notifications for sprint completion."

Bad AI output:

Scenario: Sprint closes successfully
  Given a sprint is closed
  When the closure happens
  Then a notification is sent to Slack

Three lines of nothing. Which channel? Which message? What if Slack is offline? What does the user see in-product?

Fix: Always pair the AI's first draft with a "what could go wrong" follow-up. The AI is much better at finding holes in its own draft than at writing the draft in one shot.

2. Implementation-leak in disguise

The AI sometimes writes Gherkin like a junior engineer's ticket comment: implementation choices disguised as behaviour.

Bad output:

Scenario: Sprint close persists to event log
  Given the API receives a POST to /api/sprints/:id/close
  When the request is authenticated
  Then a row is inserted into sprint_events table with type=closure

This is the story you ship later, after architecture. The AC for the user-facing story shouldn't constrain implementation.

Fix: System prompt should explicitly say "describe behaviour visible to the user, not implementation choices." Models follow this constraint well when told.

3. False-positive completeness

AI writes 8 scenarios and confidently asserts the story is fully specified. Sometimes it is. Often it isn't — a 9th case requires domain knowledge the AI doesn't have.

Example: Report export story. AI writes scenarios for CSV, Excel, PDF, JSON, scheduled email. Misses: what if the report has 10M rows? (Pagination? Chunking? Async with email-link?)

Fix: Always read AI-Gherkin with "what does this story not say about scale or limits?" Volume edge cases are the single most common gap.

4. Hallucinated dependencies

When AI has graph access, it sometimes invents relationships. "This story depends on STORY-451" — but STORY-451 is about unrelated functionality.

Rare in well-prompted systems (model has to ground in real graph data) but common in naive integrations that feed AI a story-title list.

Fix: Validate dependency claims against actual story graph before accepting. Stride does this server-side; teams using ChatGPT directly to write Gherkin don't, and pay for it.

5. Wrong granularity on edge cases

AI sometimes treats trivial cases as critical. "What if the display name has a single quote?" gets the same prominence as "What if the user is offline?" — one is a 5-minute fix, the other is a strategic call about supporting offline.

Fix: Edit pass. Group cases by tier (must-handle / should-handle / nice-to-have); let the team triage.

Prompts that work

We A/B-tested prompt structures against ~2,000 stories. Patterns that work:

  1. "Write Gherkin AC for this story, with at least one scenario for empty state, error state, and permission denied." Forces minimum surface area.

  2. "Write Gherkin (Given/When/Then). Use 'I' as the subject for user actions. Use present tense throughout." Locks syntax.

  3. "After writing the AC, list 3 cases your scenarios don't cover that a senior QA engineer would think to test." AI is much better at self-critique than self-completion.

  4. "For each scenario, identify if it describes user-visible behaviour or system implementation. Rewrite any implementation lines to be behaviour-only." Catches the implementation-leak anti-pattern.

  5. "Reference these existing stories when relevant: [linked story IDs]. Do not invent new dependencies." Constrains hallucination.

Prompts that don't

  1. "Write good Gherkin for this story." (Goodness undefined.)
  2. "Write comprehensive scenarios covering all cases." (Model interprets "all" as "everything I can think of" — over-specifies.)
  3. "Write Gherkin like a senior PM would." (Persona prompting works for marketing copy; for technical AC it produces flowery prose without sharper logic.)
  4. "Make sure the AC is testable." (Too abstract. Better: "For each scenario, write a one-sentence test plan.")
  5. "Cover edge cases." (Same problem as "comprehensive" — undefined target.)

How Stride does it

The system prompt that drives our Gherkin generation:

  • Reads the story body + project's data model graph (existing stories, sprints, integrations)
  • Asks for Gherkin format with a Feature: header
  • Requires minimum 5 scenarios including: happy path, empty state, error state, permission denied, and one behavior-vs-implementation check
  • Generates a separate "Out of scope" list under each story
  • Surfaces 3 follow-up cases the AC doesn't cover, marked as suggestions for the PM
  • Cross-references real story IDs only — no hallucinated dependencies

PM edit pass usually drops 2-3 scenarios and tightens wording on the rest. Total time per story: 4-7 minutes from prompt to merge.

Teams using AI Gherkin generation ship 31% fewer defects per story in the first 90 days. The lift is entirely from edge cases the AI surfaces that humans skip.

Internal data, n=5,200 stories · Stride telemetry, Q1 2026

Gherkin scenarios + test case skeletons generated from your stories' AC at create time, with traceability that maintains itself.

See AI test generation

The point isn't that AI writes your Gherkin for you. It's that the boring 70% is over, and the PMs who use the saved time to deeply understand the remaining 30% are the ones whose teams ship cleaner software.

Defined in our glossary

Keep reading