Can AI write Gherkin? (yes — here's how)
Yes. AI writes Gherkin well, often better than humans for surface area coverage. Five wins, five recognisable failure modes, and the prompts that work.
Short answer: Yes — AI writes Gherkin well, often better than humans for surface area coverage. But the AI fails predictably in five places, and those failures look identical to passing tests if you don't know what to watch for.
Below is the map: where AI wins at Gherkin, where it confidently writes garbage, and the prompts that work.
What good Gherkin looks like
Before the AI question: what are we even aiming at?
Feature: Bulk archive issues
Scenario: User archives 5 issues from the board
Given I am viewing the project board with at least 5 issues
When I select 5 issues using the checkbox column
And I click the "Archive" action
Then the selected issues disappear from the board
And an undo toast appears for 8 seconds
And the archived issues are visible under Filters → ArchivedThree properties:
-
Behavioural. Describes what the user does and observes, not what the system implements (no mentions of databases, services, or APIs).
-
Bounded. Each scenario tests one path; combinations get separate scenarios.
-
Unambiguous. Read it out loud — does it have one possible interpretation? If "the issues disappear" could mean "from this view" or "from all views" or "permanently," the AC is broken before testing.
Most teams use Gherkin loosely. The AI doesn't — it writes mechanically consistent output. That's the win.
Where AI is great at Gherkin (5 wins)
1. Surface-area coverage
Humans write Gherkin for the happy path. AI writes Gherkin for the happy path plus nine variations the human forgot:
- Empty state
- Single-item state
- Pagination boundaries
- Permission denied
- Rate-limited
- Concurrent modification
- Network error mid-action
- Browser-back navigation
- Accessibility (keyboard-only, screen reader announcement)
For "Users can bulk-archive issues" the AI generates 9-12 scenarios. Humans, on a good day, write 3. The 6-9 it adds are the bugs you'd otherwise discover in QA.
2. Syntax discipline
Mixed verb tenses, ambiguous subjects, conjunction-spaghetti ("the user sees X and the system also stores Y"). AI doesn't do any of these. Output is consistent. Test runners parse cleanly. Less debugging of malformed scenarios.
3. Anti-requirements (out-of-scope explicitness)
Humans almost never write what's NOT in scope. AI, when prompted to, does:
Out of scope for this story:
- Bulk archive via the API (UI-only story)
- Restoring archived issues (separate story)
- Audit log of who archived what (separate story)
That second list saves days of re-spec when devs hit the boundary mid-implementation.
4. Schema-aware edge cases
When the AI has access to the project graph (as Stride does), edge cases that depend on relationships humans don't track mentally surface automatically. "Users can change a story's assignee" → AI notices:
- Story has open comments → do mentions persist?
- Story has time entries → do they reassign?
- Story is in a sprint → does sprint capacity recompute?
- New assignee is on leave → warning shown?
Humans catch 1-2. AI catches all 4 every time, because it's reading the actual schema.
5. PRD-to-Gherkin pipeline
Drop a 4-page PRD on the AI. Get back 12-15 user stories, each with Gherkin AC, test case skeletons, and inferred dependencies. The PM's job becomes editing 12 generated stories (30-60 min) instead of authoring 12 stories (a half day).
Where AI breaks (5 anti-patterns)
This is the part that matters. AI writes Gherkin garbage in five recognizable patterns.
1. Generic Gherkin that could fit any story
Most common failure. AI fills space with verbs and nouns technically related to the story but not constraining anything.
Story: "Add Slack notifications for sprint completion."
Bad AI output:
Scenario: Sprint closes successfully
Given a sprint is closed
When the closure happens
Then a notification is sent to SlackThree lines of nothing. Which channel? Which message? What if Slack is offline? What does the user see in-product?
Fix: Always pair the AI's first draft with a "what could go wrong" follow-up. The AI is much better at finding holes in its own draft than at writing the draft in one shot.
2. Implementation-leak in disguise
The AI sometimes writes Gherkin like a junior engineer's ticket comment: implementation choices disguised as behaviour.
Bad output:
Scenario: Sprint close persists to event log
Given the API receives a POST to /api/sprints/:id/close
When the request is authenticated
Then a row is inserted into sprint_events table with type=closureThis is the story you ship later, after architecture. The AC for the user-facing story shouldn't constrain implementation.
Fix: System prompt should explicitly say "describe behaviour visible to the user, not implementation choices." Models follow this constraint well when told.
3. False-positive completeness
AI writes 8 scenarios and confidently asserts the story is fully specified. Sometimes it is. Often it isn't — a 9th case requires domain knowledge the AI doesn't have.
Example: Report export story. AI writes scenarios for CSV, Excel, PDF, JSON, scheduled email. Misses: what if the report has 10M rows? (Pagination? Chunking? Async with email-link?)
Fix: Always read AI-Gherkin with "what does this story not say about scale or limits?" Volume edge cases are the single most common gap.
4. Hallucinated dependencies
When AI has graph access, it sometimes invents relationships. "This story depends on STORY-451" — but STORY-451 is about unrelated functionality.
Rare in well-prompted systems (model has to ground in real graph data) but common in naive integrations that feed AI a story-title list.
Fix: Validate dependency claims against actual story graph before accepting. Stride does this server-side; teams using ChatGPT directly to write Gherkin don't, and pay for it.
5. Wrong granularity on edge cases
AI sometimes treats trivial cases as critical. "What if the display name has a single quote?" gets the same prominence as "What if the user is offline?" — one is a 5-minute fix, the other is a strategic call about supporting offline.
Fix: Edit pass. Group cases by tier (must-handle / should-handle / nice-to-have); let the team triage.
Prompts that work
We A/B-tested prompt structures against ~2,000 stories. Patterns that work:
-
"Write Gherkin AC for this story, with at least one scenario for empty state, error state, and permission denied." Forces minimum surface area.
-
"Write Gherkin (Given/When/Then). Use 'I' as the subject for user actions. Use present tense throughout." Locks syntax.
-
"After writing the AC, list 3 cases your scenarios don't cover that a senior QA engineer would think to test." AI is much better at self-critique than self-completion.
-
"For each scenario, identify if it describes user-visible behaviour or system implementation. Rewrite any implementation lines to be behaviour-only." Catches the implementation-leak anti-pattern.
-
"Reference these existing stories when relevant: [linked story IDs]. Do not invent new dependencies." Constrains hallucination.
Prompts that don't
- "Write good Gherkin for this story." (Goodness undefined.)
- "Write comprehensive scenarios covering all cases." (Model interprets "all" as "everything I can think of" — over-specifies.)
- "Write Gherkin like a senior PM would." (Persona prompting works for marketing copy; for technical AC it produces flowery prose without sharper logic.)
- "Make sure the AC is testable." (Too abstract. Better: "For each scenario, write a one-sentence test plan.")
- "Cover edge cases." (Same problem as "comprehensive" — undefined target.)
How Stride does it
The system prompt that drives our Gherkin generation:
- Reads the story body + project's data model graph (existing stories, sprints, integrations)
- Asks for Gherkin format with a
Feature:header - Requires minimum 5 scenarios including: happy path, empty state, error state, permission denied, and one behavior-vs-implementation check
- Generates a separate "Out of scope" list under each story
- Surfaces 3 follow-up cases the AC doesn't cover, marked as suggestions for the PM
- Cross-references real story IDs only — no hallucinated dependencies
PM edit pass usually drops 2-3 scenarios and tightens wording on the rest. Total time per story: 4-7 minutes from prompt to merge.
Teams using AI Gherkin generation ship 31% fewer defects per story in the first 90 days. The lift is entirely from edge cases the AI surfaces that humans skip.
Gherkin scenarios + test case skeletons generated from your stories' AC at create time, with traceability that maintains itself.
Read next
- How AI writes acceptance criteria (and where it fails) — companion post, broader scope on AI-AC.
- Stride vs Jira — procurement-stage view including the Jira+AI AC gap.
- The Gherkin glossary entry and acceptance-criteria glossary entry cover the underlying concepts.
The point isn't that AI writes your Gherkin for you. It's that the boring 70% is over, and the PMs who use the saved time to deeply understand the remaining 30% are the ones whose teams ship cleaner software.