Why do most test cases go stale?

They're anchored to specific UI elements: clicking specific buttons, finding specific text, navigating specific routes. When the UI refactors, the tests break even though the underlying behaviour didn't change. Behaviour-anchored test cases (Given/When/Then format) describe what the test verifies without binding to how the UI implements it, so they survive refactors.

What's the difference between Given/When/Then and traditional test steps?

Traditional steps describe actions: "click here, type this, verify that text appears". Given/When/Then describes a state and a transition: "Given a registered user, When they submit credentials, Then they're redirected to the dashboard". The behaviour version survives any UI change that doesn't alter the underlying contract.

How long should a test case be?

A behaviour-anchored case is usually 5-15 lines. Cases longer than that are usually combining multiple behaviours and should be split. Single-behaviour-per-case is the discipline that makes debugging easy when tests fail.

Should I write tests in Gherkin even if I'm not using a BDD framework?

The Gherkin format (Given/When/Then) is useful even without Cucumber or similar BDD tooling. It forces the writer to think in terms of preconditions, actions, and expected outcomes: the three components that survive refactors. You don't need executable Gherkin to benefit from the discipline of writing test cases this way.

All articles in Test management

Test management

Test-case design that doesn't go stale

Behaviour-anchored Gherkin survives refactors that break step-anchored UI tests. The 5 components every good case has, and structural moves that age well.

May 23, 20269 min read

Most test cases go stale. They're written for a specific UI, they reference specific elements, they capture a specific user flow, and the third refactor that touches that flow breaks every test case downstream. The team spends a sprint updating tests instead of writing new ones, then accepts that the test suite is a maintenance tax.

The fix is structural: write test cases that describe behaviour, not implementation. A behaviour-anchored test case stays accurate across UI changes because the behaviour doesn't change when the buttons move.

Behaviour-anchored vs step-anchored

The two test-case styles, side by side:

Step-anchored (fragile):

Click the "Login" button in the header

Enter user@example.com in the "Email" field

Enter password123 in the "Password" field

Click the blue "Sign In" button

Wait for the dashboard to load

Verify the "Welcome" heading appears

Behaviour-anchored (resilient):

Given a registered user with valid credentials When the user submits the login form with their email and password Then they are redirected to the dashboard and see a personalised welcome message

The step-anchored case breaks the moment someone moves the Login button or renames it to "Sign in". The behaviour-anchored case survives. The underlying behaviour ("authenticated users see their dashboard") is what the team committed to, not the position of the button.

The 5 components of a good test case

Every well-formed test case has these five elements. Skip any one and the case becomes harder to maintain.

1. A clear title

The title should describe what's being tested in one sentence: "Authenticated users see their dashboard after login", not "Login test 7" or "TC-2384". The title is what shows up in the test report; if it's opaque, the report is opaque.

2. Preconditions

What state must exist before the test runs? "A registered user with valid credentials" is a precondition; "the database is empty" is a precondition; "the feature flag is enabled" is a precondition. Hand-waving here produces tests that pass in one environment and fail in another.

3. Action

What does the test do? Be specific about the action without binding to specific UI elements. "Submits the login form", not "clicks the button at coordinate X,Y".

4. Expected outcome

What should happen? Be specific about the observable behaviour. "Sees a personalised welcome message", not "the page renders". Vague expectations produce tests that "pass" but verify nothing.

5. Why this matters

The often-omitted fifth component: a one-sentence rationale linking the test to a real user value or AC. "Verifies AC-3 of US-1284: users must see authentication confirmation before being shown sensitive data." When the test breaks two years later, this is what tells the maintainer whether to fix it or delete it.

Test cases that age well: the structural moves

Decouple from selectors

Test cases shouldn't hardcode CSS selectors, XPath expressions, or specific button text. They should describe the user-visible behaviour. Modern test frameworks (Playwright, Testing Library, Cypress) support role-based selectors that match by accessibility role rather than markup. getByRole('button', { name: /sign in/i }) is dramatically more resilient than cy.get('.login-btn').

Decouple from data

A test that asserts expect(user.id).toBe(42) is fragile. Any seed-data change breaks it. A test that asserts expect(user).toMatchObject({ email: 'test@example.com' }) is resilient. The principle: assert on the data the test cares about, not on the surrounding state.

Decouple from timing

A test that uses cy.wait(2000) is a flake waiting to happen. A test that uses cy.findByRole('heading', { name: /welcome/i }) (which waits implicitly for the element to appear) is resilient. The principle: wait for observable conditions, not arbitrary durations.

Single behaviour per case

A test that asserts 5 different things is hard to debug when it fails. Which assertion broke? A test that asserts one behaviour with clear setup is easy. The instinct to "save time" by combining tests is almost always wrong. Debug time dominates write time across the test's lifetime.

What about UI-flow tests?

Some tests genuinely need to verify a UI flow: multi-step user journeys, payment checkouts, complex form workflows. The structural advice is to use Page Object Model (POM) or similar abstraction: write step methods (loginPage.submit(email, password)) once, reference them in tests (Then loginPage.submit(...) succeeds). When the underlying selectors change, you update the page object, not 200 tests.

What about edge cases?

Edge cases are the highest-value tests because they catch the bugs nobody thinks to test for manually. The discipline is to systematically generate them:

For numerical inputs: 0, 1, -1, MIN_INT, MAX_INT, NaN, Infinity, decimals
For string inputs: empty, single char, max length, special characters, Unicode, SQL-injection patterns
For collections: empty, single item, max size, with nulls
For dates: today, far future, far past, DST transitions, leap year, year-2038
For user states: anonymous, authenticated, banned, expired session, pending verification

The Stride AI test generation feature surfaces these systematically; manual writing can absolutely cover them, but the discipline of explicitly enumerating each category is what separates teams that catch edge bugs from teams that ship them.

Maintenance discipline

A test case isn't done when it's written; it's done when it's still passing six months later for the right reasons. Healthy maintenance practices:

Every PR that touches a feature reviews the test cases for that feature. Stale tests get updated; tests that no longer reflect intent get deleted.
Failed tests get triaged within a day. Flaky tests get quarantined immediately, not "I'll look at it later".
Test names get refactored when the behaviour they verify gets renamed. A test called "test_old_user_flow" is a confusing test, even if it still passes.

The team that treats test cases as living artefacts maintains a healthy suite indefinitely. The team that treats them as write-once-forget-forever ends up with a 50K-test suite where 30% are flaky, 20% are stale, and 5% test something nobody understands anymore.

For the maintenance system that surrounds the tests, see Traceability matrix without spreadsheet hell. For deciding which tests to run when, see Regression strategy that scales past 10K tests. For what scripted tests can't catch, see Exploratory testing alongside automation.

Frequently asked questions

Why do most test cases go stale?: They're anchored to specific UI elements: clicking specific buttons, finding specific text, navigating specific routes. When the UI refactors, the tests break even though the underlying behaviour didn't change. Behaviour-anchored test cases (Given/When/Then format) describe what the test verifies without binding to how the UI implements it, so they survive refactors.
What's the difference between Given/When/Then and traditional test steps?: Traditional steps describe actions: "click here, type this, verify that text appears". Given/When/Then describes a state and a transition: "Given a registered user, When they submit credentials, Then they're redirected to the dashboard". The behaviour version survives any UI change that doesn't alter the underlying contract.
How long should a test case be?: A behaviour-anchored case is usually 5-15 lines. Cases longer than that are usually combining multiple behaviours and should be split. Single-behaviour-per-case is the discipline that makes debugging easy when tests fail.
Should I write tests in Gherkin even if I'm not using a BDD framework?: The Gherkin format (Given/When/Then) is useful even without Cucumber or similar BDD tooling. It forces the writer to think in terms of preconditions, actions, and expected outcomes: the three components that survive refactors. You don't need executable Gherkin to benefit from the discipline of writing test cases this way.