All articles in Sprint planning
Sprint planning

Story sizing without flame wars

Fibonacci vs t-shirt, when to estimate, when to stop, and how AI helps without taking over the room.

7 min read

There are two ways to lose 45 minutes of every sprint planning meeting:

  1. Debate the difference between a 3 and a 5 for the seventh time this quarter.
  2. Skip estimation entirely and hope it works out.

Both are common. Both are bad. This article is about the middle path — story sizing that converges fast, stays consistent, and doesn't become its own meeting.

Why estimate at all

Two reasons. The first is the planning input — you can't fit stories into capacity if you don't know how big they are. The second, less obvious one, is that estimation forces breakdown. A team that can't agree on whether something is a 3 or an 8 has usually surfaced that the story isn't actually scoped — there are hidden alternatives in the problem space.

If you skip estimation, you skip the conversation. Mid-sprint you discover the hidden alternatives and the team is upset that "we never talked about this."

The two scales that work

Fibonacci (1, 2, 3, 5, 8, 13). The classic. Each step is non-linearly bigger than the last, which matches the reality that you're more uncertain about bigger stories. Most teams settle here.

T-shirt sizing (XS, S, M, L, XL). Easier to introduce on teams new to estimation. Maps cleanly to Fibonacci internally (XS=1, S=2, M=3, L=5, XL=8, XXL=13). The lack of numbers reduces the false-precision argument.

Either works. Both fail if the team uses linear scales (1-10, hours, anything that suggests precision the team doesn't actually have).

The conversation that converges fast

The trap is treating estimation like a vote. "Everyone show me your point estimate on three." Different numbers come up. People argue.

Instead: have one person who knows the work propose a number. The team's job is to challenge it. "Why 5 and not 3?" forces the proposer to articulate the complexity. If the challenge surfaces something the proposer didn't know, the number updates. If the challenge is "I just feel like 3," the proposer's number stands.

This converges in 60-90 seconds per story for a calibrated team. Voting takes 3-5 minutes per story.

What good estimates capture

Three things, in order of importance:

  1. Uncertainty. A 5 isn't just "more than a 3 in time." It's "more time AND more variance." A 1 is something the team is sure about; a 13 is something they're not.

  2. Complexity. Cyclomatic complexity, number of moving parts, integration surfaces touched. A simple but tedious task (rename a thing in 200 files) is a 1 or 2. A subtle but small task (fix a race condition) might be a 5 because the rabbit hole has uncertain depth.

  3. Effort. The actual hours. Last, not first. Time estimates are notoriously bad in software; points are an attempt to avoid the false precision of "this will take 6 hours."

Anti-patterns

Re-estimating during the sprint. A story scoped as 3 is taking longer than expected. The temptation is to bump it to 5 mid-sprint. Don't — the original estimate is data about your team's estimation accuracy, and changing it pollutes that data. Note the actual size at retrospective; the next similar story gets the corrected estimate.

Estimating in hours. "This is a 2-hour task." Three weeks later, when the task took 8 hours, the team feels lied to. Switch to points (or t-shirt) — the team's velocity tracks actual completion, which is what matters.

Average-of-the-team voting. Some teams take the mean of everyone's estimate. This systematically biases toward the loudest estimator and away from the engineer who actually understands the work. Use the "propose, challenge, defend" method instead.

Compulsive splitting. Every story split into 1-point pieces feels precise but creates planning noise (50 cards to track instead of 12). Split when stories exceed 8 points; resist splitting smaller stories just for granularity.

Comparing across teams. "Team A averages 50 points per sprint, Team B averages 30. Team A is more productive." False. Point scales are team-specific by design. The moment you compare across teams, the points inflate to look better. See the velocity glossary for why this is the canonical anti-pattern.

How AI helps without taking over

The model is good at:

  • Surfacing similar past stories ("we sized something similar as a 5 in Sprint 12")
  • Pointing out hidden complexity ("this story touches the auth service, which has historically been bumpy")
  • Generating a starting estimate to react to (lowers the cognitive cost of "who goes first")

The model is bad at:

  • Knowing the team's calibration (the 5 in your team's history is what calibrates this one)
  • Knowing what's changed since the similar past story (the auth service got refactored, the comparable story is no longer comparable)
  • Picking the right scale (some stories should be a 3 because we want them small; the model doesn't know your intent)

The healthy pattern: AI proposes; the team challenges. Same shape as the human "propose, challenge, defend" approach — just with the model as the first proposer when nobody on the team has loaded context yet.

Story point suggestions with confidence intervals, similar-story references, and the team's calibration baked in.

See AI story sizing in Stride
Defined in our glossary

More in Sprint planning