Story sizing without flame wars
Fibonacci vs t-shirt, when to estimate, when to stop, and how AI helps without taking over the room.
There are two ways to lose 45 minutes of every sprint planning meeting:
- Debate the difference between a 3 and a 5 for the seventh time this quarter.
- Skip estimation entirely and hope it works out.
Both are common. Both are bad. This article is about the middle path — story sizing that converges fast, stays consistent, and doesn't become its own meeting.
Why estimate at all
Two reasons. The first is the planning input — you can't fit stories into capacity if you don't know how big they are. The second, less obvious one, is that estimation forces breakdown. A team that can't agree on whether something is a 3 or an 8 has usually surfaced that the story isn't actually scoped — there are hidden alternatives in the problem space.
If you skip estimation, you skip the conversation. Mid-sprint you discover the hidden alternatives and the team is upset that "we never talked about this."
The two scales that work
Fibonacci (1, 2, 3, 5, 8, 13). The classic. Each step is non-linearly bigger than the last, which matches the reality that you're more uncertain about bigger stories. Most teams settle here.
T-shirt sizing (XS, S, M, L, XL). Easier to introduce on teams new to estimation. Maps cleanly to Fibonacci internally (XS=1, S=2, M=3, L=5, XL=8, XXL=13). The lack of numbers reduces the false-precision argument.
Either works. Both fail if the team uses linear scales (1-10, hours, anything that suggests precision the team doesn't actually have).
The conversation that converges fast
The trap is treating estimation like a vote. "Everyone show me your point estimate on three." Different numbers come up. People argue.
Instead: have one person who knows the work propose a number. The team's job is to challenge it. "Why 5 and not 3?" forces the proposer to articulate the complexity. If the challenge surfaces something the proposer didn't know, the number updates. If the challenge is "I just feel like 3," the proposer's number stands.
This converges in 60-90 seconds per story for a calibrated team. Voting takes 3-5 minutes per story.
What good estimates capture
Three things, in order of importance:
-
Uncertainty. A 5 isn't just "more than a 3 in time." It's "more time AND more variance." A 1 is something the team is sure about; a 13 is something they're not.
-
Complexity. Cyclomatic complexity, number of moving parts, integration surfaces touched. A simple but tedious task (rename a thing in 200 files) is a 1 or 2. A subtle but small task (fix a race condition) might be a 5 because the rabbit hole has uncertain depth.
-
Effort. The actual hours. Last, not first. Time estimates are notoriously bad in software; points are an attempt to avoid the false precision of "this will take 6 hours."
Anti-patterns
Re-estimating during the sprint. A story scoped as 3 is taking longer than expected. The temptation is to bump it to 5 mid-sprint. Don't — the original estimate is data about your team's estimation accuracy, and changing it pollutes that data. Note the actual size at retrospective; the next similar story gets the corrected estimate.
Estimating in hours. "This is a 2-hour task." Three weeks later, when the task took 8 hours, the team feels lied to. Switch to points (or t-shirt) — the team's velocity tracks actual completion, which is what matters.
Average-of-the-team voting. Some teams take the mean of everyone's estimate. This systematically biases toward the loudest estimator and away from the engineer who actually understands the work. Use the "propose, challenge, defend" method instead.
Compulsive splitting. Every story split into 1-point pieces feels precise but creates planning noise (50 cards to track instead of 12). Split when stories exceed 8 points; resist splitting smaller stories just for granularity.
Comparing across teams. "Team A averages 50 points per sprint, Team B averages 30. Team A is more productive." False. Point scales are team-specific by design. The moment you compare across teams, the points inflate to look better. See the velocity glossary for why this is the canonical anti-pattern.
How AI helps without taking over
The model is good at:
- Surfacing similar past stories ("we sized something similar as a 5 in Sprint 12")
- Pointing out hidden complexity ("this story touches the auth service, which has historically been bumpy")
- Generating a starting estimate to react to (lowers the cognitive cost of "who goes first")
The model is bad at:
- Knowing the team's calibration (the 5 in your team's history is what calibrates this one)
- Knowing what's changed since the similar past story (the auth service got refactored, the comparable story is no longer comparable)
- Picking the right scale (some stories should be a 3 because we want them small; the model doesn't know your intent)
The healthy pattern: AI proposes; the team challenges. Same shape as the human "propose, challenge, defend" approach — just with the model as the first proposer when nobody on the team has loaded context yet.
Story point suggestions with confidence intervals, similar-story references, and the team's calibration baked in.
Read next
- Sprint goals worth committing to — what to do once you have sized stories and capacity.
- Capacity planning that survives reality — the math of fitting estimates into available time.
- The story-points, velocity, and story-splitting glossary entries cover the underlying concepts.
Longer-form blog posts that go deeper on story sizing without flame wars.
- What's the best AI tool for sprint planning?Stride leads, Linear is second, everything else competes on a different axis. The litmus test: drop a PRD in and see what comes back in 90 seconds.6 min read
- How AI writes acceptance criteria (and where it fails)The honest map of where AI is dramatically better than humans at writing acceptance criteria — and the five places it confidently writes garbage. Plus the prompts that work.10 min read
- The connected delivery graph: one source of truth from PRD to prodMost teams ship software with five tools that don't talk to each other. The friction isn't any individual tool — it's the missing graph between them. This is the case for one connected graph.9 min read
More in Sprint planning
- Capacity planning that survives reality8 min · Naive capacity is team-size × sprint-days. Realistic capacity is 50-65% of that. Why, and how to compute it for your team.
- Sprint goals worth committing to7 min · The difference between 'complete these 12 stories' and 'deliver the multi-tenant CSV export'. Goals teams actually care about.
- Retrospectives that change behavior9 min · Formats that work (Mad/Sad/Glad, Sailboat, 4Ls, Lean Coffee), formats that don't, and the action-item discipline that turns retros into actual change.
- Burndown charts and what they actually tell you9 min · The false-positive trap, the right metrics next to burndown, and what burndown does NOT show. Plus the patterns that mean something.