Back to Sprint Estimation Reality 2026
Methodology

Sprint Estimation Reality 2026 — Methodology

500-respondent practitioner survey (Prolific Academic panel + organic top-up) using 8 validated calibration tasks. Anonymised opt-in usage data from the public sprint-capacity-calculator. Pre-registered hypotheses, Wilson 95% CIs, Benjamini–Hochberg FDR correction.

Research questions

The Stride 2026 sprint-estimation study addresses one primary question and four secondary questions.

The primary question is the calibration gap. Across 43 years of published estimation research, software professionals are systematically overconfident about their own task-duration predictions: people report 80% confidence and turn out to be right closer to 60% of the time on hard tasks (Lichtenstein, Fischhoff & Phillips 1982; replicated extensively). Yet almost no engineering organisation measures this gap on its own teams. The Stride 2026 study measures it directly on 500 senior practitioners and tests whether the gap responds to deliberate calibration training, AI-aided estimation, or the choice of estimation scale (story points vs. time-based).

The secondary questions test four hypotheses developed from the existing literature: (1) whether story-point and time-based estimates show the same systematic bias on the same task set; (2) whether AI-aided estimation tools change confidence without changing calibration; (3) whether sprint length predicts estimation accuracy; (4) whether estimation-training ROI varies by team size.

Hypotheses (pre-registered)

These are the five hypotheses we register before fielding closes. Each is paired with a specific operationalisation, an exclusion criterion, and a falsification condition.

H1 — Calibration gap

Operationalisation. Self-reported confidence is measured by survey item §6.2 ("How confident are you, in percentage terms, that your team will complete its next sprint commitment by the sprint end date?"). Measured calibration is computed from the 8 validated estimation-calibration tasks in §7: respondents predict completion times for 8 short software tasks with explicit confidence intervals (e.g., "I am 80% confident this task will take between X and Y minutes"). Brier score across the 8 tasks is the calibration measurement.

Prediction. Mean reported confidence will exceed mean measured calibration (1 − Brier score) by a factor of ≥2× across the 500-person sample. Test: paired t-test on confidence vs. measured calibration; Cohen's d + 95% bootstrap CI.

Falsification. Mean reported confidence ≤ mean measured calibration, OR the multiplier is under 1.3× after controlling for role + company size. Both conditions would falsify H1 directionally.

H2 — Story points don't solve it

Operationalisation. Within the 500-person sample, respondents are stratum-balanced on their team's primary estimation method (story points, time-based hours, t-shirt sizing, three-point PERT, "none"). For story-point and time-based subsamples, the §7 calibration tasks are repeated in the respondent's native scale (the same task asked as both "give a point estimate" and "give an hour estimate" to comparable respondents). Signed error = (predicted − measured) / measured, expressed as a signed percentage.

Prediction. Mean signed error is statistically indistinguishable between story-point and time-based estimates (two-sample t-test, Wilson 95% CI on the difference). Both will be positive (optimistic bias) at p < 0.05.

Falsification. Either method shows a signed-error magnitude ≥1.5× the other at p < 0.05.

H3 — AI changes confidence, not calibration

Operationalisation. AI-aided cohort = respondents who self-report using an AI tool for estimation (Copilot, ChatGPT, Stride, etc.) ≥1× per sprint in §3.4. The control cohort uses no AI for estimation. Both cohorts complete the §6.2 confidence question and the §7 calibration tasks. AI-cohort confidence and calibration are compared to control-cohort confidence and calibration.

Prediction. AI cohort reports higher confidence than control cohort (Cohen's d > 0.3 on the confidence scale) AND shows statistically indistinguishable calibration (95% CIs on the difference in calibration overlap zero). Test: ANOVA on confidence × AI-usage; ANOVA on calibration × AI-usage; effect sizes + bootstrap CIs.

Falsification. AI cohort confidence is statistically indistinguishable from control OR calibration differs significantly. Both conditions falsify the "confidence without calibration" hypothesis.

H4 (null) — Sprint length is independent of estimation accuracy

Operationalisation. Sprint length is a stratum (1-week, 2-week, 3-week, 4-week) reported at §2.1. Estimation accuracy is the §7 calibration score. Linear regression with sprint length as predictor and team size + tenure as covariates.

Prediction. The sprint-length coefficient will be statistically indistinguishable from zero after BH-correction across the planned hypothesis family. The null is the finding — sprint length is a planning choice, not a calibration intervention.

Falsification. Significant non-zero coefficient on sprint length (p < 0.05 after BH-correction).

H5 (exploratory) — Training ROI is size-dependent

Reported with explicit exploratory framing. We will compare calibration-training ROI (calibration improvement per training hour) between teams ≤50 engineers and teams 5,000+, but this is hypothesis-generating, not hypothesis-testing. We will not report H5 in the headline findings; it goes in an explicit "Exploratory" section of Volume 1.

Multiple-comparison correction

The four planned hypotheses (H1–H4) form one family. The Benjamini–Hochberg FDR is controlled at q = 0.05 across the family. H5 is explicitly outside the planned family and is not corrected (and is reported with the exploratory caveat that this implies).

Survey design

The instrument is ~48 substantive items + 4 screening + 3 attention checks + 8 firmographics. Median completion target 12 minutes; pilot will verify.

Screening

S1–S4: role (engineer / EM / staff+ / director / VP — others screened out); tenure ≥5 years software-delivery work; current employment (no full-time students); English proficiency (the instrument is English-only for Volume 1).

Section 1 — Team + context (8 items)

Team size (current), team size (1 year ago), company size, industry, regulated vs. unregulated, remote vs. hybrid vs. co-located, AI-tool adoption stage, tenure on current team.

Section 2 — Sprint structure (5 items)

Sprint length (1/2/3/4 weeks / no sprints), planning frequency, retrospective frequency, on-call rotation involvement, planning poker / t-shirt / three-point / none.

Section 3 — Estimation tools + AI usage (6 items)

Primary estimation method (story points / hours / t-shirt / three-point / no estimation), secondary method, AI tool usage frequency for estimation specifically, AI tool usage frequency overall, perceived effect of AI on estimation, perceived effect of AI on team velocity.

Section 4 — Sprint accuracy self-report (5 items)

Last-sprint completion rate (% of committed work completed), 6-sprint trailing average completion rate, frequency of carry-over stories, frequency of late-add stories, sprint-stress self-rating.

Section 5 — Estimation confidence (8 items)

Confidence in last sprint, confidence in upcoming sprint, confidence interval reporting (do you express ranges?), historical accuracy tracking, calibration training history, post-mortem cadence.

Section 6 — Calibration self-rating + accuracy (4 items)

Self-rated calibration accuracy (0-100 scale), comparison to peers, comparison to industry, willingness to participate in follow-up validation tasks.

Section 7 — Calibration tasks (8 validated items)

These are the key measurement. Respondents are presented with 8 short, well-bounded software tasks (e.g., "implement a CSV-to-JSON conversion utility in your primary language with three test cases"). For each, they provide:

  • A point estimate of completion time (in minutes).
  • An 80% confidence interval (lower bound, upper bound).
  • A 95% confidence interval (lower bound, upper bound).

The reference distribution for each task comes from a prior pilot (n=40) that measured actual completion times across a diverse software-engineering population. Volume 1 will Bayesian-update this reference distribution if the primary-study sample yields divergent completion times.

Brier score is computed across the 8 tasks per respondent. The 80% and 95% intervals provide calibration measurements at two confidence levels (replicating the Lichtenstein et al. 1982 calibration-curve methodology).

Section 8 — Process + culture (4 items)

Process satisfaction, retrospective effectiveness, decision-latitude (Karasek-style autonomy item), workload sustainability.

Geographic region, gender (optional, for stratification balance), opt-in for follow-up study, dataset-release consent (CC-BY-4.0 with anonymised individual responses).

Attention checks

Item #14 ("Please select 'Strongly agree' for this item to confirm you are reading carefully"), item #29 ("Which of the following did you NOT do in the last sprint? — Pluto orbit calculations / Code review / Retro / Estimation poker"), item #42 (a paragraph-reading comprehension item where respondents must select the option that matches the paragraph's claim).

Respondents failing ≥2 of 3 checks are screened out and replaced.

Recruitment

Prolific Academic panel arm

Target n = 400 completes via Prolific Academic. Effective CPI ~$3 USD per complete (Prolific's standard rate × incentive). Pre-screen on Prolific's "Tech industry workers" + custom screener for ≥5-year software-delivery experience.

Organic top-up arm

Target n = 100 completes via:

  • Stride newsletter (~14k subscribers; expected ~0.5% conversion = 70 completes)
  • LinkedIn organic from research team
  • Industry communities (Lobste.rs, r/ExperiencedDevs, eng-management groups)
  • Industry partner referrals

Organic responses carry a separate stratum flag in the dataset so analyses can run with/without them. We expect organic respondents to skew slightly more senior + tech-adopted than the Prolific arm; the stratum flag enables direct adjustment.

Pilot wave

N = 50 panel respondents + 8 think-aloud sessions ($75 honorarium, 30 minutes each) before the main field opens. Iteration window 14 days; any §7 calibration task flagged confusing by ≥3 of 8 think-alouds gets rewording.

Tool-usage arm (sprint-capacity-calculator)

The Stride sprint-capacity-calculator is a free, no-signup, client-side tool. It collects input/output pairs (sprint inputs → calculated capacity → user's actual post-sprint accuracy rating) only when the user explicitly opts in via the "share your accuracy rating" button.

Privacy + anonymisation

  • All telemetry is opt-in. No user is captured without an explicit click on "share accuracy rating."
  • Captured fields: sprint inputs (team size, sprint length, etc.), calculated capacity, post-sprint accuracy rating (Likert 1–5). No personally-identifying fields.
  • Aggregation enforces k-anonymity at k ≥ 10. No cell with fewer than 10 distinct users is published.
  • Where a single user has multiple sprints, only the median accuracy across their sprints contributes (preventing power users from dominating).
  • A 14-day pre-publish window allows users to view + delete their contributed data.

Cohort analysis

The tool-usage arm answers a different question than the survey arm: does using the calibrated capacity recommendation improve the user's measured sprint accuracy over time? We compute median accuracy as a function of cumulative-uses-of-the-tool, controlling for team size + sprint length. Volume 1 reports the median trajectory with 95% bootstrap CIs.

Tool-usage and survey arms are stratum-vs-stratum, never joined at the individual user level.

Statistical methods

Effect sizes, not just p-values

  • Cohen's h for proportion comparisons.
  • Cliff's delta for ordinal Likert items.
  • Cohen's d for continuous magnitudes (with Hedges' g correction for small samples).
  • Wilson 95% CIs on every quoted percentage.
  • Bootstrap 95% CIs (10,000 iterations) on continuous magnitudes.

Multiple-comparison correction

The planned hypothesis family (H1–H4) is corrected with Benjamini–Hochberg FDR at q = 0.05. Exploratory cross-tabs not in the planned family are reported with the explicit "exploratory" framing and are not BH-corrected (and may inflate Type I error commensurately).

Post-stratification weighting

Sample weights are computed against Stack Overflow Developer Survey 2024 base rates on the role × company size × region cross-tab, restricted to the senior-practitioner subset (≥5 years tenure). Unweighted findings are reported alongside weighted findings; weighting moves any single headline number by ≤3 percentage points or it is flagged as a sensitivity concern.

Sensitivity analysis

Before publishing the Volume 1 manuscript, we run three sensitivity analyses on each headline finding:

  1. With and without the organic stratum (Prolific-only vs. combined).
  2. With and without attention-check survivors who took ≤6 minutes (the speed-cap stratum).
  3. With three alternative weighting schemes (no weights, Stack Overflow weights, JetBrains weights).

Any headline number that flips direction under any sensitivity condition is reported in the body with the flip flagged inline, not as a footnote.

Reproducibility

The Volume 0 landscape figures are reproducible today. A Jupyter notebook at apps/platform/research/2026/reproducibility/sprint-estimation/landscape-charts.ipynb on GitHub loads the four published-data CSVs in the same folder and re-renders every Volume 0 figure (Cone of Uncertainty, planning fallacy meta-analysis, calibration curve, literature timeline) from the public source numbers. Each CSV row carries a citation_url to its primary source so a peer-reviewer can trace every plotted point back to its origin in one click.

The Volume 1 analysis pipeline will ship as a separate, deeper notebook at the same path on the publish date. CC-BY-4.0. Reproducible from a fixed seed, with package versions locked in a requirements.txt (Python) or renv lockfile (R, whichever the primary statistical reviewer prefers).

Dataset publication (Volume 1)

When Volume 1 lands, the survey dataset publishes under CC-BY-4.0 as a single ZIP bundle plus a Zenodo DOI.

Bundle contents:

  • responses.csv — one row per respondent. Anonymised; no PII. Stratum flag + post-stratification weight per row.
  • calibration-tasks.csv — per-respondent, per-task predictions + measured Brier scores.
  • cross-tabs/ — pre-computed cross-tabs for the planned hypothesis family.
  • tool-usage-aggregates.csv — k-anonymised tool-usage summary tables.
  • data-dictionary.json — JSON-Schema-compatible data dictionary describing every field.
  • weighting-recipe.md — exact reproduction of the post-stratification weighting (Stack Overflow 2024 base rates by stratum).

The integrity hash (SHA-256) of the published bundle appears on the report page next to the download button so reproducibility-conscious readers can verify their copy matches.

Vendor-neutrality posture

Stride is a software-delivery platform; we have commercial interest in sprint planning being a tractable problem. We publish this study with the following disciplines:

  • No tool names appear in survey items without alphabetical ordering. Stride is named only as the publisher in §6 and §9 (where respondents are asked about their estimation tool stack).
  • Comparison tables in Volume 1 cite published study results, not Stride's product claims.
  • The editorial owner has final cut on any number that appears in the report. No marketing or GTM voice in the body.
  • The dataset publishes whether the findings flatter Stride or not. A null result on H4 (no relationship between sprint length and accuracy) is published; a finding that AI-aided estimation hurts calibration is published; a finding that Stride's own customers are no better calibrated than non-Stride teams is published.

The vendor-neutrality posture is self-interested as much as it is principled. A study that buries an inconvenient finding dies in one news cycle. A study that publishes them is the one that gets cited five years later.