Why publish another estimation report when the literature has been settled for decades?

Because the literature is mature but rarely cited in modern Agile or DevOps writing. Boehm 1981, Jørgensen's 30-year program, Halkjelsvik & Jørgensen 2012: all are foundational, all under-cited. Volume 0 surfaces what the field already knows; Volume 1 adds the AI-era calibration question the existing literature doesn't yet answer.

Are story points more accurate than time-based estimates?

Across 30 years of studies (Jørgensen 2014 review), the answer is no: both show indistinguishable systematic bias on the same task set. The choice of scale matters less than the calibration discipline applied. A team that practices calibrated estimation in either scale outperforms a team that does not.

Does AI actually help with estimation, or does it just make us more confident?

This is pre-registered as H3 of the Volume 1 study and explicitly NOT a claim yet. Early industry signals suggest AI-aided estimation increases reported confidence without changing measured accuracy; Volume 1 tests this directly with a 500-person sample.

Why these four studies?

Each represents a distinct epistemic position: Boehm 1981 (theoretical model from observed projects); Halkjelsvik & Jørgensen 2012 (meta-analysis of 200+ studies); Jørgensen 2014 (multi-decade review); Eveleens & Verhoef 2010 (critique of the industry's most-quoted commercial source). Other studies inform other sections of the report but were not picked here because they overlap on epistemic position.

Standish CHAOS has been criticised. Why include it?

We do not cite Standish CHAOS in the body without the Eveleens & Verhoef 2010 critique attached. The pair appears in our comparison table specifically to surface the methodological dispute, not to amplify Standish's headline numbers. Reading both sources is the responsible default.

Will the dataset be released? Under what license?

Yes, under CC-BY-4.0 when Volume 1 publishes. The release includes the anonymised individual-response CSV, per-respondent calibration scores, pre-computed cross-tab tables, JSON-Schema data dictionary, and aggregated tool-usage summary tables. Free-text responses are withheld for privacy.

When does Volume 1 publish?

Target Q4 2026 at the same canonical URL. The landscape-synthesis sections of Volume 0 stay as the "Prior public evidence" framing; Volume 1 primary findings replace the "What Volume 1 will measure" section and add a dataset link. No URL change.

Who funded this research?

Stride Research (the research arm of Newlight Solutions) funds the study fully. No external sponsor; no embedded survey items from any tool vendor. The vendor-neutrality posture in the methodology page documents the operational rules: no Stride row in comparison tables, no Stride-named items in the survey, editorial owner with final cut.

Can I participate in Volume 1 as a researcher / journalist / practitioner?

Yes. The compensated panel arm is recruited through Prolific Academic when fielding opens; email info@newlightai.com to be notified. Practitioners interested in the organic top-up arm can email the same address. Journalists can request a 48-hour-embargoed preview the week before public release.

How should I cite Volume 0?

See §"Cite this volume" further down the page for APA, Chicago, BibTeX, and Markdown formats. Short version: cite the report by title, year 2026, author "Stride Research," URL at the canonical /research/sprint-estimation-reality-2026.

All research

Research

Sprint Estimation Reality 2026

Forty-three years of estimation research, synthesised. Plus the pre-registered design for a 500-person study of AI-era sprint calibration.

May 18, 2026

Contents

1. Why publish another estimation report
2. The historical baseline 1981–present
3. What modern surveys show
3a. The four landmark studies, side by side
4. The measurement gap nobody fills
5. What Volume 1 will measure
6. Methodology summary
7. Limitations and what to expect
8. FAQ
9. Related work
10. Cite this volume
11. Participate in Volume 1

Key findings

Software estimates have been ~30% optimistic on average across 200+ published time-prediction studies (Halkjelsvik & Jørgensen 2012). The bias has not narrowed over decades.
Boehm's Cone of Uncertainty (1981) predicts 4× variance at project inception, narrowing to 1× at delivery; modern Agile teams broadly replicate the shape at sprint scale.
Story-point and time-based estimates show indistinguishable systematic bias on the same task set (Jørgensen 2014 review of 30 years of estimation research).
People who say they are 80% confident are typically right closer to 66% of the time (Lichtenstein, Fischhoff & Phillips 1982, replicated extensively); software estimators sit on this calibration curve, not above it.
AI-aided estimation may increase reported confidence without changing measured calibration; this is pre-registered as H3 in the Volume 1 study and explicitly NOT a claim yet.

Methodology

500-respondent practitioner survey (Prolific Academic panel + organic top-up) using 8 validated calibration tasks. Anonymised opt-in usage data from the public sprint-capacity-calculator. Pre-registered hypotheses, Wilson 95% CIs, Benjamini–Hochberg FDR correction.

Dataset: Sprint Estimation Reality 2026: Estimation Variance Data · License: CC BY 4.0: attribute "Stride 2026 Sprint Estimation Reality" with link to source URL. · Download raw data

Read the full methodology

Why publish another estimation report

The honest version of an estimation report is harder than the marketing version. The marketing version goes: "Story points are broken / planning poker is broken / AI fixes estimation / here's our framework / call our reps." The honest version notices that 43 years of estimation research, by serious authors with serious methodologies, broadly converges on the same handful of findings, and almost none of that literature gets cited in modern Agile or DevOps thought leadership.

We chose to publish Volume 0 before Volume 1 for three reasons.

First, the literature is mature and rarely surfaced. Boehm's 1981 Cone of Uncertainty is foundational and broadly replicated. Halkjelsvik & Jørgensen's 2012 meta-analysis of 200+ time-prediction studies put the average software overrun at about 30%. Jørgensen's 30-year program of research is the most comprehensive body of empirical work on the topic. Synthesising what's already known is the prerequisite for adding to it.

Second, pre-registering hypotheses before fielding is what separates a research report from advocacy dressed up with citations. The hypotheses we register today are the ones we will test when our 500-respondent primary study fields, not the ones we invent after looking at the data.

Third, the 2026 question is genuinely new even if 1981's framework still applies. Does AI-aided estimation improve calibration (accuracy), or only confidence? If AI raises confidence without raising calibration, sprint commitments will become more confident and more wrong, exactly the dynamic that produces the late nights, missed launches, and the engineer burnout we synthesise in our companion report on burnout and process debt.

Read the pre-registered methodology

The historical baseline 1981–present

The most-cited single source on software estimation is also the oldest. Boehm's 1981 book Software Engineering Economics introduced the Cone of Uncertainty as a model of estimation variance over the project lifecycle. The Cone is descriptive (what variance looks like in practice across observed projects), not prescriptive. At project inception, estimates have roughly 4× variance: a project budgeted at 100 person-weeks could legitimately turn out to be anywhere from 25 to 400. The cone narrows as decisions get locked in, reaching 1× (the actual outcome) at delivery.

Figure 1. The Cone of Uncertainty (Boehm 1981). Estimation variance multiplier vs project phase. Bounds = published ratio of high estimate to low estimate at each phase.

Source: Boehm 1981 Table 22-1 (revisited Boehm & Turner 2003).

Chart description (text)

A funnel-shaped chart. At project inception, the variance band spans from 0.25× to 4.0× (a 16-fold range, an estimate of 100 person-weeks could legitimately turn out to be 25 to 400 person-weeks). The band narrows progressively: at requirements lockdown to 0.5× to 2.0×; at design specification to 0.67× to 1.5×; at interface plan to 0.8× to 1.25×; at detailed design specification to 0.9× to 1.1×; and at delivery the band collapses to 1.0× (the actual outcome). The phases shown follow Boehm 1981 Table 22-1 with labels for Inception, Requirements, Design, Interface, Detailed design, Delivered. A dashed centerline marks the 1× (actual) baseline.

≈30%

Mean software effort overrun across 200+ published time-prediction studies.Source: Halkjelsvik & Jørgensen 2012 meta-analysis.

Figure 2. Distribution of software estimation overruns (Halkjelsvik & Jørgensen 2012). Across 200+ published studies, the distribution of mean effort overrun centres at roughly +30% with a positive-skew tail. The minority of studies showing underruns lie in the left tail; the long right tail is the population of projects that ship at 2-3× the original estimate.

Source: Halkjelsvik & Jørgensen 2012, doi.org/10.1037/a0027275. Distribution shape illustrative of the published meta-analysis.

Chart description (text)

A histogram with eleven buckets ranging from -30% overrun to +100% overrun. The distribution is positive-skewed: only a small number of studies fall in the -30% to 0% range (i.e. completed faster than estimated), the mode is at +30%, and the right tail extends through +60%, +80%, and +100% buckets. A dashed orange vertical line at +30% marks the published mean overrun across the meta-analysis. Source: Halkjelsvik & Jørgensen 2012 meta-analysis of 200+ time-prediction studies. The exact bucket counts shown are illustrative of the published distribution shape; the original meta-analysis reports the central tendency at approximately 30% but does not publish per-bucket counts.

Boehm's contemporaries (and the next 25 years of estimation research) broadly replicated the pattern. Magne Jørgensen's 30-year program of research (1995 through 2014) is the most comprehensive empirical body of work on software estimation accuracy in the field. The synthesis: expert judgment is roughly as accurate as formal estimation models (COCOMO, function points, etc.) for most software effort estimation; both are systematically optimistic; neither has been clearly improved by any one technique in 30 years.

Halkjelsvik & Jørgensen's 2012 paper "From origami to software development: A review of studies on judgment-based predictions of performance time" is the most-cited single meta-analysis. Across 200+ time-prediction studies (software, manufacturing, daily-life tasks, origami) the central finding is consistent: people are systematically optimistic about how long their tasks will take, and the bias does not narrow with practice on the same task.

Steve McConnell's 2006 book Software Estimation: Demystifying the Black Art is the closest the practitioner literature has to a standard reference. McConnell catalogues two-dozen estimation techniques (PERT three-point, planning poker, t-shirt sizing, group estimation, fuzzy logic, neural networks…) and argues that the choice of technique matters less than the discipline applied to it: calibration training, structured uncertainty bands, and explicit recording of historical accuracy.

The Standish CHAOS reports, the most-quoted commercial source on software project outcomes, have been heavily critiqued. Eveleens & Verhoef's 2010 paper reanalyses the Standish methodology and shows that the reports' headline "failure rate" is overstated due to definitional choices that systematically classify successful projects as failed. We cite Standish in our reading list with this critique attached; we do not cite Standish in the body without it.

What modern surveys show

The modern survey literature on estimation is thinner than the academic literature, because most engineering surveys focus on adoption and tools rather than calibration accuracy.

The 2024 Atlassian State of Teams report surveys ~5,000 knowledge workers and finds that ~70% of respondents say their teams routinely miss sprint commitments. The report does not measure calibration accuracy (the gap between confidence in an estimate and the empirical accuracy of that estimate), but it provides a useful proxy: routine missed commitments are inconsistent with well-calibrated estimation across a healthy variance of work.

The 2024 JetBrains State of Developer Ecosystem reports that ~64% of developers use story points for estimation and ~17% use hour-based estimates, with the rest using t-shirt sizes, three-point estimation, or no estimation at all. The survey doesn't measure accuracy of those estimates, but it documents what tools the population uses, important context for the Volume 1 primary study's stratification.

The Stack Overflow Developer Survey 2024 (n≈65,000) doesn't have an estimation-accuracy section, but reports that ~74% of professional developers identify "deadlines that don't reflect estimates" as a leading source of stress at work, consistent with the planning-fallacy literature's prediction that estimates produced under the bias produce deadlines that are systematically too tight.

Figure 3. The calibration curve (Lichtenstein, Fischhoff & Phillips 1982). When people say they are X% confident, they are right at the rate shown on the y-axis. The diagonal is well-calibrated; the observed curve sits systematically below.

Source: Lichtenstein, Fischhoff & Phillips 1982, doi.org/10.1017/CBO9780511809477.023.

Chart description (text)

A two-axis chart. The x-axis is self-reported confidence from 50 percent to 100 percent. The y-axis is measured accuracy from 50 percent to 100 percent. The dashed green diagonal represents perfect calibration (you say 80 percent confident, you are right 80 percent of the time). The solid red curve plots the empirically observed relationship across dozens of studies summarised in Lichtenstein, Fischhoff and Phillips 1982: at 50 percent confidence people are right 49 percent of the time (essentially calibrated); at 70 percent confidence right 60 percent of the time (10 percentage point gap); at 80 percent confidence right 66 percent of the time (14 point gap); at 90 percent right 74 percent (16 point gap); at 95 percent right 78 percent (17 point gap); at 99 percent right 81 percent (18 point gap, extreme overconfidence on the rightmost band). Software estimators sit on this curve; the field has not been separately tabulated but subsequent software-specific calibration studies show comparable gaps.

The four landmark studies, side by side

The table below collapses the historical baseline into one frame. Read across the rows: each study answered a different question, with a different population, on a different time horizon. Read down the "key limitation" column: each is honest about what its design cannot conclude. Volume 1 picks a fifth position (calibration tasks on a primary survey of 500 senior software-delivery practitioners, with pre-registered hypotheses) to fill the gap none of these four cover at the modern AI-era practitioner scale.

Side-by-side comparison of 4 landmark studies cited in this report.
Study	Sample	Method	Headline finding	Key limitation	Source
Boehm 1981 (revisited 2002)1981	TRW projects (review)	Theoretical model from observed TRW projects	Cone of Uncertainty: 4× variance at inception → 1× at acceptance test. Phase-by-phase narrowing of estimate-to-actual ratios across observed projects.	Pre-Agile era; assumes waterfall phase model. Modern Agile reintroduces uncertainty per-sprint.	Link
Halkjelsvik & Jørgensen 20122012	200+ studies meta-analysis	Systematic review of time-prediction studies	Software estimates ~30% optimistic on average across the meta-analysis; no clear improvement over decades.	Mostly Western European / Scandinavian sample skew in the underlying studies.	Link
Jørgensen 20142014	Multi-decade synthesis	Forecasting research review (30 years of estimation work)	Expert judgment ≈ formal models for software estimation effort. The choice of technique matters less than the calibration discipline applied to it.	Field still maturing on AI-aided estimation; the review predates modern LLM coding assistants.	Link
Eveleens & Verhoef 2010 (CHAOS critique)2010	Methodological critique	Reanalysis of Standish CHAOS reports	Standish "failure rate" is overstated; methodology has serious flaws around how successful projects are classified.	Critique itself is contested by Standish. Read both sides before citing Standish data.	Link

Boehm 1981 (revisited 2002)

1981

Sample: TRW projects (review)
Method: Theoretical model from observed TRW projects
Headline finding: Cone of Uncertainty: 4× variance at inception → 1× at acceptance test. Phase-by-phase narrowing of estimate-to-actual ratios across observed projects.
Key limitation: Pre-Agile era; assumes waterfall phase model. Modern Agile reintroduces uncertainty per-sprint.

Open source

Halkjelsvik & Jørgensen 2012

2012

Sample: 200+ studies meta-analysis
Method: Systematic review of time-prediction studies
Headline finding: Software estimates ~30% optimistic on average across the meta-analysis; no clear improvement over decades.
Key limitation: Mostly Western European / Scandinavian sample skew in the underlying studies.

Open source

Jørgensen 2014

2014

Sample: Multi-decade synthesis
Method: Forecasting research review (30 years of estimation work)
Headline finding: Expert judgment ≈ formal models for software estimation effort. The choice of technique matters less than the calibration discipline applied to it.
Key limitation: Field still maturing on AI-aided estimation; the review predates modern LLM coding assistants.

Open source

Eveleens & Verhoef 2010 (CHAOS critique)

2010

Sample: Methodological critique
Method: Reanalysis of Standish CHAOS reports
Headline finding: Standish "failure rate" is overstated; methodology has serious flaws around how successful projects are classified.
Key limitation: Critique itself is contested by Standish. Read both sides before citing Standish data.

Open source

The measurement gap nobody fills

Across the historical literature and the modern surveys, one fact stands out. Almost no engineering organisation measures its own estimation calibration. Teams measure velocity (story points completed per sprint), Atlassian and JetBrains measure tool adoption, Stack Overflow measures stress; almost no team measures "when our team said we were 80% confident in this sprint commitment, were we right 80% of the time?"

This is the gap Volume 1 fills. The Stride sprint-capacity-calculator (public, free, no-signup) is the measurement instrument for the primary study. Teams enter their planned sprint capacity inputs (team size, sprint length, PTO, meetings, on-call rotation, historical velocity), and the calculator outputs a calibrated sprint-capacity recommendation. Volume 1's primary-study question: for participants who use the calculator, does the measured accuracy of their subsequent sprint commitments improve over time? And does the improvement track participants' self-reported confidence?

Try the sprint-capacity calculator

What you cannot defensibly claim from the current literature

"Story points are more accurate than time-based estimates." (Jørgensen 2014 review: across 30 years of studies, expert judgment ≈ formal models; the choice of scale matters less than the calibration discipline. Either scale is systematically optimistic in untrained populations.)
"AI fixes estimation." (Early literature on AI-aided estimation, e.g. some recent industry whitepapers, shows mixed results. No pre-registered, peer-reviewed study has yet shown AI-aided estimation improving calibration, only confidence. Volume 1 tests this directly.)
"Sprint length predicts estimation accuracy." (Pre-registered H4 of this study; expected null. Sprint length is a planning choice, not a calibration intervention.)
"Estimation accuracy improves with practice." (Halkjelsvik & Jørgensen 2012's meta-analysis explicitly finds no clear improvement over decades. Practice without calibration feedback is not training; it's repetition.)

What Volume 1 will measure

Figure 4. The sprint-estimation research timeline 1981–2026. Where Volume 0 sits in the published literature, and where Volume 1 lands. The 2012 Halkjelsvik meta-analysis and the Stride V0 are highlighted; Stride V1 is the forthcoming marker.

Markers compiled from each study's published release date.

Chart description (text)

Horizontal timeline with seven published markers and one forthcoming marker. Boehm 1981 Cone of Uncertainty publishes January 1981. Buehler, Griffin and Ross 1994 planning-fallacy paper publishes 1994. Jørgensen review 2004. McConnell Software Estimation book 2006. Eveleens and Verhoef CHAOS critique 2010. Halkjelsvik and Jørgensen 2012 meta-analysis (highlighted as the most-cited modern source). Atlassian State of Teams 2024. Stride Volume 0 May 2026 highlighted as the current page. Stride Volume 1 October 2026 shown in dashed muted treatment as a forward-looking marker.

The sprint-estimation primary study runs two arms, analysed against each other.

The survey arm targets N=500 senior software-delivery practitioners (IC engineers ≥5 yrs tenure, engineering managers, staff+ engineers, eng directors, VPs). Recruitment is Prolific Academic for the panel arm with an organic top-up from the Stride newsletter, LinkedIn, and selected industry communities. Sample is balanced across role, company size, region, and AI-adoption stage.

The tool-usage arm analyses anonymised usage of the sprint-capacity-calculator, comparing planned-capacity inputs against post-sprint accuracy ratings opt-ed-in by users. Anonymisation enforces k≥10 cells; no individual user is identifiable in published findings.

Both arms are stratum-vs-stratum, never joined at the individual respondent level.

Pre-registered hypotheses

These are the five hypotheses we register on the Open Science Framework before fielding closes. They are also the hypotheses we test in October 2026.

H1: Calibration gap. Teams' self-reported confidence in their sprint estimates will exceed measured calibration (Brier score) by ≥2× across the 500-person sample. Test: paired t-test on confidence vs measured calibration; Cohen's d + 95% bootstrap CI.
H2: Story points don't solve it. Story-point and time-based estimates will show indistinguishable systematic bias (mean signed error) on the same task set. Test: paired t-test on signed-error magnitudes; effect size + Wilson 95% CI.
H3: AI changes confidence, not calibration. Teams using AI-aided estimation tools will report higher confidence than non-AI teams, but their measured calibration error will be statistically indistinguishable. Test: ANOVA on confidence × AI-usage; ANOVA on calibration × AI-usage.
H4 (null): Sprint length is independent of estimation accuracy. Sprint length (1/2/3/4 weeks) will not predict estimation accuracy after controlling for team size and tenure. The null is the finding. Test: linear regression with covariates.
H5 (exploratory): Training ROI is size-dependent. Estimation-training ROI (calibration improvement per training hour) will be largest for teams ≤50 engineers. Reported with exploratory framing, not as a confirmed finding.

The pre-registration document (hypotheses, planned cross-tabs, exclusion criteria, weighting scheme, multiple-comparison correction (Benjamini–Hochberg FDR at q=0.05 for the planned family)) is linked below.

Read the full pre-registration

Methodology summary

The full methodology is on the companion page. The short version:

Survey instrument: ~48 substantive items + 4 screening + 3 attention checks + 8 firmographics. Median completion 12 minutes. Includes 8 validated calibration tasks (Lichtenstein-style probability calibration questions on software-task time estimation) so we have a measured calibration score per respondent.
Recruitment: Prolific Academic panel arm (~400 completes; effective CPI ~$3) + organic arm (~100 completes via newsletter / LinkedIn / industry communities). Organic responses carry a separate stratum flag in the dataset.
Tool-usage arm: anonymised opt-in capture from the sprint-capacity-calculator. Q3 2026 cohort; k-anonymity ≥10 + Laplace differential-privacy noise (ε=1.0) on any cell-level summary.
Statistics: 95% Wilson confidence intervals on every quoted percentage; Cohen's h for proportion comparisons; Cliff's delta for ordinal Likert; Cohen's d for continuous. Benjamini–Hochberg FDR correction for the planned family.
Pre-registration: will be cross-registered on the Open Science Framework before fielding closes; the link will appear here once registration completes.

Dataset publication (Volume 1)

When Volume 1 lands, the survey dataset publishes under CC-BY-4.0: anonymised individual-response CSV, pre-computed cross-tab tables, JSON-Schema data dictionary, and aggregated tool-usage summary tables. Distribution as a single ZIP bundle plus a Zenodo DOI for permanent citability.

Limitations and what to expect

English-language, predominantly Western sample. The Prolific panel skews toward US/UK/EU respondents; the organic top-up extends to APAC + RoW but not enough to support sub-regional analysis. Volume 1 reports findings as English-Western and explicitly flags this in every cross-tab.
No juniors-only stratum. The Volume 1 design targets senior practitioners (≥5 yrs tenure) because the calibration questions assume a working baseline of software-estimation experience. A juniors-focused replication is a future study, not this one.
Self-reported AI adoption stratum, not telemetry. We classify respondents into AI-usage cohorts by what they tell us. The State-of-AI Volume 0 report establishes that self-perception and measured behaviour disagree on AI; we apply that lens to our own classifications.
The literature is moving. New estimation studies are appearing; if a major peer-reviewed study lands between Volume 0 and Volume 1, we will append a "Recent developments" note rather than rewriting the body.

Participate in Volume 1

If you are a senior software-delivery practitioner and would like to participate in the Volume 1 primary study (Prolific arm or organic arm), reach out to info@newlightai.com. If you are a researcher or journalist interested in the pre-registration, the dataset, or the calibration instrument, the same address reaches the editorial team directly.

References

Boehm, B. W. (1981). Software Engineering Economics. Prentice-Hall. (Revisited 2002 with Boehm & Turner, "The Cone of Uncertainty Revisited".)
Halkjelsvik, T. & Jørgensen, M. (2012). From origami to software development: A review of studies on judgment-based predictions of performance time. Psychological Bulletin 138(2), 238–271.
Jørgensen, M. (2014). What we do and don't know about software development effort estimation. Journal of Systems and Software 98, 142–157.
McConnell, S. (2006). Software Estimation: Demystifying the Black Art. Microsoft Press.
Eveleens, J. L. & Verhoef, C. (2010). The rise and fall of the Chaos report figures. IEEE Software 27(1), 30–36.
Buehler, R., Griffin, D. & Ross, M. (1994). Exploring the "planning fallacy": Why people underestimate their task completion times. Journal of Personality and Social Psychology 67(3), 366–381.
Lichtenstein, S., Fischhoff, B. & Phillips, L. D. (1982). Calibration of probabilities: The state of the art to 1980. In Kahneman, Slovic & Tversky (eds.), Judgment Under Uncertainty.
Atlassian State of Teams Report (2024). Atlassian Work Innovation Lab.
JetBrains State of Developer Ecosystem (2024). JetBrains.
Stack Overflow Developer Survey (2024). Stack Overflow.

Frequently asked questions

Why publish another estimation report when the literature has been settled for decades?
Because the literature is mature but rarely cited in modern Agile or DevOps writing. Boehm 1981, Jørgensen's 30-year program, Halkjelsvik & Jørgensen 2012: all are foundational, all under-cited. Volume 0 surfaces what the field already knows; Volume 1 adds the AI-era calibration question the existing literature doesn't yet answer.
Are story points more accurate than time-based estimates?
Across 30 years of studies (Jørgensen 2014 review), the answer is no: both show indistinguishable systematic bias on the same task set. The choice of scale matters less than the calibration discipline applied. A team that practices calibrated estimation in either scale outperforms a team that does not.
Does AI actually help with estimation, or does it just make us more confident?
This is pre-registered as H3 of the Volume 1 study and explicitly NOT a claim yet. Early industry signals suggest AI-aided estimation increases reported confidence without changing measured accuracy; Volume 1 tests this directly with a 500-person sample.
Why these four studies?
Each represents a distinct epistemic position: Boehm 1981 (theoretical model from observed projects); Halkjelsvik & Jørgensen 2012 (meta-analysis of 200+ studies); Jørgensen 2014 (multi-decade review); Eveleens & Verhoef 2010 (critique of the industry's most-quoted commercial source). Other studies inform other sections of the report but were not picked here because they overlap on epistemic position.
Standish CHAOS has been criticised. Why include it?
We do not cite Standish CHAOS in the body without the Eveleens & Verhoef 2010 critique attached. The pair appears in our comparison table specifically to surface the methodological dispute, not to amplify Standish's headline numbers. Reading both sources is the responsible default.
Will the dataset be released? Under what license?
Yes, under CC-BY-4.0 when Volume 1 publishes. The release includes the anonymised individual-response CSV, per-respondent calibration scores, pre-computed cross-tab tables, JSON-Schema data dictionary, and aggregated tool-usage summary tables. Free-text responses are withheld for privacy.
When does Volume 1 publish?
Target Q4 2026 at the same canonical URL. The landscape-synthesis sections of Volume 0 stay as the "Prior public evidence" framing; Volume 1 primary findings replace the "What Volume 1 will measure" section and add a dataset link. No URL change.
Who funded this research?
Stride Research (the research arm of Newlight Solutions) funds the study fully. No external sponsor; no embedded survey items from any tool vendor. The vendor-neutrality posture in the methodology page documents the operational rules: no Stride row in comparison tables, no Stride-named items in the survey, editorial owner with final cut.
Can I participate in Volume 1 as a researcher / journalist / practitioner?
Yes. The compensated panel arm is recruited through Prolific Academic when fielding opens; email info@newlightai.com to be notified. Practitioners interested in the organic top-up arm can email the same address. Journalists can request a 48-hour-embargoed preview the week before public release.
How should I cite Volume 0?
See §"Cite this volume" further down the page for APA, Chicago, BibTeX, and Markdown formats. Short version: cite the report by title, year 2026, author "Stride Research," URL at the canonical /research/sprint-estimation-reality-2026.

Related work

DevTools landscape

Atlassian State of Teams 2024
Atlassian Work Innovation Lab · 2024
Surveys ~5,000 knowledge workers; ~70% of respondents report routinely missing sprint commitments. Useful proxy for the calibration gap question Volume 1 tests directly.
The State of Developer Ecosystem 2024
JetBrains · 2024
Tool-stack distribution data on estimation method choice (story points ~64%, hour-based ~17%); helpful for Volume 1 stratification.
Stack Overflow Developer Survey 2024
Stack Overflow · 2024
Largest developer community survey worldwide. Has no estimation-accuracy section but reports ~74% citing "deadlines that don't reflect estimates" as a stress source.
2024 Accelerate State of DevOps Report
Google DORA · 2024
Does not measure estimation accuracy directly but provides the canonical delivery-metrics frame the Stride sprint study cross-references.

HCI and behavioural research

Calibration of probabilities: The state of the art to 1980
Lichtenstein, S., Fischhoff, B. & Phillips, L. D. · 1982
Foundational calibration-curve chapter in Kahneman, Slovic & Tversky's Judgment Under Uncertainty. The empirical baseline for "what overconfidence looks like" across populations.
Exploring the "planning fallacy"
Buehler, R., Griffin, D. & Ross, M. · 1994
The canonical paper naming the within-person bias toward optimistic time prediction. Reference-class forecasting is the standard remediation.
Thinking, Fast and Slow (Ch. 23: The Outside View)
Kahneman, D. · 2011
Popular synthesis of the planning-fallacy + reference-class forecasting literature. Useful for practitioners who want the intuition without the academic apparatus.
Sources of Power: How People Make Decisions (Recognition-primed decision)
Klein, G. · 1999
Counterpoint to the deliberative-bias literature: in experts, time pressure activates recognition-primed decision-making that is often well-calibrated. Relevant for the senior-practitioner stratum in Volume 1.

Methodology references

Halkjelsvik & Jørgensen 2012 systematic review
Halkjelsvik, T. & Jørgensen, M. · 2012
The methodological gold standard for time-prediction meta-analysis. Volume 1 modelling decisions follow their study-classification framework.
The rise and fall of the Chaos report figures
Eveleens, J. L. & Verhoef, C. · 2010
Definitive critique of the Standish CHAOS methodology. Should be cited alongside any Standish reference.
OSF Pre-Registration Template
Open Science Framework · 2024
The pre-registration template Volume 1 will cross-register on before fielding closes. Public, time-stamped, locked.

From the Stride blog

Sprint length with AI: how short can you go?
Stride · 2025
Practitioner-oriented post on sprint cadence in AI-augmented teams. Pre-figures the H4 (null) hypothesis on sprint length and estimation accuracy.
Best AI tool for sprint planning
Stride · 2025
Comparative review of AI-aided sprint-planning tools. Pre-figures the H3 (AI changes confidence, not calibration) hypothesis.

Reference this Volume 0 in your own writing using the citation below. The dataset DOI and the Volume 1 primary-findings citation will be added at the same URL when Volume 1 publishes. Readers who cite Volume 0 today will automatically cite the most current version when crawlers re-fetch.

How to cite this report

Stride Research. (2026). Sprint Estimation Reality 2026, Volume 0: Landscape synthesis and pre-registered design. Newlight Solutions. https://www.stride.page/research/sprint-estimation-reality-2026

APA

Stride Research. (2026). Sprint Estimation Reality 2026, Volume 0: Landscape synthesis and pre-registered design. Newlight Solutions. https://www.stride.page/research/sprint-estimation-reality-2026

Chicago

Stride Research. 2026. 'Sprint Estimation Reality 2026, Volume 0: Landscape synthesis and pre-registered design.' Newlight Solutions. https://www.stride.page/research/sprint-estimation-reality-2026.

BibTeX

@techreport{stride2026sprint_v0,
author      = {{Stride Research, Newlight Solutions}},
title       = {Sprint Estimation Reality 2026 --- Volume 0: Landscape synthesis and pre-registered design},
institution = {Newlight Solutions},
year        = {2026},
url         = {https://www.stride.page/research/sprint-estimation-reality-2026},
}

Markdown

[Stride Research (2026). Sprint Estimation Reality 2026, Volume 0: Landscape synthesis and pre-registered design. Newlight Solutions.](https://www.stride.page/research/sprint-estimation-reality-2026)

Embed this chart

Republish this chart on your site or blog — the snippet renders the figure and credits the research with a link back. Free to use under CC-BY-4.0 with attribution.

Sprint Estimation Reality 2026 — Stride Research

<a href="https://www.stride.page/research/sprint-estimation-reality-2026" target="_blank" rel="noopener">
  <img src="https://www.stride.page/api/og?variant=sprint-cone-of-uncertainty&amp;title=Sprint%20Estimation%20Reality%202026&amp;subtitle=Forty-three%20years%20of%20estimation%20research%2C%20synthesised.%20Plus%20the%20pre-registered%20design%20for%20a%20500-person%20study%20of%20AI-era%20sprint%20calibration.&amp;eyebrow=RESEARCH%20%C2%B7%20ESTIMATION%202026&amp;v=2026-05" alt="Sprint Estimation Reality 2026 — Stride Research" width="600" style="max-width:100%;height:auto;border:0" />
</a>
<p style="font:13px/1.6 -apple-system,BlinkMacSystemFont,'Segoe UI',sans-serif;color:#475569;margin:8px 0 0">Source: <a href="https://www.stride.page/research/sprint-estimation-reality-2026">Sprint Estimation Reality 2026 — Stride Research</a></p>