Back to State of AI Software Delivery 2026
Methodology

State of AI Software Delivery 2026 — Methodology

Volume 0 synthesises public studies (DORA 2024 Accelerate, METR 2025 RCT, GitHub Octoverse 2024, Stack Overflow Dev Survey 2024, McKinsey State of AI 2024, Anthropic Economic Index, Peng et al. 2023). Volume 1 (July 2026) adds a pre-registered survey of N≥1,500 senior practitioners (Prolific Academic, 95% Wilson CIs, post-stratification weights) + telemetry from ~14,000 Stride stories.

Research questions

The Stride 2026 study addresses one primary question and four secondary questions.

The primary question is the believer-vs-measurer gap. Existing public surveys (Stack Overflow Developer Survey, McKinsey State of AI, DORA Accelerate Report) consistently report high self-rated productivity gains from AI in software delivery. Existing controlled experiments (Peng et al. 2023, METR 2025) report measured effects ranging from +55% to −19%, with the same population sometimes diverging from its own self-assessment by ≥30 percentage points (METR 2025). No published study has measured the self-vs-measured productivity gap on the same teams, at population scale, across the full range of software-delivery tasks. The Stride 2026 study is designed to do that.

The secondary questions test four further hypotheses developed from the existing literature: (1) whether teams that measure AI's impact report higher tool satisfaction than teams that don't, (2) whether cognitive load varies monotonically with AI adoption depth, (3) whether burnout differs between heavy-AI and opted-out cohorts, and (4) how ROI per dollar of AI tool spend varies by team size.

Hypotheses (pre-registered)

These are the five hypotheses we register before fielding closes. Each is paired with a specific operationalisation, an exclusion criterion, and a falsification condition.

H1 — Self-vs-measured gap

Operationalisation. Self-reported productivity gain is measured by survey item §5.3 ("Over the last six months, how much faster or slower has your work been with AI involved?" — 7-point Likert anchored at "≥50% slower" / "no measurable change" / "≥50% faster"). Telemetry-observed productivity gain is measured by the median ratio of (time-from-story-created to story-marked-done) between AI-assisted and AI-disallowed stories within the same workspace, restricted to workspaces with ≥30 stories of each type in the window.

Prediction. Self-report > telemetry-observed, by a factor of ≥1.5×, on comparable cuts (matched by role + company size).

Falsification. Telemetry-observed gain ≥ self-report, OR the gap is under 1.2× in 4 of 5 segment cuts. Both conditions would falsify H1 directionally.

Statistical test. Survey self-report distribution: weighted point estimate + 95% Wilson CI. Telemetry-observed distribution: median + bootstrap CI (10,000 iterations). Comparison: ratio of medians with bootstrap CI on the ratio.

H2 — Measurement-as-correlate

Operationalisation. "Measurement practice" coded from survey items §4.1–§4.6: a team is classified as "measuring" if it reports at least one pre/post comparison, A/B condition, or instrumented metric attributed to AI in the last six months. Tool satisfaction measured by §3.7 (NPS for primary AI tool). Controls: company size (§1.1), monthly per-developer AI spend band (§8.2).

Prediction. Measuring teams report higher NPS than non-measuring teams; the effect persists after controlling for size and spend.

Falsification. No NPS difference between groups, OR the difference vanishes after controlling for size and spend (suggesting measurement is purely a size/spend proxy).

Statistical test. Ordinary least squares regression of NPS on measurement-practice indicator + size + spend, with cluster-robust standard errors on company size band. Effect size: Cohen's d on adjusted NPS difference.

H3 — Cognitive-load U-curve

Operationalisation. Cognitive load measured by NASA-TLX raw 5-item short form (mental demand, physical demand, temporal demand, performance, effort), unit-weighted composite. AI adoption depth coded by §3.1 ("What percentage of your daily work involves AI tools?") into four ordered bands: 0%, 1–25%, 26–75%, 76–100%.

Prediction. Composite cognitive load is not monotonically decreasing with adoption-depth band. We predict a U-curve, with lowest cognitive load in the 1–25% or 26–75% band and a rise back up in the 76–100% band ("context-switching tax" hypothesis).

Falsification. Monotonic relationship in either direction (linear regression β statistically distinguishable from zero with no significant quadratic term), OR a flat relationship (no statistically distinguishable difference between bands).

Statistical test. Polynomial regression of composite cognitive load on adoption-depth band as ordinal predictor, with significance test on the quadratic term. Effect size: Cliff's delta between the predicted-minimum band and the 76–100% band.

H4 — Burnout null

Operationalisation. Burnout measured by the Maslach Burnout Inventory — Human Services Survey, 5-item short form (emotional exhaustion subscale), Mind Garden licensed. Heavy-AI cohort: §3.1 = 76–100%. Opted-out cohort: §2.1 indicates the respondent's team has actively decided not to adopt AI tooling. Controls: company size, role tenure.

Prediction. No significant difference in burnout composite between heavy-AI and opted-out cohorts, after controlling for size and tenure. The null is the finding. Both "AI causes burnout" and "AI cures burnout" are commercially attractive claims; the existing literature does not support either, and we predict our data won't either.

Falsification — direction A. Heavy-AI cohort significantly more burnt out than opted-out, with Cohen's d ≥ 0.2 after controls.

Falsification — direction B. Heavy-AI cohort significantly less burnt out, same effect-size threshold.

Either direction is publishable as the finding; the null is also publishable. The point of pre-registration is to commit to publishing whatever the data shows.

Statistical test. OLS regression of burnout composite on cohort indicator + size + tenure. Effect size: Cohen's d on adjusted means.

H5 (exploratory) — ROI by team size

Operationalisation. ROI per dollar coded from survey items §8.3–§8.6 (perceived productivity gain × estimated hours saved × hourly cost ÷ per-developer AI spend). Team size from §1.1.

Prediction. Higher ROI per dollar in ≤50-person companies than in 5,000+ companies, mediated by adoption-speed differences.

Status. Exploratory. Reported as exploratory in Volume 1 regardless of result. Not in the Benjamini–Hochberg correction family.

Survey instrument

Total ~62 substantive items + 4 screening + 3 attention checks + 8 firmographics. Target median completion 14 minutes (capped at 18 in pilot; cut back to 14 before main fielding if pilot completion runs long).

Section structure

  • Screening (4 items). Role must be one of: IC engineer, engineering manager, staff+ engineer, product manager, eng director/VP, CTO/founder. Team size 3–500. Ships software ≥ monthly. Not a full-time student.
  • Firmographics (8 items). Company size, industry vertical, region, primary tech stack, deployment cadence band, age of team, AI adoption stage, role tenure.
  • AI adoption breadth (10 items). Which workflows use AI (code generation, code review, planning, acceptance criteria, tests, docs, design, ops), with a "none" sentinel and a randomised tool-name ordering for stated-tools follow-up.
  • AI adoption depth (8 items). Frequency, percentage of workflow, mandatory-vs-optional, paid-vs-free, primary tools.
  • Measurement and evaluation (12 items). Whether the team measures AI's impact, what is measured, how often, whether a control or pre/post condition is used. Hero section. Includes one attention check at item #14.
  • Productivity outcomes (8 items). Self-reported effort change, cycle-time perception, defect-rate perception. 7-point Likert plus magnitude estimate in percent.
  • Cognitive load and burnout (6 items). NASA-TLX raw 5-item + Maslach Burnout Inventory HSS 5-item. Validated scales; license footnotes apply.
  • Trust, safety, governance (8 items). Code-leak concerns, eval/red-team practice, policy existence, vendor-lock-in concerns. Includes second attention check at item #32.
  • ROI and spend (6 items). Tool spend per developer per month, perceived ROI, gut-vs-measured, planned spend next year.
  • Free response (2 items). "Biggest unexpected win." "Biggest unexpected friction." Qualitative coding feeds Volume 1 narrative.

Validated scales

  • Maslach Burnout Inventory — Human Services Survey, 5-item short form. Licensed via Mind Garden at approximately $3 per response over the burnout subset. License agreement archived; cite Maslach & Jackson 1981 in Volume 1.
  • NASA-TLX raw 5-item. Public-domain NASA instrument; cite Hart & Staveland 1988.
  • Net Promoter Score (NPS) for primary AI tool. Industry-standard wording: "On a scale of 0–10, how likely are you to recommend [tool] to a colleague?"
  • System Usability Scale (SUS) item 10 adapted, 5-point. "Overall, how satisfied are you with [tool]?" — single-item satisfaction anchored to SUS Q10 wording.

Quality controls

  • 3 attention checks at items #14, #32, #54. One instructional ("Select 'somewhat agree' to confirm you're reading"), one semantic plausibility ("Have you ever stopped time?"), one logical-consistency (cross-validates a forced single-select against a re-asked question 20 items later).
  • 2 honeypot items. CSS-hidden fields bots fill in; humans never see them. Any non-empty response on these flags the response for exclusion.
  • Cloudflare Turnstile at entry. Invisible challenge; manual challenge only on suspicious sessions.
  • Vendor-side IP and device-fingerprint dedup at the Prolific platform level.
  • Time gates. Drop respondents with completion time under 4 minutes (likely straight-line speeders) or over 60 minutes (likely paused / contaminated by interruption).
  • Straight-line detection on Likert grids ≥6 items. Any respondent with zero variance on such a grid is flagged for review.
  • Free-response gibberish filter. Responses with fewer than three unique tokens, or responses that pattern-match common LLM-generated text, are flagged for manual review.

Soft quotas

Region: NA 40% / EU 30% / APAC 20% / RoW 10%. Role: IC 40% / EM 25% / PM 15% / Staff+ 15% / Exec 5%. Company size: ≤50 25% / 51–500 35% / 501–5000 25% / 5000+ 15%.

Quotas are soft (Prolific continues fielding once a quota fills) but inform recruitment routing.

Recruitment

Panel arm (~1,200 completes)

Prolific Academic. Effective CPI ~$4 (panel rate plus 33% platform fee). Hard screen on role, team size, ships-software-monthly. Soft quotas per above. Estimated panel cost $5,000 inclusive of pilot wave.

Organic arm (~300 completes)

Four channels:

  • Stride newsletter (~14,000 subscribers). One dedicated send plus one reminder. Expected response ~210 completes assuming a 2.5% conversion through screening.
  • LinkedIn organic. Founder + team waves over two weeks. Expected ~30 completes.
  • OSS-maintainer GitHub Action. Opt-in script maintainers run on their repo to receive a survey invite. A novel channel that we will document in the methodology section of Volume 1 with a "did it work" assessment. Expected ~30 completes.
  • Industry communities. Rands Leadership Slack, DevRel Collective, MLOps Community, Software Engineering Daily Slack — coordinated with community owners. Expected ~30 completes.

Organic responses carry a separate stratum flag in the dataset (source = panel | newsletter | linkedin | github_action | community). Primary analyses use the panel arm; organic is a robustness check via sensitivity analysis (every headline number is recomputed with organic excluded and reported in the methodology appendix). The two arms are not pooled in any reported point estimate without an explicit "weighted across arms" note and the per-arm point estimates appearing alongside.

Incentive

$20 Amazon/Tango gift card via raffle (1-in-25 chance), disclosed at consent. Raffle structure intentionally avoids the bias of guaranteed cash (which selects for low-income panelists in tech surveys).

Sample targets per segment

We aim for each major cut to produce a sub-sample ≥100 with 95% CI half-width ≤±10pp.

  • By role: IC eng ≥400 · EM ≥250 · staff+/principal ≥200 · PM ≥150 · exec/VP ≥80.
  • By company size: ≤50 ≥300 · 51–500 ≥450 · 501–5000 ≥350 · 5000+ ≥200.
  • By region: NA ≥500 · EU ≥400 · APAC ≥250 · RoW ≥150.
  • By AI adoption stage: heavy daily ≥500 · selective ≥500 · exploratory ≥350 · opted out ≥150.

The opted-out target is non-negotiable. The hardest finding in the existing literature to defend is the experience of teams that have decided not to adopt AI; we will not run a study that drowns out their voice.

Pilot wave

Pilot N=75, fielded through Prolific on the final v1 instrument, in Q2 2026 week 4 per the project plan. Median completion time, screener kill rate, attention-check fail rate, distribution of free-text length, and any answer-pattern that looks like satisficing are all reviewed within 48 hours of close.

Eight think-aloud sessions follow the pilot. Thirty minutes each, $75 honorarium, recruited via the Stride newsletter. Composition: two IC engineers, two engineering managers, two PMs, two staff+ engineers. Sessions recorded with consent; analysed by the editorial owner together with the statistical reviewer.

Iteration rules. Any item flagged by ≥3 of 8 think-aloud participants as confusing gets reworded. Any item with under 50% variance (in the pilot data) gets dropped or split. Hard cap of 14 days from pilot launch to instrument lock; no instrument changes after the lock.

If pilot completion rate is under 70% or median time exceeds 22 minutes, the kill criterion fires: we pause and iterate for up to five additional days, then re-pilot. If the second pilot fails the same gates, we descope sections (in order of cuttability: §7 Trust/safety, §9 Free response, §6 Cognitive load/burnout) until the gates pass.

Telemetry extraction

Telemetry analyses run against an anonymised warehouse copy (stride_warehouse_anon) of the production Stride database. The copy is generated once, on a fixed timestamp, with the following transformations applied before any query runs:

  • User IDs replaced with random opaque identifiers; mapping not retained.
  • Workspace IDs hashed (SHA-256 with a project-specific salt).
  • Free-text fields (story descriptions, AC bodies, comment text) dropped entirely.
  • Personally-identifying fields (email, name, phone) dropped.

Every aggregation enforces k-anonymity at k≥10. No cell with fewer than 10 distinct workspaces is published; suppressed cells are reported as <10 workspaces — not disclosed. The pre-aggregation k-anon check is a step in the published extraction script, not an after-the-fact filter.

Every "average" or "median" reported at the cell level has Laplace differential-privacy noise added with ε=1.0. The noise scale is documented in the methodology appendix. Trend lines and ratios are reported with the noise floor visible.

Privacy review

External privacy counsel (separate firm from Stride's general counsel; engaged at week 1 of the research sprint) signs off on the extraction schema and the aggregation method before extraction runs. The sign-off email is archived and a redacted version is published with the dataset.

Workspace-admin opt-out

Workspace administrators receive a one-time notification 14 days before publish offering opt-out for their workspace. Opted-out workspaces are excluded from the telemetry analysis; the opt-out tally is reported in the methodology appendix. Existing customer ToS already covers aggregated, anonymised usage analytics; the 14-day notice is a courtesy that goes beyond contractual obligation, on the principle that publishing customer data — even anonymised — should be opt-out by default for the customer, not opt-in for the vendor.

No individual joining

Telemetry data and survey responses are never joined at the individual level. Comparisons between the two sources are stratum-vs-stratum (e.g., "self-reported planning time" vs "telemetry-observed planning time" within the same role + team-size cut). The decision not to allow individual joins is per the consent design; respondents who opt in to a follow-up interview at end-of-survey contribute their email to a separate list that is never linked to their survey responses or their workspace.

Statistical methods

Confidence intervals

95% Wilson score interval on every proportion reported in Volume 1. Wilson over normal-approximation because Wilson is robust on small denominators (which is where most of the interesting segment cuts will live).

Bootstrap 95% CIs (10,000 iterations) on every continuous magnitude. Bootstrap over parametric CIs because we have no prior reason to assume normality on the magnitude items.

Effect sizes

Effect sizes appear on every comparative claim. Cohen's h for proportion comparisons. Cliff's delta for ordinal Likert comparisons (no normality assumption). Cohen's d for continuous comparisons.

We do not quote p-values in the body of Volume 1. P-values appear in the methodology appendix for completeness. Effect sizes describe the size of the difference; p-values describe its compatibility with the null. The body privileges the former.

Multiple-comparison correction

Benjamini–Hochberg FDR correction at q=0.05 on the planned cross-tab family (H1, H2, H3, H4 — H5 exploratory is excluded from the family by design). Family-wise error correction (Bonferroni) is rejected as overly conservative for an exploratory-ish predictive study. The BH choice is documented and applied uniformly.

Exploratory cuts — including H5 and any post-hoc findings discovered while exploring the data — are labelled exploratory in the body and reported with uncorrected p-values plus effect sizes. The labelling is explicit; the reader knows when they are reading a confirmed finding versus a hypothesis-generating one.

Weighting

Post-stratification weights are computed on the role × company size × region cell against base rates from the Stack Overflow Developer Survey 2024 (the largest publicly-documented base-rate source for the population we are studying). Cells with fewer than five respondents are pooled with their nearest neighbour before weighting to avoid extreme weights.

Both weighted and unweighted versions of every headline number appear in the methodology appendix. The body of Volume 1 quotes the weighted version; the appendix shows the delta to unweighted as a sanity check.

Telemetry numbers are not weighted to match survey strata. Telemetry describes Stride's user base, not the universe. Every time a survey number sits next to a telemetry number in the body, the boundary is stated explicitly.

Sensitivity analyses

Every headline number is re-computed under three sensitivity conditions:

  1. Organic respondents excluded.
  2. Attention-check failures excluded.
  3. Completions under 4 minutes excluded.

The sensitivity appendix shows all three deltas. Any headline number that flips direction under any sensitivity condition is reported in the body with the flip flagged inline, not as a footnote.

Reproducibility

The Volume 0 landscape figures are reproducible today. A Jupyter notebook at apps/platform/research/2026/reproducibility/landscape-charts.ipynb on GitHub loads the four published-data CSVs in the same folder and re-renders every Volume 0 figure (effect-size range, perception gap, adoption rates, literature timeline) from the public source numbers. Each CSV row carries a citation_url to its primary source so a peer-reviewer can trace every plotted point back to its origin in one click.

The Volume 1 analysis pipeline will ship as a separate, deeper notebook at the same path on the publish date. CC-BY-4.0. Reproducible from a fixed seed, with package versions locked in a requirements.txt (Python) or renv lockfile (R, whichever the primary statistical reviewer prefers). The statistical reviewer pair-runs the analysis end-to-end from the published notebook before publish; any discrepancy between the reviewer's numbers and the manuscript's numbers blocks publish.

Dataset publication (Volume 1)

When Volume 1 lands, the survey dataset publishes under CC-BY-4.0 as a single ZIP bundle plus a Zenodo DOI.

Released: aggregated anonymised survey CSV (one row per respondent, all responses, weighted-indicator column) plus pre-computed cross-tab tables as a second CSV. JSON-Schema data dictionary describing every field. Aggregated telemetry summary tables (cell-level numbers with k-anon + DP applied; no row-level rows).

Withheld: free-text responses (de-anonymisation risk), raw individual telemetry rows (privacy plus customer-contract).

The integrity hash (SHA-256) of the published bundle appears on the report page next to the download button so reproducibility-conscious readers can verify their copy matches.

Vendor-neutrality posture

The Stride 2026 study is published by Stride. The vendor-neutrality posture is not a denial of that fact; it is a set of explicit operational rules designed to keep the data honest regardless.

  • No survey item names "Stride" as an option in any tool-name list.
  • Tool-name questions list the top 12 named tools alphabetically with "Other (specify)" as the catch-all. The alphabetical ordering is intentional — randomising would defeat the cross-survey comparability we want with Stack Overflow's tool list.
  • No language ("AI-first", "AI-native", "delivery platform") in the survey instrument that primes respondents toward Stride positioning.
  • Comparison tables in Volume 1 cite category averages, not Stride numbers. Stride is one row among many; its row is presented in the same form as every other row.
  • The mid-body call-to-action in Volume 1 is the only place Stride is mentioned as a product. Its placement is disclosed in the methodology appendix with the rationale.
  • The editorial owner has final cut on findings, not GTM. If a finding makes Stride look bad on dimension X, it ships as the finding. (This is also self-interested: a study that buries inconvenient findings dies in one news cycle.)

Changes from this pre-registration

Any deviation from this document during execution will be reported in Volume 1 under a "Deviations from pre-registration" heading. Common deviations we will report if they occur include: a survey item reworded after pilot, a sensitivity analysis added in response to a reviewer comment, a sample-size shortfall that triggers a contingency analysis. We will not silently change any aspect of the design or the analysis plan.

Open Science Framework registration

This document will be cross-registered on the Open Science Framework before fielding closes, with a time-stamp predating that closure. The OSF link is added to the citation block on this page once registration completes. Until then, this page is the canonical pre-registration draft, and revisions are tracked in the repository commit history at the canonical URL apps/platform/src/content/research/state-of-ai-software-delivery-2026.methodology.mdx so the audit trail is public.

References

References for this methodology page (validated scales, statistical methods, prior studies) are footnoted on the main report page at /research/state-of-ai-software-delivery-2026. The full bibliography for Volume 1 will publish alongside Volume 1.