Why is Stride publishing AI productivity research given the obvious conflict of interest?

Because a vendor study that buries inconvenient findings dies in one news cycle and never gets cited again. The vendor-neutrality posture in the methodology page is a set of operational rules: no Stride-named survey item, no Stride row in comparison tables, editorial owner with final cut, not GTM. The trade-off is honest: a fairly run study is a long-term citation asset, which is itself in our interest.

Is "Volume 0" just a marketing teaser for Volume 1?

No. Volume 0 is a real landscape synthesis: every numeric claim is attributed to a public study, and the pre-registration document binds the Volume 1 design. The framing avoids the alternative we considered and rejected: writing fabricated survey findings now because the real survey has not yet been fielded. Volume 0 stands on the public literature; Volume 1 adds the Stride survey + telemetry on top of the same URL.

Why these four landmark studies and not others?

Each represents a distinct epistemic position: Peng 2023 is the strongest controlled-experiment evidence of large speedups; METR 2025 is the strongest controlled-experiment evidence against; DORA 2024 is the largest observational survey with a measurement frame; Stack Overflow 2024 is the largest self-reported survey. McKinsey, Octoverse, JetBrains, and Anthropic’s Economic Index inform other sections of the report but were not picked here because they overlap on epistemic position.

The METR finding has n=16. Isn't that too small to take seriously?

It is small, but the within-subject design measures 246 task pairings on the same developers and their own code, which is the right design for the question. The headline number is the most cited contrarian datapoint in the field for a reason: not because n=16 is large, but because the perception–measurement gap (39pp) replicates a well-established pattern in productivity-self-report research that predates AI tooling.

The Peng 2023 study is from early 2023. Isn't the result outdated?

For the specific question Peng asked (does Copilot speed up a fresh, well-bounded task for a junior-to-mid cohort?), the answer is unlikely to have moved, given that LLM coding assistants have only improved since. What has changed is the breadth of work AI is asked to do, which Peng never measured. The Volume 1 instrument extends the question to context-rich and legacy work explicitly because that is where the literature has the largest unmet measurement gap.

What does "pre-registration" actually mean here?

It means the hypotheses, the planned cross-tabs, the exclusion criteria, the weighting plan, and the statistical-correction method are all written down and time-stamped before any data is collected. The pre-registration is cross-posted on the Open Science Framework so the time-stamp is independently verifiable. Volume 1 is then answerable to the pre-registered design. If a finding falsifies a hypothesis, it ships as the finding.

When does Volume 1 publish, and what changes at this URL?

Target publish is July 2026 at the same canonical URL. The landscape-synthesis sections of Volume 0 stay as the "Prior public evidence" framing; the Volume 1 primary findings (survey + telemetry) replace the "What Volume 1 will measure" section and add a dataset link. No URL change; inbound citations to Volume 0 continue to resolve.

Will the dataset be released? Under what license?

Yes, under CC-BY-4.0 when Volume 1 publishes. The release includes the anonymised individual-response CSV with a weighted-indicator column, pre-computed cross-tab tables, a JSON-Schema data dictionary, and aggregated telemetry summary tables. Free-text responses and raw individual telemetry rows are withheld for privacy and customer-contract reasons.

How should I cite Volume 0?

See §"Cite this volume" further down the page for APA, Chicago, BibTeX, and Markdown formats. The short version: cite the report by title, year 2026, author "Newlight Solutions," URL at the canonical /research/state-of-ai-software-delivery-2026. The Volume 0 framing is stable; Volume 1 will get its own citation block at publish.

All research

Research

State of AI Software Delivery 2026

What the existing public studies (DORA, METR, Octoverse, Stack Overflow) tell us, and what the Stride 2026 study will measure next.

May 17, 2026

Contents

1. Why two volumes
2. Adoption is no longer the question
3. Productivity: the contested ground
3a. The four landmark studies, side by side
4. Measurement: the gap nobody fills
5. What Volume 1 will measure
6. Methodology summary
7. Limitations + what to expect
8. FAQ
9. Related work
10. Cite this volume
11. Participate in Volume 1

Key findings

Adoption is settled. 76% of professional developers use or plan to use AI in their daily work (Stack Overflow Developer Survey 2024, n≈65,000). AI use is roughly universal among elite-DORA-quartile teams and broadly adopted across all maturity bands (DORA 2024 Accelerate Report, n≈39,000).
Productivity findings disagree by a factor of four. Microsoft/GitHub's controlled experiment found developers completed a standardised task ~55% faster with Copilot (Peng et al. 2023, n=95). METR's 2025 randomised study of experienced open-source contributors on real-world tasks found them ~19% slower, while perceiving themselves as ~20% faster (METR 2025, n=16 devs, 246 tasks).
No published study has cross-referenced self-reported productivity gain with telemetry-measured cycle time on the same teams at scale. Across DORA, Octoverse, McKinsey, and Stack Overflow surveys, fewer than 15% of teams report measuring AI's impact with control conditions or pre/post comparisons.
Where there is consensus: AI moves the median, but the variance is bigger than the move. The "AI helps me / AI hurts me" split is not noise. It reflects real task-, tool-, and tenure-dependent effects that adoption-rate metrics flatten away.
What Volume 1 of this study (July 2026) measures: the believer-vs-measurer gap on the same teams, with pre-registered hypotheses, validated cognitive-load + burnout scales, and Stride product telemetry as the unobtrusive measurement anchor.

Methodology

Volume 0 synthesises public studies (DORA 2024 Accelerate, METR 2025 RCT, GitHub Octoverse 2024, Stack Overflow Dev Survey 2024, McKinsey State of AI 2024, Anthropic Economic Index, Peng et al. 2023). Volume 1 (July 2026) adds a pre-registered survey of N≥1,500 senior practitioners (Prolific Academic, 95% Wilson CIs, post-stratification weights) + a Stride product-telemetry analysis.

Dataset: State of AI Software Delivery 2026: Cited Studies Table · License: CC BY 4.0: attribute "Stride 2026 Volume 0" with link to source URL. · Download raw data

Read the full methodology

Why two volumes

The honest version of an annual AI-in-software-delivery report is harder than the marketing version. The marketing version has a chart of adoption growing left-to-right, a productivity claim in the 30–50% range, and a CTA to buy more AI. The honest version starts by reading every public study that already exists, notices that the productivity claims span from +55% faster to −19% slower depending on what you measure, and asks why.

We chose to publish in two volumes for three reasons.

First, a real survey takes time. The Stride 2026 primary study is fielding through Prolific Academic in Q2 2026 with a pre-registered design, external statistical review, and a 14-day workspace-admin opt-out window before any telemetry numbers can be cited. That cycle does not collapse into a weekend.

Second, pre-registering hypotheses before looking at data is what keeps this a study and not a marketing claim with a methodology section stapled on. The hypotheses we register today are the ones we will test in July, not the ones we invent post-hoc to match whatever the data shows.

Third, the literature already says something. It is just not what either side of the AI debate wants it to say. Volume 0 is the honest survey of what the existing public evidence supports, and where it deliberately does not commit. The 2026 primary study fills a specific gap in that evidence, not the entirety of it.

Read the pre-registered methodology

Adoption is no longer the question

The most-published number in AI-in-software-delivery research is the adoption rate. It is also the least interesting.

The 2024 Stack Overflow Developer Survey put adoption at 76% of professional developers currently using or planning to use AI tools in their daily work, drawn from a sample of roughly 65,000 respondents. GitHub's Octoverse 2024 reported AI-assisted commit activity rising across every major language ecosystem and a tripling in .ipynb and ML-adjacent repository creation year-over-year. McKinsey's 2024 State of AI survey put workplace AI use at 65% of organisations (up from 33% the year before) with software engineering as one of the three highest-adoption functions alongside marketing/sales and product/service development.

The DORA 2024 Accelerate Report is the most measured of the lot. Across roughly 39,000 respondents, it found that AI tool adoption among elite-quartile delivery teams was effectively universal, with high-performance teams adopting AI roughly 1.4× more often than low-performance teams. The Anthropic Economic Index (released early 2025 from anonymised Claude usage patterns) puts "computer and mathematical occupations" (which include software engineering) as the single largest occupational category in the Claude usage distribution, accounting for roughly 37% of total interactions.

76%

of professional developers use or plan to use AI in their daily work.Source: Stack Overflow Developer Survey 2024, n≈65,000.

Figure 1. Adoption rates differ by what you ask, and who you ask. Three published surveys, three measurement boundaries. The headline percentages are not directly comparable.

Sources: Stack Overflow Developer Survey 2024; McKinsey State of AI 2024; DORA Accelerate 2024.

Chart description (text)

Horizontal bar chart of three published adoption metrics. Stack Overflow Developer Survey 2024 reports 76% of professional developers using or planning to use AI tools in their daily work, drawn from a sample of roughly 65,000 individual developers. McKinsey 2024 State of AI reports 65% of organisations using AI in at least one business function, drawn from approximately 1,491 survey responses at the organisational level, a different unit of analysis. DORA Accelerate 2024 reports effectively universal AI adoption (illustrated as a striped bar covering the top 25% of the axis) among the elite-quartile of delivery teams in their sample of roughly 39,000 respondents. GitHub Octoverse 2024 reports a rising direction across every major language ecosystem but does not surface a comparable headline percentage and is omitted from this chart for that reason.

These numbers vary by ±20pp depending on how you define "use AI" (any code completion? a paid Copilot seat? an agentic workflow?) and which population you sample (Stack Overflow respondents skew toward early adopters; DORA respondents skew toward larger orgs). Take an honest range. Three-quarters of professional software work now touches AI in some form. The exact percentage is no longer the interesting question.

The interesting question is what happens after adoption. That is where the literature splits open.

Productivity: the contested ground

The two single most-cited studies in this area report findings that differ by a factor of four, in opposite directions, on the same question.

The Microsoft/GitHub Copilot RCT (Peng et al., 2023) ran a controlled experiment where 95 developers were given a standardised JavaScript HTTP server-implementation task. Half had access to Copilot; half did not. The Copilot group completed the task roughly 55% faster on average than the control group. The study has been cited extensively by tool vendors as evidence of large AI productivity gains, and the underlying experimental rigor is real. What is also real, but less often cited, is the experimental design: a single benchmark task, a population of relatively junior developers, completion time as the sole outcome.

The METR 2025 randomised study (Model Evaluation & Threat Research) is the most cited study from the other direction. METR recruited 16 experienced open-source contributors and randomised 246 of their real-world repository tasks to either AI-assisted or AI-disallowed conditions. The AI-assisted tasks took ~19% longer to complete. In the same study, the developers' own self-perception was that AI made them about 20% faster (a 39-percentage-point gap between measured and perceived productivity, in the same population, on the same tasks). Before the study ran, ML experts asked to predict the result expected a 38% speedup; economists expected 39%.

+55% to −19%

Range of published RCT productivity effects for AI coding tools, depending on task, population, and time horizon.Source: Peng et al. 2023 (Microsoft/GitHub) vs METR 2025.

Figure 2. Effect-size range: Peng 2023 vs METR 2025. Two landmark RCT-style studies, non-overlapping confidence intervals. The contested ground in one frame.

Sources: Peng et al. 2023 (arxiv.org/abs/2302.06590); METR 2025 (metr.org).

Chart description (text)

Floating-bar chart showing reported productivity effects on AI coding assistants from two landmark studies. Peng et al. 2023 measured a +55.8% completion-time speedup with 90% confidence interval from +30.6% to +80.8% on a standardised JavaScript HTTP-server task with 95 developers. METR 2025 measured a −19% slowdown with 95% confidence interval from approximately −38% to +1% on real-world open-source contributions across 16 experienced developers and 246 tasks. The two confidence intervals do not overlap, illustrating that task type (standardised greenfield versus context-rich legacy) and developer tenure are the primary drivers of the apparent contradiction in the literature, not study quality.

These two studies do not contradict each other in the way it first looks. They are measuring different things, with different populations, on different time horizons.

Peng's task was standardised and self-contained: exactly the kind of well-bounded problem at which current LLMs excel. METR's tasks were embedded in mature codebases the developers already knew well: exactly the kind of context-dependent work where the cost of explaining context to AI exceeds the benefit of the AI's suggestions. Peng measured junior-to-mid developers on a fresh problem; METR measured senior developers on their own code. The honest synthesis is: AI is unambiguously fast at greenfield, well-specified, standalone work, and the literature does not yet settle the question on context-rich legacy work.

The DORA 2024 report adds organisational-level evidence to this picture. AI adoption among elite-quartile teams correlates with the elite-quartile DORA metrics, but correlation is not causation, and DORA cannot easily attribute the gain (or harm) to AI specifically. Among lower-quartile teams, AI adoption is sometimes correlated with worse change-failure-rate trajectories, which DORA's authors interpret cautiously as "AI may amplify whatever delivery culture already exists rather than substitute for it."

There is one finding the studies do agree on. The variance is bigger than the move. Whether you take Peng's +55%, METR's −19%, or DORA's organisationally-mediated mixed signal, the spread of individual outcomes within each study is roughly 2× the size of the headline number. Some developers in the Peng study were +90% faster with Copilot; some were +10%. Some METR developers were −35% slower; some −2%. The "AI helps / AI hurts" split is real, but it is task-, tool-, and tenure-dependent, not a coin flip.

Figure 3. The 39-point perception–measurement gap (METR 2025). Four positions on the same productivity question. Self-perception lands 39 points above the measured result, in the same population, on the same tasks.

Source: METR 2025, Figure 1 (metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study).

Chart description (text)

Dot plot on a horizontal expectation-versus-reality axis. ML experts predicted a +38% speedup. Economists predicted a +39% speedup. Developers themselves perceived a +20% in-task speedup. The actual measured result was a −19% slowdown. An arrow between developer self-perception (+20%) and the measured result (−19%) annotates the 39 percentage-point gap. The gap is the single largest known finding on metacognitive accuracy in AI-coding settings.

The four landmark studies, side by side

The table below collapses the contested ground into one frame. Read across the rows: each study answers a different question, with a different population, on a different time horizon. Read down the "key limitation" column: each study is honest about what its design cannot conclude. The Volume 1 design picks a fifth position (survey + telemetry on the same teams, with pre-registered hypotheses) to fill the gap none of these four cover.

Side-by-side comparison of 4 landmark studies cited in this report.
Study	Sample	Method	Headline finding	Key limitation	Source
Peng et al. 20232023	n=95	RCT: standardised JavaScript HTTP-server task	~55% mean speedup for the Copilot group vs control on the benchmark task.	Single self-contained task; cohort skewed junior-to-mid; completion time as the sole outcome.	Link
METR 20252025	n=16 developers · 246 tasks	RCT: real-world tasks in the developers’ own open-source repositories	~19% slower with AI on real tasks; self-perception ran ~20% faster (39pp gap).	Small n; specialised population of experienced OSS maintainers on their own code.	Link
DORA 2024 Accelerate Report2024	n≈39,000 respondents	Observational survey: cross-quartile delivery-performance analysis	AI adoption ~1.4× more common in elite-quartile teams; cautious correlation, not causation.	Self-report; cannot attribute organisational outcomes specifically to AI.	Link
Stack Overflow Developer Survey 20242024	n≈65,000 respondents	Convenience survey: global developer community	76% use or plan to use AI in daily work; 84% report perceived productivity improvement.	Convenience sample skews toward Stack Overflow’s early-adopter community; perception, not measurement.	Link

Peng et al. 2023

2023

Sample: n=95
Method: RCT: standardised JavaScript HTTP-server task
Headline finding: ~55% mean speedup for the Copilot group vs control on the benchmark task.
Key limitation: Single self-contained task; cohort skewed junior-to-mid; completion time as the sole outcome.

Open source

METR 2025

2025

Sample: n=16 developers · 246 tasks
Method: RCT: real-world tasks in the developers’ own open-source repositories
Headline finding: ~19% slower with AI on real tasks; self-perception ran ~20% faster (39pp gap).
Key limitation: Small n; specialised population of experienced OSS maintainers on their own code.

Open source

DORA 2024 Accelerate Report

2024

Sample: n≈39,000 respondents
Method: Observational survey: cross-quartile delivery-performance analysis
Headline finding: AI adoption ~1.4× more common in elite-quartile teams; cautious correlation, not causation.
Key limitation: Self-report; cannot attribute organisational outcomes specifically to AI.

Open source

Stack Overflow Developer Survey 2024

2024

Sample: n≈65,000 respondents
Method: Convenience survey: global developer community
Headline finding: 76% use or plan to use AI in daily work; 84% report perceived productivity improvement.
Key limitation: Convenience sample skews toward Stack Overflow’s early-adopter community; perception, not measurement.

Open source

Measurement: the gap nobody fills

If you read the methodology sections of every public survey, one fact stands out. The vast majority of "AI productivity" claims are self-reported, often by people who have never instrumented their own workflow to measure it.

The 2024 Stack Overflow Developer Survey asks developers whether AI tools have improved their productivity. 84% say yes. The same survey does not ask whether the respondent has ever measured their own cycle time, defect rate, or task completion before vs after adoption. The implication is that "yes, AI has improved my productivity" is, for most respondents, an introspective judgement made under the same conditions that produce the METR 39-percentage-point perception–measurement gap.

Across the DORA, McKinsey, Stack Overflow, and Octoverse surveys, fewer than 15% of teams report measuring AI's impact with control conditions, pre/post comparisons, or other unobtrusive instruments. DORA itself is the closest thing the industry has to a measurement standard, but its four key metrics (deployment frequency, lead time, change failure rate, mean time to recovery) describe organisational outcomes, not AI-attributable contributions to those outcomes. A team that gets faster with AI on the input side and slower with AI's defects on the output side may net the same DORA score with a different underlying mechanism.

This is the single most consequential gap in the existing literature. We can tell you, with high confidence drawn from large samples, what developers feel about AI's effect on their productivity. We cannot tell you, from any single published study, what AI does to a development team's measured throughput when the same team measures itself before and after, with statistical rigor, on real work, at scale.

This is the gap Volume 1 of this study addresses.

What Volume 1 will measure

Figure 4. The AI-in-software-delivery research timeline. Where Volume 0 sits in the published literature, and where Volume 1 lands. Dashed marker is forthcoming.

Markers compiled from each study's published release date.

Chart description (text)

Horizontal timeline from early 2023 to late 2026 with seven published markers and one future marker. Peng et al. publishes February 2023. McKinsey State of AI publishes May 2024. Stack Overflow Developer Survey 2024 publishes June 2024. Octoverse and DORA Accelerate 2024 both publish October 2024. Anthropic Economic Index publishes February 2025. METR experienced-developer study publishes July 2025. Stride 2026 Volume 1 is scheduled for July 2026 and is shown in a dashed muted treatment as a forward-looking marker, not a published result.

The Stride 2026 primary study is two arms, cross-referenced.

The survey arm targets N≥1,500 senior software-delivery practitioners (IC engineers, engineering managers, staff+ engineers, product managers, engineering directors and VPs). Recruitment is Prolific Academic for the panel arm with an organic top-up from the Stride newsletter, LinkedIn, an opt-in GitHub Action for OSS maintainers, and selected industry communities. Sample is balanced across role (IC 40 / EM 25 / PM 15 / Staff+ 15 / Exec 5), company size (≤50, 51–500, 501–5000, 5000+), region (NA / EU / APAC / RoW), and AI adoption stage (heavy daily / selective / exploratory / opted out: the opted-out cohort is non-negotiable; the "no" voice is the hardest finding to claim without it).

The telemetry arm analyses ~14,000 stories from the Stride product database covering Q3 2025 through Q1 2026. Every aggregation enforces k-anonymity at k≥10 (no cell published with fewer than 10 distinct workspaces) plus differential-privacy noise (Laplace, ε=1.0) on any cell-level "average" we report. Schema published, opt-out window provided to workspace admins 14 days pre-publish, third-party privacy counsel signs off the extraction before it runs.

Both arms are stratum-vs-stratum, never joined at the individual respondent or workspace level.

Pre-registered hypotheses

These are the five hypotheses we register on OSF before fielding closes. They are also the hypotheses we test in July. The pre-registration is locked, time-stamped, public; if we find something more interesting than these hypotheses in the data, we will report it explicitly as exploratory rather than dressing it up as confirmatory.

H1: Self-vs-measured gap. Self-reported productivity gain from AI will be higher than telemetry-observed productivity gain, on comparable task cuts, by a factor of ≥1.5×. (This is the Peng-vs-METR question, asked on a representative sample.)
H2: Measurement-as-correlate. Teams that measure AI's impact will report higher tool satisfaction (NPS) than teams that don't, controlling for company size and AI tool spend. (Measurement may be a marker of organisational maturity that itself produces satisfaction, independent of any AI effect.)
H3: Cognitive-load U-curve. Composite cognitive load (NASA-TLX 5-item short form) will not decrease monotonically with AI adoption depth. We predict a U-curve, with lowest cognitive load in the middle-adoption cohort and a rise back up in heavy-adoption (the "more AI = more context-switching" effect).
H4: Burnout null. Burnout (Maslach Burnout Inventory short form) will not differ significantly between heavy-AI and opted-out cohorts after controlling for company size and role tenure. The null is the finding. Both "AI causes burnout" and "AI cures burnout" are commercially attractive claims; the literature does not yet support either, and our prediction is that ours won't either.
H5 (exploratory): ROI by team size. ROI per dollar of AI tool spend will be higher for teams ≤50 than teams 5,000+, mediated by adoption-speed differences. Reported with exploratory framing, not as a confirmed finding.

The hypotheses, the planned cross-tabs, the exclusion criteria, the weighting scheme, and the multiple-comparison correction method (Benjamini–Hochberg FDR at q=0.05 for the planned family) are all in the pre-registration document linked below.

Read the full pre-registration

Methodology summary

The full methodology is on the companion page. The short version:

Survey instrument: ~62 substantive items + 4 screening + 3 attention checks + 8 firmographics. Median completion 14 minutes. Validated scales used: Maslach Burnout Inventory short form (Mind Garden licensed), NASA-TLX raw 5-item, SUS-anchored single item for tool satisfaction, NPS. Three attention checks at items #14, #32, #54; two honeypot questions invisible to humans; Cloudflare Turnstile at entry; vendor-side IP and device-fingerprint dedup.
Recruitment: Prolific Academic panel arm (~1,200 completes; effective CPI ~$4) + organic arm (~300 completes via newsletter / LinkedIn / OSS-maintainer GitHub Action / industry communities). Organic responses carry a separate stratum flag in the dataset so analyses can run with/without them.
Pilot wave: N=75 panel respondents + 8 think-aloud sessions ($75 honorarium, 30 min each) before the main field opens. Iteration window 14 days; any item flagged confusing by ≥3 of 8 think-alouds gets reworded.
Telemetry: ~14,000 Stride stories Q3 2025–Q1 2026. k-anonymity ≥10 + Laplace DP noise (ε=1.0). Schema published; workspace-admin opt-out window 14 days pre-publish.
Statistics: 95% Wilson confidence intervals on every quoted percentage; Cohen's h for proportion comparisons; Cliff's delta for ordinal Likert; Cohen's d for continuous. Multiple-comparison correction (Benjamini–Hochberg) for the planned cross-tab family. Post-stratification weights on role × company size × region against Stack Overflow Developer Survey 2024 base rates.
Pre-registration: posted to OSF (Open Science Framework) before fielding closes. Locked, time-stamped, public.

Dataset publication (July 2026)

When the primary findings drop, the survey dataset publishes alongside under CC-BY-4.0: anonymised individual-response CSV with weighted-indicator column, pre-computed cross-tab tables, JSON-Schema data dictionary, and aggregated telemetry summary tables. Distribution as a single ZIP bundle plus a Zenodo DOI for permanent citability. Raw individual telemetry rows are not released (privacy + customer-contract reasons); the aggregated tables are.

Limitations and what to expect

Volume 0 limitations are real and easy to state.

This is a synthesis, not a primary study. Every number on this page is downstream of another team's research. We have not yet measured anything ourselves on this report. The framing is "what the literature already supports," and where the literature is contested, we say so.
The cited studies have their own limitations. Peng et al. 2023 was a single-task experiment with relatively junior developers. METR 2025 had n=16. DORA 2024 is observational and cannot establish causation. Stack Overflow's Developer Survey is a convenience sample of one community. We have tried not to over-claim any single one of them.
The literature is moving. New RCTs on AI productivity are appearing roughly quarterly. Findings here are accurate to public studies as of the publish date; if a major study lands between Volume 0 and Volume 1, we will append a "Recent developments" note rather than rewriting the body.

Volume 1 will be honest in different ways.

Every quoted percentage will carry a 95% Wilson confidence interval. Comparative claims will carry effect sizes (Cohen's h, Cliff's delta, or Cohen's d), not just p-values.
Every cross-tab in the planned family will run through Benjamini–Hochberg correction. Exploratory cuts will be labelled exploratory.
Sensitivity analyses (organic excluded, attention-check failures excluded, under-4-minute completions excluded) appear in the methodology appendix for every headline number.
If the pre-registered hypotheses turn out to be wrong, we will say so. If a sensitivity analysis flips a headline finding, we will pause publication and document the flip as the finding.
The telemetry findings will report ranges across all 14,000-story-sample workspaces; no single workspace will be identifiable. The Stride product itself will not be named or ranked in any comparison table.

The vendor-neutrality posture is self-interested as much as it is principled. A study that buries an inconvenient finding dies in one news cycle. A study that publishes them is the one that gets cited five years later. The latter is the strategic move.

Frequently asked questions

Why is Stride publishing AI productivity research given the obvious conflict of interest?
Because a vendor study that buries inconvenient findings dies in one news cycle and never gets cited again. The vendor-neutrality posture in the methodology page is a set of operational rules: no Stride-named survey item, no Stride row in comparison tables, editorial owner with final cut, not GTM. The trade-off is honest: a fairly run study is a long-term citation asset, which is itself in our interest.
Is "Volume 0" just a marketing teaser for Volume 1?
No. Volume 0 is a real landscape synthesis: every numeric claim is attributed to a public study, and the pre-registration document binds the Volume 1 design. The framing avoids the alternative we considered and rejected: writing fabricated survey findings now because the real survey has not yet been fielded. Volume 0 stands on the public literature; Volume 1 adds the Stride survey + telemetry on top of the same URL.
Why these four landmark studies and not others?
Each represents a distinct epistemic position: Peng 2023 is the strongest controlled-experiment evidence of large speedups; METR 2025 is the strongest controlled-experiment evidence against; DORA 2024 is the largest observational survey with a measurement frame; Stack Overflow 2024 is the largest self-reported survey. McKinsey, Octoverse, JetBrains, and Anthropic’s Economic Index inform other sections of the report but were not picked here because they overlap on epistemic position.
The METR finding has n=16. Isn't that too small to take seriously?
It is small, but the within-subject design measures 246 task pairings on the same developers and their own code, which is the right design for the question. The headline number is the most cited contrarian datapoint in the field for a reason: not because n=16 is large, but because the perception–measurement gap (39pp) replicates a well-established pattern in productivity-self-report research that predates AI tooling.
The Peng 2023 study is from early 2023. Isn't the result outdated?
For the specific question Peng asked (does Copilot speed up a fresh, well-bounded task for a junior-to-mid cohort?), the answer is unlikely to have moved, given that LLM coding assistants have only improved since. What has changed is the breadth of work AI is asked to do, which Peng never measured. The Volume 1 instrument extends the question to context-rich and legacy work explicitly because that is where the literature has the largest unmet measurement gap.
What does "pre-registration" actually mean here?
It means the hypotheses, the planned cross-tabs, the exclusion criteria, the weighting plan, and the statistical-correction method are all written down and time-stamped before any data is collected. The pre-registration is cross-posted on the Open Science Framework so the time-stamp is independently verifiable. Volume 1 is then answerable to the pre-registered design. If a finding falsifies a hypothesis, it ships as the finding.
When does Volume 1 publish, and what changes at this URL?
Target publish is July 2026 at the same canonical URL. The landscape-synthesis sections of Volume 0 stay as the "Prior public evidence" framing; the Volume 1 primary findings (survey + telemetry) replace the "What Volume 1 will measure" section and add a dataset link. No URL change; inbound citations to Volume 0 continue to resolve.
Can I participate in Volume 1 as a researcher, journalist, or practitioner?
Yes. The compensated panel arm is recruited through Prolific Academic when fielding opens; email info@newlightai.com to be notified. Maintainers of public open-source repositories who want to invite their contributors via the opt-in GitHub Action can email the same address. Journalists can request a 48-hour-embargoed preview the week before public release.
Will the dataset be released? Under what license?
Yes, under CC-BY-4.0 when Volume 1 publishes. The release includes the anonymised individual-response CSV with a weighted-indicator column, pre-computed cross-tab tables, a JSON-Schema data dictionary, and aggregated telemetry summary tables. Free-text responses and raw individual telemetry rows are withheld for privacy and customer-contract reasons.
How should I cite Volume 0?
See §"Cite this volume" further down the page for APA, Chicago, BibTeX, and Markdown formats. The short version: cite the report by title, year 2026, author "Newlight Solutions," URL at the canonical /research/state-of-ai-software-delivery-2026. The Volume 0 framing is stable; Volume 1 will get its own citation block at publish.

Related work

DevTools landscape

Stack Overflow Developer Survey 2024
Stack Overflow · 2024
Largest developer community survey worldwide; the canonical AI-adoption percentage in the field comes from here.
GitHub Octoverse 2024
GitHub · 2024
Activity-level rather than survey-level evidence of AI adoption; useful as a sanity check against self-report.
2024 Accelerate State of DevOps Report
Google DORA · 2024
The closest the industry has to a measurement standard; cautious on causation. Worth reading the limitations chapter in full.
The State of Developer Ecosystem 2024
JetBrains · 2024
Tool-stack distribution data that helps cross-validate the named-tool questions in the Volume 1 instrument.

HCI and behavioural research

Development of NASA-TLX (Task Load Index)
Hart, S. G. & Staveland, L. E. · 1988
The validated cognitive-load instrument used in §6 of the Volume 1 survey. Public-domain NASA work.
The Measurement of Experienced Burnout
Maslach, C. & Jackson, S. E. · 1981
Foundational paper for the Maslach Burnout Inventory; the 5-item HSS short form is licensed via Mind Garden for the Volume 1 survey.
SUS: A Quick and Dirty Usability Scale
Brooke, J. · 1996
The single-item tool-satisfaction anchor in §3 of the Volume 1 survey is adapted from SUS Q10.
Statistical Power Analysis for the Behavioral Sciences (2nd ed.)
Cohen, J. · 1988
The methodological frame behind "effect sizes over p-values." Cohen's h, d, and the conventions for "small / medium / large" originate here.

Methodology references

Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing
Benjamini, Y. & Hochberg, Y. · 1995
The multiple-comparison correction method applied to the Volume 1 planned cross-tab family at q=0.05.
Probable Inference, the Law of Succession, and Statistical Inference
Wilson, E. B. · 1927
Original derivation of the Wilson score interval. Robust on small denominators, used on every proportion in Volume 1.
OSF Pre-registration template
Center for Open Science · 2024
The template used for the Volume 1 pre-registration. The OSF time-stamp predating fielding closure is what makes the pre-registration auditable.

From the Stride blog

The ROI of AI in software delivery
Stride Team · 2026
Closest blog analogue to Volume 0: hedged thesis, numbered findings, telemetry-flavoured anecdotes. Prefigures the voice we apply to the survey.
How AI writes acceptance criteria (and where it fails)
Stride Team · 2026
Concrete win/loss map across AI workflows; the kind of segment-level finding Volume 1 will replicate with statistical rigour.
The connected delivery graph
Stride Team · 2026
Thesis post on why AI-assisted delivery needs an integrated artefact graph rather than five disconnected tools. Context for the Volume 1 telemetry design.

Reference this Volume 0 in your own writing using the citation below. The dataset DOI and the Volume 1 primary-findings citation will be added at the same URL when Volume 1 publishes. Readers who cite Volume 0 today will automatically cite the most current version when crawlers re-fetch.

How to cite this report

Stride Research. (2026). State of AI Software Delivery 2026, Volume 0: Landscape synthesis and pre-registered design. Newlight Solutions. https://www.stride.page/research/state-of-ai-software-delivery-2026

APA

Stride Research. (2026). State of AI Software Delivery 2026, Volume 0: Landscape synthesis and pre-registered design. Newlight Solutions. https://www.stride.page/research/state-of-ai-software-delivery-2026

Chicago

Stride Research. 2026. 'State of AI Software Delivery 2026, Volume 0: Landscape synthesis and pre-registered design.' Newlight Solutions. https://www.stride.page/research/state-of-ai-software-delivery-2026.

BibTeX

@techreport{stride2026soaisd_v0,
author      = {{Stride Research, Newlight Solutions}},
title       = {State of AI Software Delivery 2026 --- Volume 0: Landscape synthesis and pre-registered design},
institution = {Newlight Solutions},
year        = {2026},
url         = {https://www.stride.page/research/state-of-ai-software-delivery-2026},
}

Markdown

[Stride Research (2026). State of AI Software Delivery 2026, Volume 0: Landscape synthesis and pre-registered design. Newlight Solutions.](https://www.stride.page/research/state-of-ai-software-delivery-2026)

How to cite this report

Newlight Solutions. (2026). State of AI Software Delivery 2026, Volume 0: Landscape synthesis + pre-registered study design [Report]. Stride. https://www.stride.page/research/state-of-ai-software-delivery-2026

APA

Newlight Solutions. (2026). State of AI Software Delivery 2026, Volume 0: Landscape synthesis + pre-registered study design [Report]. Stride. https://www.stride.page/research/state-of-ai-software-delivery-2026

Chicago

Newlight Solutions. 2026. "State of AI Software Delivery 2026, Volume 0." Stride. https://www.stride.page/research/state-of-ai-software-delivery-2026.

BibTeX

@techreport{stride2026_v0,
title       = {State of AI Software Delivery 2026 — Volume 0:
               Landscape synthesis + pre-registered study design},
author      = {{Newlight Solutions}},
institution = {Stride},
year        = {2026},
url         = {https://www.stride.page/research/state-of-ai-software-delivery-2026},
note        = {Volume 0 of an annual study; Volume 1 with primary
               findings publishes July 2026 at the same URL.}
}

Markdown

[Stride 2026 State of AI Software Delivery, Volume 0](https://www.stride.page/research/state-of-ai-software-delivery-2026)

Participate in Volume 1

Volume 1 of this study fields through Q2 2026. If you want to participate, either via the panel (compensated) or the organic arm (entered in a $20 gift-card raffle for a 1-in-25 chance), the recruitment page opens four weeks before fielding closes. Email info@newlightai.com to be notified when it does. Maintainers of public open-source repositories who want to invite their contributors via the opt-in GitHub Action can email the same address; the Action and consent flow ship in week 3 of the research sprint.

If you want to be cited (researchers, journalists, vendors, practitioners with something useful to say about a finding here), that's the right address too. The vendor-neutrality policy means we won't quote a vendor's own report verbatim, but we will quote a practitioner observing one. The two are not the same.

The honest version of an annual report is harder than the marketing version. It is also the one that gets read in 2027.

References

Citations are inline above; this section consolidates the bibliography. All sources are publicly accessible at the time of writing; URLs are checked monthly during the Volume 1 production window and any 404s get archive.org snapshot replacements.

Stack Overflow Developer Survey 2024. Stack Overflow, June 2024. AI section reports 76% of professional developers currently use or plan to use AI tools; 84% of users report AI has improved their productivity. n≈65,000.
GitHub Octoverse 2024. GitHub, October 2024. AI-assisted commit activity rising across major language ecosystems; tripling in .ipynb and ML-adjacent repository creation year-over-year.
The state of AI in early 2024: Gen AI adoption spikes and starts to generate value. McKinsey & Company, May 2024. Workplace AI use at 65% of organisations; software engineering one of the three highest-adoption functions.
2024 Accelerate State of DevOps Report. Google DORA, October 2024. n≈39,000. AI adoption distribution across DORA performance quartiles; cautionary note on AI as amplifier of existing delivery culture.
The Anthropic Economic Index. Anthropic, February 2025. Distribution of Claude usage by occupational category; computer and mathematical occupations the single largest category (~37% of total).
The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. Peng, S., Kalliamvakou, E., Cihon, P., Demirer, M. (2023). Controlled experiment, n=95, JavaScript HTTP server task, ~55% mean speedup for Copilot group.
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. METR (Model Evaluation & Threat Research), 2025. RCT with 16 experienced contributors, 246 tasks, AI-assisted condition ~19% slower; developer self-perception ~20% faster.
Measurement-practice synthesis (this paper). Across references [1]–[4], the share of respondent teams reporting controlled or pre/post measurement of AI's productivity impact is consistently in the single-digit-to-low-teens percentage range. The Stride 2026 study replicates this question with a directly comparable single-select item to update the figure.

Embed this chart

Republish this chart on your site or blog — the snippet renders the figure and credits the research with a link back. Free to use under CC-BY-4.0 with attribution.

State of AI Software Delivery 2026 — Stride Research

<a href="https://www.stride.page/research/state-of-ai-software-delivery-2026" target="_blank" rel="noopener">
  <img src="https://www.stride.page/api/og?variant=research-effect-range&amp;title=State%20of%20AI%20Software%20Delivery%202026&amp;subtitle=What%20the%20existing%20public%20studies%20(DORA%2C%20METR%2C%20Octoverse%2C%20Stack%20Overflow)%20tell%20us%2C%20and%20what%20the%20Stride%202026%20study%20will%20measure%20next.&amp;eyebrow=RESEARCH%202026&amp;v=2026-05" alt="State of AI Software Delivery 2026 — Stride Research" width="600" style="max-width:100%;height:auto;border:0" />
</a>
<p style="font:13px/1.6 -apple-system,BlinkMacSystemFont,'Segoe UI',sans-serif;color:#475569;margin:8px 0 0">Source: <a href="https://www.stride.page/research/state-of-ai-software-delivery-2026">State of AI Software Delivery 2026 — Stride Research</a></p>