Methodology

DORA Metrics in Practice 2026 · Methodology

800-respondent engineering-leader survey using DORA's published item wording verbatim, segmented by team size × industry × AI-adoption-depth × practice-maturity. Hierarchical regression with practice covariates; Wilson 95% CIs; Benjamini–Hochberg FDR correction.

Research questions

The Stride 2026 DORA segmented refresh addresses one primary question and four secondary questions.

The primary question is the causation question DORA's annual cadence doesn't interrogate in real time. The DORA 2024 report observes that AI adoption is ~1.4× higher in elite-quartile teams than low-quartile teams. The popular reading: AI causes elite performance. DORA's published reading: correlation only. The Stride 2026 segmented refresh tests directly which causal story the data supports: (1) AI raises performance, (2) performance enables AI, or (3) selection effects (elite teams adopt new tools faster).

The secondary questions test four further hypotheses developed from the DORA literature: (1) whether change failure rate is higher in heavy-AI cohorts in non-elite quartiles; (2) whether deployment-frequency gains concentrate in smaller teams; (3) whether Time to Restore is practice-driven rather than AI-driven; and (4) whether engineering-practice maturity + AI-adoption depth together explain DORA quartile membership variance.

Hypotheses (pre-registered)

These are the five hypotheses we register before fielding closes. Each is paired with a specific operationalisation, an exclusion criterion, and a falsification condition.

H1: Quartile correlation weakens with controls

Operationalisation. Respondents self-classify against DORA's published quartile thresholds for all four metrics (deployment frequency, lead time, time to restore, change failure rate). AI adoption is measured in §3 of the instrument on a 5-point scale (none / exploratory / selective / regular / pervasive). Engineering-practice maturity is measured by a 12-item composite (§4) covering trunk-based development, CI/CD maturity, code-review discipline, test automation, on-call rotation maturity, post-incident review, runbook quality.

Prediction. Without controls: AI adoption correlates with elite-quartile membership at r ≥ 0.35 (replicating DORA 2024 directionally). With company-size + practice-maturity controls in a hierarchical regression: AI's effect size will weaken by ≥30% (Cohen's f² drop from baseline to controlled).

Falsification. Either the unadjusted correlation is below 0.20 (replication failure), or the controlled effect weakens by under 15% (causation story 1 survives).

H2: Heavy AI raises CFR in non-elite quartiles

Operationalisation. Change failure rate is the §2.4 self-classification (0–5% elite / 5–10% high / 10–15% medium / ≥15% low). Heavy-AI cohort = respondents in "pervasive" or "regular" AI-usage strata. Selective cohort = "selective" or "exploratory."

Prediction. Within the lower three quartiles (high, medium, low combined), heavy-AI cohort CFR will be statistically higher than selective-AI cohort CFR (Cohen's h ≥ 0.2, p < 0.05 after BH-correction). Within the elite quartile, no significant difference will be observed.

Falsification. CFR is statistically indistinguishable in the lower three quartiles, OR significantly higher in the elite quartile.

H3: Deployment frequency improvements concentrate in small teams

Operationalisation. Respondents report their team size (current) and team-deployment-frequency 12 months ago (recall question, §2.7) and current deployment frequency (§2.2). Improvement = current minus prior, normalised by the published quartile threshold.

Prediction. Year-over-year deployment-frequency improvement will be statistically larger for teams ≤50 (≥30% percentage improvement) than teams 5,000+ (≤10% improvement). Test: difference-in-differences regression.

Falsification. Improvement magnitudes are statistically indistinguishable across team sizes.

H4 (null): Time to Restore is practice-driven, not AI-driven

Operationalisation. Time to Restore self-classification (§2.5) vs. AI-adoption stratum, controlling for the 12-item practice-maturity composite.

Prediction. AI's coefficient on Time to Restore will be statistically indistinguishable from zero after practice-maturity controls (95% CI on the coefficient overlaps zero). The null is the finding. Time to Restore is set by CI/CD investment, on-call runbook discipline, and feature-flag infrastructure, not by AI tool adoption.

Falsification. Significant non-zero coefficient on AI after practice-maturity controls.

H5 (exploratory): Two-factor explanation

Reported with explicit exploratory framing. A two-factor model (engineering-practice composite × AI-adoption depth) will explain ≥80% of DORA quartile membership variance in stepwise hierarchical regression, with practice-maturity alone explaining ≥60%. This is hypothesis-generating, not hypothesis-testing; will be reported in an explicit "Exploratory" section of Volume 1.

Multiple-comparison correction

The four planned hypotheses (H1–H4) form one family. The Benjamini–Hochberg FDR is controlled at q = 0.05 across the family. H5 is explicitly outside the planned family.

Survey design

The instrument is ~52 substantive items + 4 screening + 3 attention checks + 8 firmographics. Median completion target 13 minutes.

Screening

S1–S4: role (engineering manager / staff+ / director / VP, others screened out); tenure ≥3 years engineering-leadership work; current employment.

Section 1: Team + organisation context (8 items)

Team size (current), team size (12 months ago), company size, industry (with regulated-vs-unregulated flag), region.

Section 2: DORA quartile self-classification (12 items, DORA wording verbatim)

S2.1–S2.4: the four DORA quartile-classification questions (deployment frequency, lead time, time to restore, change failure rate) using DORA's published item wording verbatim. S2.5–S2.8: prior-year (12-month-ago) self-classification on the same four metrics. S2.9: confidence in the self-classifications. S2.10–S2.12: telemetry-availability questions (does your team measure these metrics with telemetry rather than self-report?).

Section 3: AI adoption (6 items)

AI tool adoption depth (5-point scale: none / exploratory / selective / regular / pervasive). Time-since-AI-introduction. Specific tools in use (alphabetically listed; Stride is named only as the publisher). Mandatory-vs-voluntary adoption.

Section 4: Engineering practice maturity (12 items, composite scored)

Trunk-based development. CI/CD pipeline maturity. Code-review discipline. Test-automation coverage. On-call rotation maturity (toil tracking, rotation fairness, escalation discipline). Post-incident review cadence. Runbook quality. Feature-flag infrastructure. Documentation discipline. Observability investment. SRE practices. Architecture review cadence.

Section 5: Industry + regulatory context (4 items)

Regulatory burden self-classification. Compliance-frame age (does the team work in established regulatory frames or emerging ones?). Industry-specific tooling adoption.

Section 6: Outcomes (6 items)

Self-reported delivery velocity (qualitative). Self-reported quality (qualitative). NPS-style satisfaction with current delivery process. Burnout self-rating (single-item proxy; not validated MBI, see /research/engineering-burnout-and-process-debt-2026 for the validated study).

Geographic region. Gender (optional). Opt-in for follow-up study. Dataset-release consent (CC-BY-4.0 with anonymised individual responses).

Attention checks

Item #14 ("Please select 'Strongly agree' for this item to confirm careful reading"), item #29 (a list-question where one option is obviously off-topic), item #42 (a paragraph-reading comprehension item).

Respondents failing ≥2 of 3 checks are screened out and replaced.

Recruitment

Prolific Academic panel arm

Target n = 650 completes via Prolific Academic. Effective CPI ~$3.50 USD per complete (higher than the sprint-estimation study because eng-leader screening narrows the eligible pool).

Organic top-up arm

Target n = 150 completes via:

Stride newsletter
LinkedIn organic from research team
Engineering-leader communities (Rands Leadership Slack, Lead Dev, EM-adjacent newsletters)
Industry-partner referrals

Organic responses carry a separate stratum flag in the dataset.

Pilot wave

N = 60 panel respondents + 8 think-aloud sessions ($75 honorarium, 30 minutes each) before the main field opens. Iteration window 14 days; any DORA-wording item flagged confusing by ≥3 of 8 think-alouds gets rewording (with the change documented as a deviation from DORA's verbatim wording).

Statistical methods

Effect sizes, not just p-values

Cohen's h for proportion comparisons.
Cohen's d + Hedges' g for continuous magnitudes.
Cohen's f² for hierarchical regression effect-size comparisons.
Wilson 95% CIs on every quoted percentage.
Bootstrap 95% CIs (10,000 iterations) on continuous magnitudes.

Practice-maturity composite scoring

The 12-item practice-maturity composite (§4) is scored as the unit-weighted sum of standardised item scores (z-score of each item, summed). Cronbach's α on the composite is pre-registered as the reliability check; α < 0.7 in the pilot would trigger item revision.

Multiple-comparison correction

The planned hypothesis family (H1–H4) is corrected with Benjamini–Hochberg FDR at q = 0.05. Exploratory cross-tabs are reported with explicit "exploratory" framing.

Sensitivity analysis

Before publishing Volume 1, we run three sensitivity analyses on each headline finding:

With and without the organic stratum.
With and without attention-check survivors who took ≤6 minutes.
With three alternative practice-maturity scorings (unit-weighted, factor-loading-weighted, latent-class).

Any headline number that flips direction under any sensitivity condition is flagged inline.

Reproducibility

The Volume 0 landscape figures are reproducible today. Every figure's underlying numbers publish with this report as a single CC-BY-4.0 CSV (download below) — one row per source study, each carrying a source_citation_url so a peer-reviewer can trace every plotted point. A Jupyter notebook (landscape-charts.ipynb) loads that data and re-renders every Volume 0 figure (quartile distribution 2018–2024, AI adoption by quartile, metric thresholds card, literature timeline); it publishes alongside the Volume 1 reproducibility bundle.

The Volume 1 analysis pipeline will ship as a separate notebook on the publish date. CC-BY-4.0.

Dataset

DORA Metrics in Practice 2026 — Volume 0 source dataset

License: CC-BY-4.0

CSV

Dataset publication (Volume 1)

When Volume 1 lands, the survey dataset publishes under CC-BY-4.0 as a single ZIP bundle + Zenodo DOI.

Bundle contents:

responses.csv: one row per respondent, anonymised, no PII.
quartile-classification.csv: per-respondent quartile classifications for both current and 12-month-prior periods.
practice-maturity-scores.csv: per-respondent composite score + per-item raw scores.
cross-tabs/: pre-computed cross-tabs (industry × quartile, team-size × quartile, AI-depth × quartile, regulated × quartile, practice-maturity × quartile).
data-dictionary.json: JSON-Schema-compatible data dictionary.

The integrity hash (SHA-256) of the bundle appears on the report page so reproducibility-conscious readers can verify their copy.

Vendor-neutrality posture

Stride is a software-delivery platform. We publish this study with the same disciplines as the rest of the 2026 research series:

DORA wording verbatim for all four quartile-classification questions. Any deviation is documented.
Stride is not named in survey items except in §3 (where respondents are asked about their AI-tool stack, alphabetically listed).
Comparison tables in Volume 1 cite published study results, not Stride product claims.
Editorial owner has final cut, not GTM. No marketing voice in the body.
The dataset publishes whether the findings flatter Stride or not. A null result on H4 ships. A finding that AI raises CFR in non-elite cohorts ships. A finding that Stride's own customers cluster in the medium quartile ships.

A study that buries inconvenient findings dies in one news cycle. A study that publishes them is the one that gets cited five years later.

Research questions

Hypotheses (pre-registered)

H1: Quartile correlation weakens with controls

H2: Heavy AI raises CFR in non-elite quartiles

H3: Deployment frequency improvements concentrate in small teams

H4 (null): Time to Restore is practice-driven, not AI-driven

H5 (exploratory): Two-factor explanation

Multiple-comparison correction

Survey design

Screening

Section 1: Team + organisation context (8 items)

Section 2: DORA quartile self-classification (12 items, DORA wording verbatim)

Section 3: AI adoption (6 items)

Section 4: Engineering practice maturity (12 items, composite scored)

Section 5: Industry + regulatory context (4 items)

Section 6: Outcomes (6 items)

Section 7: Firmographics + consent (4 items)

Attention checks

Recruitment

Prolific Academic panel arm

Organic top-up arm

Pilot wave

Statistical methods

Effect sizes, not just p-values

Practice-maturity composite scoring

Multiple-comparison correction

Sensitivity analysis

Reproducibility

Dataset

Dataset publication (Volume 1)

Vendor-neutrality posture