DORA Metrics in Practice 2026 — Methodology
800-respondent engineering-leader survey using DORA's published item wording verbatim, segmented by team size × industry × AI-adoption-depth × practice-maturity. Hierarchical regression with practice covariates; Wilson 95% CIs; Benjamini–Hochberg FDR correction.
Research questions
The Stride 2026 DORA segmented refresh addresses one primary question and four secondary questions.
The primary question is the causation question DORA's annual cadence doesn't interrogate in real time. The DORA 2024 report observes that AI adoption is ~1.4× higher in elite-quartile teams than low-quartile teams. The popular reading: AI causes elite performance. DORA's published reading: correlation only. The Stride 2026 segmented refresh tests directly which causal story the data supports: (1) AI raises performance, (2) performance enables AI, or (3) selection effects (elite teams adopt new tools faster).
The secondary questions test four further hypotheses developed from the DORA literature: (1) whether change failure rate is higher in heavy-AI cohorts in non-elite quartiles; (2) whether deployment-frequency gains concentrate in smaller teams; (3) whether Time to Restore is practice-driven rather than AI-driven; and (4) whether engineering-practice maturity + AI-adoption depth together explain DORA quartile membership variance.
Hypotheses (pre-registered)
These are the five hypotheses we register before fielding closes. Each is paired with a specific operationalisation, an exclusion criterion, and a falsification condition.
H1 — Quartile correlation weakens with controls
Operationalisation. Respondents self-classify against DORA's published quartile thresholds for all four metrics (deployment frequency, lead time, time to restore, change failure rate). AI adoption is measured in §3 of the instrument on a 5-point scale (none / exploratory / selective / regular / pervasive). Engineering-practice maturity is measured by a 12-item composite (§4) covering trunk-based development, CI/CD maturity, code-review discipline, test automation, on-call rotation maturity, post-incident review, runbook quality.
Prediction. Without controls: AI adoption correlates with elite-quartile membership at r ≥ 0.35 (replicating DORA 2024 directionally). With company-size + practice-maturity controls in a hierarchical regression: AI's effect size will weaken by ≥30% (Cohen's f² drop from baseline to controlled).
Falsification. Either the unadjusted correlation is below 0.20 (replication failure), or the controlled effect weakens by under 15% (causation story 1 survives).
H2 — Heavy AI raises CFR in non-elite quartiles
Operationalisation. Change failure rate is the §2.4 self-classification (0–5% elite / 5–10% high / 10–15% medium / ≥15% low). Heavy-AI cohort = respondents in "pervasive" or "regular" AI-usage strata. Selective cohort = "selective" or "exploratory."
Prediction. Within the lower three quartiles (high, medium, low combined), heavy-AI cohort CFR will be statistically higher than selective-AI cohort CFR (Cohen's h ≥ 0.2, p < 0.05 after BH-correction). Within the elite quartile, no significant difference will be observed.
Falsification. CFR is statistically indistinguishable in the lower three quartiles, OR significantly higher in the elite quartile.
H3 — Deployment frequency improvements concentrate in small teams
Operationalisation. Respondents report their team size (current) and team-deployment-frequency 12 months ago (recall question, §2.7) and current deployment frequency (§2.2). Improvement = current minus prior, normalised by the published quartile threshold.
Prediction. Year-over-year deployment-frequency improvement will be statistically larger for teams ≤50 (≥30% percentage improvement) than teams 5,000+ (≤10% improvement). Test: difference-in-differences regression.
Falsification. Improvement magnitudes are statistically indistinguishable across team sizes.
H4 (null) — Time to Restore is practice-driven, not AI-driven
Operationalisation. Time to Restore self-classification (§2.5) vs. AI-adoption stratum, controlling for the 12-item practice-maturity composite.
Prediction. AI's coefficient on Time to Restore will be statistically indistinguishable from zero after practice-maturity controls (95% CI on the coefficient overlaps zero). The null is the finding — Time to Restore is set by CI/CD investment, on-call runbook discipline, and feature-flag infrastructure, not by AI tool adoption.
Falsification. Significant non-zero coefficient on AI after practice-maturity controls.
H5 (exploratory) — Two-factor explanation
Reported with explicit exploratory framing. A two-factor model (engineering-practice composite × AI-adoption depth) will explain ≥80% of DORA quartile membership variance in stepwise hierarchical regression, with practice-maturity alone explaining ≥60%. This is hypothesis-generating, not hypothesis-testing; will be reported in an explicit "Exploratory" section of Volume 1.
Multiple-comparison correction
The four planned hypotheses (H1–H4) form one family. The Benjamini–Hochberg FDR is controlled at q = 0.05 across the family. H5 is explicitly outside the planned family.
Survey design
The instrument is ~52 substantive items + 4 screening + 3 attention checks + 8 firmographics. Median completion target 13 minutes.
Screening
S1–S4: role (engineering manager / staff+ / director / VP — others screened out); tenure ≥3 years engineering-leadership work; current employment.
Section 1 — Team + organisation context (8 items)
Team size (current), team size (12 months ago), company size, industry (with regulated-vs-unregulated flag), region.
Section 2 — DORA quartile self-classification (12 items, DORA wording verbatim)
S2.1–S2.4: the four DORA quartile-classification questions (deployment frequency, lead time, time to restore, change failure rate) using DORA's published item wording verbatim. S2.5–S2.8: prior-year (12-month-ago) self-classification on the same four metrics. S2.9: confidence in the self-classifications. S2.10–S2.12: telemetry-availability questions (does your team measure these metrics with telemetry rather than self-report?).
Section 3 — AI adoption (6 items)
AI tool adoption depth (5-point scale: none / exploratory / selective / regular / pervasive). Time-since-AI-introduction. Specific tools in use (alphabetically listed; Stride is named only as the publisher). Mandatory-vs-voluntary adoption.
Section 4 — Engineering practice maturity (12 items, composite scored)
Trunk-based development. CI/CD pipeline maturity. Code-review discipline. Test-automation coverage. On-call rotation maturity (toil tracking, rotation fairness, escalation discipline). Post-incident review cadence. Runbook quality. Feature-flag infrastructure. Documentation discipline. Observability investment. SRE practices. Architecture review cadence.
Section 5 — Industry + regulatory context (4 items)
Regulatory burden self-classification. Compliance-frame age (does the team work in established regulatory frames or emerging ones?). Industry-specific tooling adoption.
Section 6 — Outcomes (6 items)
Self-reported delivery velocity (qualitative). Self-reported quality (qualitative). NPS-style satisfaction with current delivery process. Burnout self-rating (single-item proxy; not validated MBI — see /research/engineering-burnout-and-process-debt-2026 for the validated study).
Section 7 — Firmographics + consent (4 items)
Geographic region. Gender (optional). Opt-in for follow-up study. Dataset-release consent (CC-BY-4.0 with anonymised individual responses).
Attention checks
Item #14 ("Please select 'Strongly agree' for this item to confirm careful reading"), item #29 (a list-question where one option is obviously off-topic), item #42 (a paragraph-reading comprehension item).
Respondents failing ≥2 of 3 checks are screened out and replaced.
Recruitment
Prolific Academic panel arm
Target n = 650 completes via Prolific Academic. Effective CPI ~$3.50 USD per complete (higher than the sprint-estimation study because eng-leader screening narrows the eligible pool).
Organic top-up arm
Target n = 150 completes via:
- Stride newsletter
- LinkedIn organic from research team
- Engineering-leader communities (Rands Leadership Slack, Lead Dev, EM-adjacent newsletters)
- Industry-partner referrals
Organic responses carry a separate stratum flag in the dataset.
Pilot wave
N = 60 panel respondents + 8 think-aloud sessions ($75 honorarium, 30 minutes each) before the main field opens. Iteration window 14 days; any DORA-wording item flagged confusing by ≥3 of 8 think-alouds gets rewording (with the change documented as a deviation from DORA's verbatim wording).
Statistical methods
Effect sizes, not just p-values
- Cohen's h for proportion comparisons.
- Cohen's d + Hedges' g for continuous magnitudes.
- Cohen's f² for hierarchical regression effect-size comparisons.
- Wilson 95% CIs on every quoted percentage.
- Bootstrap 95% CIs (10,000 iterations) on continuous magnitudes.
Practice-maturity composite scoring
The 12-item practice-maturity composite (§4) is scored as the unit-weighted sum of standardised item scores (z-score of each item, summed). Cronbach's α on the composite is pre-registered as the reliability check; α < 0.7 in the pilot would trigger item revision.
Multiple-comparison correction
The planned hypothesis family (H1–H4) is corrected with Benjamini–Hochberg FDR at q = 0.05. Exploratory cross-tabs are reported with explicit "exploratory" framing.
Sensitivity analysis
Before publishing Volume 1, we run three sensitivity analyses on each headline finding:
- With and without the organic stratum.
- With and without attention-check survivors who took ≤6 minutes.
- With three alternative practice-maturity scorings (unit-weighted, factor-loading-weighted, latent-class).
Any headline number that flips direction under any sensitivity condition is flagged inline.
Reproducibility
The Volume 0 landscape figures are reproducible today. A Jupyter notebook at apps/platform/research/2026/reproducibility/dora/landscape-charts.ipynb loads the four published-data CSVs and re-renders every Volume 0 figure (quartile distribution 2018–2024, AI adoption by quartile, metric thresholds card, literature timeline). Each CSV row carries a source_citation_url so a peer-reviewer can trace every plotted point.
The Volume 1 analysis pipeline will ship as a separate notebook at the same path on the publish date. CC-BY-4.0.
Dataset publication (Volume 1)
When Volume 1 lands, the survey dataset publishes under CC-BY-4.0 as a single ZIP bundle + Zenodo DOI.
Bundle contents:
responses.csv— one row per respondent, anonymised, no PII.quartile-classification.csv— per-respondent quartile classifications for both current and 12-month-prior periods.practice-maturity-scores.csv— per-respondent composite score + per-item raw scores.cross-tabs/— pre-computed cross-tabs (industry × quartile, team-size × quartile, AI-depth × quartile, regulated × quartile, practice-maturity × quartile).data-dictionary.json— JSON-Schema-compatible data dictionary.
The integrity hash (SHA-256) of the bundle appears on the report page so reproducibility-conscious readers can verify their copy.
Vendor-neutrality posture
Stride is a software-delivery platform. We publish this study with the same disciplines as the rest of the 2026 research series:
- DORA wording verbatim for all four quartile-classification questions. Any deviation is documented.
- Stride is not named in survey items except in §3 (where respondents are asked about their AI-tool stack, alphabetically listed).
- Comparison tables in Volume 1 cite published study results, not Stride product claims.
- Editorial owner has final cut, not GTM. No marketing voice in the body.
- The dataset publishes whether the findings flatter Stride or not. A null result on H4 ships. A finding that AI raises CFR in non-elite cohorts ships. A finding that Stride's own customers cluster in the medium quartile ships.
A study that buries inconvenient findings dies in one news cycle. A study that publishes them is the one that gets cited five years later.