Experiment Validity

Experiment validity is whether an A/B test result is both trustworthy (internal) and generalisable (external). Here's how to spot the threats that quietly inflate winners.
Experiment Validity
Whether an A/B test result is both trustworthy in its own right and likely to hold up beyond the conditions it was measured under.
Experiment validity is the bar a test result has to clear before you act on it. It splits into two questions. Internal validity asks whether the lift you measured is real — no tracking bugs, no sample-ratio mismatch, no novelty effect, no confounding campaign running in parallel. External validity asks whether that real lift will repeat — for other traffic sources, other devices, other seasons, other countries, and after the launch glow has faded.
A test can be internally valid and externally useless (a Black Friday winner that flops in March), or externally plausible but internally broken (a clean directional story sitting on top of a logging bug). You need both before a result earns a rollout.
Internal validity is about the inside of the test. Did the randomiser split traffic 50/50 as expected? Did both variants fire the same events the same way? Did a deploy, a price change, or a paid push hit one variant harder than the other during the window? Any "yes" here means the lift you're reading is partly noise from something other than the change you intended to measure.
External validity is about everything outside the test. A 7% checkout lift measured over two weeks on desktop iOS visitors from Google may not survive contact with Meta-sourced mobile shoppers in Q4. The further the rollout audience drifts from the test audience, the more the original result is a guess rather than a measurement — which is where the statistical analysis you ran stops protecting you.
Adjusted Lift = Observed Lift × (1 − Validity Discount)
Observed Lift
Measured relative change
The raw % change in your primary metric between control and variant.
Validity Discount
Threat haircut
A 0–1 adjustment reflecting known internal + external threats (SRM, short window, narrow audience, novelty).
Adjusted Lift
Decision-grade lift
What you should plug into the business case for rollout.
An apparel Shopify store runs a new product-detail-page layout for 14 days. Observed lift on add-to-cart is +9.2%. The test ran only on desktop traffic during a Klaviyo flow refresh, so external validity is shaky — apply a 35% discount.
Observed Lift: 9.2%
Validity Discount: 0.35
→ Adjusted Lift ≈ 6.0%
Still worth rolling out, but the business case should be built on ~6%, not 9.2%. If finance signed off on the higher number, you'd miss the forecast even when the test 'won'.
The discount isn't a formula you look up — it's an honest haircut you take based on the threats you can name. The table below lists the threats that actually show up in DTC experimentation programmes, ranked by how often they invalidate a result in practice.
Common validity threats in DTC experimentation, by frequency and severity
| Threat | Type | How often it bites | Severity if missed |
|---|---|---|---|
| Sample-ratio mismatch (SRM) | Internal | 1 in 6 tests | Critical — invalidates result |
| Tracking / event-firing bug | Internal | 1 in 8 tests | Critical |
| Novelty or primacy effect | External | 1 in 4 tests | High — inflates 1–2 week lifts |
| Concurrent campaign or price change | Internal | 1 in 5 tests | High |
| Audience too narrow (1 device, 1 channel) | External | 1 in 3 tests | Medium — limits rollout |
| Seasonality / promo window | External | 1 in 4 tests | Medium |
| Peeking and early stopping | Internal | 1 in 3 tests | Medium — false positives |
| Underpowered (too few conversions) | Internal | 1 in 2 tests | Medium — noisy lift |
Notice the split: internal threats are usually catastrophic but rare-ish and detectable with the right checks. External threats are subtler — the test is technically fine, you just measured the wrong slice of reality. Both belong in the test readout, not in the post-mortem six weeks later when the rollout underperforms.
Frequently asked questions
Internal validity asks whether the lift inside your test is real and caused by your change. External validity asks whether that lift will repeat for other audiences, devices, channels, and time periods. A result needs both before it earns a rollout.
Statistical significance only tells you the difference between variants is unlikely to be random noise. It doesn't tell you whether your randomisation worked, whether tracking fired correctly, or whether the result generalises. Validity is the wider check that statistical analysis sits inside.
SRM is when your 50/50 split actually delivers, say, 48/52 traffic — a sign the randomiser, redirect, or tracking is broken. Any lift measured on top of an unbalanced split is suspect, because the two groups were never comparable in the first place.
At least one full business cycle — typically two weekly cycles for DTC, so 14 days minimum, and longer if your buying cycle is monthly or seasonal. Shorter windows over-index on whoever happened to visit that week and on the novelty of the change.
No. A win measured on desktop search traffic may not hold for paid social mobile users with very different intent and AOV. Either include those segments in the test or treat the rollout to them as a new hypothesis, not a confirmed result.
Returning visitors react to anything new — sometimes positively, sometimes negatively — and that reaction fades within 1–3 weeks. Detect it by splitting your test readout into new vs returning visitors, or by extending the window and checking whether the lift decays over time.
Rarely. If SRM is present, tracking was broken, or a confounding campaign ran on one variant, the cleanest move is to fix the cause and re-run. Statistical adjustments after the fact are easy to motivate-reason into whichever answer you wanted.
Include an SRM check, a tracking sanity check (events per visitor in both variants), the segments tested, the segments NOT tested, and any known confounds during the window. State the validity discount you're applying and why before quoting the adjusted lift.
No. Tightening significance reduces false positives from random noise but does nothing about bugs, SRM, novelty, or audience mismatch. You can be 99.9% confident in a result that's 100% wrong because the tracking was broken.
Treat the validity-adjusted lift as the number you forecast against, not the raw observed lift. If the adjusted lift still clears the business case, roll out. If it only clears once you ignore the threats, you're optimising for the readout, not the P&L.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.