Experiment Validity

Metricuno

May 17, 2026

4 min read

Experiment Validity — Experiment validity decides whether an A/B test result is real and repeatable. Learn the internal and external threats that quietly inflate winners.

Quick answer

Experiment validity is whether an A/B test result is both trustworthy (internal) and generalisable (external). Here's how to spot the threats that quietly inflate winners.

Definition

Experimentation

Experiment Validity

Whether an A/B test result is both trustworthy in its own right and likely to hold up beyond the conditions it was measured under.

Experiment validity is the bar a test result has to clear before you act on it. It splits into two questions. Internal validity asks whether the lift you measured is real — no tracking bugs, no sample-ratio mismatch, no novelty effect, no confounding campaign running in parallel. External validity asks whether that real lift will repeat — for other traffic sources, other devices, other seasons, other countries, and after the launch glow has faded.

A test can be internally valid and externally useless (a Black Friday winner that flops in March), or externally plausible but internally broken (a clean directional story sitting on top of a logging bug). You need both before a result earns a rollout.

Also known as

Test validity

Result validity

Internal validity is about the inside of the test. Did the randomiser split traffic 50/50 as expected? Did both variants fire the same events the same way? Did a deploy, a price change, or a paid push hit one variant harder than the other during the window? Any "yes" here means the lift you're reading is partly noise from something other than the change you intended to measure.

External validity is about everything outside the test. A 7% checkout lift measured over two weeks on desktop iOS visitors from Google may not survive contact with Meta-sourced mobile shoppers in Q4. The further the rollout audience drifts from the test audience, the more the original result is a guess rather than a measurement — which is where the statistical analysis you ran stops protecting you.

Formula

Adjusted Lift = Observed Lift × (1 − Validity Discount)

Variables

Observed Lift

Measured relative change

The raw % change in your primary metric between control and variant.

Validity Discount

Threat haircut

A 0–1 adjustment reflecting known internal + external threats (SRM, short window, narrow audience, novelty).

Adjusted Lift

Decision-grade lift

What you should plug into the business case for rollout.

Worked example

An apparel Shopify store runs a new product-detail-page layout for 14 days. Observed lift on add-to-cart is +9.2%. The test ran only on desktop traffic during a Klaviyo flow refresh, so external validity is shaky — apply a 35% discount.

Observed Lift: 9.2%

Validity Discount: 0.35

→ Adjusted Lift ≈ 6.0%

Still worth rolling out, but the business case should be built on ~6%, not 9.2%. If finance signed off on the higher number, you'd miss the forecast even when the test 'won'.

The discount isn't a formula you look up — it's an honest haircut you take based on the threats you can name. The table below lists the threats that actually show up in DTC experimentation programmes, ranked by how often they invalidate a result in practice.

Benchmark

Common validity threats in DTC experimentation, by frequency and severity

Threat	Type	How often it bites	Severity if missed
Sample-ratio mismatch (SRM)	Internal	1 in 6 tests	Critical — invalidates result
Tracking / event-firing bug	Internal	1 in 8 tests	Critical
Novelty or primacy effect	External	1 in 4 tests	High — inflates 1–2 week lifts
Concurrent campaign or price change	Internal	1 in 5 tests	High
Audience too narrow (1 device, 1 channel)	External	1 in 3 tests	Medium — limits rollout
Seasonality / promo window	External	1 in 4 tests	Medium
Peeking and early stopping	Internal	1 in 3 tests	Medium — false positives
Underpowered (too few conversions)	Internal	1 in 2 tests	Medium — noisy lift

Notice the split: internal threats are usually catastrophic but rare-ish and detectable with the right checks. External threats are subtler — the test is technically fine, you just measured the wrong slice of reality. Both belong in the test readout, not in the post-mortem six weeks later when the rollout underperforms.

Frequently asked

Frequently asked questions

Internal validity asks whether the lift inside your test is real and caused by your change. External validity asks whether that lift will repeat for other audiences, devices, channels, and time periods. A result needs both before it earns a rollout.

Statistical significance only tells you the difference between variants is unlikely to be random noise. It doesn't tell you whether your randomisation worked, whether tracking fired correctly, or whether the result generalises. Validity is the wider check that statistical analysis sits inside.

SRM is when your 50/50 split actually delivers, say, 48/52 traffic — a sign the randomiser, redirect, or tracking is broken. Any lift measured on top of an unbalanced split is suspect, because the two groups were never comparable in the first place.

At least one full business cycle — typically two weekly cycles for DTC, so 14 days minimum, and longer if your buying cycle is monthly or seasonal. Shorter windows over-index on whoever happened to visit that week and on the novelty of the change.

No. A win measured on desktop search traffic may not hold for paid social mobile users with very different intent and AOV. Either include those segments in the test or treat the rollout to them as a new hypothesis, not a confirmed result.

Returning visitors react to anything new — sometimes positively, sometimes negatively — and that reaction fades within 1–3 weeks. Detect it by splitting your test readout into new vs returning visitors, or by extending the window and checking whether the lift decays over time.

Rarely. If SRM is present, tracking was broken, or a confounding campaign ran on one variant, the cleanest move is to fix the cause and re-run. Statistical adjustments after the fact are easy to motivate-reason into whichever answer you wanted.

Include an SRM check, a tracking sanity check (events per visitor in both variants), the segments tested, the segments NOT tested, and any known confounds during the window. State the validity discount you're applying and why before quoting the adjusted lift.

No. Tightening significance reduces false positives from random noise but does nothing about bugs, SRM, novelty, or audience mismatch. You can be 99.9% confident in a result that's 100% wrong because the tracking was broken.

Treat the validity-adjusted lift as the number you forecast against, not the raw observed lift. If the adjusted lift still clears the business case, roll out. If it only clears once you ignore the threats, you're optimising for the readout, not the P&L.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.

Launch your first experiment

Experiment Validity

Experiment Validity

Common validity threats in DTC experimentation, by frequency and severity

Frequently asked questions

What's the difference between internal and external validity?

How is experiment validity different from statistical significance?

What is sample-ratio mismatch and why does it kill validity?

How long should a test run to be externally valid?

Does a winning A/B test always generalise to my whole site?

What's the novelty effect and how do I detect it?

Can I fix a test with low internal validity after the fact?

How do I report validity in a test readout?

Is a higher significance threshold (e.g. 99%) a substitute for validity?

How does validity affect rollout decisions?

Test ideas before you ship them