False Positives

A false positive is an A/B test that declares a winner when nothing actually changed. Here's why the 5% threshold lies to you once you run more than one test — and how to fix it.
False Positives
A false positive is an A/B test that declares a winner when there is no real underlying effect — the type I error.
In statistical analysis, a false positive happens when the data crosses your significance threshold by chance alone. The variant didn't actually beat control; random noise just lined up that way during the test window. With the conventional 5% significance level, you accept this risk on every test you run — meaning roughly one in twenty true-null tests will look like a winner.
The trap is that false positives compound. Run ten parallel tests on the same store and the probability that at least one is a fake winner climbs above 40%. That's why multiple-comparison corrections, pre-registered hypotheses, and replication exist.
Every A/B test starts with a null hypothesis: the variant has no effect on conversion rate. The significance threshold (alpha, usually 0.05) is the probability you're willing to accept of rejecting that null hypothesis when it's actually true. Cross the threshold and you call a winner — but 5% of the time, you're wrong.
That 5% sounds tolerable for a single test. It stops being tolerable the moment you run a roadmap. Shipping a fake winner means you change the site based on nothing, and you can't tell the difference from a real winner until the next quarter's revenue numbers come in flat.
FWER = 1 - (1 - alpha)^k
FWER
Family-wise error rate
Probability of at least one false positive across the whole test family
alpha
Per-test significance threshold
Typically 0.05
k
Number of independent tests or comparisons
Tests run in the same family, e.g. a multi-variant test or a quarterly roadmap
An apparel store runs 10 independent A/B tests in a quarter, each at alpha = 0.05.
alpha: 0.05
k: 10
→ FWER ≈ 0.401 (40.1%)
Even if none of the ten tests has a real effect, there's a 40% chance at least one will look significant. Without correction, roughly two of every five quarterly roadmaps will ship a fake winner.
The fix isn't to stop testing — it's to adjust how you read significance. Bonferroni correction divides alpha by the number of comparisons (0.05 / 10 = 0.005 per test). Benjamini-Hochberg controls the false discovery rate instead, which is less punishing on test velocity. Both reduce false positives at the cost of needing larger samples to detect real effects.
How the family-wise false-positive rate grows with the number of tests (alpha = 0.05, no correction)
| Tests run in family | Probability of ≥1 false positive | Expected false winners | Risk level |
|---|---|---|---|
| 1 | 5.0% | 0.05 | Baseline |
| 3 | 14.3% | 0.15 | Low |
| 5 | 22.6% | 0.25 | Moderate |
| 10 | 40.1% | 0.50 | High |
| 20 | 64.2% | 1.00 | Severe |
| 50 | 92.3% | 2.50 | Critical |
Two operational habits cut false-positive risk more than any correction formula. First, never call a test before it hits its pre-calculated sample size — peeking and stopping early roughly doubles your real alpha. Second, replicate suspicious wins on a fresh audience before rolling out site-wide; a real effect survives replication, a fluke does not.
False positives in A/B testing — FAQ
A false positive (type I error) declares a winner when no real effect exists. A false negative (type II error) misses a real effect because the test lacked statistical power. Alpha controls the first, sample size and effect size control the second.
It's a convention from R.A. Fisher's early-1900s work, not a law of nature. Many teams use 0.10 for low-risk UI tests and 0.01 for high-stakes checkout changes. The right alpha depends on the cost of shipping a fake winner versus the cost of slower iteration.
Yes, indirectly. A/A tests split traffic between two identical experiences. If your platform flags them as 'significant' more than ~5% of the time, your significance engine is leaking — usually from peeking, sample-ratio mismatch, or broken randomisation.
Bonferroni is stricter and best for small test families where any single false positive is costly. Benjamini-Hochberg controls the proportion of false discoveries and is better suited to high-velocity roadmaps where you'd rather catch most real winners and tolerate a few fakes.
No. Bayesian methods replace p-values with probability-of-being-best estimates, but the underlying noise is the same. You can still declare a winner that doesn't replicate. Bayesian frameworks just make the trade-offs more explicit.
Checking results repeatedly and stopping the moment significance appears effectively runs many tests instead of one. Empirically, peeking daily on a two-week test can push the true false-positive rate from 5% to 15-25% depending on traffic shape.
Yes, because every combination is a comparison. A 2×2×2 MVT has 8 variants and 28 pairwise comparisons. Without correction, the family-wise false-positive rate exceeds 75%. Most teams treat MVT outputs as exploratory, then confirm winners with a clean A/B.
Statistically, no — the underlying evidence is nearly identical. Treat anything in the 0.04-0.06 band as inconclusive rather than decisive. The hard threshold is a decision rule, not a signal of certainty.
Sample size doesn't change your false-positive rate — alpha does. Larger samples reduce false negatives by giving you the power to detect small real effects. A common misconception is that 'more traffic' makes results more trustworthy in the type I sense; it doesn't.
The experimentation engine uses sequential testing math that's valid under peeking, flags sample-ratio mismatch automatically, and applies false-discovery-rate correction when you run multiple tests in the same family. You see honest significance, not an inflated version of it.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.