False Positives

Metricuno
May 17, 2026
4 min read
False Positives — What false positives mean in A/B testing, how the 5% threshold inflates with multiple tests, and how to correct for it. Clear formula and benchmarks.
Quick answer

A false positive is an A/B test that declares a winner when nothing actually changed. Here's why the 5% threshold lies to you once you run more than one test — and how to fix it.

Definition
Statistical Analysis

False Positives

A false positive is an A/B test that declares a winner when there is no real underlying effect — the type I error.

In statistical analysis, a false positive happens when the data crosses your significance threshold by chance alone. The variant didn't actually beat control; random noise just lined up that way during the test window. With the conventional 5% significance level, you accept this risk on every test you run — meaning roughly one in twenty true-null tests will look like a winner.

The trap is that false positives compound. Run ten parallel tests on the same store and the probability that at least one is a fake winner climbs above 40%. That's why multiple-comparison corrections, pre-registered hypotheses, and replication exist.

Also known as
Type I error
Alpha error
False discovery

Every A/B test starts with a null hypothesis: the variant has no effect on conversion rate. The significance threshold (alpha, usually 0.05) is the probability you're willing to accept of rejecting that null hypothesis when it's actually true. Cross the threshold and you call a winner — but 5% of the time, you're wrong.

That 5% sounds tolerable for a single test. It stops being tolerable the moment you run a roadmap. Shipping a fake winner means you change the site based on nothing, and you can't tell the difference from a real winner until the next quarter's revenue numbers come in flat.

Formula

FWER = 1 - (1 - alpha)^k

Variables

FWER

Family-wise error rate

Probability of at least one false positive across the whole test family

alpha

Per-test significance threshold

Typically 0.05

k

Number of independent tests or comparisons

Tests run in the same family, e.g. a multi-variant test or a quarterly roadmap

Worked example

An apparel store runs 10 independent A/B tests in a quarter, each at alpha = 0.05.

alpha: 0.05

k: 10

FWER ≈ 0.401 (40.1%)

Even if none of the ten tests has a real effect, there's a 40% chance at least one will look significant. Without correction, roughly two of every five quarterly roadmaps will ship a fake winner.

The fix isn't to stop testing — it's to adjust how you read significance. Bonferroni correction divides alpha by the number of comparisons (0.05 / 10 = 0.005 per test). Benjamini-Hochberg controls the false discovery rate instead, which is less punishing on test velocity. Both reduce false positives at the cost of needing larger samples to detect real effects.

Benchmark

How the family-wise false-positive rate grows with the number of tests (alpha = 0.05, no correction)

Tests run in familyProbability of ≥1 false positiveExpected false winnersRisk level
15.0%0.05Baseline
314.3%0.15Low
522.6%0.25Moderate
1040.1%0.50High
2064.2%1.00Severe
5092.3%2.50Critical

Two operational habits cut false-positive risk more than any correction formula. First, never call a test before it hits its pre-calculated sample size — peeking and stopping early roughly doubles your real alpha. Second, replicate suspicious wins on a fresh audience before rolling out site-wide; a real effect survives replication, a fluke does not.

Frequently asked

False positives in A/B testing — FAQ

A false positive (type I error) declares a winner when no real effect exists. A false negative (type II error) misses a real effect because the test lacked statistical power. Alpha controls the first, sample size and effect size control the second.

It's a convention from R.A. Fisher's early-1900s work, not a law of nature. Many teams use 0.10 for low-risk UI tests and 0.01 for high-stakes checkout changes. The right alpha depends on the cost of shipping a fake winner versus the cost of slower iteration.

Yes, indirectly. A/A tests split traffic between two identical experiences. If your platform flags them as 'significant' more than ~5% of the time, your significance engine is leaking — usually from peeking, sample-ratio mismatch, or broken randomisation.

Bonferroni is stricter and best for small test families where any single false positive is costly. Benjamini-Hochberg controls the proportion of false discoveries and is better suited to high-velocity roadmaps where you'd rather catch most real winners and tolerate a few fakes.

No. Bayesian methods replace p-values with probability-of-being-best estimates, but the underlying noise is the same. You can still declare a winner that doesn't replicate. Bayesian frameworks just make the trade-offs more explicit.

Checking results repeatedly and stopping the moment significance appears effectively runs many tests instead of one. Empirically, peeking daily on a two-week test can push the true false-positive rate from 5% to 15-25% depending on traffic shape.

Yes, because every combination is a comparison. A 2×2×2 MVT has 8 variants and 28 pairwise comparisons. Without correction, the family-wise false-positive rate exceeds 75%. Most teams treat MVT outputs as exploratory, then confirm winners with a clean A/B.

Statistically, no — the underlying evidence is nearly identical. Treat anything in the 0.04-0.06 band as inconclusive rather than decisive. The hard threshold is a decision rule, not a signal of certainty.

Sample size doesn't change your false-positive rate — alpha does. Larger samples reduce false negatives by giving you the power to detect small real effects. A common misconception is that 'more traffic' makes results more trustworthy in the type I sense; it doesn't.

The experimentation engine uses sequential testing math that's valid under peeking, flags sample-ratio mismatch automatically, and applies false-discovery-rate correction when you run multiple tests in the same family. You see honest significance, not an inflated version of it.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.