Statistical Significance

Metricuno
May 17, 2026
4 min read
Statistical Significance — What statistical significance means in A/B testing, how p-values work, and the thresholds CRO teams use to call a winner without fooling themselves.
Quick answer

Statistical significance tells you how likely an A/B test result is to be a real effect versus random noise. Here's how to read p-values, pick a threshold, and avoid the common traps.

Definition
Statistical Analysis

Statistical Significance

The probability that an observed lift between variants is real, not random — usually declared when the p-value drops below 0.05.

Statistical significance is the formal answer to a simple question: if there were truly no difference between your control and variant, how often would random visitor behaviour alone produce a gap this big or bigger? That probability is the p-value. A small p-value means the observed lift is unlikely under the null hypothesis of no effect, and you reject that null.

The 5% threshold (p < 0.05) is convention, not law. It corresponds to a 95% confidence level — meaning across many tests, roughly 1 in 20 'wins' at this threshold will be false positives. Stricter tests use 1% or 0.1%; faster-moving experimentation programmes sometimes accept 10%.

Also known as
p-value threshold
statistical confidence
test significance

In an A/B test on your Shopify checkout, significance answers whether the 4.1% conversion rate on Variant B is genuinely better than the 3.6% on Control, or just a fluke from the specific visitors who happened to land in each bucket. Without that check, you'd ship every variant that looked good on day three.

Significance is one half of statistical analysis for experimentation; the other half is statistical power — your ability to detect a real effect when one exists. A test can be 'not significant' simply because it didn't run long enough, not because the variant failed.

Formula

z = (p_b - p_a) / sqrt( p_pool * (1 - p_pool) * (1/n_a + 1/n_b) )

Variables

p_a

Control conversion rate

Conversions in the control bucket divided by visitors in the control bucket.

p_b

Variant conversion rate

Conversions in the variant bucket divided by visitors in the variant bucket.

n_a

Control sample size

Visitors assigned to the control.

n_b

Variant sample size

Visitors assigned to the variant.

p_pool

Pooled conversion rate

Combined conversions across both buckets divided by combined visitors — (conv_a + conv_b) / (n_a + n_b).

z

Z-score

Standardised distance between the two rates. Convert to a p-value via the normal distribution; |z| > 1.96 corresponds to p < 0.05 (two-tailed).

Worked example

An apparel store runs a checkout test: a single-page checkout (Variant B) against the existing three-step flow (Control A). After two weeks the data is in.

Control visitors (n_a): 12000

Control conversions: 432

Variant visitors (n_b): 12000

Variant conversions: 492

Control rate (p_a): 3.60%

Variant rate (p_b): 4.10%

Pooled rate (p_pool): 3.85%

z ≈ 2.01, p ≈ 0.044

p < 0.05, so the 0.5-point lift clears the conventional significance bar. You'd reject the null of no difference and ship the single-page checkout — but only if the test also met its pre-registered minimum detectable effect and runtime, otherwise you risk an underpowered false positive.

Which threshold you pick is a business decision, not a maths one. Tighter alpha cuts false positives but demands more traffic; looser alpha ships winners faster but lets more noise through. Map the choice to the cost of being wrong: a checkout test that touches every order deserves 99% confidence, a hero-image test does not.

Benchmark

Typical significance thresholds and what they imply for sample size

Confidence levelAlpha (p-value)Z-critical (two-tailed)Relative sample sizeWhen to use
90%0.101.640.7×Low-risk surface tests (copy, imagery, badges)
95%0.051.961.0× (baseline)Default for most CRO programmes
99%0.012.581.7×Checkout, pricing, payment flow changes
99.9%0.0013.292.7×Platform-wide rollouts, regulatory-sensitive UI

The biggest practical trap is peeking — checking significance daily and calling the test the moment p dips below 0.05. Because the p-value wanders as data accumulates, repeated looks inflate your real false-positive rate well above the stated alpha. Either fix the runtime in advance, or use sequential testing methods designed for continuous monitoring.

Frequently asked

Statistical significance FAQ

It means that if the control and variant were truly identical, you'd see a difference at least this large in roughly 5% of repeated experiments by chance alone. It is not the probability that the variant 'works' — that's a Bayesian quantity computed differently.

No. 95% is the default convention, but match the threshold to the stakes. Use 99% for checkout, pricing, or anything touching revenue at scale; 90% is defensible for low-risk surface tweaks where shipping fast matters more than catching every false positive.

Statistical significance says the lift is unlikely to be noise. Practical significance says the lift is big enough to matter. A 0.2% improvement on 10 million visitors can be statistically significant and operationally pointless — always pair the p-value with the absolute effect size and a revenue impact estimate.

Not with a standard fixed-horizon test. Peeking and stopping early inflates your false-positive rate because the p-value fluctuates as data arrives. Either commit to a pre-calculated runtime, or switch to a sequential or Bayesian framework that's valid under continuous monitoring.

Significance (alpha) controls false positives — wrongly declaring a winner. Power (1 - beta) controls false negatives — missing a real winner. Both are pillars of statistical analysis: you set alpha when you pick a confidence level, and you achieve power by running enough sample to detect your minimum detectable effect.

Three common causes: you peeked and stopped on a noisy spike, the test was underpowered so the 'win' was within normal variance, or a novelty effect faded. Pre-register your sample size and minimum runtime — usually at least one full business cycle — to avoid this.

Yes, but the maths is unforgiving on low traffic. A store doing 20,000 sessions a month and a 2% conversion rate typically needs 4-6 weeks per test to detect a 15% relative lift at 95% confidence. Below that, focus on bigger, bolder variants or qualitative methods.

Default to two-tailed. A one-tailed test assumes the variant can only help, never hurt — which is almost never true in practice. Two-tailed is the conservative, defensible choice and is what most testing tools report by default.

For checkout changes, push to p < 0.01 (99% confidence). Checkout touches every order, so a false positive that lowers conversion by even 1% is expensive. The extra sample size is worth the protection.

Frequentist significance reports the p-value: the chance of seeing this data under the null. Bayesian testing reports probability to beat control, e.g. 'Variant B has a 96% probability of being better.' The Bayesian framing is easier to communicate to stakeholders and tolerates continuous monitoring, which is why many modern CRO platforms default to it.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.