Statistical Significance

Statistical significance tells you how likely an A/B test result is to be a real effect versus random noise. Here's how to read p-values, pick a threshold, and avoid the common traps.
Statistical Significance
The probability that an observed lift between variants is real, not random — usually declared when the p-value drops below 0.05.
Statistical significance is the formal answer to a simple question: if there were truly no difference between your control and variant, how often would random visitor behaviour alone produce a gap this big or bigger? That probability is the p-value. A small p-value means the observed lift is unlikely under the null hypothesis of no effect, and you reject that null.
The 5% threshold (p < 0.05) is convention, not law. It corresponds to a 95% confidence level — meaning across many tests, roughly 1 in 20 'wins' at this threshold will be false positives. Stricter tests use 1% or 0.1%; faster-moving experimentation programmes sometimes accept 10%.
In an A/B test on your Shopify checkout, significance answers whether the 4.1% conversion rate on Variant B is genuinely better than the 3.6% on Control, or just a fluke from the specific visitors who happened to land in each bucket. Without that check, you'd ship every variant that looked good on day three.
Significance is one half of statistical analysis for experimentation; the other half is statistical power — your ability to detect a real effect when one exists. A test can be 'not significant' simply because it didn't run long enough, not because the variant failed.
z = (p_b - p_a) / sqrt( p_pool * (1 - p_pool) * (1/n_a + 1/n_b) )
p_a
Control conversion rate
Conversions in the control bucket divided by visitors in the control bucket.
p_b
Variant conversion rate
Conversions in the variant bucket divided by visitors in the variant bucket.
n_a
Control sample size
Visitors assigned to the control.
n_b
Variant sample size
Visitors assigned to the variant.
p_pool
Pooled conversion rate
Combined conversions across both buckets divided by combined visitors — (conv_a + conv_b) / (n_a + n_b).
z
Z-score
Standardised distance between the two rates. Convert to a p-value via the normal distribution; |z| > 1.96 corresponds to p < 0.05 (two-tailed).
An apparel store runs a checkout test: a single-page checkout (Variant B) against the existing three-step flow (Control A). After two weeks the data is in.
Control visitors (n_a): 12000
Control conversions: 432
Variant visitors (n_b): 12000
Variant conversions: 492
Control rate (p_a): 3.60%
Variant rate (p_b): 4.10%
Pooled rate (p_pool): 3.85%
→ z ≈ 2.01, p ≈ 0.044
p < 0.05, so the 0.5-point lift clears the conventional significance bar. You'd reject the null of no difference and ship the single-page checkout — but only if the test also met its pre-registered minimum detectable effect and runtime, otherwise you risk an underpowered false positive.
Which threshold you pick is a business decision, not a maths one. Tighter alpha cuts false positives but demands more traffic; looser alpha ships winners faster but lets more noise through. Map the choice to the cost of being wrong: a checkout test that touches every order deserves 99% confidence, a hero-image test does not.
Typical significance thresholds and what they imply for sample size
| Confidence level | Alpha (p-value) | Z-critical (two-tailed) | Relative sample size | When to use |
|---|---|---|---|---|
| 90% | 0.10 | 1.64 | 0.7× | Low-risk surface tests (copy, imagery, badges) |
| 95% | 0.05 | 1.96 | 1.0× (baseline) | Default for most CRO programmes |
| 99% | 0.01 | 2.58 | 1.7× | Checkout, pricing, payment flow changes |
| 99.9% | 0.001 | 3.29 | 2.7× | Platform-wide rollouts, regulatory-sensitive UI |
The biggest practical trap is peeking — checking significance daily and calling the test the moment p dips below 0.05. Because the p-value wanders as data accumulates, repeated looks inflate your real false-positive rate well above the stated alpha. Either fix the runtime in advance, or use sequential testing methods designed for continuous monitoring.
Statistical significance FAQ
It means that if the control and variant were truly identical, you'd see a difference at least this large in roughly 5% of repeated experiments by chance alone. It is not the probability that the variant 'works' — that's a Bayesian quantity computed differently.
No. 95% is the default convention, but match the threshold to the stakes. Use 99% for checkout, pricing, or anything touching revenue at scale; 90% is defensible for low-risk surface tweaks where shipping fast matters more than catching every false positive.
Statistical significance says the lift is unlikely to be noise. Practical significance says the lift is big enough to matter. A 0.2% improvement on 10 million visitors can be statistically significant and operationally pointless — always pair the p-value with the absolute effect size and a revenue impact estimate.
Not with a standard fixed-horizon test. Peeking and stopping early inflates your false-positive rate because the p-value fluctuates as data arrives. Either commit to a pre-calculated runtime, or switch to a sequential or Bayesian framework that's valid under continuous monitoring.
Significance (alpha) controls false positives — wrongly declaring a winner. Power (1 - beta) controls false negatives — missing a real winner. Both are pillars of statistical analysis: you set alpha when you pick a confidence level, and you achieve power by running enough sample to detect your minimum detectable effect.
Three common causes: you peeked and stopped on a noisy spike, the test was underpowered so the 'win' was within normal variance, or a novelty effect faded. Pre-register your sample size and minimum runtime — usually at least one full business cycle — to avoid this.
Yes, but the maths is unforgiving on low traffic. A store doing 20,000 sessions a month and a 2% conversion rate typically needs 4-6 weeks per test to detect a 15% relative lift at 95% confidence. Below that, focus on bigger, bolder variants or qualitative methods.
Default to two-tailed. A one-tailed test assumes the variant can only help, never hurt — which is almost never true in practice. Two-tailed is the conservative, defensible choice and is what most testing tools report by default.
For checkout changes, push to p < 0.01 (99% confidence). Checkout touches every order, so a false positive that lowers conversion by even 1% is expensive. The extra sample size is worth the protection.
Frequentist significance reports the p-value: the chance of seeing this data under the null. Bayesian testing reports probability to beat control, e.g. 'Variant B has a 96% probability of being better.' The Bayesian framing is easier to communicate to stakeholders and tolerates continuous monitoring, which is why many modern CRO platforms default to it.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.