Statistical Analysis

Metricuno
May 17, 2026
6 min read
Statistical Analysis — The statistical foundations of A/B testing — significance, confidence, sample size, power, and validity — explained for online stores running real experiments.
Quick answer

A practical framework for the statistics behind A/B testing: significance, confidence intervals, sample size, power, and the validity threats that quietly invalidate winners.

Definition
Experimentation

Statistical Analysis (in Experimentation)

The math that decides whether an A/B test result is a real effect or random noise — built on significance, confidence, power, and sample size.

Statistical analysis is the layer of experimentation that translates raw conversion counts into a defensible decision. It answers one question: given the variance you'd expect by chance, is the difference between your variants large enough, and observed across enough visitors, to believe the change caused it?

The framework rests on four moving parts — significance, confidence, sample size, and power — plus a set of validity checks that protect those numbers from being silently corrupted by how the test was run. Get the math right and you ship real winners. Get it wrong and you ship noise, then wonder why revenue didn't move.

Also known as
Inferential statistics for A/B testing
Experiment statistics

Most teams running A/B tests on Shopify or WooCommerce learn the vocabulary — p-value, confidence, significance — without ever building intuition for what the numbers protect against. The result is a portfolio of declared winners that don't replicate, and a roadmap that drifts away from real revenue impact.

This page is the framework: the four pillars you actually need (significance, confidence intervals, sample size, power), the validity threats that quietly invalidate them, and the decision rules that turn a dashboard reading into a ship/kill call you can defend to your CFO.

The four pillars: significance, confidence, sample size, power

Statistical significance is the headline metric — the probability your observed lift would occur if the variant had zero true effect. The complement, the p-value, is what most testing tools surface. A p-value of 0.04 means: if nothing changed, you'd see a result this extreme about 4% of the time by chance.

Confidence intervals are the underused twin. Instead of a single yes/no, they give you a range: "the true lift is between +1.2% and +6.8% with 95% confidence." That range tells you whether the win is decisively positive, marginal, or possibly negative — information a p-value alone hides. Sample size and power analysis then tell you how many visitors you need to detect a given lift reliably; under-powered tests are the single biggest source of false negatives in DTC experimentation.

Validity: why correct math still ships the wrong winners

Experiment validity is the layer the math assumes but doesn't enforce. Sample ratio mismatch, contamination between variants, novelty effects on returning customers, peeking at results before the planned sample is reached — each one silently breaks the assumptions a 95% confidence number depends on. A test can be statistically significant and still be wrong because the design was compromised.

False positives compound the problem. Run ten independent tests at 95% confidence and you'd expect roughly one false winner by chance alone. Run them with peeking — checking the dashboard daily and stopping the moment significance appears — and that false-positive rate can climb past 30%. Sequential testing methods exist specifically to let you look early without paying that penalty; classical fixed-horizon tests do not.

The peeking trap

If your testing tool lets you call a winner whenever the p-value crosses 0.05, you are not running at 95% confidence — you are running at whatever the cumulative error rate becomes after dozens of looks. Either fix the sample size in advance, or use a sequential or Bayesian method designed for continuous monitoring.

Frequentist vs Bayesian: choosing a decision framework

Classical (frequentist) testing asks: "assuming no effect, how surprising is this data?" Bayesian testing inverts the question: "given this data, what's the probability the variant beats control by at least X%?" For most operators, the Bayesian framing maps more cleanly onto the actual business decision — you want to know the chance you're shipping a loser, not the chance of seeing this data under a hypothetical null.

Neither framework is statistically superior; they answer different questions. Frequentist methods give you airtight guarantees if you commit to a sample size and don't peek. Bayesian methods give you a probability of being right that updates continuously, at the cost of needing reasonable priors. Pick one and run it consistently — mixing frameworks mid-test, or changing rules after seeing data, is how teams talk themselves into bad ships.

Chart

Visitors per variant needed to detect a lift (95% confidence, 80% power, 3% baseline conversion)

0visitors50.0kvisitors100.0kvisitors150.0kvisitors200.0kvisitors250.0kvisitors+5%+10%+15%+20%+30%+50%Visitors required per variantMinimum detectable lift (relative)
Frequently asked

Statistical analysis FAQ

Significance gives you a yes/no on whether the effect is distinguishable from zero. A confidence interval gives you the plausible range of the true effect — for example, +2% to +9%. The interval is strictly more informative, because a barely-significant test with a range of +0.1% to +8% means something very different operationally than +4% to +6%.

You need four inputs: your baseline conversion rate, the minimum lift you want to detect (MDE), your significance threshold (usually 5%), and your power (usually 80%). Plug those into a sample size calculator. On a 3% baseline conversion store hunting a +10% lift, expect to need roughly 60,000+ visitors per variant — most underpowered tests are run because teams skip this step.

Not under classical statistics — that's peeking, and it inflates your false-positive rate well above the 5% your tool reports. You can stop early if you've explicitly designed a sequential test (e.g. mSPRT, group sequential boundaries) or if you're running a Bayesian framework with appropriate decision rules. Otherwise, commit to your pre-calculated sample size.

Power is the probability your test will correctly detect a real effect of a given size. 80% is convention — it means a 20% chance of missing a true winner. Lower power means under-powered tests where real wins look inconclusive. Higher power costs proportionally more traffic, so most teams settle at 80%.

Bayesian methods map more naturally to business questions ("probability variant B is best") and tolerate continuous monitoring. Frequentist methods are the industry default, well-understood, and rigorous if you don't peek. For most online stores running fewer than 20 tests a year, either is fine — what matters more is running it consistently and respecting the rules of whichever you pick.

If you split traffic 50/50 but observe 52/48 in your data, that's a sample ratio mismatch — and it's a strong signal something is broken: bot filtering, redirect bugs, cookie issues, or a tracking gap that's silently dropping users. SRM invalidates the test even if the p-value looks great. Always check it before reading results.

At 95% confidence, roughly 1 in 20 truly-null tests will show a false winner. If you run 40 tests a year and half are flat, expect 1-2 false positives even if you do everything right. The fix isn't tighter thresholds — it's holdout validation, replication of large wins, and accepting that a single significant test is evidence, not proof.

Conversion rate is binomial, not normal, but the normal approximation works fine once you have a few hundred conversions per variant. Where you do need to be careful is revenue-per-visitor or AOV tests — those have heavy-tailed distributions where one €800 order can swing the mean. Use a non-parametric or bootstrap approach for revenue metrics.

At least one full business cycle — usually two weeks — even if you hit sample size sooner. That covers weekly seasonality (weekend vs weekday buyers behave differently) and gives returning visitors time to interact with both variants. If your sample size calculation says you need six weeks, run six weeks; cutting it short is the most common way teams ship noise.

Statistical analysis is the decision layer of a broader experimentation program — it tells you what to ship, but not what to test. Hypothesis quality, prioritisation, and test velocity sit upstream; statistics keeps you honest at the finish line. A team with great statistics and bad hypotheses learns slowly but reliably; the reverse learns fast and wrong.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.