Sample Size

Sample size is the number of visitors an A/B test needs to detect a given effect with statistical confidence. Compute it before the test starts — or you're running without an exit criterion.
Sample Size
The number of visitors per variant an A/B test needs to detect a given effect at a chosen power and significance level.
Sample size is the count of users (or sessions) each variant in an experiment needs before the result can be trusted. It's a function of four inputs: your baseline conversion rate, the minimum effect you care about detecting, your significance threshold (typically 5%), and your statistical power (typically 80%).
The number is computed before the test launches, not discovered after. Without a planned sample size, a test has no exit criterion — you end up either stopping early on noise or running indefinitely and burning traffic on inconclusive variants.
Most teams under-power their tests. A checkout experiment running on 4,000 weekly visitors can't reliably detect a 5% lift on a 2% baseline — it needs roughly 30,000 per variant. Skip the calculation and you'll either ship a losing change or scrap a winning one because the numbers wobbled.
Sample size sits inside the broader statistical analysis layer of experimentation. It determines test duration (sample size ÷ daily traffic per variant), which in turn determines how many tests your store can run per quarter. On low-traffic pages, the calculation often reveals that the experiment isn't viable — better to know that on day zero than week six.
n = 2 * ( (z_alpha + z_beta)^2 * p * (1 - p) ) / (MDE * p)^2
n
Sample size per variant
Visitors needed in each variant (control and treatment).
p
Baseline conversion rate
Current conversion rate of the control, as a decimal (e.g. 0.025 for 2.5%).
MDE
Minimum detectable effect
Smallest relative lift you want to detect, as a decimal (e.g. 0.10 for a 10% relative lift).
z_alpha
Significance z-score
1.96 for a two-sided test at 95% confidence (α = 0.05).
z_beta
Power z-score
0.84 for 80% statistical power.
A Shopify apparel store wants to test a new product-page hero. Current PDP-to-add-to-cart rate is 4%. The team wants to detect a 10% relative lift (i.e. a move from 4.00% to 4.40%) at 95% confidence and 80% power.
Baseline conversion rate (p): 0.04
Minimum detectable effect (MDE): 0.10 (relative)
z_alpha (two-sided, 95%): 1.96
z_beta (80% power): 0.84
→ ≈ 15,700 visitors per variant (≈ 31,400 total)
On 2,500 PDP visitors per day, this test needs about 13 days to reach the planned sample size. Anything shorter risks calling a result on noise.
Two things drive the number more than anything else: the baseline rate and the MDE. Halving the MDE quadruples the sample size. That's why ambitious tests on small lifts ("can we squeeze 2% more out of checkout?") need enormous traffic — and why most stores under €3M revenue should focus on bigger swings.
Visitors per variant needed to detect a relative lift (95% confidence, 80% power)
| Baseline conversion rate | Detect 5% lift | Detect 10% lift | Detect 20% lift | Detect 50% lift |
|---|---|---|---|---|
| 1% (checkout step) | 620,000 | 155,000 | 39,000 | 6,300 |
| 2.5% (PDP → cart) | 245,000 | 61,000 | 15,400 | 2,500 |
| 5% (site-wide CVR) | 119,000 | 30,000 | 7,500 | 1,200 |
| 10% (email opt-in) | 57,000 | 14,000 | 3,600 | 580 |
| 25% (returning visitors) | 18,000 | 4,500 | 1,150 | 190 |
Read the table the practical way: find your baseline rate, then look across to the MDE you can credibly aim for. If the cell exceeds a month of traffic, the test isn't viable in its current form — either widen the MDE (test bolder changes), increase traffic (apply the test site-wide instead of one page), or move to a higher-converting funnel step.
Sample size FAQ
There's no universal number — it depends on your baseline conversion rate and the smallest lift you want to detect. For a typical Shopify store with a 2-3% site-wide conversion rate testing for a 10% relative lift, expect to need 50,000-80,000 visitors per variant at 95% confidence and 80% power.
Because the calculation is your exit criterion. Without a planned sample size, you'll either peek at results and stop early on noise (inflating false positives well above 5%) or run forever and waste traffic on inconclusive variants. Pre-committing also forces an honest conversation about whether the test is even viable on your traffic.
Lower baselines need dramatically more visitors. A 1% baseline needs roughly 4× the sample size of a 4% baseline to detect the same relative lift, because rare events have higher relative variance. That's why deep-funnel tests (e.g. checkout completion) require more traffic than top-of-funnel tests (e.g. add-to-cart).
MDE is the smallest improvement your test can reliably catch given the sample size you plan to collect. Set it too small and the test runs forever; set it too large and the test will miss real but modest wins. A common starting point for online retail is a 10% relative lift on the primary metric.
Not without correction. Peeking at fixed-horizon tests and stopping when p < 0.05 inflates your real false-positive rate to roughly 20-30%. If you need early-stopping behaviour, switch to a sequential testing method (e.g. Bayesian bandits or always-valid p-values) — but those have their own sample-size logic.
At least one full business cycle, typically 14 days, even if you hit the sample size sooner. Different days and traffic sources behave differently — running for only 4 high-traffic days oversamples one segment and gives you a winner that doesn't generalise. Sample size is a floor, not a ceiling.
Three options. Widen the MDE so you're only testing bold changes likely to move the needle 20%+. Test site-wide changes instead of single-page changes to pool traffic. Or test higher-converting events (add-to-cart, email signup) which need fewer visitors than rare events like checkout completion.
Per variant. If a calculator returns 15,000, you need 15,000 in the control AND 15,000 in each treatment — so a standard A/B test needs 30,000 total visitors, and an A/B/C test needs 45,000. Multi-variant tests scale linearly in traffic cost.
Relative MDE is the industry default and is what most calculators expect — "a 10% lift on a 4% baseline means 4.40%." Absolute MDE expresses the same change as "+0.40 percentage points." Both work; just be consistent and confirm which one your tool uses, because mixing them produces sample sizes that are off by 10-25×.
Power is your test's ability to detect a real effect when one exists. Standard is 80%, meaning a 20% chance of missing a true winner. Raising power to 90% increases sample size by roughly 35%; dropping to 70% reduces it by 20% but means you'll miss almost a third of real wins. 80% is the pragmatic default.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.