Sample Size

Metricuno
May 17, 2026
4 min read
Sample Size — What sample size means in A/B testing, the formula behind it, and visitor benchmarks for common conversion rates and minimum detectable effects.
Quick answer

Sample size is the number of visitors an A/B test needs to detect a given effect with statistical confidence. Compute it before the test starts — or you're running without an exit criterion.

Definition
Experimentation

Sample Size

The number of visitors per variant an A/B test needs to detect a given effect at a chosen power and significance level.

Sample size is the count of users (or sessions) each variant in an experiment needs before the result can be trusted. It's a function of four inputs: your baseline conversion rate, the minimum effect you care about detecting, your significance threshold (typically 5%), and your statistical power (typically 80%).

The number is computed before the test launches, not discovered after. Without a planned sample size, a test has no exit criterion — you end up either stopping early on noise or running indefinitely and burning traffic on inconclusive variants.

Also known as
Required sample size
Test sample size
n per variant

Most teams under-power their tests. A checkout experiment running on 4,000 weekly visitors can't reliably detect a 5% lift on a 2% baseline — it needs roughly 30,000 per variant. Skip the calculation and you'll either ship a losing change or scrap a winning one because the numbers wobbled.

Sample size sits inside the broader statistical analysis layer of experimentation. It determines test duration (sample size ÷ daily traffic per variant), which in turn determines how many tests your store can run per quarter. On low-traffic pages, the calculation often reveals that the experiment isn't viable — better to know that on day zero than week six.

Formula

n = 2 * ( (z_alpha + z_beta)^2 * p * (1 - p) ) / (MDE * p)^2

Variables

n

Sample size per variant

Visitors needed in each variant (control and treatment).

p

Baseline conversion rate

Current conversion rate of the control, as a decimal (e.g. 0.025 for 2.5%).

MDE

Minimum detectable effect

Smallest relative lift you want to detect, as a decimal (e.g. 0.10 for a 10% relative lift).

z_alpha

Significance z-score

1.96 for a two-sided test at 95% confidence (α = 0.05).

z_beta

Power z-score

0.84 for 80% statistical power.

Worked example

A Shopify apparel store wants to test a new product-page hero. Current PDP-to-add-to-cart rate is 4%. The team wants to detect a 10% relative lift (i.e. a move from 4.00% to 4.40%) at 95% confidence and 80% power.

Baseline conversion rate (p): 0.04

Minimum detectable effect (MDE): 0.10 (relative)

z_alpha (two-sided, 95%): 1.96

z_beta (80% power): 0.84

≈ 15,700 visitors per variant (≈ 31,400 total)

On 2,500 PDP visitors per day, this test needs about 13 days to reach the planned sample size. Anything shorter risks calling a result on noise.

Two things drive the number more than anything else: the baseline rate and the MDE. Halving the MDE quadruples the sample size. That's why ambitious tests on small lifts ("can we squeeze 2% more out of checkout?") need enormous traffic — and why most stores under €3M revenue should focus on bigger swings.

Benchmark

Visitors per variant needed to detect a relative lift (95% confidence, 80% power)

Baseline conversion rateDetect 5% liftDetect 10% liftDetect 20% liftDetect 50% lift
1% (checkout step)620,000155,00039,0006,300
2.5% (PDP → cart)245,00061,00015,4002,500
5% (site-wide CVR)119,00030,0007,5001,200
10% (email opt-in)57,00014,0003,600580
25% (returning visitors)18,0004,5001,150190

Read the table the practical way: find your baseline rate, then look across to the MDE you can credibly aim for. If the cell exceeds a month of traffic, the test isn't viable in its current form — either widen the MDE (test bolder changes), increase traffic (apply the test site-wide instead of one page), or move to a higher-converting funnel step.

Frequently asked

Sample size FAQ

There's no universal number — it depends on your baseline conversion rate and the smallest lift you want to detect. For a typical Shopify store with a 2-3% site-wide conversion rate testing for a 10% relative lift, expect to need 50,000-80,000 visitors per variant at 95% confidence and 80% power.

Because the calculation is your exit criterion. Without a planned sample size, you'll either peek at results and stop early on noise (inflating false positives well above 5%) or run forever and waste traffic on inconclusive variants. Pre-committing also forces an honest conversation about whether the test is even viable on your traffic.

Lower baselines need dramatically more visitors. A 1% baseline needs roughly 4× the sample size of a 4% baseline to detect the same relative lift, because rare events have higher relative variance. That's why deep-funnel tests (e.g. checkout completion) require more traffic than top-of-funnel tests (e.g. add-to-cart).

MDE is the smallest improvement your test can reliably catch given the sample size you plan to collect. Set it too small and the test runs forever; set it too large and the test will miss real but modest wins. A common starting point for online retail is a 10% relative lift on the primary metric.

Not without correction. Peeking at fixed-horizon tests and stopping when p < 0.05 inflates your real false-positive rate to roughly 20-30%. If you need early-stopping behaviour, switch to a sequential testing method (e.g. Bayesian bandits or always-valid p-values) — but those have their own sample-size logic.

At least one full business cycle, typically 14 days, even if you hit the sample size sooner. Different days and traffic sources behave differently — running for only 4 high-traffic days oversamples one segment and gives you a winner that doesn't generalise. Sample size is a floor, not a ceiling.

Three options. Widen the MDE so you're only testing bold changes likely to move the needle 20%+. Test site-wide changes instead of single-page changes to pool traffic. Or test higher-converting events (add-to-cart, email signup) which need fewer visitors than rare events like checkout completion.

Per variant. If a calculator returns 15,000, you need 15,000 in the control AND 15,000 in each treatment — so a standard A/B test needs 30,000 total visitors, and an A/B/C test needs 45,000. Multi-variant tests scale linearly in traffic cost.

Relative MDE is the industry default and is what most calculators expect — "a 10% lift on a 4% baseline means 4.40%." Absolute MDE expresses the same change as "+0.40 percentage points." Both work; just be consistent and confirm which one your tool uses, because mixing them produces sample sizes that are off by 10-25×.

Power is your test's ability to detect a real effect when one exists. Standard is 80%, meaning a 20% chance of missing a true winner. Raising power to 90% increases sample size by roughly 35%; dropping to 70% reduces it by 20% but means you'll miss almost a third of real wins. 80% is the pragmatic default.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.