Sample Size

Q: How long should an A/B test run once it hits the sample size?

At least one full business cycle, typically 14 days, even if you hit the sample size sooner. Different days and traffic sources behave differently — running for only 4 high-traffic days oversamples one segment and gives you a winner that doesn't generalise. Sample size is a floor, not a ceiling.

Q: What if my store doesn't have enough traffic for the calculated sample size?

Three options. Widen the MDE so you're only testing bold changes likely to move the needle 20%+. Test site-wide changes instead of single-page changes to pool traffic. Or test higher-converting events (add-to-cart, email signup) which need fewer visitors than rare events like checkout completion.

Q: Does sample size apply per variant or total?

Per variant. If a calculator returns 15,000, you need 15,000 in the control AND 15,000 in each treatment — so a standard A/B test needs 30,000 total visitors, and an A/B/C test needs 45,000. Multi-variant tests scale linearly in traffic cost.

Q: Should I use absolute or relative MDE?

Relative MDE is the industry default and is what most calculators expect — "a 10% lift on a 4% baseline means 4.40%." Absolute MDE expresses the same change as "+0.40 percentage points." Both work; just be consistent and confirm which one your tool uses, because mixing them produces sample sizes that are off by 10-25×.

Q: How does statistical power affect the number?

Power is your test's ability to detect a real effect when one exists. Standard is 80%, meaning a 20% chance of missing a true winner. Raising power to 90% increases sample size by roughly 35%; dropping to 70% reduces it by 20% but means you'll miss almost a third of real wins. 80% is the pragmatic default.

Metricuno

May 17, 2026

4 min read

Sample Size — What sample size means in A/B testing, the formula behind it, and visitor benchmarks for common conversion rates and minimum detectable effects.

Quick answer

Sample size is the number of visitors an A/B test needs to detect a given effect with statistical confidence. Compute it before the test starts — or you're running without an exit criterion.

Definition

Experimentation

Sample Size

The number of visitors per variant an A/B test needs to detect a given effect at a chosen power and significance level.

Sample size is the count of users (or sessions) each variant in an experiment needs before the result can be trusted. It's a function of four inputs: your baseline conversion rate, the minimum effect you care about detecting, your significance threshold (typically 5%), and your statistical power (typically 80%).

The number is computed before the test launches, not discovered after. Without a planned sample size, a test has no exit criterion — you end up either stopping early on noise or running indefinitely and burning traffic on inconclusive variants.

Also known as

Required sample size

Test sample size

n per variant

Most teams under-power their tests. A checkout experiment running on 4,000 weekly visitors can't reliably detect a 5% lift on a 2% baseline — it needs roughly 30,000 per variant. Skip the calculation and you'll either ship a losing change or scrap a winning one because the numbers wobbled.

Sample size sits inside the broader statistical analysis layer of experimentation. It determines test duration (sample size ÷ daily traffic per variant), which in turn determines how many tests your store can run per quarter. On low-traffic pages, the calculation often reveals that the experiment isn't viable — better to know that on day zero than week six.

Formula

n = 2 * ( (z_alpha + z_beta)^2 * p * (1 - p) ) / (MDE * p)^2

Variables

Sample size per variant

Visitors needed in each variant (control and treatment).

Baseline conversion rate

Current conversion rate of the control, as a decimal (e.g. 0.025 for 2.5%).

MDE

Minimum detectable effect

Smallest relative lift you want to detect, as a decimal (e.g. 0.10 for a 10% relative lift).

z_alpha

Significance z-score

1.96 for a two-sided test at 95% confidence (α = 0.05).

z_beta

Power z-score

0.84 for 80% statistical power.

Worked example

A Shopify apparel store wants to test a new product-page hero. Current PDP-to-add-to-cart rate is 4%. The team wants to detect a 10% relative lift (i.e. a move from 4.00% to 4.40%) at 95% confidence and 80% power.

Baseline conversion rate (p): 0.04

Minimum detectable effect (MDE): 0.10 (relative)

z_alpha (two-sided, 95%): 1.96

z_beta (80% power): 0.84

→ ≈ 15,700 visitors per variant (≈ 31,400 total)

On 2,500 PDP visitors per day, this test needs about 13 days to reach the planned sample size. Anything shorter risks calling a result on noise.

Two things drive the number more than anything else: the baseline rate and the MDE. Halving the MDE quadruples the sample size. That's why ambitious tests on small lifts ("can we squeeze 2% more out of checkout?") need enormous traffic — and why most stores under €3M revenue should focus on bigger swings.

Benchmark

Visitors per variant needed to detect a relative lift (95% confidence, 80% power)

Baseline conversion rate	Detect 5% lift	Detect 10% lift	Detect 20% lift	Detect 50% lift
1% (checkout step)	620,000	155,000	39,000	6,300
2.5% (PDP → cart)	245,000	61,000	15,400	2,500
5% (site-wide CVR)	119,000	30,000	7,500	1,200
10% (email opt-in)	57,000	14,000	3,600	580
25% (returning visitors)	18,000	4,500	1,150	190

Read the table the practical way: find your baseline rate, then look across to the MDE you can credibly aim for. If the cell exceeds a month of traffic, the test isn't viable in its current form — either widen the MDE (test bolder changes), increase traffic (apply the test site-wide instead of one page), or move to a higher-converting funnel step.

Frequently asked

Sample size FAQ

There's no universal number — it depends on your baseline conversion rate and the smallest lift you want to detect. For a typical Shopify store with a 2-3% site-wide conversion rate testing for a 10% relative lift, expect to need 50,000-80,000 visitors per variant at 95% confidence and 80% power.

Because the calculation is your exit criterion. Without a planned sample size, you'll either peek at results and stop early on noise (inflating false positives well above 5%) or run forever and waste traffic on inconclusive variants. Pre-committing also forces an honest conversation about whether the test is even viable on your traffic.

Lower baselines need dramatically more visitors. A 1% baseline needs roughly 4× the sample size of a 4% baseline to detect the same relative lift, because rare events have higher relative variance. That's why deep-funnel tests (e.g. checkout completion) require more traffic than top-of-funnel tests (e.g. add-to-cart).

MDE is the smallest improvement your test can reliably catch given the sample size you plan to collect. Set it too small and the test runs forever; set it too large and the test will miss real but modest wins. A common starting point for online retail is a 10% relative lift on the primary metric.

Not without correction. Peeking at fixed-horizon tests and stopping when p < 0.05 inflates your real false-positive rate to roughly 20-30%. If you need early-stopping behaviour, switch to a sequential testing method (e.g. Bayesian bandits or always-valid p-values) — but those have their own sample-size logic.

At least one full business cycle, typically 14 days, even if you hit the sample size sooner. Different days and traffic sources behave differently — running for only 4 high-traffic days oversamples one segment and gives you a winner that doesn't generalise. Sample size is a floor, not a ceiling.

Three options. Widen the MDE so you're only testing bold changes likely to move the needle 20%+. Test site-wide changes instead of single-page changes to pool traffic. Or test higher-converting events (add-to-cart, email signup) which need fewer visitors than rare events like checkout completion.

Per variant. If a calculator returns 15,000, you need 15,000 in the control AND 15,000 in each treatment — so a standard A/B test needs 30,000 total visitors, and an A/B/C test needs 45,000. Multi-variant tests scale linearly in traffic cost.

Relative MDE is the industry default and is what most calculators expect — "a 10% lift on a 4% baseline means 4.40%." Absolute MDE expresses the same change as "+0.40 percentage points." Both work; just be consistent and confirm which one your tool uses, because mixing them produces sample sizes that are off by 10-25×.

Power is your test's ability to detect a real effect when one exists. Standard is 80%, meaning a 20% chance of missing a true winner. Raising power to 90% increases sample size by roughly 35%; dropping to 70% reduces it by 20% but means you'll miss almost a third of real wins. 80% is the pragmatic default.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.

Launch your first experiment

Sample Size

Sample Size

Visitors per variant needed to detect a relative lift (95% confidence, 80% power)

Sample size FAQ

What's a good sample size for an A/B test?

Why do I need to calculate sample size before starting the test?

How does baseline conversion rate affect sample size?

What is the minimum detectable effect (MDE)?

Can I stop a test early if results look significant?

How long should an A/B test run once it hits the sample size?

What if my store doesn't have enough traffic for the calculated sample size?

Does sample size apply per variant or total?

Should I use absolute or relative MDE?

How does statistical power affect the number?

Test ideas before you ship them