Sample Size Calculator

Metricuno

May 17, 2026

5 min read

Sample Size Calculator — Calculate the exact users per variant your A/B test needs to detect a target lift with 80% power and 95% confidence. Free, instant, with worked examples.

Quick answer

A pre-test planning tool that tells you exactly how many visitors per variant you need to detect a target lift — before you waste two weeks on an underpowered test.

Definition

Experimentation

Sample Size Calculator

A tool that tells you how many users per A/B test variant you need to reliably detect a target lift at 80% power and 95% confidence.

A sample size calculator is the pre-test planning step nobody on your team should skip. You feed it your current conversion rate (the baseline), the smallest lift you'd care about (the minimum detectable effect), and your statistical thresholds. It returns the visitors per variant required for the test to actually distinguish a real winner from noise.

Skip this step and you end up in one of two failure modes: running for three weeks, declaring a 'flat' result, and never knowing if you missed a true 5% lift — or calling a winner at day four on 800 sessions, shipping it, and watching the revenue lift fail to materialise. The calculator removes that ambiguity before traffic starts flowing.

Also known as

A/B test sample size calculator

Test duration calculator

Power analysis calculator

Calculator

A/B Test Sample Size Calculator

Inputs

Baseline conversion rate

Your current conversion rate, before the test.

Minimum detectable effect (relative)

Smallest relative lift you want to detect. 10% means a 5% baseline lifting to 5.5%.

Significance level (α)

0.05 = 5% false-positive rate (standard).

Statistical power (1−β)

0.80 = 80% chance of detecting a true effect (standard).

Result

Visitors needed per variant

—

Total visitors (control + variant)

—

Enter your current conversion rate (baseline) and the smallest lift you want to detect. The calculator returns required visitors per variant and total test population. Relative MDE means 'detect a 10% lift on top of baseline' (3.0% → 3.3%). Absolute MDE means 'detect a 10 percentage-point shift' (3.0% → 13.0%). For most checkout, PDP, and signup tests, use relative.

The widget above does the work, but understanding what it's computing protects you from misreading the output. Four inputs drive every sample size estimate — baseline rate, minimum detectable effect, significance level (alpha), and statistical power. Change any one and the required traffic moves, sometimes dramatically.

The formula behind the number

Formula

n = ((Z_α/2 + Z_β)² × (p₁(1-p₁) + p₂(1-p₂))) / (p₂ - p₁)²

Variables

Sample size per variant

Visitors needed in each arm of the test

p₁

Baseline conversion rate

Your current conversion rate (the control)

p₂

Target conversion rate

Baseline plus the minimum detectable effect

Z_α/2

Critical value for significance

1.96 for a two-tailed test at α = 0.05

Z_β

Critical value for power

0.84 for 80% statistical power

Worked example

A Shopify apparel store wants to test a new product-page hero. Checkout conversion sits at 3.0%, and the team won't ship anything below a 10% relative lift (so p₂ = 3.3%). Standard thresholds: 95% confidence, 80% power, two-tailed.

Baseline (p₁): 3.0%

Target (p₂): 3.3%

Z_α/2: 1.96

Z_β: 0.84

→ ~51,800 visitors per variant (~103,600 total)

At 8,000 PDP visitors per week, that's a 13-week test. If you can't run that long, you need a bigger MDE — accept that you'll only detect bigger wins — or test higher up the funnel where volume is larger.

The square in the denominator is what makes small effects so expensive. Halving the MDE roughly quadruples the required sample. That's why a 5% relative lift takes four times the traffic of a 10% lift, not twice.

Typical sample sizes by baseline and MDE

Benchmark

Visitors required per variant at 80% power, 95% confidence, two-tailed

Baseline conversion rate	Detect 5% relative lift	Detect 10% relative lift	Detect 20% relative lift	Detect 50% relative lift
1.0% (cold traffic landing page)	620,000	156,000	39,500	6,800
2.0% (typical PDP)	308,000	77,200	19,500	3,400
3.0% (strong PDP / category page)	205,000	51,200	13,000	2,250
5.0% (newsletter signup form)	121,000	30,200	7,700	1,330
10% (add-to-cart on hot SKU)	57,000	14,300	3,640	630
25% (checkout completion)	19,200	4,800	1,220	210

Read the table from your row, not the column. If your PDP converts at 2% and you want to catch a 10% lift, you need 77,000 visitors per variant — about 154,000 total. At 10,000 PDP visitors per week, that's a 15-week test. Most teams won't wait, which is why the honest move is to set your MDE based on the traffic you actually have.

Where teams get the inputs wrong

The most common mistake is anchoring MDE to wishful thinking instead of historical data. Past winning tests on your store probably moved the needle 3-8% relative — not 25%. If you plan around a 25% MDE, you'll underpower every test that finds a real 5% win, then conclude 'CRO doesn't work here'.

The second mistake is mid-test peeking and early stopping. Sample size calculators assume you check the result once, at the end. Peek daily and call a winner the first time p < 0.05 and your true false-positive rate balloons past 25%. Either commit to the planned duration or use a sequential testing method built for repeated looks.

Third: forgetting that variants split the traffic. A 100,000-visitor weekly budget tested across three variants gives each arm 33,000 visitors, not 100,000. The calculator returns per-variant counts for a reason.

If the answer is 'more traffic than you have'

Don't run the test. You have three honest options: (1) accept a larger MDE and only chase bigger wins, (2) test further up the funnel where volume is higher (PDP visitors > checkout starters), or (3) bundle several smaller changes into one bolder variant. Running an underpowered test and calling it 'flat' is the worst outcome — you've spent the calendar time and learned nothing.

Frequently asked

Sample size calculator FAQ

They're the industry-standard thresholds: a 5% false-positive rate (alpha) and a 20% false-negative rate (beta). You can tighten either, but 99% confidence roughly doubles required traffic, and 90% power adds another 30% on top. For most product and checkout tests, the 80/95 combination is the right cost/risk balance.

Relative for nearly every e-commerce test. 'A 10% lift on top of my 3% conversion rate' is how the business thinks about wins. Absolute MDE ('detect a 10 percentage-point shift') only makes sense for high-baseline metrics like email open rates or completion rates above 50%.

Two-tailed, almost always. A one-tailed test assumes you only care if the variant wins — but a variant that loses by 8% is information you need (you'd ship the control with confidence). Two-tailed is the safer default and what every reputable calculator returns by default.

Look at your last 10-20 test results. Take the median effect size of the winners and use that as your floor. If you've never run tests, start with 10% relative for PDP and category tests, 5% for checkout (where small lifts are huge in absolute revenue), and 15-20% for landing pages and ad creative.

Yes, in two ways. Each variant still needs the full per-arm sample size from the calculator, so a 3-variant test needs ~3x the total traffic of an A/B. And running multiple comparisons inflates the false-positive rate; if you test 5 variants against control, apply a Bonferroni correction (divide alpha by the number of comparisons).

Not directly — the formula above is for binary conversion (converted / didn't). Continuous metrics like revenue per visitor need a different calculation that accounts for the variance of the metric, which is typically 5-10x more samples than a conversion test on the same traffic. Most platforms have a separate 'continuous' mode.

Duration = required sample size ÷ daily traffic to the test area. The calculator tells you the 'how many' — your analytics tells you the 'how fast'. Always run for at least one full business cycle (typically 14 days) even if you hit the sample size faster, to absorb weekday/weekend behaviour differences.

A sample size calculator runs before the test — it tells you how many users you need. A significance calculator runs after — it tells you whether the lift you observed is statistically real. Use them as a pair: plan with the first, validate with the second. Never use the significance calculator to decide when to stop a running test.

No formula does — those are upstream data-quality issues. Filter out bots and internal traffic at the analytics layer before counting visitors, and decide upfront whether your unit of randomisation is the visitor or the session. Inconsistent counting between the planning and analysis stages is a common reason tests look underpowered after the fact.

No. Significance means the observed difference is unlikely to be random noise. It says nothing about whether the lift is large enough to be worth shipping, whether it'll hold up across segments, or whether it'll persist over time. A statistically significant 1.5% lift on a checkout step might not survive a holiday traffic mix shift.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.

Launch your first experiment

Sample Size Calculator

Sample Size Calculator

The formula behind the number

Typical sample sizes by baseline and MDE

Visitors required per variant at 80% power, 95% confidence, two-tailed

Where teams get the inputs wrong

Sample size calculator FAQ

Why 80% power and 95% confidence as defaults?

Should I use relative or absolute MDE?

One-tailed or two-tailed test?

How do I pick the right MDE for my store?

What if I have multiple variants — does the sample size change?

Can I use this calculator for revenue per visitor or AOV tests?

How does test duration relate to sample size?

What's the difference between a sample size and a significance calculator?

Does the calculator account for Shopify-specific noise like bot traffic or returning visitors?

Is 'statistical significance' the same as 'the variant is better'?

Test ideas before you ship them