Sample Size Calculator

A pre-test planning tool that tells you exactly how many visitors per variant you need to detect a target lift — before you waste two weeks on an underpowered test.
Sample Size Calculator
A tool that tells you how many users per A/B test variant you need to reliably detect a target lift at 80% power and 95% confidence.
A sample size calculator is the pre-test planning step nobody on your team should skip. You feed it your current conversion rate (the baseline), the smallest lift you'd care about (the minimum detectable effect), and your statistical thresholds. It returns the visitors per variant required for the test to actually distinguish a real winner from noise.
Skip this step and you end up in one of two failure modes: running for three weeks, declaring a 'flat' result, and never knowing if you missed a true 5% lift — or calling a winner at day four on 800 sessions, shipping it, and watching the revenue lift fail to materialise. The calculator removes that ambiguity before traffic starts flowing.
A/B Test Sample Size Calculator
Baseline conversion rate
%
Your current conversion rate, before the test.
Minimum detectable effect (relative)
%
Smallest relative lift you want to detect. 10% means a 5% baseline lifting to 5.5%.
Significance level (α)
0.05 = 5% false-positive rate (standard).
Statistical power (1−β)
0.80 = 80% chance of detecting a true effect (standard).
Visitors needed per variant
—
Total visitors (control + variant)
—
Enter your current conversion rate (baseline) and the smallest lift you want to detect. The calculator returns required visitors per variant and total test population. Relative MDE means 'detect a 10% lift on top of baseline' (3.0% → 3.3%). Absolute MDE means 'detect a 10 percentage-point shift' (3.0% → 13.0%). For most checkout, PDP, and signup tests, use relative.
The widget above does the work, but understanding what it's computing protects you from misreading the output. Four inputs drive every sample size estimate — baseline rate, minimum detectable effect, significance level (alpha), and statistical power. Change any one and the required traffic moves, sometimes dramatically.
The formula behind the number
n = ((Z_α/2 + Z_β)² × (p₁(1-p₁) + p₂(1-p₂))) / (p₂ - p₁)²
n
Sample size per variant
Visitors needed in each arm of the test
p₁
Baseline conversion rate
Your current conversion rate (the control)
p₂
Target conversion rate
Baseline plus the minimum detectable effect
Z_α/2
Critical value for significance
1.96 for a two-tailed test at α = 0.05
Z_β
Critical value for power
0.84 for 80% statistical power
A Shopify apparel store wants to test a new product-page hero. Checkout conversion sits at 3.0%, and the team won't ship anything below a 10% relative lift (so p₂ = 3.3%). Standard thresholds: 95% confidence, 80% power, two-tailed.
Baseline (p₁): 3.0%
Target (p₂): 3.3%
Z_α/2: 1.96
Z_β: 0.84
→ ~51,800 visitors per variant (~103,600 total)
At 8,000 PDP visitors per week, that's a 13-week test. If you can't run that long, you need a bigger MDE — accept that you'll only detect bigger wins — or test higher up the funnel where volume is larger.
The square in the denominator is what makes small effects so expensive. Halving the MDE roughly quadruples the required sample. That's why a 5% relative lift takes four times the traffic of a 10% lift, not twice.
Typical sample sizes by baseline and MDE
Visitors required per variant at 80% power, 95% confidence, two-tailed
| Baseline conversion rate | Detect 5% relative lift | Detect 10% relative lift | Detect 20% relative lift | Detect 50% relative lift |
|---|---|---|---|---|
| 1.0% (cold traffic landing page) | 620,000 | 156,000 | 39,500 | 6,800 |
| 2.0% (typical PDP) | 308,000 | 77,200 | 19,500 | 3,400 |
| 3.0% (strong PDP / category page) | 205,000 | 51,200 | 13,000 | 2,250 |
| 5.0% (newsletter signup form) | 121,000 | 30,200 | 7,700 | 1,330 |
| 10% (add-to-cart on hot SKU) | 57,000 | 14,300 | 3,640 | 630 |
| 25% (checkout completion) | 19,200 | 4,800 | 1,220 | 210 |
Read the table from your row, not the column. If your PDP converts at 2% and you want to catch a 10% lift, you need 77,000 visitors per variant — about 154,000 total. At 10,000 PDP visitors per week, that's a 15-week test. Most teams won't wait, which is why the honest move is to set your MDE based on the traffic you actually have.
Where teams get the inputs wrong
The most common mistake is anchoring MDE to wishful thinking instead of historical data. Past winning tests on your store probably moved the needle 3-8% relative — not 25%. If you plan around a 25% MDE, you'll underpower every test that finds a real 5% win, then conclude 'CRO doesn't work here'.
The second mistake is mid-test peeking and early stopping. Sample size calculators assume you check the result once, at the end. Peek daily and call a winner the first time p < 0.05 and your true false-positive rate balloons past 25%. Either commit to the planned duration or use a sequential testing method built for repeated looks.
Third: forgetting that variants split the traffic. A 100,000-visitor weekly budget tested across three variants gives each arm 33,000 visitors, not 100,000. The calculator returns per-variant counts for a reason.
If the answer is 'more traffic than you have'
Don't run the test. You have three honest options: (1) accept a larger MDE and only chase bigger wins, (2) test further up the funnel where volume is higher (PDP visitors > checkout starters), or (3) bundle several smaller changes into one bolder variant. Running an underpowered test and calling it 'flat' is the worst outcome — you've spent the calendar time and learned nothing.
Sample size calculator FAQ
They're the industry-standard thresholds: a 5% false-positive rate (alpha) and a 20% false-negative rate (beta). You can tighten either, but 99% confidence roughly doubles required traffic, and 90% power adds another 30% on top. For most product and checkout tests, the 80/95 combination is the right cost/risk balance.
Relative for nearly every e-commerce test. 'A 10% lift on top of my 3% conversion rate' is how the business thinks about wins. Absolute MDE ('detect a 10 percentage-point shift') only makes sense for high-baseline metrics like email open rates or completion rates above 50%.
Two-tailed, almost always. A one-tailed test assumes you only care if the variant wins — but a variant that loses by 8% is information you need (you'd ship the control with confidence). Two-tailed is the safer default and what every reputable calculator returns by default.
Look at your last 10-20 test results. Take the median effect size of the winners and use that as your floor. If you've never run tests, start with 10% relative for PDP and category tests, 5% for checkout (where small lifts are huge in absolute revenue), and 15-20% for landing pages and ad creative.
Yes, in two ways. Each variant still needs the full per-arm sample size from the calculator, so a 3-variant test needs ~3x the total traffic of an A/B. And running multiple comparisons inflates the false-positive rate; if you test 5 variants against control, apply a Bonferroni correction (divide alpha by the number of comparisons).
Not directly — the formula above is for binary conversion (converted / didn't). Continuous metrics like revenue per visitor need a different calculation that accounts for the variance of the metric, which is typically 5-10x more samples than a conversion test on the same traffic. Most platforms have a separate 'continuous' mode.
Duration = required sample size ÷ daily traffic to the test area. The calculator tells you the 'how many' — your analytics tells you the 'how fast'. Always run for at least one full business cycle (typically 14 days) even if you hit the sample size faster, to absorb weekday/weekend behaviour differences.
A sample size calculator runs before the test — it tells you how many users you need. A significance calculator runs after — it tells you whether the lift you observed is statistically real. Use them as a pair: plan with the first, validate with the second. Never use the significance calculator to decide when to stop a running test.
No formula does — those are upstream data-quality issues. Filter out bots and internal traffic at the analytics layer before counting visitors, and decide upfront whether your unit of randomisation is the visitor or the session. Inconsistent counting between the planning and analysis stages is a common reason tests look underpowered after the fact.
No. Significance means the observed difference is unlikely to be random noise. It says nothing about whether the lift is large enough to be worth shipping, whether it'll hold up across segments, or whether it'll persist over time. A statistically significant 1.5% lift on a checkout step might not survive a holiday traffic mix shift.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.