How to use Power Analysis

Power analysis tells you whether your A/B test is big enough to detect a real effect. Here's how to size tests properly, pick the right power level, and rescue underpowered experiments.
Power Analysis
Power analysis is the calculation that tells you how likely your A/B test is to detect a real effect of a given size.
Power is the probability that a test correctly rejects the null hypothesis when the alternative is true — in plain terms, the chance you actually spot a winner when one exists. The industry convention is 80% power, meaning you accept a 20% risk of missing a real lift.
A power analysis combines four levers — sample size, baseline conversion rate, minimum detectable effect (MDE), and significance threshold — to answer the question every experimenter eventually faces: how much traffic do I need before I can trust the result? Skip it, and you ship tests that are mathematically incapable of finding what you're looking for.
Most teams obsess over statistical significance (the p-value) and ignore power entirely. That's backwards. Significance tells you whether the result you observed is unlikely under the null; power tells you whether your test was ever capable of producing a significant result in the first place.
An underpowered test is the worst kind of failure: it returns flat or noisy results that look like 'no effect,' so you discard a variant that might have lifted revenue 5%. You don't get a warning. You just quietly leave money on the table and move on to the next test.
What power actually measures
Every A/B test has two ways to be wrong. A Type I error is a false positive — you declare a winner that isn't real. A Type II error is a false negative — a real winner exists, but your test missed it. Significance (alpha, usually 5%) controls Type I. Power (1 minus beta, usually 80%) controls Type II.
At 80% power, you'll catch 4 out of every 5 real winners of the size you specified. The other one slips through as a false negative. Raise power to 90% and you catch 9 out of 10 — but you'll need roughly 35% more sample size to get there.
Power is always defined relative to a specific effect size. A test powered at 80% to detect a 10% relative lift is only powered around 25% to detect a 5% lift. That's why 'is my test powered?' is the wrong question. The right one is 'powered to detect what?'
The 80% myth
80% power is a convention, not a law. It's the default in every sample-size calculator because Jacob Cohen suggested it in 1988 as a reasonable balance. For high-stakes tests (checkout flow, pricing), bump to 90%. For exploratory tests on cheap traffic, 70% can be defensible.
The four levers that move sample size
Sample size per variant scales with four inputs. Baseline conversion rate: lower baselines need more traffic (a 1% checkout rate is harder to test than a 10% add-to-cart rate). MDE: the smaller the lift you want to detect, the more visitors you need, and the relationship is quadratic — halving the MDE quadruples the sample.
Significance and power are the other two. Tightening alpha from 5% to 1% roughly doubles required sample. Raising power from 80% to 95% adds about 70% more visitors. Most teams hold alpha and power constant and trade off MDE against runtime — the only lever they control week to week.
Sample size per variant by MDE (5% baseline conversion, 80% power, 95% significance)
Read the curve carefully. Going from a 10% MDE to a 5% MDE looks like a small change, but the sample requirement jumps from 7,700 to 30,500 per variant. If your store does 20,000 sessions a week, that's the difference between a one-week test and a month-long test on a single experiment.
Setting power in practice
Run the power analysis before the test starts, not after. Pick your baseline from the last 4-8 weeks of GA4 data for the exact page and audience you'll be testing on. Pick your MDE based on what would be commercially meaningful — not what your tool promises is achievable.
A useful heuristic: an apparel store with €60 AOV needs a lift of at least 4-5% on add-to-cart to move the needle on monthly revenue once you factor in test runtime. Below that, the test costs more in opportunity than it returns. Set MDE there and let the calculator tell you the runtime.
Typical power-analysis settings by traffic tier (Shopify stores, €1M-€15M revenue)
| Weekly sessions | Recommended MDE | Power | Significance | Typical runtime |
|---|---|---|---|---|
| < 10,000 | 15-20% | 80% | 90% | 3-4 weeks |
| 10,000-30,000 | 8-12% | 80% | 95% | 2-3 weeks |
| 30,000-75,000 | 5-8% | 80% | 95% | 2 weeks |
| 75,000-200,000 | 3-5% | 85% | 95% | 2 weeks |
| > 200,000 | 2-3% | 90% | 95% | 1-2 weeks |
Notice the pattern: smaller stores need bigger MDEs to stay within reasonable runtime, which means they should only test changes likely to produce large effects — new hero images, full PDP redesigns, free-shipping thresholds — not button colours. Higher-traffic stores can afford to chase smaller wins because the math is on their side.
Rescuing an underpowered test
If a test ends flat and you suspect it was underpowered, don't restart blindly. Compute the post-hoc MDE: given the sample you actually collected, what's the smallest lift you could have detected at 80% power? If that number is 12% and you were hoping for a 4% lift, the test never had a chance — the variant might genuinely be a winner.
Three ways to add power without waiting forever: pick a metric closer to the change (add-to-cart instead of revenue per visitor on a PDP test), segment to the affected audience (mobile-only if the change is mobile-specific), or use variance-reduction techniques like CUPED if your platform supports it. Each can cut required sample by 20-50%.
Power belongs in your test plan
Before you queue an experiment, write down the baseline, MDE, power, significance, and resulting sample size. If the calculator says you need 6 weeks of runtime and you only have 2, kill the test now — don't run it and hope. That single discipline eliminates the majority of inconclusive results.
Power analysis FAQ
Significance (alpha) is the probability of a false positive — calling a winner when there isn't one. Power (1 minus beta) is the probability of a true positive — detecting a real winner when it exists. You set both before the test: typically 5% significance and 80% power.
It's a convention proposed by Jacob Cohen in 1988 as a reasonable trade-off between sample cost and detection rate. At 80% you accept missing 1 in 5 real winners. For high-stakes tests like checkout or pricing, bump to 90%; for cheap exploratory tests, 70% can be acceptable.
Plug four numbers into a sample-size calculator: baseline conversion rate (from your last 4-8 weeks of GA4 data), minimum detectable effect (the smallest lift worth caring about commercially), power (usually 80%), and significance (usually 95%). The output is the visitors per variant you need.
MDE is the smallest relative lift your test is powered to detect. A test with a 5% MDE can reliably catch a 5%-or-larger improvement; smaller real effects will likely return as 'no significant difference' even when they're genuine. Set MDE based on commercial relevance, not technical optimism.
Post-hoc power calculation has limited use, but post-hoc MDE is valuable: given the sample you collected, what lift could you have detected? If the answer is much larger than the effect you were hoping for, the test was underpowered and a flat result doesn't mean the variant lost.
Lower baselines require more traffic. A test on a 1% checkout-completion metric needs roughly 10x the sample of a test on a 10% add-to-cart metric for the same relative MDE. That's why teams often test upstream metrics (CTRs, add-to-cart) rather than revenue per visitor directly.
You're likely to see a flat or noisy result regardless of whether the variant is genuinely better. You'll then conclude 'no effect' and ship the control — discarding a potential winner. Underpowered tests are worse than no test because they generate false confidence in negative results.
Bayesian tests don't use power in the frequentist sense, but they have analogues: expected loss thresholds and credible-interval widths play similar roles. You still need to plan sample size around the smallest effect you care about — the framework changes, the underlying constraint doesn't.
Power analysis is one branch of statistical analysis focused specifically on test design. Sister topics include statistical significance (decision rule), confidence intervals (effect estimation), and multiple-testing corrections (when running many variants). Together they form the toolkit for trustworthy experimentation.
Yes — three ways. Choose a metric closer to the change being tested (higher baseline, less noise). Segment to the audience actually affected (don't dilute with irrelevant traffic). Use variance-reduction methods like CUPED, which leverage pre-experiment data to cut required sample 20-50%.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.