What is A/B Testing

A plain-English definition of A/B testing, the sample-size formula behind it, realistic win-rate benchmarks, and the questions store owners actually ask.
A/B Testing
A/B testing splits live traffic between two variants of a page or element to measure which one performs better against a defined goal.
A/B testing (also called split testing) is the controlled experiment at the heart of conversion-rate optimisation. You take a page, element, or flow, change one meaningful thing, and serve the original (control, A) and the variant (B) to comparable slices of live traffic at the same time. A statistical test then tells you whether the difference in conversion rate, revenue per visitor, or add-to-cart rate is real or noise.
The point is not the test — it's the discipline. A/B testing replaces design-by-opinion with design-by-evidence, which is why mature CRO teams measure their output in tests shipped per month rather than redesigns shipped per quarter.
The mechanics are simple. A randomiser assigns each visitor to A or B on first exposure, sticks them in that bucket via a cookie, and logs whether they hit the goal event (purchase, add to cart, email signup). After enough traffic, you compare the conversion rates and ask whether the gap is statistically significant.
What changed in the last few years is the cost. Running an A/B test used to require a tag-manager rebuild, a developer sprint, and a six-figure annual licence. On Shopify or WooCommerce today, a single snippet handles assignment, tracking, and reporting — so the bottleneck is no longer tooling, it's hypothesis quality and traffic volume.
n = 16 * p * (1 - p) / MDE^2
n
Sample size per variant
Visitors needed in each bucket (A and B) to detect the effect.
p
Baseline conversion rate
Current conversion rate of the control, as a decimal (e.g. 0.03 for 3%).
MDE
Minimum detectable effect
Smallest absolute lift you care about, as a decimal (e.g. 0.006 for a 0.6-point lift).
An apparel store on Shopify converts at 3.0% on its product detail page. The team wants to detect a lift of at least 0.6 percentage points (from 3.0% to 3.6%) at 80% power and 95% confidence.
Baseline conversion rate (p): 0.030
Minimum detectable effect (MDE): 0.006
→ n ≈ 12,933 visitors per variant
At ~25,000 weekly PDP sessions, the test reaches significance in about a week. If weekly traffic were 5,000, the same test would need ~5 weeks — long enough that seasonality starts to threaten validity.
That formula is the single most useful piece of math in CRO: it tells you before you launch whether a test is even feasible. Low-traffic stores trying to detect tiny lifts will wait months and learn nothing — better to bundle smaller changes into a bolder variant and test for a larger effect.
Typical A/B testing throughput and outcomes by store revenue tier
| Annual GMV | Tests / month | Win rate | Avg lift on winners | Min traffic for PDP test |
|---|---|---|---|---|
| €500k – €1M | 1 – 2 | 18% | +4 to +8% | Insufficient — bundle changes |
| €1M – €5M | 2 – 4 | 22% | +3 to +6% | ~8k weekly sessions |
| €5M – €15M | 4 – 8 | 25% | +2 to +5% | Comfortable on most pages |
| €15M+ | 8 – 15 | 27% | +1.5 to +3% | Can test deep-funnel and segment splits |
Two patterns stand out. Win rates climb slowly with scale because bigger programmes generate better hypotheses, not because they have better tools. And average lift shrinks as stores grow — the obvious wins are gone, so mature programmes are mining diminishing returns and have to test more to ship the same absolute revenue gain.
A/B testing FAQ
Nothing — they're the same thing. 'Split testing' was the older direct-mail term and got carried into web experimentation. Some teams use 'split URL test' specifically to mean a test where A and B are served from different URLs rather than swapped client-side, but the statistics are identical.
An A/B test compares two complete variants and tells you which wins. A multivariate test (MVT) varies several elements at once — say headline × hero image × CTA colour — and tells you which combination wins and which element matters most. MVT needs roughly N× the traffic for N combinations, so most stores under €5M GMV stick to A/B.
It can if you load a heavy testing tool that blocks rendering. A lightweight snippet (under ~15 KB, loaded async) is unmeasurable in practice. The real risk is flicker — the control showing for a split-second before the variant paints — which you avoid by hiding the tested element until the assignment resolves.
Until two conditions are both met: you've hit the required sample size from your power calculation, AND you've covered at least one full business cycle — usually a week, sometimes two if weekday and weekend buyers behave differently. Stopping the moment significance flickers green is the most common reason teams ship false winners.
On a product page or cart, a winning test typically lifts conversion by 2–8% relative. Bigger lifts (15%+) usually come from changing the offer, the price, or the page structure entirely — not from button-colour tweaks. Aim your MDE at the smallest lift that would justify the engineering cost of shipping the variant permanently.
It's the probability that the difference you observed between A and B would have happened by chance if there were really no difference. A 95% significance threshold means you're willing to accept a 5% false-positive rate. It does not tell you the size of the effect or whether it'll hold up next quarter — only that the gap is unlikely to be noise.
GA4 can measure an experiment if you push the variant assignment as a user property or event parameter, but it won't run the test for you — assignment, bucketing, and significance calculation need a separate tool. The common stack is GA4 for analytics plus a dedicated experimentation layer that writes the variant back into GA4.
Technically unlimited (A/B/C/D/n), but each extra variant splits your traffic and inflates the multiple-comparison risk. Most teams stick to two — control and one challenger — and only add a third when they have a genuinely different second hypothesis worth running in parallel.
Calling tests too early. The conversion-rate gap between A and B fluctuates wildly in the first few hundred conversions and only stabilises near the calculated sample size. Teams that peek daily and stop the moment they see green ship 'winners' that revert to neutral within a month of going live.
For one or two tests a quarter, you can get by with a Shopify app or a feature flag in your codebase. Once you're running two or more tests a month, a dedicated experimentation layer pays for itself — primarily because it removes engineering time per test, which is the actual bottleneck on test velocity.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.