A/B Testing

Metricuno
May 17, 2026
6 min read
A/B Testing — A practical A/B testing framework for online stores: how to design tests, hit significance faster, avoid common mistakes, and scale into a CRO program.
Quick answer

A complete framework for A/B testing on Shopify, Woo, and Magento stores — from hypothesis to significance to building a repeatable experimentation program.

Definition
Experimentation

A/B Testing

A/B testing is a controlled experiment that splits live traffic between two variants to measure which one moves a target metric.

A/B testing is the operational core of modern experimentation. You take a single page, element, or flow, build a challenger against the current version, randomly assign visitors to each, and compare conversion rates with enough statistical rigor to call a winner. Done well, it replaces opinion with evidence and protects revenue from confident-sounding redesigns that quietly tank checkout.

The method scales from a single button-color test up to a continuous program running dozens of experiments per quarter across home, PDP, cart, and checkout. The framework on this page covers test design, execution, analysis, and the discipline needed to keep the program honest as it grows.

Also known as
split testing
online controlled experiment
bucket testing

Every A/B test answers one question: does this specific change cause a measurable lift in a specific metric, for a specific audience? If any of those three pieces is fuzzy, the result is fuzzy too — you ship a variant nobody can defend three months later when revenue moves the other way.

The framework below has three phases — design, execution, and program management. Each maps to a child topic you can drill into: the A/B testing process for the day-to-day mechanics, the A/B testing framework for hypothesis scoring, and A/B testing program management for the operating model around it.

Phase 1 — Designing a test that can win

A good test starts with a real drop-off, not a brainstorm. Look at your funnel for the largest leak you can actually influence — for most Shopify apparel stores that's PDP-to-cart or cart-to-checkout, not the homepage hero everyone wants to redesign. Quantify the leak before you write the hypothesis.

Then write the hypothesis in causal form: "because visitors hesitate on shipping cost, surfacing the free-shipping threshold on the PDP will lift add-to-cart by 6%+." That sentence forces you to name the friction, the change, the metric, and the minimum detectable effect. Anything vaguer belongs back in the A/B testing roadmap, not in production. The A/B testing framework spoke covers scoring models (PIE, ICE, PXL) for prioritising the backlog.

Phase 2 — Running it without lying to yourself

Calculate sample size before you launch. A store doing 40,000 PDP sessions a month with a 4% add-to-cart rate needs roughly 28,000 visitors per variant to detect a 10% relative lift at 95% confidence and 80% power. If you can't reach that in two to four weeks, the test is either too ambitious or the page gets too little traffic — change the variant, not the math.

Once live, leave it alone. Cover at least one full business cycle (usually two weeks including both weekends), don't peek at p-values daily, and don't call a winner because the dashboard flickered green on day three. Most of the worst A/B testing mistakes are about stopping early, segmenting after the fact, or testing five things at once and pretending it was one.

Peeking inflates false positives

Checking a test repeatedly and stopping the first time it crosses 95% significance is not a 5% error rate — it's closer to 20-30%. Either pre-commit to a fixed sample size and duration, or use a sequential testing method (Bayesian, mSPRT) explicitly designed for continuous monitoring. Most fixed-horizon tools are not.

Phase 3 — Scaling from tests to a program

One winning test is a story. Ten winning tests in a quarter is a program. The shift requires infrastructure: a backlog scored against impact and effort, a calendar so tests don't collide on the same page, a results log so you stop re-running questions you already answered, and a clear rule for when a non-significant test still ships (usually: never, unless it removes complexity).

Tooling matters here. The A/B testing tools market splits between client-side editors (fast to launch, flicker-prone, slow your Largest Contentful Paint) and server-side platforms (clean, fast, more dev work). For stores in the €1M-€15M range a lightweight snippet that handles both — and pulls hypotheses directly from your GA4 drop-off data — usually wins on time-to-test. See A/B testing examples for the patterns that consistently produce lifts on Shopify, Woo, and Magento.

Chart

Sample size required per variant by minimum detectable effect

0visitors50.0kvisitors100.0kvisitors150.0kvisitors200.0kvisitors250.0kvisitors300.0kvisitors350.0kvisitors5%10%15%20%30%Visitors per variantMinimum detectable effect (relative lift)

Baseline conversion 2%

Baseline conversion 4%

Baseline conversion 8%

Frequently asked

A/B testing — frequently asked questions

A/B testing splits your live traffic between the current version (control) and a changed version (variant), then compares a target metric — usually conversion rate or revenue per visitor — with enough volume to rule out random chance. It's the simplest form of online controlled experimentation and the foundation of every CRO program.

At minimum two full weeks to cover weekday and weekend behaviour, and long enough to reach the pre-calculated sample size at 95% confidence and 80% power. For most stores under €15M revenue that's two to four weeks. Stopping earlier inflates false positives even if the dashboard shows significance.

It depends on your baseline conversion rate and the minimum detectable effect you care about. A store with 4% add-to-cart conversion needs around 38,000 visitors per variant to detect a 10% relative lift. Halve the effect you want to detect and the sample size roughly quadruples — see the chart above.

Yes, if they don't touch the same page or funnel step. Running a PDP test and a checkout test in parallel is fine; running two PDP tests at once creates interaction effects that contaminate both. The A/B testing program management spoke covers calendar discipline for concurrent tests.

A/B testing compares two (or a few) full variants of a page. Multivariate testing changes several elements independently and measures every combination — useful for fine-tuning a winner, but needs roughly N times the traffic where N is the number of combinations. Most stores under €15M revenue should stick to A/B.

For client-side tests (copy, layout, images, simple flow changes) no — a tag-based tool or zero-dev plugin handles it. For server-side tests, checkout-extensibility changes, or anything touching Shopify Markets logic, you'll want developer involvement to avoid flicker and to keep Core Web Vitals clean.

Stopping early when significance flickers green, running tests on too little traffic, testing trivial changes (button colour) instead of friction points, ignoring segment-level results, and shipping winners without documenting why. The A/B testing mistakes spoke covers each in depth with examples.

The biggest measurable drop-off in your funnel that you can reach without a replatform. For most online stores that's the PDP-to-add-to-cart step or the cart-to-checkout step — both have high traffic and clear friction (shipping cost, trust, payment options). Hero banners and homepage carousels are usually a worse first bet.

A/B testing is one method inside the wider discipline of experimentation, which also covers holdout tests, geo experiments, MVT, bandits, and pre/post quasi-experiments. A/B is the workhorse — fastest to set up, easiest to interpret — and most programs run 80%+ of their tests as straight A/B.

Industry data puts winning tests at roughly 15-25% of all tests run, with another 20-30% inconclusive and the rest flat or negative. If your win rate is above 40% you're probably calling tests early or testing only safe bets; below 10% suggests weak hypothesis prioritisation. The A/B testing framework spoke covers scoring methods that lift win rate.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.