How to use A/B Testing Framework

A practical end-to-end A/B testing framework for online stores: how to write hypotheses, size tests, run them, analyse results, and decide what ships.
A/B Testing Framework
The end-to-end methodology for designing, running, analysing, and acting on controlled online experiments.
An A/B testing framework is the structured methodology a team uses to move from a business question to a shipped decision. It covers six linked steps: forming a hypothesis, designing the test, calculating sample size, running the test for a defined window, analysing the result, and deciding what to roll out.
What separates a framework from a checklist is the reasoning behind each step. A framework explains why you need a minimum detectable effect before you launch, why you stop on a date rather than a p-value, and why a winning variant still needs a holdout. It turns A/B testing from a habit into a repeatable system that produces trustworthy decisions.
Most stores already run A/B tests. Few run them inside a framework. The difference shows up six months later, when one team has a backlog of "inconclusive" results and the other has a shipped roadmap of confirmed wins.
This guide walks through the framework end-to-end, in the order you'd actually use it on a Shopify or WooCommerce store. Each step has a purpose, a deliverable, and a typical failure mode to watch for.
Step 1 — Hypothesis and prioritisation
A good hypothesis names a specific change, a specific audience, a specific metric, and a specific expected direction. "Make the PDP better" is not a hypothesis. "Adding a sticky add-to-cart on mobile PDPs will lift add-to-cart rate for new visitors by at least 5%" is.
The structure matters because it forces you to commit to a minimum detectable effect (MDE) before you see data. Without an MDE, every test risks running until you find noise you like. With one, the sample-size calculator has the input it needs and the decision rule writes itself.
Prioritise hypotheses with a model like PIE or ICE — Potential, Importance, Ease. The exact scoring matters less than having a queue everyone agrees on. An apparel store with 80,000 monthly sessions can run roughly 2-3 tests per month per surface; choose what earns the slot.
The cold-start trap
Teams that start experimenting without a prior audit usually burn their first 6-8 tests on guesses. Pull at least 90 days of GA4 funnel data first and rank tests against your actual drop-off points — checkout step 2, mobile PDP, or filtered category views are almost always richer ground than the homepage.
Step 2 — Test design and sample size
Design covers three choices: the primary metric, the unit of randomisation (visitor vs session — almost always visitor), and the audience filter. The primary metric should be as close to revenue as your traffic allows. On a store with under 50k sessions/month, that usually means conversion rate or add-to-cart rate, not revenue per visitor, which is noisier.
Sample size flows directly from your baseline conversion rate, your MDE, and the statistical power you want (90% is a reasonable default; 80% is the floor). A 2.5% baseline conversion rate trying to detect a 10% relative lift at 90% power needs roughly 30,000 visitors per variant. Stores often discover at this point that their hypothesis isn't feasible — that's the framework doing its job.
Sample size per variant by minimum detectable effect (2.5% baseline CR, 90% power)
The chart explains why most small-effect tests never reach significance: the traffic isn't there. If you can't reasonably reach the required sample inside two full business cycles, redesign the test for a bigger swing — bundle changes into a stronger variant rather than testing pixel-level tweaks one at a time.
Step 3 — Runtime and analysis
Run for whole weeks — minimum two, ideally three — even if you've hit sample size early. Buyer behaviour on Tuesday looks nothing like Saturday, and a partial-week stop bakes a weekday bias into your result. End on the same weekday you started.
Once the window closes, analyse in this order: sample ratio check (did traffic actually split 50/50?), primary metric, guardrail metrics (revenue per visitor, page-load time, return rate), then segments. Segment analysis is for hypothesis generation, not for declaring wins — slicing into a non-significant overall result until something "works" is the most common way teams lie to themselves.
Typical A/B test outcomes by platform and test surface (DTC stores, 12-month rolling)
| Platform / surface | Win rate | Avg. lift on winners | Inconclusive rate | Median runtime |
|---|---|---|---|---|
| Shopify — PDP | 22% | +6.4% | 48% | 18 days |
| Shopify — Checkout | 31% | +4.1% | 35% | 21 days |
| WooCommerce — Category | 18% | +5.8% | 55% | 24 days |
| Magento — Cart | 26% | +7.2% | 40% | 20 days |
| All platforms — Homepage | 14% | +3.9% | 62% | 22 days |
Two numbers in the table deserve attention. First, checkout tests win more often than homepage tests because the audience is high-intent and the metric distance is short. Second, inconclusive rates above 50% usually signal underpowered designs — that's a framework problem, not a creative problem.
Step 4 — Decision, rollout, and learning capture
The decision rule is set before the test starts: ship the variant if the primary metric is significant at your chosen alpha, no guardrail metric has degraded materially, and the effect direction matches your hypothesis. Anything else is either a rerun, a rollback, or a learning.
After rollout, keep a 5-10% holdout for two to four weeks to confirm the lift survives in production. Novelty effects fade, and a variant that wins by 6% in a two-week test sometimes settles at 3-4% steady-state. Document every test — win, loss, or inconclusive — with the hypothesis, result, and the next test it implies. This log is the asset that compounds.
What good looks like after 12 months
A team running this framework on a €5M Shopify store typically lands around 25-30 tests per year, a 25% win rate, and 3-5% compounded conversion-rate improvement net of regression. The wins aren't dramatic individually — the system is what makes them add up.
A/B testing framework FAQ
A/B testing is the technique — split traffic, compare metrics. The framework is the surrounding methodology that decides what to test, how to size it, when to stop, and how to act on the result. Without the framework, the technique produces noise.
Roughly 20,000-30,000 monthly sessions on the surface you're testing is the practical floor for detecting 10-15% relative lifts in a reasonable window. Below that, you're better off running diagnostic research and shipping informed redesigns rather than under-powered tests.
Until you hit your pre-calculated sample size AND complete at least two full weekly cycles, whichever is longer. Stopping early on a "significant" p-value is the single most common cause of false wins.
Either works if applied consistently. Frequentist is easier to defend to stakeholders because the rules are bright lines (alpha, power, MDE). Bayesian is easier to interpret in plain English ("95% probability variant B is better") but requires more discipline to prevent peeking.
Across mature programmes, 20-30% of tests produce a shippable winner, another 10-15% produce a loser worth knowing about, and the rest are inconclusive. If your win rate is above 50%, you're either under-powering tests or only testing safe bets.
Yes, on independent surfaces — a PDP test and a checkout test can run in parallel without meaningful interaction. Avoid running two tests on the same template, and never include the same audience in two tests measuring the same primary metric.
Revenue per visitor, page-load time, bounce rate, and return rate cover most stores. The point is to catch variants that win on a narrow metric (e.g. add-to-cart) while harming a downstream one (e.g. refunds), which happens more often than teams expect.
The framework is platform-agnostic, but Shopify constrains the checkout surface — on standard plans you can't A/B test the checkout itself. Most Shopify experimentation lives on PDP, cart, and category pages, where lift potential is real and tooling is mature.
Don't ship the variant, and don't conclude "no effect." Inconclusive usually means underpowered. Either the effect is smaller than your MDE (in which case it's not worth shipping anyway) or you need a stronger variant and a rerun. Log it as a learning either way.
You need three capabilities: traffic splitting, statistical analysis, and result documentation. These can come from one platform or several stitched together. The framework matters more than the tool — teams with a clear methodology and a basic tool out-perform teams with the best tool and no methodology.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.