How to use A/B Testing Process

A step-by-step A/B testing process — from prioritization to documentation — with realistic cycle times, QA checkpoints, and the decision rules teams actually use.
A/B Testing Process
The standard operating workflow that takes an experiment from idea to documented decision — eight repeatable steps.
An A/B testing process is the operational checklist a CRO team runs every test through: prioritize, design, QA, launch, monitor, analyze, decide, document. It exists so experimentation produces compounding learning instead of one-off opinions, and so two specialists running parallel tests reach the same quality bar.
The steps are sequential but cycle. A well-run program closes one test on Monday and opens the next on Tuesday using the previous result as input. On a Shopify or WooCommerce store doing €1M-€15M, the constraint is rarely ideas — it's getting from hypothesis to clean decision in under three weeks without breaking site speed or analytics.
Most teams already do five or six of these steps. The gap that hurts revenue isn't a missing step — it's an inconsistent step. Tests launch without QA, get called early because the dashboard looks green, or finish and never get documented so the same hypothesis comes back six months later.
This guide walks through the full process in four phases — prioritize and design, build and QA, launch and monitor, analyze and document — with the specific checks that distinguish a serious program from a hopeful one. It assumes you've already chosen what A/B testing is at a conceptual level; here we focus on the operating mechanics.
Phase 1: Prioritize and design
Prioritization is the step that decides whether the next quarter compounds or stalls. You pick from a backlog of hypotheses using a scoring framework — ICE, PIE, or PXL are the common ones — that forces you to weight impact, confidence and effort consistently rather than testing whatever the loudest stakeholder asked for last.
The hypothesis itself should be a single falsifiable sentence: "If we move the trust badges above the add-to-cart button on PDPs, mobile add-to-cart rate will increase, because session recordings show users scroll past the badges before deciding." If you can't write that sentence, you don't have a test yet — you have an opinion.
Design covers the variant spec, the primary metric, guardrail metrics, target audience, and the required sample size. A sample-size calculation up front tells you whether the test is even viable on your traffic — running a test that needs eight weeks on a page that gets a campaign refresh every three weeks is a waste before it starts.
Don't skip the pre-mortem
Before building anything, write down what you'd do if the variant wins by 8%, by 1%, and if it loses by 5%. Teams that don't agree on decision rules in advance spend the post-test meeting arguing about whether to ship, instead of shipping.
Phase 2: Build and QA
Build is where most teams underestimate effort. A "simple" headline swap touches the PDP template, the analytics layer, the consent banner, and — if you're on Shopify — possibly the theme's section schema. Budget the dev work honestly or self-serve with a visual editor, but don't pretend a one-hour estimate covers a four-hour reality.
QA is the step that separates a real program from theatre. Check the variant on Chrome, Safari, iOS Safari, and Android Chrome. Verify the event fires once per exposure, not on every page-view. Test it with an ad-blocker on. Confirm the variant doesn't break the checkout funnel for assigned users. Two hours of QA prevents two weeks of contaminated data.
Where time actually goes in a typical A/B test cycle
Notice that build and analyze together are more than half the cycle. Teams that want higher test velocity rarely get there by rushing QA — they get there by reducing dev dependency on the build step and by having a templated analysis workflow that doesn't reinvent the wheel for every test.
Phase 3: Launch and monitor
Launch should be a non-event if QA was thorough. Ramp to 50/50 immediately on most DTC traffic volumes — gradual ramps are a SaaS habit that makes little sense when you need two weeks of data and have one shot at peak-season traffic. Confirm the split is actually 50/50 in your analytics within the first 24 hours.
Monitoring is passive, not active. Look at the test once a day for sample-ratio mismatch, tracking errors, and catastrophic regressions on guardrails (page-load time, error rate, checkout completion). Do not look at the primary metric daily. Peeking at significance every morning is how teams convince themselves a noise-driven 8% lift is real and ship a flat variant.
Realistic A/B test cycle times by store traffic tier
| Monthly sessions | Min runtime | Typical runtime | Tests per quarter |
|---|---|---|---|
| 50k-150k | 3 weeks | 4-6 weeks | 2-3 |
| 150k-500k | 2 weeks | 2-3 weeks | 4-6 |
| 500k-1.5M | 1 week | 10-14 days | 8-12 |
| 1.5M+ | 1 week | 7-10 days | 12-18 |
These ranges assume you're testing meaningful changes on a primary conversion metric, not button colours on a thank-you page. If your traffic puts you in the top tier you can run more tests, but the discipline of the process doesn't change — what changes is how quickly the same eight steps cycle.
Phase 4: Analyze, decide, document
Analysis starts when the test reaches its pre-declared sample size, not when it hits significance. Pull the primary metric, segment by mobile vs desktop, new vs returning, and any audience cut you pre-registered. Check guardrails. If the headline result looks great but bounce rate spiked on mobile, you don't have a winner — you have a question.
The decision is one of four: ship, kill, iterate, or re-run. "Iterate" means the directional signal is real but the implementation wasn't, so the next test refines the same hypothesis. "Re-run" applies when something contaminated the test — a Black Friday email blast skewed the audience, a tracking bug only showed up halfway through.
Documentation is the step everyone agrees matters and nobody does. Write a one-page summary with the hypothesis, screenshots of both variants, the primary metric result with confidence interval, the segment cuts, the decision, and the next test it triggers. Tag it with the page and element so future-you can search "PDP trust badges" and find every test that touched them.
The compounding asset
After two years, your test repository — not your current win rate — is the program's most valuable asset. New hires onboard from it, agencies stop pitching tests you already ran, and pattern-matching across 80 documented tests produces hypotheses no individual could generate.
Frequently asked questions
Two to four weeks is the typical range for stores doing 150k-500k monthly sessions. The minimum is whatever your sample-size calculation says, plus at least one full business cycle (usually seven days) to absorb day-of-week effects. Running shorter risks a result that flips when you ship it.
Yes, if they don't touch the same page or user flow. Parallel tests on the PDP and the checkout are fine; two tests on the same PDP hero section are not — you'll get interaction effects you can't cleanly attribute. Most DTC programs run 2-4 concurrent tests across different funnel stages.
A/B testing is the statistical method — comparing two variants with random assignment. The A/B testing process is the operational workflow your team runs around that method: how hypotheses get prioritized, how tests get QA'd, how decisions get made and documented. Method without process produces noisy results.
Use a scoring framework (ICE, PIE, or PXL) applied consistently to every hypothesis. Weight impact by funnel stage — tests on high-traffic pages with high commercial weight (PDP, cart, checkout) usually outrank top-of-funnel content tests. Effort is the tiebreaker.
Only for guardrail violations — site speed regression, checkout breakage, tracking failure — never because the primary metric looks bad on day three. Calling a test early on the headline result is the most common way teams ship false negatives and waste the next month rebuilding.
You need (1) a way to assign variants without flicker, (2) clean event tracking tied to the assignment, and (3) a statistics layer you trust. Whether that's one tool or a stack depends on your traffic and dev resources. Lightweight all-in-one tools beat fragmented stacks for stores under €15M revenue.
For a 150k-500k session store with one CRO specialist, 4-6 tests per quarter is healthy. Below 2 means the process is broken; above 10 usually means tests are too small to move the business. Test velocity matters less than test learning rate — what you know after each one.
Flat is information. It tells you the lever you pulled doesn't move the metric you measured at the magnitude you tested. Document it, then decide: was the hypothesis wrong, the implementation weak, or the metric insensitive? Half of CRO progress comes from killing zombie hypotheses, not finding winners.
Don't launch new tests during peak weeks (Black Friday, post-Christmas sales, your category's seasonal spike). Traffic mix shifts, intent shifts, and your test audience stops representing your steady-state customer. Use those weeks to monitor, not learn.
Typically: CRO specialist owns prioritize/design/analyze, dev or no-code editor owns build, analytics owner runs QA on tracking, CRO specialist plus head of e-commerce decide on ship/kill, and CRO specialist documents. On smaller teams one person wears all hats — the process still needs to exist.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.