How to use A/B Testing Process

Metricuno

May 17, 2026

7 min read

How to use A/B Testing Process — The A/B testing process explained step-by-step — prioritize, design, QA, launch, monitor, analyze, decide, document. With cycle-time benchmarks and pitfalls.

Quick answer

A step-by-step A/B testing process — from prioritization to documentation — with realistic cycle times, QA checkpoints, and the decision rules teams actually use.

Definition

Experimentation

A/B Testing Process

The standard operating workflow that takes an experiment from idea to documented decision — eight repeatable steps.

An A/B testing process is the operational checklist a CRO team runs every test through: prioritize, design, QA, launch, monitor, analyze, decide, document. It exists so experimentation produces compounding learning instead of one-off opinions, and so two specialists running parallel tests reach the same quality bar.

The steps are sequential but cycle. A well-run program closes one test on Monday and opens the next on Tuesday using the previous result as input. On a Shopify or WooCommerce store doing €1M-€15M, the constraint is rarely ideas — it's getting from hypothesis to clean decision in under three weeks without breaking site speed or analytics.

Also known as

experimentation workflow

CRO test process

A/B test SOP

Most teams already do five or six of these steps. The gap that hurts revenue isn't a missing step — it's an inconsistent step. Tests launch without QA, get called early because the dashboard looks green, or finish and never get documented so the same hypothesis comes back six months later.

This guide walks through the full process in four phases — prioritize and design, build and QA, launch and monitor, analyze and document — with the specific checks that distinguish a serious program from a hopeful one. It assumes you've already chosen what A/B testing is at a conceptual level; here we focus on the operating mechanics.

Phase 1: Prioritize and design

Prioritization is the step that decides whether the next quarter compounds or stalls. You pick from a backlog of hypotheses using a scoring framework — ICE, PIE, or PXL are the common ones — that forces you to weight impact, confidence and effort consistently rather than testing whatever the loudest stakeholder asked for last.

The hypothesis itself should be a single falsifiable sentence: "If we move the trust badges above the add-to-cart button on PDPs, mobile add-to-cart rate will increase, because session recordings show users scroll past the badges before deciding." If you can't write that sentence, you don't have a test yet — you have an opinion.

Design covers the variant spec, the primary metric, guardrail metrics, target audience, and the required sample size. A sample-size calculation up front tells you whether the test is even viable on your traffic — running a test that needs eight weeks on a page that gets a campaign refresh every three weeks is a waste before it starts.

Don't skip the pre-mortem

Before building anything, write down what you'd do if the variant wins by 8%, by 1%, and if it loses by 5%. Teams that don't agree on decision rules in advance spend the post-test meeting arguing about whether to ship, instead of shipping.

Phase 2: Build and QA

Build is where most teams underestimate effort. A "simple" headline swap touches the PDP template, the analytics layer, the consent banner, and — if you're on Shopify — possibly the theme's section schema. Budget the dev work honestly or self-serve with a visual editor, but don't pretend a one-hour estimate covers a four-hour reality.

QA is the step that separates a real program from theatre. Check the variant on Chrome, Safari, iOS Safari, and Android Chrome. Verify the event fires once per exposure, not on every page-view. Test it with an ad-blocker on. Confirm the variant doesn't break the checkout funnel for assigned users. Two hours of QA prevents two weeks of contaminated data.

Chart

Where time actually goes in a typical A/B test cycle

Notice that build and analyze together are more than half the cycle. Teams that want higher test velocity rarely get there by rushing QA — they get there by reducing dev dependency on the build step and by having a templated analysis workflow that doesn't reinvent the wheel for every test.

Phase 3: Launch and monitor

Launch should be a non-event if QA was thorough. Ramp to 50/50 immediately on most DTC traffic volumes — gradual ramps are a SaaS habit that makes little sense when you need two weeks of data and have one shot at peak-season traffic. Confirm the split is actually 50/50 in your analytics within the first 24 hours.

Monitoring is passive, not active. Look at the test once a day for sample-ratio mismatch, tracking errors, and catastrophic regressions on guardrails (page-load time, error rate, checkout completion). Do not look at the primary metric daily. Peeking at significance every morning is how teams convince themselves a noise-driven 8% lift is real and ship a flat variant.

Benchmark

Realistic A/B test cycle times by store traffic tier

Monthly sessions	Min runtime	Typical runtime	Tests per quarter
50k-150k	3 weeks	4-6 weeks	2-3
150k-500k	2 weeks	2-3 weeks	4-6
500k-1.5M	1 week	10-14 days	8-12
1.5M+	1 week	7-10 days	12-18

These ranges assume you're testing meaningful changes on a primary conversion metric, not button colours on a thank-you page. If your traffic puts you in the top tier you can run more tests, but the discipline of the process doesn't change — what changes is how quickly the same eight steps cycle.

Phase 4: Analyze, decide, document

Analysis starts when the test reaches its pre-declared sample size, not when it hits significance. Pull the primary metric, segment by mobile vs desktop, new vs returning, and any audience cut you pre-registered. Check guardrails. If the headline result looks great but bounce rate spiked on mobile, you don't have a winner — you have a question.

The decision is one of four: ship, kill, iterate, or re-run. "Iterate" means the directional signal is real but the implementation wasn't, so the next test refines the same hypothesis. "Re-run" applies when something contaminated the test — a Black Friday email blast skewed the audience, a tracking bug only showed up halfway through.

Documentation is the step everyone agrees matters and nobody does. Write a one-page summary with the hypothesis, screenshots of both variants, the primary metric result with confidence interval, the segment cuts, the decision, and the next test it triggers. Tag it with the page and element so future-you can search "PDP trust badges" and find every test that touched them.

The compounding asset

After two years, your test repository — not your current win rate — is the program's most valuable asset. New hires onboard from it, agencies stop pitching tests you already ran, and pattern-matching across 80 documented tests produces hypotheses no individual could generate.

Frequently asked

Frequently asked questions

Two to four weeks is the typical range for stores doing 150k-500k monthly sessions. The minimum is whatever your sample-size calculation says, plus at least one full business cycle (usually seven days) to absorb day-of-week effects. Running shorter risks a result that flips when you ship it.

Yes, if they don't touch the same page or user flow. Parallel tests on the PDP and the checkout are fine; two tests on the same PDP hero section are not — you'll get interaction effects you can't cleanly attribute. Most DTC programs run 2-4 concurrent tests across different funnel stages.

A/B testing is the statistical method — comparing two variants with random assignment. The A/B testing process is the operational workflow your team runs around that method: how hypotheses get prioritized, how tests get QA'd, how decisions get made and documented. Method without process produces noisy results.

Use a scoring framework (ICE, PIE, or PXL) applied consistently to every hypothesis. Weight impact by funnel stage — tests on high-traffic pages with high commercial weight (PDP, cart, checkout) usually outrank top-of-funnel content tests. Effort is the tiebreaker.

Only for guardrail violations — site speed regression, checkout breakage, tracking failure — never because the primary metric looks bad on day three. Calling a test early on the headline result is the most common way teams ship false negatives and waste the next month rebuilding.

You need (1) a way to assign variants without flicker, (2) clean event tracking tied to the assignment, and (3) a statistics layer you trust. Whether that's one tool or a stack depends on your traffic and dev resources. Lightweight all-in-one tools beat fragmented stacks for stores under €15M revenue.

For a 150k-500k session store with one CRO specialist, 4-6 tests per quarter is healthy. Below 2 means the process is broken; above 10 usually means tests are too small to move the business. Test velocity matters less than test learning rate — what you know after each one.

Flat is information. It tells you the lever you pulled doesn't move the metric you measured at the magnitude you tested. Document it, then decide: was the hypothesis wrong, the implementation weak, or the metric insensitive? Half of CRO progress comes from killing zombie hypotheses, not finding winners.

Don't launch new tests during peak weeks (Black Friday, post-Christmas sales, your category's seasonal spike). Traffic mix shifts, intent shifts, and your test audience stops representing your steady-state customer. Use those weeks to monitor, not learn.

Typically: CRO specialist owns prioritize/design/analyze, dev or no-code editor owns build, analytics owner runs QA on tracking, CRO specialist plus head of e-commerce decide on ship/kill, and CRO specialist documents. On smaller teams one person wears all hats — the process still needs to exist.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.

Launch your first experiment

How to use A/B Testing Process

A/B Testing Process

Phase 1: Prioritize and design

Phase 2: Build and QA

Where time actually goes in a typical A/B test cycle

Phase 3: Launch and monitor

Realistic A/B test cycle times by store traffic tier

Phase 4: Analyze, decide, document

Frequently asked questions

How long should a single A/B test run on a Shopify store?

Can I run multiple A/B tests at the same time?

What's the difference between A/B testing and the A/B testing process?

How do we prioritize which test to run first?

When should we kill a test early?

Do we need a dedicated tool, or can we use Google Optimize alternatives?

How many tests per quarter is realistic?

What if our test results are flat?

How do we handle seasonality in the testing process?

Who owns each step of the A/B testing process?

Test ideas before you ship them