Part of The Complete Guide to Benchmarks →

Experimentation Benchmarks

Metricuno

May 17, 2026

5 min read

Experimentation Benchmarks — A/B test win rates, effect sizes, runtime medians and sample-size norms from real CRO programs — use them to set realistic experimentation targets.

Quick answer

Realistic benchmarks for A/B test win rates, effect sizes, and runtimes — so you can plan a CRO roadmap that ships winners instead of chasing fantasy lifts.

Definition

Experimentation

Experimentation Benchmarks

Reference figures — win rates, effect sizes, runtimes, and sample-size norms — that show what realistic A/B testing programs actually deliver.

Experimentation benchmarks are the typical performance ranges of CRO programs: how often a test wins, how big the average lift is, how long tests run, and how much traffic they need to reach significance. Across mature programs, win rates cluster at 15-25%, median lifts on winning variants land at 3-8% on the primary metric, and tests typically run 14-28 days.

These numbers matter because most experimentation roadmaps are planned with the wrong expectations. A team forecasting a 20% lift on every test will overspend on dev time and underdeliver on revenue. Benchmarks let you size the prize honestly before you ship a single variant.

Also known as

A/B testing benchmarks

CRO program benchmarks

test win rate benchmarks

The figures on this page are drawn from public CRO program data — Microsoft, Booking.com, and Airbnb post-mortems, plus aggregated tool-vendor reports from Optimizely, VWO, and Convert. They're directional, not gospel. Your industry, traffic volume, and program maturity all shift the numbers.

Use them two ways. First, as a sanity check on individual test results — a 40% reported lift on a button-color test almost always evaporates on replication. Second, as a planning input: if your program runs 24 tests a year at a 20% win rate, you ship roughly 5 winners. That's the unit of work to plan around, not the wishful one.

Benchmark

Experimentation program benchmarks by maturity level

Program maturity	Win rate	Median lift on winners	Avg runtime (days)	Tests / year
Beginner (year 1)	10-15%	5-12%	21-35	8-15
Established (years 2-3)	18-22%	3-6%	14-21	20-40
Mature (4+ years)	20-25%	2-4%	10-18	50-150
Best-in-class (Booking, Amazon-tier)	10-15%	0.5-2%	7-14	500+

Notice the counter-intuitive pattern at the bottom of the table. Best-in-class programs report lower win rates and smaller lifts than mid-maturity teams. That's not regression — it's because mature programs test bolder hypotheses on already-optimised pages, where easy wins are gone and each successful experiment is worth less in percentage terms but more in absolute revenue.

Chart

Win rate by program maturity

Aggregated from Optimizely, VWO, and Convert benchmark reports, 2021-2023.

Interpreting the numbers for your program

Win rate is the headline number, but it's also the most misleading one in isolation. A 30% win rate with 6 tests a year produces fewer winners than a 15% win rate with 40 tests a year. Velocity multiplies the benchmark — and velocity is what most stores leave on the table.

Effect size is the more honest planning number. If your average winner adds 4% to checkout conversion and your checkout does €200k a month, each shipped winner is worth around €8k/month in incremental revenue. Multiply by expected winners per year, subtract program cost, and you have the real ROI question — not the brochure version.

Beware the inflated-lift trap

Tests stopped early — the moment a variant crosses 95% confidence — overstate their true lift by 20-50% on average. This is regression to the mean, and it's why benchmark medians sit at 3-8% even though individual reported lifts often look much higher. Always pre-commit to a fixed sample size or runtime.

Applying benchmarks to your roadmap

Start by computing your traffic ceiling. A page with 20,000 monthly visitors and a 3% baseline conversion rate can reliably detect a 10% relative lift in about three weeks. Anything smaller needs either more traffic, a higher-funnel metric, or a longer runtime — which directly caps how many tests you can run a year.

Then set a hypothesis bar. Mature programs reject roughly 60-70% of submitted ideas before they ever reach a test, because the expected lift wouldn't survive sample-size math. That filter is what protects velocity. Without it, you spend three weeks proving a copy tweak moved nothing.

Frequently asked

Experimentation benchmark questions

For established programs, 18-22% is the realistic range. Beginners typically see 10-15% as they learn the basics of hypothesis quality and statistical rigor. Anything reported above 35% over a meaningful sample of tests usually indicates peeking, underpowered tests, or counting non-significant results as wins.

Median runtimes are 14-21 days for established programs. The floor is two full business cycles (so usually 14 days minimum) to capture day-of-week effects. The ceiling is around 6 weeks, after which cookie deletion, audience drift, and external events start to muddy the data.

It depends on your baseline conversion rate and the minimum detectable effect you care about. As a rough anchor: detecting a 10% relative lift on a 3% baseline conversion rate requires about 15,000 visitors per variant at 80% power and 95% confidence. Use a sample-size calculator with your real numbers.

Two reasons. First, regression to the mean — tests that just barely cross significance tend to revert. Second, novelty effects fade as the variant stops being new. Plan for a 20-30% haircut on reported lifts when you forecast annual revenue impact from a winning test.

The win rates and effect sizes translate, but the runtime and tests-per-year figures don't. A store doing 15,000 sessions a month can run maybe 6-10 well-powered tests a year, not 40. Either test higher up the funnel where sample sizes are bigger, or batch changes into bigger redesign tests.

3-8% relative lift on the primary metric for winning tests in established programs. Smaller for mature programs (often 1-3%), larger for early-stage programs that still have obvious UX issues to fix. Anything you forecast above 15% should be treated as a long-shot, not a base case.

Aim for whatever your traffic supports, not an arbitrary target. Most established programs in this revenue band ship 20-40 tests a year. Below 10 tests a year, learning compounds too slowly. Above 50, you're usually testing things that aren't worth the sample-size cost.

Yes, for win-rate accounting. An inconclusive test is one that didn't beat control with enough confidence to ship — that's not a win. Mature programs typically see 50-60% of tests come back inconclusive, 20-25% win, and 15-25% lose to control.

Email subject-line tests typically show higher reported lifts (5-15%) because the metric is open rate on a smaller, more engaged sample. Ad creative tests on Meta or Google show similar win rates to on-site CRO but bigger effect sizes because creative variance is wider. They're not directly comparable to on-site experimentation.

Rough math: a store doing €5M in annual revenue, running 25 tests a year at a 20% win rate with 4% average lift on winners, generates roughly €100-150k in incremental annual revenue. Subtract tool and program cost (typically €30-60k all-in) and the program returns 2-4x in year one, with compounding wins after.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.

Launch your first experiment

Experimentation Benchmarks

Experimentation Benchmarks

Experimentation program benchmarks by maturity level

Win rate by program maturity

Interpreting the numbers for your program

Applying benchmarks to your roadmap

Experimentation benchmark questions

What's a good A/B test win rate?

How long should an A/B test run?

What sample size do I need per variant?

Why do reported lifts shrink after the test ends?

Are these benchmarks realistic for a small Shopify store?

What's a realistic effect size to plan around?

How many tests should we run per year?

Should I count inconclusive tests as losses?

How do these benchmarks compare to email or ad testing?

What's the ROI of an experimentation program at these benchmarks?

Test ideas before you ship them