Experimentation Benchmarks

Metricuno
May 17, 2026
5 min read
Experimentation Benchmarks — A/B test win rates, effect sizes, runtime medians and sample-size norms from real CRO programs — use them to set realistic experimentation targets.
Quick answer

Realistic benchmarks for A/B test win rates, effect sizes, and runtimes — so you can plan a CRO roadmap that ships winners instead of chasing fantasy lifts.

Definition
Experimentation

Experimentation Benchmarks

Reference figures — win rates, effect sizes, runtimes, and sample-size norms — that show what realistic A/B testing programs actually deliver.

Experimentation benchmarks are the typical performance ranges of CRO programs: how often a test wins, how big the average lift is, how long tests run, and how much traffic they need to reach significance. Across mature programs, win rates cluster at 15-25%, median lifts on winning variants land at 3-8% on the primary metric, and tests typically run 14-28 days.

These numbers matter because most experimentation roadmaps are planned with the wrong expectations. A team forecasting a 20% lift on every test will overspend on dev time and underdeliver on revenue. Benchmarks let you size the prize honestly before you ship a single variant.

Also known as
A/B testing benchmarks
CRO program benchmarks
test win rate benchmarks

The figures on this page are drawn from public CRO program data — Microsoft, Booking.com, and Airbnb post-mortems, plus aggregated tool-vendor reports from Optimizely, VWO, and Convert. They're directional, not gospel. Your industry, traffic volume, and program maturity all shift the numbers.

Use them two ways. First, as a sanity check on individual test results — a 40% reported lift on a button-color test almost always evaporates on replication. Second, as a planning input: if your program runs 24 tests a year at a 20% win rate, you ship roughly 5 winners. That's the unit of work to plan around, not the wishful one.

Benchmark

Experimentation program benchmarks by maturity level

Program maturityWin rateMedian lift on winnersAvg runtime (days)Tests / year
Beginner (year 1)10-15%5-12%21-358-15
Established (years 2-3)18-22%3-6%14-2120-40
Mature (4+ years)20-25%2-4%10-1850-150
Best-in-class (Booking, Amazon-tier)10-15%0.5-2%7-14500+

Notice the counter-intuitive pattern at the bottom of the table. Best-in-class programs report lower win rates and smaller lifts than mid-maturity teams. That's not regression — it's because mature programs test bolder hypotheses on already-optimised pages, where easy wins are gone and each successful experiment is worth less in percentage terms but more in absolute revenue.

Chart

Win rate by program maturity

0%5%10%15%20%25%BeginnerEstablishedMatureBest-in-classWin rateProgram maturity
Aggregated from Optimizely, VWO, and Convert benchmark reports, 2021-2023.

Interpreting the numbers for your program

Win rate is the headline number, but it's also the most misleading one in isolation. A 30% win rate with 6 tests a year produces fewer winners than a 15% win rate with 40 tests a year. Velocity multiplies the benchmark — and velocity is what most stores leave on the table.

Effect size is the more honest planning number. If your average winner adds 4% to checkout conversion and your checkout does €200k a month, each shipped winner is worth around €8k/month in incremental revenue. Multiply by expected winners per year, subtract program cost, and you have the real ROI question — not the brochure version.

Beware the inflated-lift trap

Tests stopped early — the moment a variant crosses 95% confidence — overstate their true lift by 20-50% on average. This is regression to the mean, and it's why benchmark medians sit at 3-8% even though individual reported lifts often look much higher. Always pre-commit to a fixed sample size or runtime.

Applying benchmarks to your roadmap

Start by computing your traffic ceiling. A page with 20,000 monthly visitors and a 3% baseline conversion rate can reliably detect a 10% relative lift in about three weeks. Anything smaller needs either more traffic, a higher-funnel metric, or a longer runtime — which directly caps how many tests you can run a year.

Then set a hypothesis bar. Mature programs reject roughly 60-70% of submitted ideas before they ever reach a test, because the expected lift wouldn't survive sample-size math. That filter is what protects velocity. Without it, you spend three weeks proving a copy tweak moved nothing.

Frequently asked

Experimentation benchmark questions

For established programs, 18-22% is the realistic range. Beginners typically see 10-15% as they learn the basics of hypothesis quality and statistical rigor. Anything reported above 35% over a meaningful sample of tests usually indicates peeking, underpowered tests, or counting non-significant results as wins.

Median runtimes are 14-21 days for established programs. The floor is two full business cycles (so usually 14 days minimum) to capture day-of-week effects. The ceiling is around 6 weeks, after which cookie deletion, audience drift, and external events start to muddy the data.

It depends on your baseline conversion rate and the minimum detectable effect you care about. As a rough anchor: detecting a 10% relative lift on a 3% baseline conversion rate requires about 15,000 visitors per variant at 80% power and 95% confidence. Use a sample-size calculator with your real numbers.

Two reasons. First, regression to the mean — tests that just barely cross significance tend to revert. Second, novelty effects fade as the variant stops being new. Plan for a 20-30% haircut on reported lifts when you forecast annual revenue impact from a winning test.

The win rates and effect sizes translate, but the runtime and tests-per-year figures don't. A store doing 15,000 sessions a month can run maybe 6-10 well-powered tests a year, not 40. Either test higher up the funnel where sample sizes are bigger, or batch changes into bigger redesign tests.

3-8% relative lift on the primary metric for winning tests in established programs. Smaller for mature programs (often 1-3%), larger for early-stage programs that still have obvious UX issues to fix. Anything you forecast above 15% should be treated as a long-shot, not a base case.

Aim for whatever your traffic supports, not an arbitrary target. Most established programs in this revenue band ship 20-40 tests a year. Below 10 tests a year, learning compounds too slowly. Above 50, you're usually testing things that aren't worth the sample-size cost.

Yes, for win-rate accounting. An inconclusive test is one that didn't beat control with enough confidence to ship — that's not a win. Mature programs typically see 50-60% of tests come back inconclusive, 20-25% win, and 15-25% lose to control.

Email subject-line tests typically show higher reported lifts (5-15%) because the metric is open rate on a smaller, more engaged sample. Ad creative tests on Meta or Google show similar win rates to on-site CRO but bigger effect sizes because creative variance is wider. They're not directly comparable to on-site experimentation.

Rough math: a store doing €5M in annual revenue, running 25 tests a year at a 20% win rate with 4% average lift on winners, generates roughly €100-150k in incremental annual revenue. Subtract tool and program cost (typically €30-60k all-in) and the program returns 2-4x in year one, with compounding wins after.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.