How to use A/B Testing Mistakes

Most A/B test "wins" don't survive a second look. Here are the recurring mistakes that invalidate experiments — and the disciplines that protect your roadmap from false positives.
A/B Testing Mistakes
The recurring methodological errors — peeking, early stopping, underpowering, segment confounds — that produce false A/B test results.
A/B testing mistakes are the procedural and statistical errors that cause an experiment to report a win or a loss that wouldn't survive replication. Most aren't catastrophic on their own — a single peek, one underpowered cell, a forgotten holiday weekend — but they compound across a roadmap until the test programme is generating noise dressed up as insight.
The damage is rarely visible at the test level. It shows up months later when shipped winners don't move the quarterly number, when segment-level rollouts contradict the original read-out, or when your win rate looks suspiciously high for an industry where 1 in 7 tests typically wins. Avoiding these mistakes is what separates a programme that compounds from one that churns.
Bad tests cost more than bad ideas. A failed experiment teaches you something; an invalid experiment teaches you the wrong thing and gets shipped. The shipped version then becomes the new baseline that future tests are measured against, so the error propagates.
The mistakes below cluster into four families: statistical errors (how you count), design errors (how you set up the test), interpretation errors (what you conclude), and process errors (what happens around the test). Most teams have weak links in two or three of these — fix those first.
Statistical mistakes that inflate false positives
Peeking is the most common and most expensive mistake in A/B testing. You check the dashboard at day 3, see variant B is up 12% with p = 0.04, and call the test. The problem: classical significance tests assume you look exactly once, at a pre-declared sample size. Look five times during a test and your nominal 5% false-positive rate balloons toward 14%.
Stopping early on a positive result is peeking with a decision attached. Most early winners are random noise that would regress to zero — or to a loss — given another two weeks. If you need to monitor in-flight, use a sequential testing method (mSPRT, Bayesian, or always-valid p-values) designed for continuous looking.
Underpowering is the silent killer. A test on 8,000 sessions trying to detect a 5% lift in a 2.5% conversion rate has roughly 20% power — meaning you'll miss four out of five real wins. Teams then conclude "nothing works on our site" when the truth is they can't see anything smaller than a 20% lift.
The peeking penalty is bigger than it feels
If you check a test daily for two weeks and call it whenever p < 0.05, your real false-positive rate is somewhere between 20% and 30% — not 5%. One in three "wins" you ship is noise. Pre-commit to a sample size and a stop date, or switch to a sequential method built for monitoring.
Design mistakes that confound the result
Sample ratio mismatch (SRM) is the canary in the experimentation coal mine. You set up a 50/50 split but the data shows 51.8/48.2 with millions of sessions — a vanishingly unlikely random outcome. SRM almost always points to a bug: bot filtering hitting one variant, a redirect breaking on mobile Safari, or a caching layer serving stale assignments.
Testing during atypical windows — Black Friday, a viral TikTok, a Klaviyo flow change — confounds variant with traffic mix. The variant that "won" during a paid Meta push may simply be the one that converted the high-intent retargeting audience better, which tells you nothing about steady-state behaviour.
False-positive rate when you peek and stop early
Fixed-horizon test (no peeking)
Peeking + stopping at p < 0.05
Running too many variants without correction is the other design trap. A 4-arm test (control + 3 variants) at α = 0.05 gives you three chances to find a false positive, so the family-wise error rate creeps toward 14%. Apply a Bonferroni or Holm correction, or pre-declare one primary metric and treat the rest as exploratory.
Interpretation mistakes after the test ends
Segment slicing post-hoc is one of the most seductive mistakes. You ran a flat test, it came back null, and someone starts cutting the data: mobile, returning visitors, paid traffic, US visitors. Eventually one slice shows p < 0.05 and gets shipped. With 20 segments, you'd expect one to look significant by chance — that's not insight, that's roulette.
Novelty effects bias short tests on returning-visitor-heavy stores. A new checkout layout might lift conversion 8% in week one as repeat customers click around in curiosity, then settle back to baseline by week three. Run any test that touches familiar surfaces for at least two full business cycles, and inspect the lift trajectory rather than just the endpoint.
Typical sample-size shortfalls in DTC A/B testing (5% relative MDE, 80% power)
| Monthly sessions | Baseline CVR | Sessions needed per variant | Weeks to power |
|---|---|---|---|
| 50,000 | 1.8% | ~178,000 | 14-16 |
| 150,000 | 2.5% | ~125,000 | 7-8 |
| 400,000 | 3.2% | ~96,000 | 2-3 |
| 1,000,000 | 3.5% | ~87,000 | <1 |
The table above is why most stores below 150k monthly sessions should test bolder changes — pricing, hero proposition, full template swaps — rather than button colours. The detectable effect at your traffic level is the real ceiling on what you can learn, and that ceiling moves down as your baseline conversion rate climbs.
Process mistakes around the test
Not pre-registering the hypothesis, primary metric, and stop rule is how teams end up rationalising whatever the data shows. Write the test brief before launch: what you're testing, why, the one metric you'll judge it on, the minimum detectable effect, and the date the test stops. Anything beyond that is exploratory.
Ignoring guardrail metrics ships winners that quietly hurt the business. A checkout test might lift completion rate 6% by stripping the upsell module — and reduce AOV by 4%, leaving revenue per visitor flat or negative. Every primary-metric test should report at least three guardrails: revenue per visitor, AOV, and a downstream quality metric like 30-day return rate.
The pre-test checklist that prevents 80% of these mistakes
Before launch: (1) state the hypothesis and primary metric in writing, (2) calculate sample size for your MDE and pre-commit to it, (3) set the stop date, (4) list two guardrail metrics, (5) confirm SRM detection is on. Tests that pass this gate fail far less often in replication.
Frequently asked questions
Stopping tests early when you see a positive result. It feels like efficiency but inflates your false-positive rate from the nominal 5% to 20-30% depending on how often you check. Most teams' reported win rates are noticeably higher than the industry baseline of 12-15% — that gap is usually peeking, not skill.
Long enough to (a) hit the pre-calculated sample size, (b) cover at least one full business cycle (typically two weeks to capture both weekdays and weekends), and (c) outlast any novelty effect on returning visitors. For most online stores that's a 14-21 day minimum, regardless of how fast significance appears.
Usually no. Three-day tests miss weekday/weekend mix, are vulnerable to a single traffic spike (paid push, influencer post, press hit) skewing the result, and almost always involve peeking. Let it run the full pre-declared window even if the dashboard is already green.
SRM is when your actual traffic split deviates from the assignment ratio more than chance would explain — e.g. 51.5/48.5 on a million sessions when you set 50/50. It's almost always a bug: redirect issues, bot filtering hitting one arm, cache poisoning. An SRM-flagged test is uninterpretable; fix the cause and re-run rather than analysing the dirty data.
Only as exploratory analysis to generate next-test hypotheses — never as the basis for shipping. With 10-20 segments, one will almost always show p < 0.05 by pure chance. If a segment looks interesting, design a follow-up test that pre-registers that segment as the primary audience.
Calculate sample size before launch using your baseline conversion rate, your minimum detectable effect (typically 5-10% relative), 80% power, and α = 0.05. If the required sample exceeds 8 weeks of traffic, either test a bolder change (bigger MDE), pool variants, or accept that your store can't detect small effects and prioritise larger-impact ideas.
Yes, in two situations: when you have more than two variants (3+ arm tests inflate family-wise error), and when you analyse multiple primary metrics. Use Bonferroni for simplicity, Holm for slightly more power, or pre-declare one primary metric and treat the rest as descriptive.
At minimum: revenue per visitor (catches AOV erosion), average order value (catches discount-driven false wins), and a quality metric like 30-day return rate or refund rate. For checkout tests add page-load time; for acquisition-page tests add bounce rate. Guardrails should auto-fail the test if they regress beyond a pre-set threshold.
A real lift is roughly stable across the test window; a novelty effect is large in week one and decays toward zero by week three as returning visitors habituate to the new design. Plot daily conversion lift over the test duration — if the trend slopes down, you're likely looking at novelty, not a winner.
The platform pre-calculates sample size and stop dates at test creation, runs SRM detection automatically, supports always-valid sequential analysis so monitoring doesn't penalise you, and auto-reports guardrail metrics alongside the primary KPI. The historical GA4 import means you can power your sample-size calculation against your real baseline conversion rate from day one rather than guessing.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.