Experiment Analysis

Metricuno
May 17, 2026
5 min read
Experiment Analysis — A four-phase framework for analysing A/B test results — segment cuts, revenue impact, and the decision logic that turns a finished test into shipped code.
Quick answer

A four-phase framework for the work between "test ended" and "decision made" — statistical interpretation, segment and device breakdowns, revenue attribution, and reporting.

Definition
Experimentation

Experiment Analysis

The structured post-test work of turning a finished A/B test into a confident ship, kill, or iterate decision.

Experiment analysis is everything that happens between the moment a test reaches its sample-size target and the moment a stakeholder commits to a decision. It covers statistical interpretation, segment and cohort breakdowns, device cuts, revenue attribution, and the written report that captures the verdict.

Done well, it protects you from two failure modes: shipping a flat test because the headline number looked green, and killing a winner because mobile dragged desktop into noise. The framework below walks through the four phases in the order an experienced tester runs them.

Also known as
Post-test analysis
Results analysis
Test readout

Most teams under-invest here. They spend three weeks designing a test, two weeks running it, and forty-five minutes reading the result before Slacking a screenshot of the dashboard. That ratio is backwards.

The decision a test informs is almost always worth more than the test itself. A bad call on a checkout variant compounds across every order for the next six months. Treat analysis as a first-class activity — not a five-minute wrap-up.

Phase 1: Statistical interpretation

Before you cut anything, confirm the headline. Did the test hit its pre-registered sample size? Is the primary metric significant at your declared alpha? Did the experiment run for full weekly cycles so weekday-versus-weekend mix is balanced across variants?

This phase is also where you sanity-check the plumbing — sample ratio mismatch, tracking gaps, and duplicate user assignment. Statistical interpretation done first prevents you from spending an hour celebrating a 14% lift that turns out to be an SRM artifact.

Phase 2: Segment, cohort, and device cuts

Once the headline holds up, slice it. Segment analysis splits the result by traffic source, returning vs new, geography, and landing context. Cohort analysis groups visitors by signup or first-purchase week to see whether the lift is durable or front-loaded. Device analysis separates mobile, tablet, and desktop — often the single most decision-changing cut on a Shopify storefront.

The goal isn't to find a sub-segment that confirms what you wanted. It's to find heterogeneity worth acting on. A +3% blended lift hiding a +9% mobile win and a -2% desktop loss is a completely different ship decision than a uniform +3%.

Don't go fishing

Slicing a flat test into twelve segments until one shows p<0.05 is not analysis — it's a multiple-comparisons problem. Pre-register the two or three cuts that matter for your hypothesis, and treat anything else as exploratory signal for the next test, not a basis to ship.

Phase 3: Revenue impact and the decision

Conversion lift is a proxy. The decision needs revenue impact — what does this variant do to revenue per visitor, average order value, and refund rate together? An apparel test that lifts checkout completion 5% but drops AOV 8% because it strips a cross-sell is a loser dressed as a winner.

Project the annualised impact using your actual traffic, then write the verdict in one sentence: ship, kill, or iterate. Iterate is the underused option — most tests neither win cleanly nor lose cleanly, and the right next move is usually a sharper variant, not a binary call.

Chart

Same test, different story per segment (conversion lift %)

-2%0%2%4%6%8%10%Blended (headline)MobileDesktopNew visitorsReturningPaid socialOrganicConversion liftSegment
Frequently asked

Experiment analysis FAQ

Budget roughly 10-20% of the test's run time. A two-week test deserves one to two days of analysis and write-up. Anything less and you're approving ship decisions on autopilot; anything more and you're probably re-running the analysis hoping for a different answer.

After. Confirm the primary metric on the full population first, then cut. Looking at segments first invites cherry-picking — you'll find a slice that lifted in any test if you try hard enough. Pre-register the two or three cuts that matter to your hypothesis.

Segment analysis slices users by an attribute they have at test time — device, traffic source, geography. Cohort analysis groups users by when they entered (signup week, first-purchase week) and tracks them forward, which is how you tell if a lift is durable or just a novelty bump.

Sometimes, but only if mobile was a pre-registered cut and the segment-level result is itself significant. A 'mobile-only ship' on a post-hoc slice is just a fancy way of cherry-picking. If you genuinely believe device matters, run a mobile-only follow-up to confirm.

Use revenue per visitor (RPV) as the bridge metric — it captures both moves in one number. Then decompose: how much of the RPV change came from conversion lift versus AOV shift? If they're working against each other, the variant probably needs another iteration before ship.

Hypothesis, variant description, sample size and run dates, primary metric result with confidence interval, the two or three pre-registered segment cuts, revenue impact projection, and a one-sentence decision (ship / kill / iterate) with the reasoning. Anything else is optional.

Yes, unless you used a sequential testing method designed for early stopping. Peeking at fixed-horizon tests inflates false-positive rates dramatically — what looks significant on day four often regresses to flat by day fourteen.

Trust the quantitative result for the ship decision, but treat the contradiction as a signal worth investigating. Often the qualitative finding was real but applied to a different segment than the test exposed. Run a follow-up that targets the segment your research described.

Always. Losing tests are roughly two-thirds of any honest experimentation program, and the patterns across losses are often more instructive than the wins. Teams that only celebrate winners end up with a graveyard of unanalysed losses and the same mistakes recurring.

We auto-run pre-registered segment, device, and cohort cuts the moment a test reaches its sample size, flag sample ratio mismatch, and project annualised revenue impact using your actual store traffic. The output is a one-page readout your team can ship from, not a dashboard you still have to interpret.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.