How to use Experiment Reporting

Metricuno
May 17, 2026
7 min read
How to use Experiment Reporting — How to report A/B test results so stakeholders trust them — readout structure, segment cuts, recommendations, and cadences that build experimentation maturity.
Quick answer

Experiment reporting is how test results become organisational decisions. This guide covers readout structure, segment cuts, recommendations, and the cadence that builds trust.

Definition
Experimentation

Experiment Reporting

Communicating A/B test results to stakeholders — variant visuals, lift numbers, segment cuts, and a clear recommendation.

Experiment reporting is the last mile of experiment analysis: turning a statistical result into a decision other people can act on. A complete readout shows what was tested and why, what happened to the primary metric, how the result holds up across segments and guardrails, and what you recommend doing next — ship, kill, iterate, or re-run.

Done consistently, reporting compounds. Each readout teaches the organisation a little more about what works on your store, which builds the appetite to test more. Done badly — buried decks, cherry-picked segments, p-values without context — and experimentation quietly loses credibility, which is much harder to win back than to keep.

Also known as
test readout
experiment readout
results communication

Reporting sits downstream of experiment analysis but it is not the same job. Analysis answers "is this result real?" Reporting answers "what should we do, and why should you trust me?" The first is a statistics problem; the second is a communication problem.

The audience for a readout is almost never another analyst. It is the head of e-commerce, the brand team, the developer who built the variant, the founder who wants to know if the homepage test worked. Each of those people needs a different level of depth, but all of them need the same headline: what changed, by how much, and what you want to do about it.

Anatomy of a readout that gets believed

Open with the recommendation. Most readouts bury it on slide nine after a wall of methodology. Flip that: lead with "Ship variant B — +6.2% on add-to-cart, holds across mobile and desktop, no checkout-error regression." Anyone who needs more reads on; anyone who trusts you can leave the meeting.

Then show the variants side by side. Screenshots, not descriptions. Half the value of a readout is letting the brand team see exactly what shipped to customers — they will spot details (a missing trust badge, a font weight change) that the analyst missed, and that scrutiny is the point.

Only then come the numbers: primary metric with confidence interval, sample size, duration, and the guardrails you watched (revenue per visitor, return rate, page-load time). Confidence intervals beat raw p-values for non-statistical audiences — "+6.2% with a range of +2% to +10%" tells a merchandiser more than "p=0.03".

The single most damaging reporting mistake

Reporting a winner on conversion rate while ignoring revenue per visitor. A test can lift CVR by pushing customers toward lower-AOV bundles and lose you money. Always report the revenue guardrail next to the primary metric, even when the primary metric is the one you optimised for.

Writing the recommendation

Four recommendations cover roughly 95% of tests: ship, kill, iterate, or re-run. "Ship" means roll the variant to 100%. "Kill" means the control stays and the hypothesis is closed. "Iterate" means the direction looks promising but the execution needs another swing. "Re-run" means the test was inconclusive due to traffic, seasonality, or a tracking issue.

Frequency of each verdict is itself a useful signal. Teams new to experimentation tend to over-ship (everything looks like a winner) or over-kill (anything not significant at p<0.05 gets binned). A healthy programme on a Shopify store of €5M revenue typically lands somewhere in the distribution below.

Chart

Typical distribution of experiment verdicts on a mature programme

0%10%20%30%40%50%ShipKillIterateRe-runShare of testsVerdict

If your ship rate is north of 40%, you are probably calling tests early or reporting on cherry-picked segments. If it is under 10%, you are either testing too cautiously (tiny copy tweaks) or your minimum detectable effect is set too high for the traffic you have.

Segment cuts: useful insight vs. p-hacking

Segment analysis is where reporting credibility is won or lost. A real segment cut is one you pre-declared in the test plan — mobile vs desktop, new vs returning, paid vs organic. A retroactive segment cut ("it worked for visitors from Germany on Tuesdays") is p-hacking dressed up as insight, and stakeholders eventually notice.

Report the segments you committed to, even when they are uninteresting. Boring segment slides — "behaves the same on mobile as desktop" — are evidence that you are not fishing. They make the segments that do diverge land harder.

Benchmark

What to include in a standard readout, by audience

ArtefactCRO teamHead of E-commBrand / dev
Variant screenshotsOptionalYesRequired
Primary metric + CIRequiredRequiredOptional
Pre-declared segmentsRequiredRequiredOptional
Revenue per visitorRequiredRequiredOptional
Guardrails (load time, errors)RequiredYesYes
Sample size + durationRequiredYesOptional
Recommendation + next testRequiredRequiredYes

On most teams a single readout doc serves all three audiences — the analyst-facing detail goes in an appendix, and the top of the page is built for the head of e-commerce. One artefact, layered depth.

Cadence and the long game

Reporting is not a per-test activity, it is a weekly habit. The teams that get the most out of experimentation publish a short readout for every concluded test within 48 hours, plus a monthly roll-up that shows what was learned across the portfolio: which hypotheses keep winning, which keep losing, what the cumulative revenue impact has been.

The monthly roll-up matters more than any single readout. It is where you make the case for headcount, for traffic to the experimentation programme, and for the brand team to stop second-guessing winners. "We shipped 11 tests this quarter, four winners, estimated +€140k annualised revenue, three killed hypotheses we no longer need to revisit" is a different conversation from "we ran some tests."

Report losers as loudly as winners

Killed tests are not failures — they are learning the organisation paid for. A readout that says "the trust-badge hypothesis did not move conversion, here is what we will try instead" raises the credibility of the next ship recommendation. Programmes that only publicise winners get treated like marketing campaigns, not research.

Frequently asked

Experiment reporting FAQ

Both, sequenced. A Slack post within 48 hours with the verdict, primary metric, and a link to the full readout doc. The doc holds the segment cuts, guardrails, and methodology for anyone who wants to dig. Decks are for the monthly roll-up, not for individual tests.

Skip the p-value, show the confidence interval. "+6.2% with a 95% range of +2% to +10%" communicates uncertainty in a way decision-makers can actually use. Keep the p-value in the appendix for the analysts who want it.

Honestly. Report the observed lift, the confidence interval (which will straddle zero), and the minimum detectable effect you would have needed. Then recommend re-run, iterate, or kill based on whether the direction is worth another swing — not on whether the p-value happened to cross 0.05.

Experiment analysis is the statistical work — checking validity, computing lift, watching guardrails. Reporting is the communication work — turning that analysis into a recommendation and a story stakeholders can act on. The same person often does both, but they are different skills.

Mobile vs desktop and new vs returning are non-negotiable for most online stores. Add traffic source (paid vs organic) if your paid mix is material, and device type if you sell into both iOS and Android customer bases. Anything else should be pre-declared in the test plan, not invented after the fact.

Within 48 hours for the headline, within a week for the full write-up. Beyond that, context decays — people forget what was tested, decisions get made without the evidence, and the test effectively never happened. Speed of reporting is the strongest predictor of programme momentum.

Ship it anyway if the result is clean, but document the objection. Brand intuition is data too, and the variants the brand team flagged sometimes underperform once shipped to 100% (novelty effects fade). Track post-ship performance and revisit if it doesn't hold.

Generally no — a flat result is a kill, not a ship. Shipping flat variants adds complexity to the codebase without earning revenue, and clutters future experiments. The exception is when the variant unblocks a strategic change (a redesign, a platform migration) and parity is enough.

Flag it prominently and discount the result. Sale traffic skews high-intent, mobile-heavy, and price-sensitive, so lifts measured during Black Friday rarely generalise. Either re-run post-peak or report the lift with an explicit "sale-period only" caveat in the recommendation.

A simple portfolio log: test name, dates, verdict, primary metric lift, estimated annual revenue impact, and one-line learning. Update it after every readout and roll it up monthly. After a year it is the most persuasive document the experimentation programme owns.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.