How to use Segment Analysis

Metricuno
May 17, 2026
7 min read
How to use Segment Analysis — How to run segment analysis on A/B tests without inflating false positives — when to slice, how to pre-register, and realistic per-segment lift ranges.
Quick answer

Segment analysis splits A/B test results by device, source, or cohort to surface wins a sitewide read hides — but only if you control for multiple comparisons.

Definition
Experimentation

Segment Analysis

Splitting A/B test results by traffic source, device, geography, or cohort to find effects that sitewide averages hide.

Segment analysis is the practice of breaking an experiment's results into subgroups — mobile vs desktop, paid vs organic, returning vs new, EU vs US — to see whether the treatment behaves differently across them. A neutral sitewide reading often masks a strong mobile win cancelled out by a desktop loss, or a paid-traffic regression hidden by organic stability.

The technique is powerful and dangerous in equal measure. Slicing the same dataset across ten dimensions inflates the false-positive rate, so disciplined teams pre-register the segments that matter and apply a correction before declaring per-segment wins.

Also known as
Subgroup analysis
Cohort cut
Heterogeneous treatment effect analysis

A sitewide A/B test gives you one number: the average effect across everyone who saw the variant. That number is the right place to start — but it's almost never the full story for a store with mixed traffic.

If 65% of your sessions are mobile and the variant lifts mobile conversion by 4% while flattening desktop, the sitewide read might come in at +2.5% — significant, but understating the real story. Flip the traffic mix and the same test reads as a tie. Segment analysis is how you stop letting traffic composition decide your roadmap.

When segment analysis actually earns its place

Not every test deserves a segment cut. Reach for it when the treatment plausibly affects subgroups differently — a checkout redesign that adds Apple Pay (mobile-skewed), a free-shipping threshold change (cart-value-skewed), or a hero swap that speaks to first-time visitors but not returning buyers.

Skip it for tests where the mechanism is segment-agnostic — a button colour tweak, a font change, a generic copy edit. Slicing those is fishing, and you'll catch a false positive eventually.

The cleanest discipline: before the test ships, write down the two or three segments where you'd expect different behaviour, and why. That's your pre-registered analysis plan. Any other segment cut later is exploratory and gets labelled as such in the readout.

The multiple-comparisons trap

If you test a null effect across 20 independent segments at p<0.05, you expect one false positive by pure chance. Slicing your A/B test by device × country × source × new/returning is exactly this trap. Either pre-register the segments or apply a Bonferroni / Benjamini-Hochberg correction — don't both fish and report.

Segments that matter for an online store

Four segment dimensions cover most of the signal you'll find in a Shopify or WooCommerce test. Device is the highest-yield split because mobile and desktop UX are genuinely different products. Traffic source is next — paid visitors arrive with different intent than organic, and email traffic converts on a different curve again.

New vs returning sessions is the third high-value cut, especially for tests that touch trust signals, reviews, or the first-time-buyer offer. Geography matters when shipping cost, currency, or language switches across the variant — less so for pure UX changes.

Chart

Per-segment lift for a single checkout test — sitewide reads +1.8%

-2%0%2%4%6%Mobile / paidMobile / organicDesktop / paidDesktop / organicReturning / emailNew / socialLift in conversion rateSegment

A +1.8% sitewide read hides a 6.8-point swing between desktop organic (-1.6%) and mobile paid (+5.2%). Shipping the variant to everyone leaves money on the table on mobile and quietly damages desktop. Targeted rollout — variant for mobile, control for desktop — is the actionable conclusion.

Sizing a segment before you trust it

A segment with 800 sessions and three conversions is noise dressed up as a number. Before reading a per-segment effect, check the segment's own sample size against the lift you're hoping to detect. The smaller the segment, the larger the lift needs to be to clear significance — which means small segments only surface big effects, and small effects in small segments are invisible.

A practical floor: a segment needs at least 200 conversions per variant to support a ±15% relative lift reading at 80% power. Below that, treat the number as directional only, not as a basis for shipping.

Benchmark

Typical conversion rates and minimum useful segment size for a Shopify apparel store

SegmentTypical CRShare of sessionsSessions needed per variant for ±15% MDE
Mobile / paid social1.4-2.2%35-45%12,000-18,000
Mobile / organic search1.8-2.8%15-20%9,000-13,000
Desktop / paid search2.8-4.0%10-15%6,500-9,000
Desktop / organic3.2-4.5%8-12%5,500-7,500
Returning / email5.0-8.0%8-15%3,500-5,000
New / direct1.2-2.0%5-10%13,000-20,000

The right-most column is the discipline check. If a test runs for 14 days and accumulates 8,000 mobile-paid sessions per variant, you do not have a reliable mobile-paid read — you have a hint. Either extend the test, narrow the segment set to the two or three you pre-registered, or accept that some segments are exploratory and label them so in the readout to stakeholders.

Reporting segment results without misleading anyone

Segment analysis sits inside the broader practice of experiment analysis, and the same reporting hygiene applies. Lead with the sitewide effect — it's the question the test was powered to answer. Then show pre-registered segments as confirmatory analysis with corrected p-values, and finally any exploratory cuts clearly flagged as hypothesis-generating, not decision-grade.

When a segment cut suggests a strong effect that wasn't pre-registered, the correct next step is almost never to ship to that segment. It's to run a follow-up test scoped to that segment, with proper power. Treating exploratory findings as confirmatory is how teams end up shipping changes that quietly underperform in production.

Pre-register, then explore

Write your segments in the test brief before launch: which two or three, and why. Anything else you slice afterwards is fair game to look at — but report it as exploratory. This single habit eliminates roughly 80% of the false positives that creep into segment-driven roadmaps.

Frequently asked

Segment analysis: common questions

Two or three pre-registered segments per test is the practical ceiling. Each additional segment increases the chance of a false positive — at 10 segments you'll see roughly one chance finding even on a true null. If you must look at more, apply a Bonferroni or Benjamini-Hochberg correction to the segment p-values.

Around 200 conversions per variant within the segment is a sensible floor for detecting a ±15% relative lift at 80% power. Below that the segment can only surface very large effects, and noise dominates the read. Smaller segments are useful as directional signal, not as a basis for shipping.

When a treatment is segment-specific by design — a mobile-only nav change, a US-only shipping offer — yes, scope the test to that segment from the start and power it accordingly. Segment analysis on a sitewide test is for treatments that affect everyone but plausibly differently.

Experiment analysis is the full readout of a test — sitewide effect, statistical significance, guardrails, secondary metrics. Segment analysis is one technique inside it, focused on whether the treatment effect varies across subgroups. Every segment analysis lives inside a broader experiment analysis.

Yes, and it's often the most actionable cut for stores with a strong returning-customer base. Segments like 'has purchased in last 90 days' or 'high AOV cohort' usually need a Klaviyo or CDP join to your test data. The same pre-registration discipline applies — decide before, not after.

Treat it as a hypothesis, not a result. Either the segment really is responsive and the rest of the audience dilutes the effect, or you've found a chance finding. The honest next step is a follow-up test scoped to that segment, properly powered, before shipping anything.

For two pre-registered segments the correction is mild and many teams skip it, accepting a slightly elevated false-positive rate. Once you're looking at four or more segments, or mixing pre-registered with exploratory cuts, applying Benjamini-Hochberg is the safer default.

Yes, but be careful. If the sitewide test is underpowered and noisy, per-segment cuts are even noisier. A common failure mode is declaring a 'mobile win' on a test that was inconclusive overall, and then watching the win evaporate in production. Re-test with proper power before shipping.

Pin the source at session start using UTM parameters or the referrer, and freeze it for the duration of the experiment exposure — don't re-attribute mid-session. For paid traffic, exclude bot and click-fraud segments before analysis, since they distort low-converting segments badly.

Sometimes — if the segment cut was pre-registered, the per-segment effect is significant after correction, and your platform supports targeting that segment cleanly (e.g. by device or country). For exploratory wins, run a confirmatory test before splitting your experience by segment in production.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.