Product Experiments

Metricuno

May 17, 2026

4 min read

Product Experiments — Product experiments test new features, flows, and removed functionality — not just visuals. Learn runtimes, sample sizes, and how to design them.

Quick answer

Product experiments test functional changes — new features, new flows, removed features — rather than visual tweaks. Stakes are higher and runtimes are longer because you're measuring deeper behavior.

Definition

Experimentation

Product Experiments

A/B tests on functional product changes — features, flows, or removals — rather than purely visual or copy tweaks.

Product experiments are controlled tests where the variant changes how the product actually works: a new checkout step, a different recommendation engine, a removed upsell, a rebuilt mobile filter. Unlike a button-color test, they alter the behavior the user has to learn, adopt, or trust.

Because the change is deeper, the measurement is harder. Effects take longer to surface, novelty and learning curves distort early data, and the metrics that matter usually sit further down the funnel — retention, repeat purchase, AOV — not just click-through. A product experiment is a commitment, not a quick win.

Also known as

Feature tests

Functional A/B tests

Product A/B tests

Product experiments sit inside the broader practice of feature experimentation, but they're the highest-stakes subset. A copy test runs for a week and impacts one funnel step. A product experiment can run for a month and impacts how the entire shopping experience feels.

The classic example on a Shopify store: replacing a multi-step checkout with a single-page checkout. The variant changes navigation, form behavior, payment-method order, and trust signals all at once. You're not testing a pixel — you're testing a product decision.

Formula

required_runtime_days = (sample_size_per_variant * 2) / daily_traffic

Variables

sample_size_per_variant

Sample size per variant

Visitors needed per arm to detect the minimum effect at 95% confidence and 80% power.

daily_traffic

Daily traffic into the test

Unique visitors per day eligible for the experiment.

required_runtime_days

Required runtime in days

Calendar days the test needs to run, before adding a buffer for weekly cycles.

Worked example

A Shopify apparel store tests a new size-recommendation widget on product pages, targeting a lift from 2.4% to 2.7% conversion rate.

Sample size per variant: 31,000

Daily eligible traffic: 2,200

Variants (control + 1): 2

→ ~28 days

The test needs a full four weeks at minimum. Add another week to cover two complete weekly cycles, since weekend traffic skews younger and converts differently.

Notice the runtime: four to five weeks is normal for product experiments, not the 7-14 days a CRO team might quote for a headline test. The smaller the effect you need to detect, the longer the runtime — and product changes often move metrics by single-digit percentages, not 30% lifts.

Benchmark

Typical runtime and sample profile by experiment type

Experiment type	Typical MDE	Runtime (weeks)	Primary metric
Headline / copy test	10-20%	1-2	Click-through rate
Visual / layout test	5-10%	2-3	Add-to-cart rate
Checkout flow change	2-5%	3-5	Checkout completion
New feature rollout	3-8%	4-6	Conversion + AOV
Feature removal	1-3%	4-8	Revenue per visitor

The metric column matters as much as the runtime. A copy test optimizes a click — a product experiment has to defend revenue per visitor, because a feature can lift add-to-cart but tank AOV, or improve conversion but hurt 30-day repeat rate. Always pair the primary metric with a guardrail.

Frequently asked

Frequently asked questions

CRO tests are usually visual or copy changes on existing pages — they optimize what's already there. Product experiments change how a feature works, add new functionality, or remove existing functionality. The change is structural, not cosmetic.

Feature experimentation is the umbrella practice that covers any test involving product functionality. Product experiments are the high-stakes subset where the change is significant enough to alter user behavior at a deep level — new flows, removed features, or major rebuilds.

Plan for four to six weeks as a baseline. You need enough sample size to detect small effects, full weekly cycles to smooth out weekday/weekend variance, and a buffer for novelty effects to wash out in the first 7-10 days.

Returning users react to anything new — sometimes positively, sometimes negatively — before adjusting. For a button color, this washes out in days. For a redesigned checkout, the adjustment period can be two weeks. Ignore the first week of data when reading the result.

If retention or repeat purchase is the metric, yes — you need to follow users across sessions. For first-purchase impact, anonymous bucketing works fine, but use a persistent cookie or device fingerprint so a returning visitor stays in the same variant.

At minimum: revenue per visitor, AOV, refund rate, and support tickets. A feature that lifts conversion but causes a 15% spike in support volume isn't a win. Page load time is also worth tracking — new features often add weight.

Yes, and it's one of the most underused experiment types. Hold-out tests — where the control keeps the feature and the variant removes it — reveal which features are actually pulling weight. Many stores discover their upsell modal or quick-view drawer adds friction rather than revenue.

Use consistent assignment keyed on user ID or a persistent device cookie, not session ID. Exclude staff and bot traffic. If the feature is visible to non-bucketed users (e.g. via shared URLs), shipping it as a server-side flag rather than a client-side flicker is safer.

Resist the urge to extend indefinitely — that inflates false-positive rates. If the result is flat with a tight confidence interval, the feature genuinely doesn't move the metric. If the interval is still wide, you under-powered the test and should redesign with a larger MDE or a higher-traffic surface.

For functional changes, usually yes — someone has to build the variant. But the orchestration (assignment, tracking, analysis) can run through a no-code experimentation layer, which is how most Shopify and WooCommerce teams keep velocity high without queueing every test behind a sprint.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.

Launch your first experiment

Product Experiments

Product Experiments

Typical runtime and sample profile by experiment type

Frequently asked questions

What's the difference between a product experiment and a CRO test?

How is this related to feature experimentation?

How long should a product experiment run?

What's a novelty effect and why does it matter here?

Should I run product experiments on logged-in users only?

What guardrail metrics should I track?

Can I test removing a feature?

How do I avoid sample pollution across variants?

What if the test is inconclusive after the planned runtime?

Do I need a developer to run product experiments?

Test ideas before you ship them