Part of The Complete Guide to Experimentation →

Experiment Prioritization

Metricuno

May 17, 2026

6 min read

Experiment Prioritization — How to prioritize your CRO backlog with ICE, PIE, and RICE. Score impact, confidence, and effort so every test you run is the highest-leverage bet.

Quick answer

Experiment prioritization decides what to test next. ICE, PIE, and RICE rank backlog items by expected impact, confidence, and effort so your team always ships the highest-leverage bet.

Definition

Experimentation

Experiment Prioritization

The process of ranking a CRO test backlog so the highest-impact, highest-confidence, lowest-effort experiments ship first.

Experiment prioritization is how a CRO team decides what to test next when the backlog has 40 ideas and the calendar has room for four. Rather than picking the loudest opinion in the room, the team scores each candidate against a shared rubric — typically some combination of expected impact, confidence in the hypothesis, and effort to build — and sorts the list.

The three frameworks you'll meet in practice are ICE (Impact, Confidence, Ease), PIE (Potential, Importance, Ease), and RICE (Reach, Impact, Confidence, Effort). They all do the same job: turn subjective debate into a ranked queue that anyone can defend, and keep the team working on bets that actually move revenue rather than on whoever shouted last.

Also known as

test prioritization

CRO backlog scoring

hypothesis ranking

Most CRO programs don't fail because the team runs bad tests. They fail because the team runs the wrong tests in the wrong order — burning two weeks of traffic on a button-colour tweak while a broken mobile checkout quietly bleeds 8% of sessions.

Prioritization fixes that by forcing every backlog item through the same numeric filter. A score of 7.2 beats a score of 4.8 regardless of who proposed it, which is exactly the political cover a CRO lead needs when the head of brand wants to test their pet hero image.

The three frameworks you'll actually use

ICE is the fastest: score Impact, Confidence, and Ease from 1-10, average the three, sort descending. It's perfect for a small team running 2-4 tests a month on a single Shopify storefront where reach is roughly constant across every idea.

PIE — Potential, Importance, Ease — is ICE's older cousin from WiderFunnel. The twist is that Importance weights the business value of the page being tested, so a checkout-page experiment outranks a blog-page experiment even when raw lift potential looks similar. RICE adds an explicit Reach term, which matters once you're running experiments across multiple templates, locales, or Shopify Markets where one variant might touch 200k sessions and another only 12k.

Scoring the inputs without lying to yourself

Impact is the lift you'd expect if the variant wins. Anchor it in real numbers from your analytics — drop-off rate at the step you're targeting, current conversion rate, AOV — not in vibes. A PDP test on an apparel store with a 3.1% baseline CR and a known 22% add-to-cart drop has a different impact ceiling than a thank-you-page upsell.

Confidence is where most teams cheat. Score it against the evidence you actually have: session recordings showing the friction, a quant funnel showing the drop, prior tests in the same pattern. If the only evidence is "the agency suggested it," confidence is a 3, not an 8. Effort is engineering days plus QA — be honest about whether your dev team can ship it without a sprint review.

The confidence-inflation trap

Teams systematically over-score Confidence on ideas they like and under-score it on ideas they don't. Calibrate by going back six months and checking how many of your 8+ Confidence tests actually won. If the hit rate is under 40%, your scoring is biased and you need a second reviewer on every score before the test enters the queue.

Running the backlog as a system

Prioritization isn't a one-off ceremony — it's a weekly cadence. New ideas enter the backlog with a draft score, the CRO lead reviews scores in a 30-minute Monday session, and the top three move into the build queue. Anything sitting in the backlog for more than 90 days without progressing either gets rebuilt with new evidence or archived.

The output is a living queue, not a static spreadsheet. Pair this with a healthy experiment backlog process and impact estimation methodology so the inputs you're scoring against — baseline rates, segment sizes, prior win rates — are always fresh. When a test ships, the result feeds back into Confidence scoring for the next round of similar hypotheses.

Chart

How ICE, PIE, and RICE rank the same five backlog ideas

ICE

PIE

RICE

Frequently asked

Experiment prioritization FAQ

ICE scores Impact, Confidence, and Ease — three inputs, fast to apply. RICE adds a fourth term, Reach, and replaces Ease with Effort (in person-weeks). Use ICE when reach is roughly constant across ideas; use RICE when experiments touch wildly different audience sizes, like one variant on a global homepage versus one on a single product page.

ICE, almost always. It takes 60 seconds per idea, fits on a single spreadsheet column, and removes 80% of the prioritization arguments. Graduate to RICE only when you have enough test volume that reach genuinely varies between experiments, typically 6+ tests per month.

Tie the score to evidence. 9-10: prior winning test in the same pattern on your own site. 7-8: strong qualitative + quantitative signal (recordings plus a measured drop-off). 4-6: one source of evidence. 1-3: someone's opinion. If you can't name the evidence, the score isn't above 4.

Have two people score independently, then reconcile. The biggest score inflation comes from the person who proposed the idea also being the one who scores it. A second reviewer — usually the CRO lead plus a dev for the Effort number — catches 90% of the bias before it hits the queue.

Three to four times your monthly test capacity. If you ship 4 tests a month, keep 12-16 prioritized ideas ready. More than that and the backlog becomes stale; less and you'll run out of high-confidence options after a couple of losing tests.

PIE — Potential, Importance, Ease — was developed by WiderFunnel. The differentiator is Importance, which weights pages by business value. Use PIE when your traffic is concentrated on a few high-value pages (checkout, cart, top PDPs) and you want the framework itself to push tests toward those pages.

Impact estimation is the calculation behind the Impact score. Instead of guessing 1-10, you compute the expected lift in revenue from baseline conversion rate, traffic volume, and a realistic improvement range. That number then gets normalised onto the 1-10 scale used by ICE or PIE.

Yes — if it has access to your funnel data. Tools that ingest GA4 plus session-recording signals can surface the biggest drop-offs, propose hypotheses, and pre-score Impact and Reach using real numbers. You still want a human to score Confidence, since that requires judgement about your specific brand and audience.

Show the score, show the inputs, and ask what evidence would raise Confidence. Either they bring evidence and the score legitimately rises, or the idea stays where it is. The framework is the cover — it depersonalises the conversation so it's not your opinion versus theirs.

Re-score Confidence and Impact every time a related test ships, since the result updates your prior. Re-score the whole backlog quarterly to retire stale ideas. Effort scores should be reviewed any time your stack changes — a Shopify theme migration, for example, can collapse a 5-day build into a 1-day one.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.

Launch your first experiment

Experiment Prioritization

Experiment Prioritization

The three frameworks you'll actually use

Scoring the inputs without lying to yourself

Running the backlog as a system

How ICE, PIE, and RICE rank the same five backlog ideas

Experiment prioritization FAQ

What's the difference between ICE and RICE scoring?

Which prioritization framework is best for a small CRO team?

How do I score Confidence honestly?

Should designers and PMs score the same idea, or just the CRO lead?

How big should my experiment backlog be?

What's PIE and when should I use it over ICE?

How does impact estimation feed into prioritization?

Can AI generate prioritized hypotheses automatically?

How do I handle a stakeholder pushing a low-score idea?

How often should I re-score the backlog?

Test ideas before you ship them