Experiment Prioritization

Experiment prioritization decides what to test next. ICE, PIE, and RICE rank backlog items by expected impact, confidence, and effort so your team always ships the highest-leverage bet.
Experiment Prioritization
The process of ranking a CRO test backlog so the highest-impact, highest-confidence, lowest-effort experiments ship first.
Experiment prioritization is how a CRO team decides what to test next when the backlog has 40 ideas and the calendar has room for four. Rather than picking the loudest opinion in the room, the team scores each candidate against a shared rubric — typically some combination of expected impact, confidence in the hypothesis, and effort to build — and sorts the list.
The three frameworks you'll meet in practice are ICE (Impact, Confidence, Ease), PIE (Potential, Importance, Ease), and RICE (Reach, Impact, Confidence, Effort). They all do the same job: turn subjective debate into a ranked queue that anyone can defend, and keep the team working on bets that actually move revenue rather than on whoever shouted last.
Most CRO programs don't fail because the team runs bad tests. They fail because the team runs the wrong tests in the wrong order — burning two weeks of traffic on a button-colour tweak while a broken mobile checkout quietly bleeds 8% of sessions.
Prioritization fixes that by forcing every backlog item through the same numeric filter. A score of 7.2 beats a score of 4.8 regardless of who proposed it, which is exactly the political cover a CRO lead needs when the head of brand wants to test their pet hero image.
The three frameworks you'll actually use
ICE is the fastest: score Impact, Confidence, and Ease from 1-10, average the three, sort descending. It's perfect for a small team running 2-4 tests a month on a single Shopify storefront where reach is roughly constant across every idea.
PIE — Potential, Importance, Ease — is ICE's older cousin from WiderFunnel. The twist is that Importance weights the business value of the page being tested, so a checkout-page experiment outranks a blog-page experiment even when raw lift potential looks similar. RICE adds an explicit Reach term, which matters once you're running experiments across multiple templates, locales, or Shopify Markets where one variant might touch 200k sessions and another only 12k.
Scoring the inputs without lying to yourself
Impact is the lift you'd expect if the variant wins. Anchor it in real numbers from your analytics — drop-off rate at the step you're targeting, current conversion rate, AOV — not in vibes. A PDP test on an apparel store with a 3.1% baseline CR and a known 22% add-to-cart drop has a different impact ceiling than a thank-you-page upsell.
Confidence is where most teams cheat. Score it against the evidence you actually have: session recordings showing the friction, a quant funnel showing the drop, prior tests in the same pattern. If the only evidence is "the agency suggested it," confidence is a 3, not an 8. Effort is engineering days plus QA — be honest about whether your dev team can ship it without a sprint review.
The confidence-inflation trap
Teams systematically over-score Confidence on ideas they like and under-score it on ideas they don't. Calibrate by going back six months and checking how many of your 8+ Confidence tests actually won. If the hit rate is under 40%, your scoring is biased and you need a second reviewer on every score before the test enters the queue.
Running the backlog as a system
Prioritization isn't a one-off ceremony — it's a weekly cadence. New ideas enter the backlog with a draft score, the CRO lead reviews scores in a 30-minute Monday session, and the top three move into the build queue. Anything sitting in the backlog for more than 90 days without progressing either gets rebuilt with new evidence or archived.
The output is a living queue, not a static spreadsheet. Pair this with a healthy experiment backlog process and impact estimation methodology so the inputs you're scoring against — baseline rates, segment sizes, prior win rates — are always fresh. When a test ships, the result feeds back into Confidence scoring for the next round of similar hypotheses.
How ICE, PIE, and RICE rank the same five backlog ideas
ICE
PIE
RICE
Experiment prioritization FAQ
ICE scores Impact, Confidence, and Ease — three inputs, fast to apply. RICE adds a fourth term, Reach, and replaces Ease with Effort (in person-weeks). Use ICE when reach is roughly constant across ideas; use RICE when experiments touch wildly different audience sizes, like one variant on a global homepage versus one on a single product page.
ICE, almost always. It takes 60 seconds per idea, fits on a single spreadsheet column, and removes 80% of the prioritization arguments. Graduate to RICE only when you have enough test volume that reach genuinely varies between experiments, typically 6+ tests per month.
Tie the score to evidence. 9-10: prior winning test in the same pattern on your own site. 7-8: strong qualitative + quantitative signal (recordings plus a measured drop-off). 4-6: one source of evidence. 1-3: someone's opinion. If you can't name the evidence, the score isn't above 4.
Have two people score independently, then reconcile. The biggest score inflation comes from the person who proposed the idea also being the one who scores it. A second reviewer — usually the CRO lead plus a dev for the Effort number — catches 90% of the bias before it hits the queue.
Three to four times your monthly test capacity. If you ship 4 tests a month, keep 12-16 prioritized ideas ready. More than that and the backlog becomes stale; less and you'll run out of high-confidence options after a couple of losing tests.
PIE — Potential, Importance, Ease — was developed by WiderFunnel. The differentiator is Importance, which weights pages by business value. Use PIE when your traffic is concentrated on a few high-value pages (checkout, cart, top PDPs) and you want the framework itself to push tests toward those pages.
Impact estimation is the calculation behind the Impact score. Instead of guessing 1-10, you compute the expected lift in revenue from baseline conversion rate, traffic volume, and a realistic improvement range. That number then gets normalised onto the 1-10 scale used by ICE or PIE.
Yes — if it has access to your funnel data. Tools that ingest GA4 plus session-recording signals can surface the biggest drop-offs, propose hypotheses, and pre-score Impact and Reach using real numbers. You still want a human to score Confidence, since that requires judgement about your specific brand and audience.
Show the score, show the inputs, and ask what evidence would raise Confidence. Either they bring evidence and the score legitimately rises, or the idea stays where it is. The framework is the cover — it depersonalises the conversation so it's not your opinion versus theirs.
Re-score Confidence and Impact every time a related test ships, since the result updates your prior. Re-score the whole backlog quarterly to retire stale ideas. Effort scores should be reviewed any time your stack changes — a Shopify theme migration, for example, can collapse a 5-day build into a 1-day one.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.