How to use ICE Framework

Metricuno

May 17, 2026

7 min read

How to use ICE Framework — How the ICE framework scores Impact, Confidence, and Ease to rank A/B test ideas — with worked examples, benchmarks, and when to use PIE or RICE instead.

Quick answer

The ICE framework ranks experiment ideas on Impact, Confidence, and Ease — a fast first-pass scoring method for CRO backlogs. Here's how to apply it without falling into its well-known traps.

Definition

Experimentation

ICE Framework

A scoring method that ranks experiment ideas by Impact, Confidence, and Ease — each rated 1-10 — to prioritise a CRO backlog.

The ICE framework is a lightweight experiment prioritisation method popularised by Sean Ellis. Each test idea gets three scores from 1 to 10: Impact (how much it could move the metric), Confidence (how sure you are it will work), and Ease (how cheap and fast it is to ship). The three numbers are averaged or multiplied to produce a single rank.

Its strength is speed — a team can score 40 ideas in an hour and walk out with a ranked backlog. Its weakness is the same: scores are subjective, and the method systematically favours easy wins over high-impact bets. Most CRO teams use ICE for first-pass triage and switch to a heavier framework like RICE or PXL for the final shortlist.

Also known as

ICE scoring

ICE prioritisation

Impact-Confidence-Ease

ICE earns its place because most stores have a 60-idea backlog and a two-test-per-week capacity. Without a scoring step, the loudest stakeholder picks what ships next — and that's how you end up retesting button colours in February while the checkout still has a guest-checkout bug from October.

If you've never formally prioritised your backlog, ICE is the right place to start. It takes 90 minutes, requires no historical data, and produces a defensible ranked list you can show your Head of E-commerce. Once you outgrow it — usually around month four — you graduate to a weighted framework that accounts for traffic volume and segment reach.

How the three scores work

Impact is your honest guess at how much this test moves the primary metric. A new hero image on the homepage might be a 4. Removing a forced account-creation step at checkout is probably a 9. Anchor the scale to your own history: a 10 is a test that, if it wins, would be the best test you ran all year.

Confidence is how sure you are the test will produce a positive result. It's driven by evidence — session recordings, heatmaps, GA4 funnel data, prior wins, qualitative survey responses. A hypothesis backed by three data sources scores 8-9. A hunch from a Slack thread scores 3.

Ease is the inverse of cost. How many developer hours, design rounds, and review cycles does shipping this take? A Shopify theme tweak you can deploy from the admin scores 9-10. A custom cart drawer that touches Liquid templates and Klaviyo events scores 3-4.

Multiply or average — pick one and stick with it

Averaging (I+C+E)/3 keeps scores in the 1-10 range and is forgiving of a single low score. Multiplying I*C*E (range 1-1000) punishes weak scores harshly — a 9-9-2 idea scores 162, while a 6-6-6 scores 216. Multiplication tends to surface more balanced bets; averaging surfaces ideas with one standout strength. Don't mix methods inside a single backlog or rankings become meaningless.

When ICE is the right tool — and when it isn't

Use ICE when your backlog is unscored, your testing programme is under six months old, or you need to triage 50+ ideas down to a shortlist of 10 before the quarter starts. It's also the right call when stakeholders disagree on what to test next — three numbers per idea forces the disagreement into the open.

Skip ICE when your tests have wildly different traffic reach (a homepage test touches 100% of visitors; a returns-page test touches 3%). The framework has no reach term, so it silently overweights low-traffic surfaces. In that case, jump to RICE, which adds explicit reach. The same applies if you're testing across multiple markets — ICE can't see that your French store has a tenth of the traffic of your UK store.

Chart

Ranked backlog from a typical apparel-store ICE session

The pattern above is typical: checkout and PDP tests cluster at the top because both Impact and Confidence are usually high (you have the funnel data to back them up). Footer and homepage hero tests sink because Impact is low — they touch users who've already decided. This is the exact ranking insight ICE is designed to surface.

Benchmark scoring across common test types

Newer teams over-score Impact and under-score Ease — everything feels important and everything feels hard. The fix is to anchor against a reference table the team agrees on before scoring starts. Below is a starting point calibrated to a Shopify store doing €3-8M annual revenue.

Use it as a discussion prompt, not gospel. If your team consistently scores everything 2-3 points higher than this table, you have a calibration problem, not a backlog problem.

Benchmark

Typical ICE scores for common e-commerce test types

Test type	Impact	Confidence	Ease	Avg ICE
Checkout: remove guest-checkout friction	9	8	7	8.0
PDP: sticky add-to-cart on mobile	7	8	8	7.7
Cart: free-shipping progress bar	6	7	9	7.3
PLP: new filter UI	6	5	4	5.0
Homepage hero image swap	4	4	9	5.7
Trust badges below CTA	5	4	9	6.0
Footer redesign	2	5	6	4.3

Notice the homepage hero scores higher than the PLP filter despite lower Impact and Confidence — that's the Ease bias at work. A 9 on Ease drags the average up. If you're running ICE long enough to see this pattern emerge in your backlog, it's a strong signal to move to a weighted method as part of your broader experiment prioritisation process.

Common failure modes and how to avoid them

The first failure mode is solo scoring. One person scoring 40 ideas in a spreadsheet produces a ranked list that reflects one person's biases. Run ICE as a group exercise — three to five people score independently, then discuss any idea where scores diverge by more than 2 points. The discussions are where the real prioritisation happens; the numbers are just the prompt.

The second is Confidence inflation. Teams confuse 'I really want this to work' with 'I have evidence this will work'. Require one citation per Confidence score above 6 — a recording timestamp, a GA4 segment, a heatmap region, a Klaviyo flow stat. No citation, cap Confidence at 5. This single rule reshapes most backlogs within a quarter.

Re-score the top 10, not the whole backlog

Once you have a ranked list, don't re-run ICE on all 60 ideas every month. Re-score only the top 10 — they're the candidates for the next sprint, and they're the ones where new data (a fresh heatmap, last week's test result) actually changes the numbers. Bottom-half ideas almost never climb high enough to matter.

Frequently asked

Frequently asked questions

Average for a forgiving, balanced ranking; multiply when you want to harshly penalise ideas with one weak dimension. Most teams average because the 1-10 output is easier to reason about. Whichever you pick, apply it consistently across the whole backlog.

RICE adds a Reach term — how many users will actually see the change — and divides by Effort instead of multiplying by Ease. RICE is more accurate when your tests vary in traffic exposure (homepage vs returns page), but takes longer to score. ICE is the faster first-pass; RICE is the rigorous second pass.

PIE scores Potential, Importance, Ease. Potential is similar to Impact; Importance weights how much traffic the page gets. PIE is closer to RICE in spirit — it builds in a traffic signal. Use PIE when reach varies a lot but you don't want the full RICE overhead.

Three to five people: the CRO lead, a designer or UX researcher, a developer (for accurate Ease scores), and ideally someone from analytics or customer support (for Confidence grounding). Score independently first, then meet to discuss any idea where scores diverge by more than 2 points.

Re-score the top 10 ideas every two-week sprint, and the full backlog once a quarter. Re-scoring everything every week is wasted effort — bottom-half ideas almost never climb enough to overtake the top tier in a single round.

Yes, the framework is metric-agnostic. Email teams score subject lines, send times, and segment splits the same way. The only adjustment is anchoring the Impact scale to the channel's own benchmarks — a 10 in email is a different absolute number than a 10 in on-site testing.

There's no absolute threshold — it's relative. Test the top 10-20% of your backlog. If your highest-ranked idea scores 5.5 averaged, that's still the test you should run next. Worry about absolute scores only if your top idea scores below 4, which usually means you need more ideas, not better scoring.

Two fixes: weight Impact double in the average ((2*I + C + E)/4), or set a rule that at least one test in every two-week sprint must come from your top-Impact list regardless of Ease. The second approach forces the team to ship high-effort bets, not just low-hanging fruit.

Yes — that's where ICE shines. Confidence scores will be lower across the board (3-5 range) because you don't have prior wins to anchor against, but the relative ranking still surfaces sensible priorities. After your first quarter of tests, Confidence calibration sharpens fast.

ICE is the first-pass triage layer inside a wider experiment prioritisation process. Idea capture feeds ICE; ICE produces a shortlist; the shortlist gets a deeper hypothesis review and either RICE or PXL scoring before tests ship. Treat ICE as the funnel's top, not its bottom.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.

Launch your first experiment

How to use ICE Framework

ICE Framework

How the three scores work

When ICE is the right tool — and when it isn't

Ranked backlog from a typical apparel-store ICE session

Benchmark scoring across common test types

Typical ICE scores for common e-commerce test types

Common failure modes and how to avoid them

Frequently asked questions

Should I multiply or average the three ICE scores?

What's the difference between ICE and RICE?

How is ICE different from PIE?

Who should score the ideas?

How often should we re-run ICE on the backlog?

Does ICE work for non-CRO experiments like email or paid ads?

What's a good ICE score to greenlight a test?

How do I avoid the easy-idea bias in ICE?

Can I use ICE without any historical test data?

Where does ICE fit in a broader experimentation programme?

Test ideas before you ship them