Experiment Backlogs

Q: What fields are required for every backlog entry?

At minimum: a unique slug, a structured hypothesis (if/then/because), the target surface, an evidence link (GA4 funnel, heatmap, support ticket), a score (ICE or RICE), and a status. Optional but useful: owner, primary metric, expected MDE, and the segment if you're targeting one.

Q: Should I use ICE or RICE to score my backlog?

Use ICE when your backlog is under ~30 items and most ideas affect similar traffic volumes. Switch to RICE once you're prioritising across surfaces with very different reach (PDP vs account page) — Reach prevents low-traffic pet projects from dominating the queue.

Q: How big should an experiment backlog be?

Aim for 30–60 active ideas in an established program. Below 15, you'll run out of tests before the next research cycle; above 120, the list becomes unreadable and ideas rot. The number that matters is throughput — tests shipped per month — not raw backlog count.

Q: Where should the backlog live — Notion, Airtable, or a dedicated tool?

Airtable and Notion are fine starting points because contributors already use them. Move to a dedicated experimentation tool once you need linked results, automated scoring, and reviewer workflows. The tool matters less than the discipline of one source of truth.

Q: How do I stop the backlog from filling up with junk ideas?

Require evidence on submission. Every entry needs at least one supporting data point — a GA4 drop-off, a session replay, a heuristic-review note, or a customer quote. "I think we should try…" without evidence gets rejected at intake, not buried in the queue.

Q: How does experiment prioritization relate to the backlog?

Prioritization is the scoring and ranking layer that operates on the backlog. The backlog is the inventory; prioritization is the policy that decides which item ships next. Without a backlog there's nothing to prioritise; without prioritization, the backlog is just a list.

Q: How often should we re-score the backlog?

Re-score the top 20 every two weeks before sprint planning, and do a full backlog grooming once a quarter. Reach and Confidence drift as traffic patterns and research evidence change — a score from six months ago is essentially stale.

Q: What's a healthy win rate from a backlog?

20–25% of shipped tests should produce a clear winner in an established program; mature teams push 25–35%. If your win rate is below 15%, your backlog is too speculative — tighten intake to require stronger evidence. If it's above 40%, you're probably only testing safe bets and leaving upside on the table.

Q: Should losing tests stay in the backlog?

Move them to a separate "results archive" with the full readout, not back into active backlog. Losers are valuable as future evidence ("we tried this in May with this variant, here's why it lost") but they clog the active queue if you leave them mixed in.

Q: Who owns the backlog?

One named person — usually the CRO lead or growth manager — owns intake, scoring sign-off, and quarterly grooming. Anyone can submit; one person curates. Backlogs without a clear owner devolve into shared documents nobody trusts within a quarter.

Metricuno

May 17, 2026

4 min read

Experiment Backlogs — How to structure an experiment backlog that actually ships tests: required fields, ICE vs RICE scoring, healthy throughput benchmarks, and common failure modes.

Quick answer

An experiment backlog is the structured pipeline of test ideas waiting to ship. Done well, it's a CRO program's biggest compounding asset — done badly, it's a graveyard of half-formed hunches.

Definition

Experimentation

Experiment Backlog

A structured, scored pipeline of test ideas waiting to be run — each with a hypothesis, surface, and priority.

An experiment backlog is the central queue that holds every test idea your team has generated but not yet shipped. Each entry is more than a one-liner: it includes a slug, hypothesis, target surface (PDP, cart, checkout), supporting evidence, ICE or RICE score, and a status that tracks it from idea → designed → live → analysed.

A well-kept backlog is the single largest compounding asset of a CRO program. It turns scattered Slack messages and Hotjar replays into a ranked, reviewable list — so the next test you run is the highest-expected-value one, not just the loudest one in last Tuesday's meeting.

Also known as

test backlog

CRO backlog

experimentation pipeline

Most stores don't fail at experimentation because they lack ideas — they fail because ideas live in five different places. A backlog only earns its keep when every contributor (designer, growth marketer, agency lead, dev) submits into the same template with the same required fields.

The minimum viable schema is six columns: slug, hypothesis ("if we change X, then Y will improve, because Z"), surface, evidence link, score, and status. Anything thinner and you can't prioritise; anything fatter and contributors stop filling it in. Treat the backlog as a product, not a spreadsheet.

Formula

RICE = (Reach × Impact × Confidence) / Effort

Variables

Reach

Number of visitors or sessions affected per month by the change.

Impact

Estimated effect on the primary metric per affected user, on a 0.25–3 scale (0.25 minimal, 3 massive).

Confidence

How sure you are about Reach and Impact, expressed as a percentage (50%, 80%, 100%).

Effort

Person-weeks required to design, build, QA, and ship the test.

Worked example

An apparel store scores a test that adds a sticky add-to-cart bar on mobile PDPs.

Reach (monthly mobile PDP sessions): 180,000

Impact (0.25–3 scale): 1.5

Confidence: 80%

Effort (person-weeks): 1.5

→ 144,000 RICE points

A score this high relative to a typical 20–40k baseline puts the test in the top decile of the backlog — it should run before any 'gut-feel' header copy tweak with no traffic behind it.

RICE forces honesty about Reach. The most common backlog pathology is high-Impact, low-Reach ideas ("redesign the empty-cart state") crowding out boring, high-Reach winners ("clarify the shipping threshold above the fold"). For lighter prioritisation, ICE drops Reach and is faster to score — fine for backlogs under ~30 items, but it systematically over-weights pet projects on low-traffic surfaces.

Benchmark

Healthy backlog metrics by program maturity

Maturity stage	Active backlog size	Tests shipped / month	Avg time idea → live	Win rate
Just starting (0–6 months)	15–25 ideas	1–2	5–7 weeks	15–20%
Established (6–18 months)	30–60 ideas	3–5	3–4 weeks	20–25%
Mature (18+ months)	60–120 ideas	6–10	2–3 weeks	25–35%
Stalled program (warning)	>150 ideas	<1	>8 weeks	<15%

Watch the stalled-program row. A backlog that swells past 150 entries with sub-monthly shipping isn't healthy — it's a hoarder's attic. Prune ruthlessly every quarter: archive anything scored more than two quarters ago without movement, and demote duplicates. The signal you want is throughput, not inventory.

Frequently asked

Experiment backlog FAQ