Experiment Backlogs

An experiment backlog is the structured pipeline of test ideas waiting to ship. Done well, it's a CRO program's biggest compounding asset — done badly, it's a graveyard of half-formed hunches.
Experiment Backlog
A structured, scored pipeline of test ideas waiting to be run — each with a hypothesis, surface, and priority.
An experiment backlog is the central queue that holds every test idea your team has generated but not yet shipped. Each entry is more than a one-liner: it includes a slug, hypothesis, target surface (PDP, cart, checkout), supporting evidence, ICE or RICE score, and a status that tracks it from idea → designed → live → analysed.
A well-kept backlog is the single largest compounding asset of a CRO program. It turns scattered Slack messages and Hotjar replays into a ranked, reviewable list — so the next test you run is the highest-expected-value one, not just the loudest one in last Tuesday's meeting.
Most stores don't fail at experimentation because they lack ideas — they fail because ideas live in five different places. A backlog only earns its keep when every contributor (designer, growth marketer, agency lead, dev) submits into the same template with the same required fields.
The minimum viable schema is six columns: slug, hypothesis ("if we change X, then Y will improve, because Z"), surface, evidence link, score, and status. Anything thinner and you can't prioritise; anything fatter and contributors stop filling it in. Treat the backlog as a product, not a spreadsheet.
RICE = (Reach × Impact × Confidence) / Effort
Reach
Reach
Number of visitors or sessions affected per month by the change.
Impact
Impact
Estimated effect on the primary metric per affected user, on a 0.25–3 scale (0.25 minimal, 3 massive).
Confidence
Confidence
How sure you are about Reach and Impact, expressed as a percentage (50%, 80%, 100%).
Effort
Effort
Person-weeks required to design, build, QA, and ship the test.
An apparel store scores a test that adds a sticky add-to-cart bar on mobile PDPs.
Reach (monthly mobile PDP sessions): 180,000
Impact (0.25–3 scale): 1.5
Confidence: 80%
Effort (person-weeks): 1.5
→ 144,000 RICE points
A score this high relative to a typical 20–40k baseline puts the test in the top decile of the backlog — it should run before any 'gut-feel' header copy tweak with no traffic behind it.
RICE forces honesty about Reach. The most common backlog pathology is high-Impact, low-Reach ideas ("redesign the empty-cart state") crowding out boring, high-Reach winners ("clarify the shipping threshold above the fold"). For lighter prioritisation, ICE drops Reach and is faster to score — fine for backlogs under ~30 items, but it systematically over-weights pet projects on low-traffic surfaces.
Healthy backlog metrics by program maturity
| Maturity stage | Active backlog size | Tests shipped / month | Avg time idea → live | Win rate |
|---|---|---|---|---|
| Just starting (0–6 months) | 15–25 ideas | 1–2 | 5–7 weeks | 15–20% |
| Established (6–18 months) | 30–60 ideas | 3–5 | 3–4 weeks | 20–25% |
| Mature (18+ months) | 60–120 ideas | 6–10 | 2–3 weeks | 25–35% |
| Stalled program (warning) | >150 ideas | <1 | >8 weeks | <15% |
Watch the stalled-program row. A backlog that swells past 150 entries with sub-monthly shipping isn't healthy — it's a hoarder's attic. Prune ruthlessly every quarter: archive anything scored more than two quarters ago without movement, and demote duplicates. The signal you want is throughput, not inventory.
Experiment backlog FAQ
At minimum: a unique slug, a structured hypothesis (if/then/because), the target surface, an evidence link (GA4 funnel, heatmap, support ticket), a score (ICE or RICE), and a status. Optional but useful: owner, primary metric, expected MDE, and the segment if you're targeting one.
Use ICE when your backlog is under ~30 items and most ideas affect similar traffic volumes. Switch to RICE once you're prioritising across surfaces with very different reach (PDP vs account page) — Reach prevents low-traffic pet projects from dominating the queue.
Aim for 30–60 active ideas in an established program. Below 15, you'll run out of tests before the next research cycle; above 120, the list becomes unreadable and ideas rot. The number that matters is throughput — tests shipped per month — not raw backlog count.
Airtable and Notion are fine starting points because contributors already use them. Move to a dedicated experimentation tool once you need linked results, automated scoring, and reviewer workflows. The tool matters less than the discipline of one source of truth.
Require evidence on submission. Every entry needs at least one supporting data point — a GA4 drop-off, a session replay, a heuristic-review note, or a customer quote. "I think we should try…" without evidence gets rejected at intake, not buried in the queue.
Prioritization is the scoring and ranking layer that operates on the backlog. The backlog is the inventory; prioritization is the policy that decides which item ships next. Without a backlog there's nothing to prioritise; without prioritization, the backlog is just a list.
Re-score the top 20 every two weeks before sprint planning, and do a full backlog grooming once a quarter. Reach and Confidence drift as traffic patterns and research evidence change — a score from six months ago is essentially stale.
20–25% of shipped tests should produce a clear winner in an established program; mature teams push 25–35%. If your win rate is below 15%, your backlog is too speculative — tighten intake to require stronger evidence. If it's above 40%, you're probably only testing safe bets and leaving upside on the table.
Move them to a separate "results archive" with the full readout, not back into active backlog. Losers are valuable as future evidence ("we tried this in May with this variant, here's why it lost") but they clog the active queue if you leave them mixed in.
One named person — usually the CRO lead or growth manager — owns intake, scoring sign-off, and quarterly grooming. Anyone can submit; one person curates. Backlogs without a clear owner devolve into shared documents nobody trusts within a quarter.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.