Experiment Culture

Experiment culture is the organisational layer that decides whether your testing program ships learnings or theatre. Here's how to define it, measure it, and spot the failure modes.
Experiment Culture
The shared norms that decide whether a team's experiments produce honest learnings or political theatre.
Experiment culture is the set of organisational behaviours that surround a testing program: how negative results are treated, whether the founder's pet idea can be overruled by a p-value, and how comfortable a designer is shipping a variant they don't personally love. Tooling moves fast; culture is the slow variable.
A healthy experiment culture treats every test as a question, not a bet. A losing variant is information, not embarrassment. A winning variant doesn't promote anyone — and doesn't condemn the person whose original design lost. This separation between outcome and identity is what lets teams run 40+ tests a year without exhausting themselves on internal politics.
Most stores don't fail at experimentation because the math is wrong. They fail because the org rewards confident opinions over uncertain answers. A flat test result gets framed as wasted sprint capacity instead of a saved roadmap quarter.
This is why experiment culture sits one layer above your experimentation strategy. The strategy decides what to test; the culture decides whether the team will tell you the truth about what the test said. Without the second, the first is decorative.
ECI = (Tests_shipped × Kill_rate × Hypothesis_diversity) / HiPPO_overrides
ECI
Experiment Culture Index
A directional health score for a testing program's cultural maturity.
Tests_shipped
Tests shipped per quarter
Live A/B tests that reached a stop decision in the quarter.
Kill_rate
Kill rate
Share of tests stopped on flat or negative results without being relaunched as 'directional wins'.
Hypothesis_diversity
Hypothesis diversity
Share of tests sourced from outside the founder/CEO, scored 0-1.
HiPPO_overrides
HiPPO overrides
Count of decisions where a senior opinion overruled a statistically clear result. Floored at 1.
A €4M Shopify apparel brand reviews its Q3 testing program.
Tests shipped: 12
Kill rate: 0.55
Hypothesis diversity: 0.7
HiPPO overrides: 2
→ ECI ≈ 2.31
An ECI above 2.0 is healthy for a mid-market store. The team is shipping enough tests, accepting losers honestly, and pulling ideas from across the org — but two HiPPO overrides in a quarter is still a yellow flag worth a retro conversation.
Treat ECI as directional, not diagnostic. The point isn't the number — it's that the four inputs are the four variables that actually move. Improve kill rate without improving hypothesis diversity and you just get a quieter monoculture.
Cultural health signals across testing-program maturity levels
| Signal | Nascent (yr 1) | Established (yr 2-3) | Mature (yr 4+) |
|---|---|---|---|
| Tests shipped per quarter | 2-4 | 8-15 | 20-40 |
| Win rate | 30-40% | 20-30% | 15-25% |
| Kill rate (flat results stopped honestly) | 20% | 45% | 60%+ |
| Hypotheses from outside leadership | <20% | 40-60% | 60-80% |
| HiPPO overrides per quarter | 3-5 | 1-2 | <1 |
| Avg time from idea to live test | 4-6 weeks | 2-3 weeks | 5-10 days |
| Retros on losing tests | Rare | Sometimes | Always |
Notice that win rate falls as maturity rises. Mature teams test bolder hypotheses on smaller surfaces, which means more losers — and that's the point. A 60% win rate usually means the team is only testing changes safe enough to ship without testing.
Frequently asked questions
Experimentation strategy is the plan — which surfaces to test, what hypotheses to prioritise, how to allocate traffic. Experiment culture is the behaviour layer underneath it: whether the team will actually report a flat result honestly, and whether leadership will accept it. Strategy tells you what to do; culture decides whether the answer reaches the decision-maker intact.
HiPPO stands for Highest-Paid Person's Opinion. It matters because the single biggest predictor of a failing experiment culture is how often a senior opinion overrules a clean statistical result. One override teaches the team that data is decorative. Three overrides in a quarter ends the program — people stop bringing hypotheses that might embarrass the boss.
No. The moment a losing variant becomes a career liability, your designers and PMs will only propose safe tests they're confident will win — and safe tests teach you nothing. Tie performance reviews to test volume, hypothesis quality, and learning velocity instead. The outcome of any individual test should be irrelevant to the person who designed it.
15-25%. If your win rate is above 40%, you're almost certainly testing only changes you already believe will work — meaning the program is functioning as a confidence-bolstering ritual rather than a learning engine. A lower win rate paired with a higher kill rate is the signature of a team taking real swings.
Start by separating the two roles the founder plays: idea-generator and decision-maker. The founder can still propose half the hypotheses; they just can't override results on the ones that lose. Publish test results in a shared channel before any meeting happens, so the data lands before the politics do.
Renaming flat tests as 'directional wins' and shipping them anyway. This is the cultural equivalent of moving the goalposts mid-game. It quietly destroys the team's ability to trust their own results, because everyone learns that the bar for 'winning' is whatever leadership wanted that week.
Six to twelve months for the basics, two to three years for it to feel native. The leading indicator is when someone outside the growth team — a customer-service lead, a warehouse manager — proposes a test hypothesis unprompted. That's the moment experimentation stops being a department and starts being a habit.
A four-person team needs it more than a forty-person team. With fewer people, a single HiPPO override poisons a higher share of the roadmap. Small teams should formalise one rule on day one: results are reviewed before opinions are shared, every time.
Track four numbers quarterly: tests shipped, kill rate, share of hypotheses from outside leadership, and HiPPO overrides. Review them in a 20-minute retro, not a dashboard. The conversation is the measurement — the numbers just give it somewhere to start.
Shipping a losing variant because a senior stakeholder liked it, then telling the team the test 'wasn't conclusive enough'. Once the team sees that data loses to preference, hypothesis quality collapses within a quarter and you're back to running on opinions in under six months.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.