Hypothesis Development

Metricuno
May 17, 2026
3 min read
Hypothesis Development — Learn how to turn research insights into testable A/B test hypotheses using a four-part structure: evidence, intervention, expected outcome, and metric.
Quick answer

Hypothesis development is the bridge between user research and A/B tests. A good hypothesis names the evidence, the change, the predicted outcome, and the metric that will confirm or refute it.

Definition
Experimentation

Hypothesis Development

The practice of turning research insights into testable A/B test hypotheses with a clear evidence-intervention-outcome-metric structure.

Hypothesis development is the artifact that connects user research to live experiments and back to documented learnings. It forces you to name the evidence behind a proposed change, the change itself, the behavioural outcome you expect, and the metric that will prove or disprove it.

Without this structure, A/B tests become opinion contests: whoever shouts loudest picks the next variant. With it, every test is auditable months later — you can see why you ran it, what you predicted, and what actually happened. That paper trail is what compounds learning across an experimentation programme.

Also known as
hypothesis writing
test hypothesis design

A hypothesis is not a guess about what will win — it's a prediction tied to evidence. If you can't point at session recordings, funnel data, survey quotes, or heatmap patterns that prompted the idea, you're testing a hunch, not a hypothesis.

This step sits inside a broader experimentation strategy. Research surfaces friction, hypothesis development translates that friction into a falsifiable statement, and the test result either confirms the mental model or forces you to update it.

Formula

Because we saw [EVIDENCE], we expect [INTERVENTION] will [OUTCOME], measured by [METRIC].

Variables

EVIDENCE

Research signal

The qualitative or quantitative observation that prompted the idea — session recordings, funnel drop-off, support tickets, survey responses.

INTERVENTION

Proposed change

The specific design, copy, or flow change you'll ship as the variant.

OUTCOME

Expected behavioural shift

How user behaviour will change — directionally and on which step of the funnel.

METRIC

Primary success metric

The single number you'll judge the test on, with a guardrail metric named separately.

Worked example

Shopify apparel store — mobile product page

Evidence: 62% of mobile sessions scroll past the size guide without expanding it; support tickets cite sizing uncertainty as the top pre-purchase question.

Intervention: Replace the collapsed size guide with an inline size recommender above the add-to-cart button.

Outcome: Mobile visitors will add to cart at a higher rate because sizing friction is removed before the buying decision.

Metric: Mobile add-to-cart rate, with returns rate as a 14-day guardrail.

A complete, testable hypothesis with named evidence, change, prediction, and judgement criteria.

Because the evidence is specific (62% scroll-past rate) and the metric is bounded by a guardrail, the test can be scored objectively in any direction — including a loss with a learning attached.

Not every hypothesis you write deserves test traffic. Score them before you queue them — strong evidence, reach, and clarity of metric should beat seniority of the person who pitched the idea every time.

Benchmark

Hypothesis quality scoring rubric (1 = weak, 3 = strong)

DimensionWeak (1)Adequate (2)Strong (3)
EvidenceOpinion or competitor copyOne data sourceTwo+ sources, one qualitative + one quantitative
Intervention specificityVague ("improve UX")Named element changedPixel-level spec with variant copy
Outcome prediction"Will perform better"Direction namedDirection + funnel step + magnitude range
Primary metricRevenue, undefinedSingle metric namedPrimary + guardrail + minimum detectable effect
Reach<10% of sessions10-40% of sessions>40% of sessions or high-AOV segment

The most common failure mode is the hypothesis written backwards — someone has already built the variant and the "hypothesis" is reverse-engineered to justify it. Write the hypothesis first, get it reviewed, then ship the variant. The order matters.

Frequently asked

Hypothesis development FAQ

A test idea is a suggestion — "let's try a sticky add-to-cart bar." A hypothesis is that idea wrapped in evidence, an expected outcome, and a metric. Ideas live in a backlog; hypotheses go on the test calendar.

Most programmes running 2-4 tests per month carry 15-30 prioritised hypotheses at any time. Fewer than 10 means research isn't feeding the pipeline; more than 50 usually means nothing is being killed or shipped.

Predict a range, not a point. "+3-8% on mobile add-to-cart" is honest; "+5.2% conversion" is theatre. The range should match your minimum detectable effect so the test is correctly powered.

Qualitative evidence is valid on its own when the pattern is unambiguous — say, six recorded sessions showing the same checkout error. Pair it with one quantitative signal when you can, but don't block on quant for clearly broken experiences.

Whoever owns the test outcome — typically the CRO specialist or growth manager. Designers and developers contribute to the intervention spec, but the person on the hook for the metric writes and signs off the hypothesis.

It's the translation layer. Your experimentation strategy defines what you're trying to learn (checkout friction, pricing sensitivity, etc.); hypothesis development turns each learning goal into one or more falsifiable, prioritised tests.

AI can draft hypotheses from drop-off data and session patterns, which is useful for filling the backlog. A human still needs to score the evidence, set the guardrail, and decide priority — those judgements depend on commercial context AI doesn't have.

Document them in the same place as winners. A failed hypothesis is a refuted belief about user behaviour, which is exactly the kind of learning that prevents the team re-running the same idea 18 months later.

Yes for the primary metric. You can layer secondary hypotheses on the same test ("we also expect a lift in average order value") but the test is won or lost on one number — otherwise statistical significance becomes meaningless.

Two to four sentences fits the evidence-intervention-outcome-metric structure cleanly. If it runs past a paragraph, the intervention is probably two changes bundled, and the test will be unreadable when it ends.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.