Bayesian Testing

Metricuno
May 17, 2026
4 min read
Bayesian Testing — Bayesian testing explained: how posterior probability works, when to use it over frequentist A/B tests, priors, peeking, and decision thresholds.
Quick answer

Bayesian testing reports the probability that variant B beats A and updates as data arrives — no fixed sample size, no peeking penalty, but priors and loss thresholds matter.

Definition
Experimentation

Bayesian Testing

A testing framework that reports the posterior probability variant B beats variant A, updated continuously as data arrives.

Bayesian testing is an alternative to the frequentist null-hypothesis approach most A/B tools default to. Instead of asking "is the observed lift unlikely under the assumption variants are identical?", it asks the more intuitive question stakeholders actually want answered: "given everything I've seen, what's the probability B is better than A, and by how much?"

The framework starts with a prior belief about each variant's conversion rate, then updates that belief into a posterior distribution as visitors and conversions accumulate. Because the maths is updated rather than re-tested at each look, peeking at results doesn't inflate false-positive risk the way it does with classical p-values.

Also known as
Bayesian A/B testing
posterior probability testing

The appeal for e-commerce teams is mostly practical. You don't have to commit to a fixed sample size before launch, and a stakeholder check-in halfway through doesn't compromise the result. The output — "94% probability the new product card beats the control, expected lift +3.1%" — also lands better in a Monday meeting than "p = 0.043, 95% CI [0.2%, 6.0%]".

The trade-off is interpretation discipline. Bayesian tests need a prior (usually a weak Beta distribution for conversion rate) and a decision rule beyond raw probability — typically expected loss, which caps the downside risk of shipping a losing variant. Get the prior wrong or skip the loss threshold and you'll ship noise faster than a sloppy frequentist setup ever could.

Formula

P(B > A | data) = ∫∫ [θ_B > θ_A] · Beta(θ_A | α_A, β_A) · Beta(θ_B | α_B, β_B) dθ_A dθ_B

Variables

θ_A, θ_B

True conversion rates

The unknown underlying conversion rates for the control and variant.

α, β

Beta parameters

Posterior shape parameters: α = prior_α + conversions, β = prior_β + (visitors − conversions).

P(B > A | data)

Posterior probability

The probability variant B's true conversion rate exceeds A's, given observed data.

Worked example

A Shopify apparel brand tests a new product-card layout. Control gets 412 conversions from 9,800 visitors; variant gets 461 from 9,750. Using a weak prior Beta(1,1) for both arms, the posterior for A is Beta(413, 9389) and for B is Beta(462, 9289).

Control conversions / visitors: 412 / 9,800

Variant conversions / visitors: 461 / 9,750

Prior: Beta(1, 1) — uniform

P(B > A) ≈ 96.8%, expected lift ≈ +0.52pp, expected loss if you ship B ≈ 0.02pp

Both the high posterior probability and the very low expected loss clear typical decision thresholds (P > 95%, loss < 0.1pp), so shipping B is justified. The expected-loss check is what stops you from shipping a 51% posterior with huge variance.

In practice you rarely compute the integral by hand — every Bayesian testing tool simulates 10,000+ draws from each posterior and counts how often B's sample exceeds A's. The same simulation gives you the credible interval on the lift and the expected loss, which together form a complete decision package.

Benchmark

Bayesian vs frequentist testing: practical behaviour at typical e-commerce traffic

PropertyFrequentist (NHST)Bayesian
Outputp-value, confidence intervalPosterior probability, credible interval, expected loss
Sample sizeFixed up front; peeking inflates errorFlexible; can stop when threshold is met
Typical decision rulep < 0.05P(B>A) > 95% AND expected loss < 0.1pp
Time to decision (5% MDE, 2% baseline CVR)~3-4 weeks~2-3 weeks (earlier stopping when effect is clear)
Risk if misusedFalse positives from peekingShipping noise when priors are too strong or loss is ignored
Stakeholder readabilityLow — requires stats fluencyHigh — "96% chance B wins"

Neither framework is universally better. Bayesian testing fits stores that run many shorter experiments and want flexible stopping; frequentist testing fits regulated reporting or environments where a fixed pre-registered design matters. Most modern experimentation platforms — including Metricuno — implement Bayesian by default and surface expected loss alongside probability so the decision rule is explicit.

Frequently asked

Frequently asked questions

Not more accurate — different. Both converge to the same conclusion with enough data. Bayesian is more flexible operationally (no fixed sample size, no peeking penalty) and easier to communicate, but it requires choosing a prior and a loss threshold. Frequentist tests are stricter about pre-commitment but more familiar to most analysts.

For conversion-rate tests, a weak Beta(1, 1) (uniform) or Beta(2, 50) (anchored near a typical 2-4% baseline) is standard. Strong priors are rarely justified unless you have years of historical data on the exact metric and surface. When in doubt, use a weak prior and let the data speak.

You can look at the numbers without inflating false positives the way frequentist peeking does, but you still shouldn't stop the moment P(B>A) crosses 95% on day two. Set a minimum sample size and a maximum runtime up front, and require expected loss to be below your threshold — otherwise you'll ship early winners that regress.

Expected loss is the probability-weighted average of how much conversion rate you'd give up if you shipped the wrong variant. A test can show P(B>A) = 92% but with high variance — expected loss catches that and blocks the ship. Most Bayesian frameworks require both a probability threshold AND a loss threshold before declaring a winner.

It extends naturally: the tool computes P(variant_i > all others) for each arm and reports expected loss for shipping each one. Multi-armed bandits go a step further, dynamically reallocating traffic toward variants with higher posterior probability — useful when you care about earnings during the test, not just the final decision.

Often slightly less in practice, because you can stop as soon as the decision criteria are met rather than waiting for a pre-calculated sample size. The savings are biggest when the true effect is large. For tiny effects (sub-1% lift) both frameworks need roughly the same number of visitors to get confident.

A 95% credible interval is the range that contains the true lift with 95% probability — the interpretation people incorrectly give to confidence intervals. A frequentist 95% confidence interval is a range produced by a procedure that, over many repeats, captures the truth 95% of the time. The Bayesian version is what stakeholders actually mean.

Avoid it when you need pre-registered fixed-sample designs (regulated reporting, scientific publication), when stakeholders are already fluent in p-values and switching adds friction, or when you have no reasonable basis for a prior and the decision is high-stakes enough that the prior choice would be contested.

It's one branch of statistical analysis applied to experimentation; the other major branch on the conversion side is frequentist NHST. The same Bayesian machinery also powers attribution modelling, conversion-rate forecasting, and segmentation — so investing in Bayesian fluency pays off beyond just A/B testing.

Yes — every experiment in Metricuno reports posterior probability, credible interval on the lift, and expected loss by default. You can set your own probability and loss thresholds per test, and the platform refuses to declare a winner until both conditions are met and the minimum runtime has elapsed.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.