Bayesian Testing

Bayesian testing reports the probability that variant B beats A and updates as data arrives — no fixed sample size, no peeking penalty, but priors and loss thresholds matter.
Bayesian Testing
A testing framework that reports the posterior probability variant B beats variant A, updated continuously as data arrives.
Bayesian testing is an alternative to the frequentist null-hypothesis approach most A/B tools default to. Instead of asking "is the observed lift unlikely under the assumption variants are identical?", it asks the more intuitive question stakeholders actually want answered: "given everything I've seen, what's the probability B is better than A, and by how much?"
The framework starts with a prior belief about each variant's conversion rate, then updates that belief into a posterior distribution as visitors and conversions accumulate. Because the maths is updated rather than re-tested at each look, peeking at results doesn't inflate false-positive risk the way it does with classical p-values.
The appeal for e-commerce teams is mostly practical. You don't have to commit to a fixed sample size before launch, and a stakeholder check-in halfway through doesn't compromise the result. The output — "94% probability the new product card beats the control, expected lift +3.1%" — also lands better in a Monday meeting than "p = 0.043, 95% CI [0.2%, 6.0%]".
The trade-off is interpretation discipline. Bayesian tests need a prior (usually a weak Beta distribution for conversion rate) and a decision rule beyond raw probability — typically expected loss, which caps the downside risk of shipping a losing variant. Get the prior wrong or skip the loss threshold and you'll ship noise faster than a sloppy frequentist setup ever could.
P(B > A | data) = ∫∫ [θ_B > θ_A] · Beta(θ_A | α_A, β_A) · Beta(θ_B | α_B, β_B) dθ_A dθ_B
θ_A, θ_B
True conversion rates
The unknown underlying conversion rates for the control and variant.
α, β
Beta parameters
Posterior shape parameters: α = prior_α + conversions, β = prior_β + (visitors − conversions).
P(B > A | data)
Posterior probability
The probability variant B's true conversion rate exceeds A's, given observed data.
A Shopify apparel brand tests a new product-card layout. Control gets 412 conversions from 9,800 visitors; variant gets 461 from 9,750. Using a weak prior Beta(1,1) for both arms, the posterior for A is Beta(413, 9389) and for B is Beta(462, 9289).
Control conversions / visitors: 412 / 9,800
Variant conversions / visitors: 461 / 9,750
Prior: Beta(1, 1) — uniform
→ P(B > A) ≈ 96.8%, expected lift ≈ +0.52pp, expected loss if you ship B ≈ 0.02pp
Both the high posterior probability and the very low expected loss clear typical decision thresholds (P > 95%, loss < 0.1pp), so shipping B is justified. The expected-loss check is what stops you from shipping a 51% posterior with huge variance.
In practice you rarely compute the integral by hand — every Bayesian testing tool simulates 10,000+ draws from each posterior and counts how often B's sample exceeds A's. The same simulation gives you the credible interval on the lift and the expected loss, which together form a complete decision package.
Bayesian vs frequentist testing: practical behaviour at typical e-commerce traffic
| Property | Frequentist (NHST) | Bayesian |
|---|---|---|
| Output | p-value, confidence interval | Posterior probability, credible interval, expected loss |
| Sample size | Fixed up front; peeking inflates error | Flexible; can stop when threshold is met |
| Typical decision rule | p < 0.05 | P(B>A) > 95% AND expected loss < 0.1pp |
| Time to decision (5% MDE, 2% baseline CVR) | ~3-4 weeks | ~2-3 weeks (earlier stopping when effect is clear) |
| Risk if misused | False positives from peeking | Shipping noise when priors are too strong or loss is ignored |
| Stakeholder readability | Low — requires stats fluency | High — "96% chance B wins" |
Neither framework is universally better. Bayesian testing fits stores that run many shorter experiments and want flexible stopping; frequentist testing fits regulated reporting or environments where a fixed pre-registered design matters. Most modern experimentation platforms — including Metricuno — implement Bayesian by default and surface expected loss alongside probability so the decision rule is explicit.
Frequently asked questions
Not more accurate — different. Both converge to the same conclusion with enough data. Bayesian is more flexible operationally (no fixed sample size, no peeking penalty) and easier to communicate, but it requires choosing a prior and a loss threshold. Frequentist tests are stricter about pre-commitment but more familiar to most analysts.
For conversion-rate tests, a weak Beta(1, 1) (uniform) or Beta(2, 50) (anchored near a typical 2-4% baseline) is standard. Strong priors are rarely justified unless you have years of historical data on the exact metric and surface. When in doubt, use a weak prior and let the data speak.
You can look at the numbers without inflating false positives the way frequentist peeking does, but you still shouldn't stop the moment P(B>A) crosses 95% on day two. Set a minimum sample size and a maximum runtime up front, and require expected loss to be below your threshold — otherwise you'll ship early winners that regress.
Expected loss is the probability-weighted average of how much conversion rate you'd give up if you shipped the wrong variant. A test can show P(B>A) = 92% but with high variance — expected loss catches that and blocks the ship. Most Bayesian frameworks require both a probability threshold AND a loss threshold before declaring a winner.
It extends naturally: the tool computes P(variant_i > all others) for each arm and reports expected loss for shipping each one. Multi-armed bandits go a step further, dynamically reallocating traffic toward variants with higher posterior probability — useful when you care about earnings during the test, not just the final decision.
Often slightly less in practice, because you can stop as soon as the decision criteria are met rather than waiting for a pre-calculated sample size. The savings are biggest when the true effect is large. For tiny effects (sub-1% lift) both frameworks need roughly the same number of visitors to get confident.
A 95% credible interval is the range that contains the true lift with 95% probability — the interpretation people incorrectly give to confidence intervals. A frequentist 95% confidence interval is a range produced by a procedure that, over many repeats, captures the truth 95% of the time. The Bayesian version is what stakeholders actually mean.
Avoid it when you need pre-registered fixed-sample designs (regulated reporting, scientific publication), when stakeholders are already fluent in p-values and switching adds friction, or when you have no reasonable basis for a prior and the decision is high-stakes enough that the prior choice would be contested.
It's one branch of statistical analysis applied to experimentation; the other major branch on the conversion side is frequentist NHST. The same Bayesian machinery also powers attribution modelling, conversion-rate forecasting, and segmentation — so investing in Bayesian fluency pays off beyond just A/B testing.
Yes — every experiment in Metricuno reports posterior probability, credible interval on the lift, and expected loss by default. You can set your own probability and loss thresholds per test, and the platform refuses to declare a winner until both conditions are met and the minimum runtime has elapsed.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.