Confidence Intervals

Metricuno
May 17, 2026
4 min read
Confidence Intervals — Confidence intervals show the range your true A/B test effect likely falls in. Learn the formula, typical widths, and how to read them in CRO.
Quick answer

A confidence interval shows the range your true A/B test lift likely sits inside — far more useful than a binary "significant or not" p-value verdict.

Definition
Statistical Analysis

Confidence Interval

A range that likely contains the true effect of a test, reported with a stated probability — usually 95%.

A confidence interval (CI) is the range of values your true A/B test effect is plausibly inside, given the data you collected. A 95% CI means that if you repeated the experiment many times, 95% of the intervals you'd construct would contain the real underlying lift.

Unlike a p-value, which collapses a result into a yes/no significance verdict, a CI shows both the direction and the magnitude of uncertainty. A variant with a +4% lift and a CI of [+2%, +6%] is a very different decision than the same +4% lift with a CI of [-1%, +9%] — even if both look 'significant' on the dashboard.

Also known as
CI
95% confidence interval

Most CRO platforms quietly default to a 95% confidence level, meaning the interval is wide enough to capture the true effect 19 times out of 20. Tighten to 99% and the interval gets wider; loosen to 90% and it narrows — but you trade off how often you'll be wrong about the direction of the result.

For an online store, the practical question is rarely 'is this lift real?' but 'is the worst-case end of the interval still worth shipping?'. If the lower bound of a checkout test sits at +0.3%, you ship cautiously. If it sits at -2%, the variant might actually be hurting revenue even though the point estimate looks positive.

Formula

CI = p̂ ± z * sqrt( p̂ * (1 - p̂) / n )

Variables

Observed conversion rate

The conversion rate measured in the variant (e.g. 0.043 for 4.3%).

z

Z-score for the confidence level

1.96 for 95% CI, 2.576 for 99%, 1.645 for 90%.

n

Sample size

Number of visitors assigned to the variant.

Worked example

A Shopify apparel store runs a product-page test. The variant gets 12,000 visitors and 540 add-to-carts (4.50% conversion). What's the 95% confidence interval for the variant's true conversion rate?

p̂ (observed rate): 0.045

z (95% level): 1.96

n (visitors): 12000

95% CI = 4.50% ± 0.37% = [4.13%, 4.87%]

The true conversion rate of this variant is very likely between 4.13% and 4.87%. If the control sits at 4.10%, the lower bound of the variant barely clears it — a positive but fragile result worth re-running before rolling out.

The width of the CI is driven almost entirely by sample size. Doubling traffic doesn't halve the interval — it shrinks it by roughly the square root of 2, or about 30%. That's why low-traffic stores often see wide, indecisive intervals even after weeks of testing, and why running tests to a pre-committed sample size matters more than chasing 'significance' day to day.

Benchmark

Typical 95% CI width for an A/B test on a 4% baseline conversion rate, by sample size per variant

Visitors per variantConversionsObserved rate95% CICI half-width
1,000404.00%[2.79%, 5.21%]±1.21%
5,0002004.00%[3.46%, 4.54%]±0.54%
10,0004004.00%[3.62%, 4.38%]±0.38%
25,0001,0004.00%[3.76%, 4.24%]±0.24%
50,0002,0004.00%[3.83%, 4.17%]±0.17%
100,0004,0004.00%[3.88%, 4.12%]±0.12%

Read CIs against a minimum detectable effect, not against zero. If your store needs at least a +2% relative lift to be worth rolling out (engineering cost, risk of regression), then a variant whose CI is [+0.5%, +3.5%] is genuinely ambiguous — even if the point estimate is +2%. The interval tells you the decision isn't ready yet.

Frequently asked

Confidence intervals: common questions

A p-value tells you how surprising the result would be if there were no real effect — a single number, usually compared to 0.05. A confidence interval shows the range of plausible true effects. Two tests can have the same p-value but very different CIs, and the CI is what tells you whether the magnitude is meaningful.

Not quite — that's the common interpretation but technically incorrect under frequentist statistics. The 95% refers to the procedure: 95% of intervals built this way would contain the true value. For practical CRO decisions the looser interpretation is fine, but Bayesian credible intervals give you that probability statement directly.

95% is the default for a reason — it balances false positives against test duration. Drop to 90% only for low-stakes tests where speed matters more than precision. Go to 99% for changes that touch checkout, pricing, or anything where a wrong call costs real revenue.

Almost always sample size. Wide intervals mean you don't have enough conversions yet — either let the test run longer, increase traffic allocation, or accept that low-volume pages can't produce tight intervals in a reasonable timeframe.

Yes. A CI that crosses zero just means you can't rule out 'no effect' — but if the bulk of the interval is positive and your downside is small, it can still inform a directional bet. Just don't ship and claim a definitive win.

If a 95% CI for the difference between variant and control excludes zero, the result is statistically significant at p < 0.05. They're two views of the same underlying calculation, but the CI carries more information because it shows magnitude.

Report both if you can. Absolute (e.g. +0.4 percentage points) is what your stats engine computes natively. Relative (e.g. +10%) is what stakeholders want to hear. Be explicit about which one a given interval refers to — mixing them up is a classic reporting mistake.

Yes, but the math is different. Revenue and AOV are continuous and often skewed, so you need a t-distribution or bootstrap rather than the binomial formula used for conversion rate. Most experimentation platforms handle this automatically — just check which method yours uses.

Standard CIs assume you check the result once at a fixed sample size. If you peek daily and stop when the interval looks good, you inflate false positives. Use sequential or always-valid CIs (offered by most modern testing tools) if you want to monitor results continuously without breaking the math.

They're one of the core tools of inferential statistical analysis alongside hypothesis tests, effect sizes, and power calculations. In CRO specifically, CIs are usually the most decision-relevant output because they translate directly into 'how much lift, with how much certainty' — exactly what a roll-out decision needs.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.