Statistical Significance Calculator

Plug in your A/B test's visitors and conversions per variant to get the p-value, confidence level, and the confidence interval around the observed lift.
Statistical Significance Calculator
A tool that computes the p-value and confidence interval for an A/B test from each variant's visitors and conversions.
A statistical significance calculator takes the raw outcome of a two-variant experiment — visitors and conversions for control and treatment — and tells you whether the observed difference in conversion rate is likely a real effect or noise. It runs a two-proportion z-test, returning a p-value, a confidence level (typically 90%, 95%, or 99%), and a confidence interval around the lift.
It is used after a test has ended, not during. Peeking at significance day by day inflates false-positive rates; the calculator's output is only valid when sample size and test duration were planned in advance.
A/B Test Significance Calculator
Control visitors
Control conversions
Variant visitors
Variant conversions
Significance level (α)
0.05 = 5% false-positive rate (standard).
Statistical significance
p = 0.0290
Z-score
2.184
Relative lift
20.00%
Conversion rates
Control: 2.55% → Variant: 3.06%
Enter the totals you'd see at the end of a test — full visitor and conversion counts for each variant. The widget runs a two-proportion z-test and returns the p-value, observed lift, and the confidence interval around that lift.
The result you care about is the p-value. If it falls below your chosen threshold (typically 0.05 for 95% confidence), the difference between control and variant is unlikely to be chance alone. If it doesn't, the test is inconclusive — which is not the same as "the variant lost."
The confidence interval is the part most teams under-use. A 95% CI of [+1.2%, +9.4%] tells you the variant is probably better, but the true lift could be small. That range matters more than the point estimate when deciding whether to ship.
The math behind the calculator
z = (p_B - p_A) / sqrt( p_pool * (1 - p_pool) * (1/n_A + 1/n_B) )
p_A
Control conversion rate
Conversions in control divided by visitors in control.
p_B
Variant conversion rate
Conversions in variant divided by visitors in variant.
n_A
Control sample size
Total visitors who saw the control.
n_B
Variant sample size
Total visitors who saw the variant.
p_pool
Pooled conversion rate
Combined conversions divided by combined visitors across both variants.
A Shopify apparel store runs a new product-page layout against the current one for two weeks.
Control visitors (n_A): 8,500
Control conversions: 255 (3.00%)
Variant visitors (n_B): 8,500
Variant conversions: 306 (3.60%)
Pooled rate (p_pool): 3.30%
→ z ≈ 2.20, two-tailed p ≈ 0.028
The variant's 0.6 percentage-point lift (a 20% relative improvement) clears the 95% confidence bar. The 95% confidence interval on the lift is roughly [+0.06pp, +1.14pp], so the true effect is positive but could be modest — worth shipping, but don't forecast +20% in next quarter's plan.
The p-value is derived from the z-score using the standard normal distribution. A two-tailed test (the default for most A/B testing) doubles the one-tailed p-value because you didn't pre-specify direction — the variant could have won or lost, and either matters.
Typical inputs by store size
What sample size you typically need to detect realistic lifts at 95% confidence, 80% power
| Baseline conversion rate | Minimum detectable lift | Visitors per variant | Typical test duration |
|---|---|---|---|
| 1.5% (luxury / high AOV) | +20% relative | ~24,000 | 4-8 weeks |
| 2.5% (apparel) | +15% relative | ~22,000 | 3-5 weeks |
| 3.5% (beauty / consumables) | +10% relative | ~31,000 | 2-4 weeks |
| 5.0% (repeat-buy categories) | +10% relative | ~21,000 | 2-3 weeks |
| 8.0% (email-driven traffic) | +8% relative | ~19,000 | 1-2 weeks |
If you're below these volumes, you have two honest options: test bigger changes (whole-page redesigns rather than button colour), or accept longer test durations. The calculator will happily return a p-value on 200 visitors per variant — it just won't be meaningful.
Reading the result without fooling yourself
A p-value of 0.04 does not mean "96% chance the variant is better." It means: if the variant and control were truly identical, you'd see a difference this large or larger 4% of the time by random chance. Subtle, but the difference matters when stakeholders ask how confident you are.
Also: significance is not the same as business impact. A test can be statistically significant and commercially worthless if the lift is +0.3% on a low-traffic page. Always pair the p-value with the confidence interval and the absolute revenue at stake.
Don't check significance every day
Peeking at p-values during a running test and stopping the moment you see p < 0.05 inflates your false-positive rate to roughly 20-30% over a typical test window. Decide your sample size up front, run to that number, then check significance once. If you must monitor, use sequential testing methods (mSPRT, Bayesian) that are designed for it.
Frequently asked questions
Most teams use 0.05 (95% confidence) as the default. Use 0.10 if you're testing low-risk changes and want faster decisions, or 0.01 for changes that touch checkout or pricing where being wrong is expensive.
Two-tailed is the safe default and the convention in most A/B testing platforms. One-tailed is only appropriate when you'd treat a negative result identically to a flat result — which is rarely true in practice, because a variant that hurts conversion is information you need.
No. Early significance is usually noise that regresses as the sample grows, and you also haven't covered a full business cycle — weekday vs weekend traffic, paid vs organic mix. Run to your pre-planned sample size and minimum two full weeks before deciding.
A sample size calculator is used before the test to determine how many visitors per variant you need to detect a given lift. A significance calculator is used after the test to evaluate whether the result you got is statistically meaningful. Both rely on the same underlying z-test math.
No — this tool assumes a binary outcome (converted or not). Revenue per visitor has a continuous, usually skewed distribution and needs a t-test or bootstrap method. Use a dedicated revenue-test calculator for those metrics.
Only by comparing variants pairwise, and you should apply a Bonferroni or Holm correction to the p-value threshold to account for multiple comparisons. With 4 variants, divide your alpha by the number of comparisons — so test against p < 0.0125 instead of 0.05.
Statistically, no. 0.06 means inconclusive at the 95% bar. Practically, look at the confidence interval and the business cost of being wrong. If the CI is mostly positive and the change is reversible, you may decide to ship; just document it as a judgement call, not a win.
Most testing tools (VWO, Optimizely, Convert) report significance natively. This calculator is useful when you're running tests in a custom setup, validating a tool's numbers, or working from raw GA4 or Shopify exports where you only have visitor and conversion counts.
The calculator returns a CI on the absolute lift in percentage points (e.g. [+0.1pp, +1.2pp]). If the interval crosses zero, the test is inconclusive. If it sits entirely above zero, the variant is better; the width tells you how precise the estimate is.
Usually one of three reasons: the true effect is smaller than you planned for, your baseline conversion rate is lower than estimated, or there is no real effect. Re-run a sample size calculation with the actual observed baseline — you may need 3-5x more visitors, or a bigger change to test.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.