Statistical Significance Calculator

Metricuno

May 17, 2026

5 min read

Statistical Significance Calculator — Free statistical significance calculator for A/B tests. Enter visitors and conversions per variant to get p-value, confidence, and lift interval.

Quick answer

Plug in your A/B test's visitors and conversions per variant to get the p-value, confidence level, and the confidence interval around the observed lift.

Definition

Experimentation

Statistical Significance Calculator

A tool that computes the p-value and confidence interval for an A/B test from each variant's visitors and conversions.

A statistical significance calculator takes the raw outcome of a two-variant experiment — visitors and conversions for control and treatment — and tells you whether the observed difference in conversion rate is likely a real effect or noise. It runs a two-proportion z-test, returning a p-value, a confidence level (typically 90%, 95%, or 99%), and a confidence interval around the lift.

It is used after a test has ended, not during. Peeking at significance day by day inflates false-positive rates; the calculator's output is only valid when sample size and test duration were planned in advance.

Also known as

A/B test significance calculator

p-value calculator

Conversion test calculator

Calculator

A/B Test Significance Calculator

Inputs

Control visitors

Control conversions

Variant visitors

Variant conversions

Significance level (α)

0.05 = 5% false-positive rate (standard).

Result

Statistical significance

p = 0.0290

Significant

Z-score

2.184

Relative lift

20.00%

Conversion rates

Control: 2.55% → Variant: 3.06%

Enter the totals you'd see at the end of a test — full visitor and conversion counts for each variant. The widget runs a two-proportion z-test and returns the p-value, observed lift, and the confidence interval around that lift.

The result you care about is the p-value. If it falls below your chosen threshold (typically 0.05 for 95% confidence), the difference between control and variant is unlikely to be chance alone. If it doesn't, the test is inconclusive — which is not the same as "the variant lost."

The confidence interval is the part most teams under-use. A 95% CI of [+1.2%, +9.4%] tells you the variant is probably better, but the true lift could be small. That range matters more than the point estimate when deciding whether to ship.

The math behind the calculator

Formula

z = (p_B - p_A) / sqrt( p_pool * (1 - p_pool) * (1/n_A + 1/n_B) )

Variables

p_A

Control conversion rate

Conversions in control divided by visitors in control.

p_B

Variant conversion rate

Conversions in variant divided by visitors in variant.

n_A

Control sample size

Total visitors who saw the control.

n_B

Variant sample size

Total visitors who saw the variant.

p_pool

Pooled conversion rate

Combined conversions divided by combined visitors across both variants.

Worked example

A Shopify apparel store runs a new product-page layout against the current one for two weeks.

Control visitors (n_A): 8,500

Control conversions: 255 (3.00%)

Variant visitors (n_B): 8,500

Variant conversions: 306 (3.60%)

Pooled rate (p_pool): 3.30%

→ z ≈ 2.20, two-tailed p ≈ 0.028

The variant's 0.6 percentage-point lift (a 20% relative improvement) clears the 95% confidence bar. The 95% confidence interval on the lift is roughly [+0.06pp, +1.14pp], so the true effect is positive but could be modest — worth shipping, but don't forecast +20% in next quarter's plan.

The p-value is derived from the z-score using the standard normal distribution. A two-tailed test (the default for most A/B testing) doubles the one-tailed p-value because you didn't pre-specify direction — the variant could have won or lost, and either matters.

Typical inputs by store size

Benchmark

What sample size you typically need to detect realistic lifts at 95% confidence, 80% power

Baseline conversion rate	Minimum detectable lift	Visitors per variant	Typical test duration
1.5% (luxury / high AOV)	+20% relative	~24,000	4-8 weeks
2.5% (apparel)	+15% relative	~22,000	3-5 weeks
3.5% (beauty / consumables)	+10% relative	~31,000	2-4 weeks
5.0% (repeat-buy categories)	+10% relative	~21,000	2-3 weeks
8.0% (email-driven traffic)	+8% relative	~19,000	1-2 weeks

If you're below these volumes, you have two honest options: test bigger changes (whole-page redesigns rather than button colour), or accept longer test durations. The calculator will happily return a p-value on 200 visitors per variant — it just won't be meaningful.

Reading the result without fooling yourself

A p-value of 0.04 does not mean "96% chance the variant is better." It means: if the variant and control were truly identical, you'd see a difference this large or larger 4% of the time by random chance. Subtle, but the difference matters when stakeholders ask how confident you are.

Also: significance is not the same as business impact. A test can be statistically significant and commercially worthless if the lift is +0.3% on a low-traffic page. Always pair the p-value with the confidence interval and the absolute revenue at stake.

Don't check significance every day

Peeking at p-values during a running test and stopping the moment you see p < 0.05 inflates your false-positive rate to roughly 20-30% over a typical test window. Decide your sample size up front, run to that number, then check significance once. If you must monitor, use sequential testing methods (mSPRT, Bayesian) that are designed for it.

Frequently asked

Frequently asked questions

Most teams use 0.05 (95% confidence) as the default. Use 0.10 if you're testing low-risk changes and want faster decisions, or 0.01 for changes that touch checkout or pricing where being wrong is expensive.

Two-tailed is the safe default and the convention in most A/B testing platforms. One-tailed is only appropriate when you'd treat a negative result identically to a flat result — which is rarely true in practice, because a variant that hurts conversion is information you need.

No. Early significance is usually noise that regresses as the sample grows, and you also haven't covered a full business cycle — weekday vs weekend traffic, paid vs organic mix. Run to your pre-planned sample size and minimum two full weeks before deciding.

A sample size calculator is used before the test to determine how many visitors per variant you need to detect a given lift. A significance calculator is used after the test to evaluate whether the result you got is statistically meaningful. Both rely on the same underlying z-test math.

No — this tool assumes a binary outcome (converted or not). Revenue per visitor has a continuous, usually skewed distribution and needs a t-test or bootstrap method. Use a dedicated revenue-test calculator for those metrics.

Only by comparing variants pairwise, and you should apply a Bonferroni or Holm correction to the p-value threshold to account for multiple comparisons. With 4 variants, divide your alpha by the number of comparisons — so test against p < 0.0125 instead of 0.05.

Statistically, no. 0.06 means inconclusive at the 95% bar. Practically, look at the confidence interval and the business cost of being wrong. If the CI is mostly positive and the change is reversible, you may decide to ship; just document it as a judgement call, not a win.

Most testing tools (VWO, Optimizely, Convert) report significance natively. This calculator is useful when you're running tests in a custom setup, validating a tool's numbers, or working from raw GA4 or Shopify exports where you only have visitor and conversion counts.

The calculator returns a CI on the absolute lift in percentage points (e.g. [+0.1pp, +1.2pp]). If the interval crosses zero, the test is inconclusive. If it sits entirely above zero, the variant is better; the width tells you how precise the estimate is.

Usually one of three reasons: the true effect is smaller than you planned for, your baseline conversion rate is lower than estimated, or there is no real effect. Re-run a sample size calculation with the actual observed baseline — you may need 3-5x more visitors, or a bigger change to test.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.

Launch your first experiment

Statistical Significance Calculator

Statistical Significance Calculator

The math behind the calculator

Typical inputs by store size

What sample size you typically need to detect realistic lifts at 95% confidence, 80% power

Reading the result without fooling yourself

Frequently asked questions

What p-value should I use to call a test a winner?

Should I use a one-tailed or two-tailed test?

My test hit 95% significance after 3 days. Can I ship it?

What's the difference between a significance calculator and a sample size calculator?

Does this calculator work for revenue per visitor, not just conversion rate?

Can I use it for tests with more than two variants?

What if my p-value is 0.06 — is that close enough?

How does this work with Shopify or my A/B testing tool?

What confidence interval will I see, and how do I read it?

Why does my test never reach significance even after weeks?

Test ideas before you ship them