P-Values

Metricuno
May 17, 2026
4 min read
P-Values — What a p-value really means in CRO, how to calculate it, and the thresholds to use. Plain-English guide with a worked Shopify A/B test example.
Quick answer

A p-value tells you how surprising your test result would be if the variant changed nothing — not the probability your variant wins. Here's how to read it correctly.

Definition
Statistical Analysis

P-Value

The probability of seeing data at least as extreme as observed, assuming the null hypothesis (no real difference) is true.

A p-value is a conditional probability that answers one narrow question: if your variant truly had no effect, how often would random visitor noise alone produce a result this lopsided or more? A small p-value (say 0.02) means that kind of fluke is rare under the null hypothesis, so the data is hard to square with "no difference." A large p-value (0.40) means what you saw is easily explained by chance.

The number does not tell you the probability your variant wins, the size of the lift, or whether the test was worth running. It only tells you how compatible your data is with the assumption that nothing changed.

Also known as
significance value
observed significance level
p

The p-value is the single most misread number in conversion optimisation. Tools display it as a confidence score, dashboards turn it green at 0.05, and teams ship variants the moment it crosses the line — without checking the lift, the sample, or whether the test was peeked at daily.

Used correctly, a p-value is a decision aid inside a wider statistical analysis: one input alongside effect size, confidence interval, and business impact. Used badly, it becomes a green-light generator that ships noise as wins and erodes trust in your testing program.

Formula

p = P(|Z| >= |z_observed| | H0)

Variables

p

p-value

Probability of a result at least as extreme as observed, under the null.

z_observed

Observed z-score

Standardised difference between variant and control conversion rates.

H0

Null hypothesis

The assumption that variant and control have identical true conversion rates.

Worked example

A Shopify apparel store runs an A/B test on a new product page layout. Control: 12,000 visitors, 360 orders (3.00% CVR). Variant: 12,000 visitors, 420 orders (3.50% CVR). Pooled standard error gives z ≈ 2.16.

Control conversion rate: 3.00%

Variant conversion rate: 3.50%

Visitors per arm: 12,000

Observed z-score: 2.16

p ≈ 0.031 (two-sided)

If the new layout truly had no effect, you'd see a gap this large or larger about 3 times in 100. That's below the conventional 0.05 threshold, so you reject the null — but the 95% CI on lift still spans roughly +1% to +32% relative, so size your rollout decision around the lower bound, not the point estimate.

Conventional thresholds (0.05, 0.01, 0.10) are inherited from agricultural and medical research, not derived from any property of e-commerce traffic. Your threshold should reflect the cost of a false positive — shipping a losing checkout change is expensive; testing a button colour is not.

Benchmark

Typical p-value thresholds used in CRO and where each fits

Alpha (α)ConfidenceWhere it fitsFalse-positive rate
0.1090%Low-risk surface tests (copy, imagery, above-the-fold tweaks)1 in 10
0.0595%Default for most product-page and PDP experiments1 in 20
0.0199%Checkout, pricing, and shipping logic changes1 in 100
0.00199.9%Platform-wide rollouts affecting all SKUs1 in 1,000

Two failure modes dominate. First, peeking: checking the p-value daily and stopping the test the moment it dips below 0.05 inflates your real false-positive rate well above the nominal threshold. Second, ignoring lift: a p-value of 0.04 on a 0.3% relative lift is statistically significant and commercially meaningless. Always pair the p-value with the confidence interval on lift before you call a winner.

Frequently asked

P-values in A/B testing — FAQ

No. The p-value is the probability of the data given the null, not the probability of the hypothesis given the data. To get a "chance of winning" number you need a Bayesian approach with a prior, which is what most modern testing tools display under the hood.

Statistical significance is a binary decision (significant / not significant) made by comparing the p-value to a pre-set threshold (alpha). The p-value is the continuous number; significance is the verdict you make from it.

Only if you used a sequential testing method (like always-valid p-values or group sequential designs) that accounts for repeated looks. Stopping a fixed-horizon test early on a low p-value can push your true false-positive rate to 20-30%.

A two-sided p-value tests whether the variant differs from control in either direction. A one-sided p-value only tests the direction you hypothesised. One-sided halves the p-value but commits you to ignoring negative results — rarely the right trade in CRO.

Usually sample size. With 2,000 visitors per arm, even a 10% relative lift can produce p ≈ 0.20 because random noise easily explains the gap. Run a sample-size calculator before launching the test, not after.

Statistically, no — the underlying evidence is nearly identical. The 0.05 cutoff is a convention, not a law of nature. Report the actual p-value, the effect size, and the confidence interval, and let stakeholders decide based on business risk.

No. Bayesian tests output probability that variant beats control and expected loss, which map more directly to business decisions. Many CRO teams prefer Bayesian for that reason, though frequentist p-values remain the default in most legacy tools.

They come from the same underlying calculation. If a 95% CI on lift excludes zero, the two-sided p-value is below 0.05. The CI tells you the plausible range of the effect; the p-value compresses that into a single "is it zero?" answer.

Larger samples shrink the standard error, so the same observed lift produces a smaller p-value. This is why tiny lifts become "significant" in massive tests — the test detects a real but commercially trivial effect.

Yes. For checkout flow, pricing, or shipping changes where a false positive costs real revenue, a 0.01 threshold (99% confidence) is the standard upgrade. Reserve 0.05 for lower-stakes surface changes.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.