Sequential Testing

Metricuno
May 17, 2026
4 min read
Sequential Testing — Sequential testing lets you stop A/B tests early without inflating false positives. See the formula, boundaries, and when to use it on your store.
Quick answer

Sequential testing is a family of A/B test designs that lets you peek at results during a test and stop early when the winner is clear — without breaking your false-positive rate.

Definition
Experimentation

Sequential Testing

A/B test designs that let you check results during the test and stop early without inflating the false-positive rate.

Sequential testing is a family of statistical methods built for the way experimenters actually behave: they look at results before the planned end date. Classic fixed-horizon t-tests assume you peek exactly once, at a pre-declared sample size. Every extra peek silently raises your false-positive rate — by the third look at a 95% test, your true Type I error is closer to 10-15%.

Sequential designs (group-sequential boundaries, alpha-spending functions, and always-valid p-values from mixture sequential probability ratio tests) account for that peeking up front. They let you stop the test as soon as evidence is decisive, or continue until a futility boundary is hit — with the headline error rate preserved.

Also known as
Always-valid inference
Group sequential testing
Continuous monitoring

The problem sequential testing solves is mundane: every CRO specialist looks at the dashboard mid-test. If the variant is up 12% on day three with a p-value of 0.04, the temptation to call it is enormous — especially when paid traffic is burning behind the test.

With a fixed-horizon design, that early call is a coin flip dressed up as significance. Sequential methods replace the single hard cutoff with a stopping boundary that gets stricter the earlier you look. Cross the boundary, you ship. Don't cross it, you keep running. The expected sample size drops by 20-40% on real winners, and your declared 5% false-positive rate stays a 5% false-positive rate.

Formula

Z(k) > b(k), where b(k) is the alpha-spending boundary at look k

Variables

Z(k)

Test statistic at look k

Standardised effect size computed from the data collected through analysis point k.

b(k)

Stopping boundary at look k

Critical value from an alpha-spending function (e.g. O'Brien-Fleming or Pocock) that controls cumulative Type I error across all looks.

α

Total false-positive budget

Usually 0.05. The alpha-spending function decides how much of α is 'spent' at each interim look.

k

Look number

Index of the interim analysis (1, 2, ... K), pre-declared as fractions of maximum sample size.

Worked example

An apparel Shopify store runs a PDP layout test at α = 0.05 with five planned looks using O'Brien-Fleming boundaries. The boundaries are tight early (b(1) ≈ 4.56) and relax near the planned end (b(5) ≈ 2.04).

Look 1 (20% of sample): Z = 2.1 — below boundary 4.56, continue

Look 3 (60% of sample): Z = 2.9 — below boundary 2.36, continue

Look 4 (80% of sample): Z = 3.1 — above boundary 2.16, stop for efficacy

Test stops at look 4 with the new PDP declared winner

The store ships the variant 20% faster than a fixed-horizon design would have allowed, and the 5% false-positive guarantee still holds across all four looks.

Different sequential frameworks make different tradeoffs. O'Brien-Fleming is conservative early and lenient late — good when you want a result close to a fixed-horizon test but with optional early stopping. Pocock spends alpha evenly across looks — better for catching strong effects fast. Always-valid p-values (mSPRT) let you peek as often as you like, including continuously, at the cost of slightly larger required samples.

Benchmark

Sequential testing approaches compared

ApproachPeek frequencyAvg sample vs fixedBest for
Fixed-horizon (no peeking)Once, at end100% (baseline)Pre-planned tests with patient stakeholders
O'Brien-Fleming boundaries3-5 pre-declared looks85-95%Confirmatory tests, near-fixed-horizon behaviour
Pocock boundaries3-5 pre-declared looks70-85%Catching large effects early
Always-valid (mSPRT)Continuous / any time110-130%Dashboards where PMs check results daily
Bayesian with decision ruleContinuous60-90%Teams with priors and explicit loss functions

Sequential testing sits inside the broader discipline of statistical analysis for experimentation, alongside power calculations, multiple-comparison corrections, and variance reduction techniques like CUPED. The methodology you pick should match how your team actually consumes results — if PMs refresh the dashboard hourly, an always-valid framework beats pretending they won't peek.

Frequently asked

Sequential testing FAQ

A fixed-horizon p-value assumes one comparison at one pre-declared sample size. Every extra look gives random noise another chance to cross the 0.05 threshold. By five unplanned peeks, your real false-positive rate is roughly 14% instead of 5%.

Sequential testing is a frequentist approach: it controls long-run false-positive rates by adjusting critical values across looks. Bayesian A/B testing replaces p-values with posterior probabilities and uses decision rules like 'ship when P(B > A) > 95%'. Both allow continuous monitoring; they differ in how they define error and incorporate prior information.

Use O'Brien-Fleming when you want behaviour close to a fixed-horizon test and only want to stop early for very large effects. Use Pocock when catching strong winners quickly matters more — it spends alpha evenly across looks, so early stops are easier but the final-look boundary is stricter.

For group-sequential designs (O'Brien-Fleming, Pocock) yes — the boundaries depend on K, the planned number of analyses. For always-valid methods like mSPRT or Bayesian sequential tests, no — you can peek at any cadence, including continuously, without invalidating the inference.

On true winners, expected sample size drops 20-40% versus a fixed-horizon design with the same power. On null tests (no real effect), futility boundaries also let you abandon early, freeing traffic for the next experiment. The catch: maximum sample size is usually 5-15% larger than the fixed-horizon equivalent.

Not retroactively. The whole point is that the boundaries are calibrated to a pre-declared analysis plan. If you've been peeking at a fixed-horizon test, the honest fix is to pick an always-valid framework going forward and treat current p-values as descriptive, not inferential.

Yes, and it often helps more — low-traffic tests take so long that the temptation to peek is overwhelming. Always-valid methods are particularly well suited because they don't require pre-declaring look counts, which is hard to plan when you're not sure how long the test will run.

A futility boundary is the mirror of the efficacy boundary: it lets you stop early when the data has effectively ruled out a meaningful winner. Combined with efficacy stopping, futility rules cut the average runtime of inconclusive tests dramatically — you free up traffic instead of waiting out a flat experiment.

Yes. Power calculations set the maximum sample size and the minimum detectable effect you're scoping the test for. Sequential boundaries decide when you can stop earlier than that maximum. Skipping the power step means you don't know what effect size the test can reliably detect.

You combine sequential boundaries with a multiple-comparisons correction across variants (Bonferroni, Holm, or false-discovery-rate methods). Some platforms bake this in automatically; if yours doesn't, divide α by the number of variant-vs-control comparisons before computing your boundaries.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.