A/B Test Sample Size Calculator

Calculate the exact number of users per variation for your experiment — with sequential testing, measurement uncertainty, overdispersion, and variance reduction adjustments that other calculators ignore.

Other calculators show n under ideal conditions. We show n under real ones.

Inputs

2%
10%
Used to estimate experiment duration


0.000
Intra-cluster correlation (Beta-Binomial)
0.00
Viewability, fraud, cross-device noise
0%
Variance reduction from pre-experiment data
Conversion lag model: observed = 1 − exp(−λ · window)

Results

Loading WASM engine…

Sample size per arm
— total (both arms)
Duration
Mode
Naive
Real

Power Curve

MDE Curve

Why Sample Size Matters for A/B Tests

Running an A/B test without a proper sample size calculation is like navigating without a map. Too few users and you'll miss real effects (a Type II error), concluding "no difference" when one exists. Too many and you waste time, budget, and opportunity cost running an experiment longer than necessary.

The sample size determines two critical properties of your test: statistical power (the probability of detecting a real effect) and precision (how tight your confidence interval will be). An underpowered test might show a non-significant result for an effect that is real and economically meaningful. An overpowered test wastes resources you could have deployed on the next experiment.

In advertising, the stakes are concrete. Every day an experiment runs is a day you're splitting budget between control and variant. If the variant is worse, you're burning money. If it's better, you're leaving revenue on the table by not fully deploying it. The right sample size calculation finds the sweet spot: enough observations for a reliable conclusion, without a single extra day of opportunity cost.

Most practitioners use the standard fixed-horizon formula and call it done. But that formula makes assumptions that rarely hold in practice: no peeking at results, no user correlation, no measurement noise, no conversion delay. When those assumptions break — and they always do — the "right" sample size is wrong. That's why this calculator exists.

What Most Calculators Get Wrong

Every popular sample size calculator — Evan Miller, Statsig, Optimizely, GrowthBook — computes the same fixed-horizon Z-test formula. They give you an answer under ideal conditions: a single analysis at the end, perfectly measured outcomes, independent observations, and instant conversions. Here's what they ignore.

Peeking at results (sequential testing). If you plan to check your test before it reaches the planned sample size, the effective significance level inflates. Group sequential designs (O'Brien-Fleming, Pocock) control this by spending alpha across interim looks, but they require more samples. With 5 looks, the increase ranges from 3% (OBF) to 25% (Pocock). No competitor calculator shows this inflation.

Overdispersion. Users within the same geography, device, or time-of-day cluster are correlated, so the actual variance exceeds the binomial assumption. Even a small intra-cluster correlation (rho = 0.001) can inflate the required n by 10-30%. The Beta-Binomial model captures this with a single parameter, but standard calculators assume rho = 0.

Measurement uncertainty. In advertising, the outcome you observe is not the true outcome. Viewability measurement has ~15% noise, fraud detection adds ~5%, cross-device attribution adds ~10%. These create an irreducible variance floor that doesn't shrink with more data. If the systematic noise is large relative to the effect you're trying to detect, no sample size will be sufficient — the effect is undetectable. This calculator computes the exact threshold.

Conversion delay. When conversions happen days or weeks after exposure, a short observation window misses a fraction of them. A 7-day window with typical e-commerce delay captures about 88% of conversions, requiring ~14% more users to compensate. Standard calculators assume instant conversion.

Variance reduction opportunities. CUPED (Controlled-experiment Using Pre-Experiment Data) can reduce required sample size by 10-50% by using pre-experiment covariates to absorb noise. If you have this data, you're leaving power on the table. No competitor calculator previews the CUPED benefit.

The result: the "sample size" from a standard calculator is a lower bound that ignores real measurement conditions. Your actual experiment will be underpowered unless you account for these factors. That's exactly what this calculator does.

Methodology

The core formula is the two-proportion Z-test (Fleiss/Lachin formulation): the pooled variance under H0 for the alpha term and unpooled variance for the beta term. This gives the standard fixed-horizon sample size nfixed.

Sequential inflation is computed by constructing a group sequential design via alpha spending functions (O'Brien-Fleming or Pocock). The last-look critical value, relative to the fixed-horizon critical value, gives the sample-size inflation factor.

Overdispersion uses the variance inflation factor VIF = 1/(1 - rho) from the Beta-Binomial model, where rho is the intra-cluster correlation.

Measurement systematics add an irreducible variance floor vsys = 2 * (sigma * p̄)² to the per-observation statistical variance. The inflation factor is delta² / (delta² - (zalpha + zbeta)² * vsys). When the denominator reaches zero, the effect is undetectable.

CUPED reduces variance by a factor of (1 - rho²), where rho² is the squared correlation between the pre-experiment covariate and the outcome metric.

Delay correction uses an exponential decay model: observed fraction = 1 - exp(-lambda * window). The inflation factor is 1 / observed fraction.

All adjustments are multiplicative on nfixed: nadjusted = nfixed * inflationsequential * inflationoverdispersion * inflationsystematics * reductionCUPED * inflationdelay.

The entire computation runs in your browser via WebAssembly, compiled from the same Rust ns-inference library that powers the full NextStat platform. No approximations, no server calls, no data leaves your device.

References:

  • Fleiss, Levin & Paik (2003). Statistical Methods for Rates and Proportions, 3rd ed. Wiley. — Two-proportion Z-test formula.
  • O'Brien & Fleming (1979). A multiple testing procedure for clinical trials. Biometrics, 35(3), 549–556. — Group sequential boundaries.
  • Pocock (1977). Group sequential methods in the design and analysis of clinical trials. Biometrika, 64(2), 191–199.
  • Deng, Xu, Kohavi & Walker (2013). Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. WSDM '13. — CUPED.
  • Lan & DeMets (1983). Discrete sequential boundaries for clinical trials. Biometrika, 70(3), 659–663. — Alpha spending functions.

How to Choose Your Minimum Detectable Effect

The minimum detectable effect (MDE) is the smallest improvement worth detecting. Choosing it is a business decision, not a statistical one. Set it too small and you'll need millions of users; set it too large and you'll miss real improvements.

Start with revenue. Calculate the revenue impact of a 1% relative lift in your conversion rate. If your baseline is 2% on 100,000 monthly visitors with $50 average order value, a 1% relative lift (2% → 2.02%) means ~20 extra conversions/month = $1,000/month. Is that worth the experiment? If not, increase the MDE until the payoff justifies the cost.

Factor in experiment cost. Every day your experiment runs, you're splitting traffic. If the variant is better, you're leaving half the uplift on the table. If it's worse, you're burning budget on the losing side. The cost of running 30 extra days often exceeds the value of detecting a 1% lift vs. a 5% lift.

Rule of thumb for advertising: most campaign-level A/B tests use 5-20% relative MDE. Landing page tests with high traffic can go as low as 2-5%. Creative or bid-strategy tests with limited traffic typically use 10-30%. If your required sample size exceeds your available traffic within a reasonable window (2-4 weeks), increase the MDE.

Sample Size Reference Tables

Fixed-horizon sample sizes per arm for common baseline rates, at alpha = 0.05 and power = 0.80 (two-sided test). These are naive estimates — add sequential, overdispersion, and systematics corrections for real-world planning.

Baseline Rate MDE 5% MDE 10% MDE 15% MDE 20% MDE 30% MDE 50%
0.5%6,147,3701,536,843683,042384,211170,76061,474
1%3,042,094760,524337,999190,13184,50330,421
2%1,489,478372,370165,49893,09341,37514,895
5%555,648138,91261,73934,72815,4355,557
10%254,29863,57528,25515,8947,0642,543
20%103,19225,79811,4666,4502,8671,032

NextStat vs. Other Calculators

Feature NextStat Evan Miller Statsig Optimizely GrowthBook
Fixed-horizon Z-test Yes Yes Yes Yes Yes
Sequential testing inflation Yes No No No No
Overdispersion correction Yes No No No No
Measurement systematics Yes No No No No
CUPED variance reduction Yes No No No No
Delay correction Yes No No No No
Power & MDE curves Yes Yes No No Yes
Sensitivity breakdown Yes No No No No
Naive vs Real comparison Yes No No No No
100% client-side (WASM) Yes Yes No No No
Open computation engine Yes No No No Yes

Frequently Asked Questions

For a two-proportion Z-test, sample size per arm is n = (z_alpha * sqrt(2 * p_bar * (1 - p_bar)) + z_beta * sqrt(p1 * (1 - p1) + p2 * (1 - p2)))² / (p2 - p1)², where p1 is the baseline rate, p2 is the expected treatment rate, p_bar is the pooled rate, z_alpha is the critical value for your significance level, and z_beta is the critical value for your desired power. This formula assumes a fixed-horizon test with no interim analysis.

Sequential testing (group sequential design) lets you check results at pre-planned interim looks without inflating the false positive rate. The trade-off is a larger maximum sample size — typically 3-25% more, depending on the number of looks and the spending function. O'Brien-Fleming boundaries add minimal inflation (~3% for 5 looks) while Pocock boundaries require more (~25% for 5 looks).

Overdispersion occurs when variance exceeds the binomial model prediction. Users within the same geography, device, or time cohort are correlated. Even a small intra-cluster correlation (rho = 0.001) can inflate the required sample size by 10-30%. The Beta-Binomial model captures this. Most calculators ignore it.

Systematic uncertainties that don't shrink with more data: viewability measurement error (~15%), fraud detection uncertainty (~5%), cross-device attribution noise (~10%). These create an irreducible variance floor. If the systematic noise is large relative to the effect, the effect becomes undetectable regardless of sample size.

CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance using pre-experiment covariate data. If the correlation (rho-squared) between the covariate and the outcome metric is 0.25, CUPED reduces the required sample size by 25%. This is free power — same data, more signal.

Other calculators compute sample size under ideal conditions: no peeking, no overdispersion, no measurement noise, no delay. If you plan to peek (sequential), have correlated users (overdispersion), or noisy measurement (systematics), you need more samples. The "Naive vs Real" bar shows exactly how much.

No. All computation is 100% client-side via WebAssembly (WASM). The engine is written in Rust, compiled to a ~130KB WASM binary that runs in a Web Worker. Your inputs never leave your browser.

Two-sided (default) detects both improvements and degradations. One-sided detects only improvements and requires fewer samples, but can't detect harm. Use two-sided for production experiments where degradation matters.

O'Brien-Fleming uses conservative early boundaries and permissive late ones — minimal sample size overhead (~3% for 5 looks) but early stopping is unlikely unless the effect is very large. Pocock uses equal boundaries at each look — easier early stopping but 20-30% more maximum sample size. For most advertising experiments, OBF is recommended.

Conversions that happen days or weeks after exposure are missed by short observation windows. A 7-day window with lambda = 0.3 captures about 88% of conversions, requiring ~14% more users to compensate. This calculator models the exact inflation.

Need more than a calculator?

NextStat Ads connects to your Google Ads account and runs sequential A/B tests with real stopping rules, causal attribution, and metric forecasting.

Try NextStat Ads Free