What is overdispersion and why does it matter for A/B tests?

Overdispersion occurs when the variance in your data exceeds what a simple binomial model predicts. This is common in advertising — users within the same geographic region, device type, or time-of-day cohort are correlated, inflating the actual variance. A Beta-Binomial model captures this with a parameter rho (intra-cluster correlation). Even small values like rho=0.001 can inflate the required sample size by 10-30%. Most other calculators ignore overdispersion entirely.

What are measurement systematics in A/B testing?

Measurement systematics are systematic uncertainties that don't shrink with more data. In digital advertising, these include viewability measurement error (±15%), ad fraud detection uncertainty (±5%), and cross-device attribution noise (±10%). These create an irreducible variance floor — no matter how large your sample, this noise remains. If the systematic uncertainty is large relative to your effect size, the effect becomes undetectable. This calculator computes the exact sample size inflation from measurement systematics.

What is CUPED and how does it reduce sample size?

CUPED (Controlled-experiment Using Pre-Experiment Data) is a variance reduction technique that uses pre-experiment covariate data to reduce the noise in your metric. If the correlation (rho-squared) between the pre-experiment covariate and the experiment metric is 0.25, CUPED reduces the required sample size by 25%. This is essentially free power — you collect the same amount of data but extract more signal. This calculator previews the sample size reduction you can expect from CUPED.

Why does this calculator give larger sample sizes than Evan Miller's?

Evan Miller's calculator (and most others) compute sample size under ideal conditions: fixed-horizon analysis, no peeking, no overdispersion, no measurement noise, no delay. The real world is different. If you plan to peek at results (sequential testing), if your users are correlated (overdispersion), or if your measurement system has noise (systematics), you need more samples. This calculator shows you the honest number under your actual conditions. The 'Naive vs Real' comparison bar shows exactly how much bigger the real number is.

Is the calculation done server-side or client-side?

All computation is 100% client-side via WebAssembly (WASM). The statistical engine is written in Rust (the same ns-inference library that powers the full NextStat platform), compiled to a ~130KB WASM binary that runs in a Web Worker. Your inputs never leave your browser. There is no server, no API call, no data collection.

How do I choose between one-sided and two-sided tests?

Use a two-sided test (default) when you want to detect both improvements and degradations. Use a one-sided test when you only care about detecting an improvement (e.g., 'is the new landing page better?'). One-sided tests require fewer samples because the alpha budget is concentrated in one tail. However, they cannot detect harmful effects, which is risky in production experiments.

What spending functions are available for sequential testing?

The calculator supports two spending functions: O'Brien-Fleming (OBF) and Pocock. OBF uses very conservative early boundaries and permissive late boundaries — it adds minimal sample size overhead (about 3% for 5 looks) but makes early stopping unlikely unless the effect is very large. Pocock uses equal boundaries at each look — it makes early stopping more likely but requires 20-30% more maximum sample size. For most advertising experiments, OBF is recommended because the sample size inflation is negligible.

A/B Test Sample Size Calculator — Free, With Sequential Testing & Real-World Corrections

Q: How does sequential testing affect sample size?

Sequential testing (also called group sequential design) lets you check results at pre-planned interim looks without inflating the false positive rate. The trade-off is a larger maximum sample size — typically 5-30% more than a fixed-horizon test, depending on the number of looks and the spending function. O'Brien-Fleming boundaries add minimal inflation (about 3% for 5 looks) while Pocock boundaries require more (about 25% for 5 looks). This calculator shows the exact inflation factor for your configuration.

Q: How does conversion delay affect my experiment?

Conversion delay means some conversions happen days or weeks after exposure. If your observation window is shorter than the typical conversion lag, you miss a fraction of conversions, effectively reducing your sample size. The calculator models this with an exponential decay: observed_fraction = 1 - exp(-lambda * window_days). A 7-day window with lambda=0.3 means you observe about 88% of conversions, requiring roughly 14% more users to compensate.

Why It Matters

Why Sample Size Matters for A/B Tests

Running an A/B test without a proper sample size calculation is like navigating without a map. Too few users and you'll miss real effects (a Type II error), concluding "no difference" when one exists. Too many and you waste time, budget, and opportunity cost running an experiment longer than necessary.

The sample size determines two critical properties of your test: statistical power (the probability of detecting a real effect) and precision (how tight your confidence interval will be). An underpowered test might show a non-significant result for an effect that is real and economically meaningful. An overpowered test wastes resources you could have deployed on the next experiment.

In advertising, the stakes are concrete. Every day an experiment runs is a day you're splitting budget between control and variant. If the variant is worse, you're burning money. If it's better, you're leaving revenue on the table by not fully deploying it. The right sample size calculation finds the sweet spot: enough observations for a reliable conclusion, without a single extra day of opportunity cost.

Most practitioners use the standard fixed-horizon formula and call it done. But that formula makes assumptions that rarely hold in practice: no peeking at results, no user correlation, no measurement noise, no conversion delay. When those assumptions break — and they always do — the "right" sample size is wrong. That's why this calculator exists.

The Problem

What Most Calculators Get Wrong

Every popular sample size calculator — Evan Miller, Statsig, Optimizely, GrowthBook — computes the same fixed-horizon Z-test formula. They give you an answer under ideal conditions: a single analysis at the end, perfectly measured outcomes, independent observations, and instant conversions. Here's what they ignore.

Peeking at results (sequential testing). If you plan to check your test before it reaches the planned sample size, the effective significance level inflates. Group sequential designs (O'Brien-Fleming, Pocock) control this by spending alpha across interim looks, but they require more samples. With 5 looks, the increase ranges from 3% (OBF) to 25% (Pocock). No competitor calculator shows this inflation.

Overdispersion. Users within the same geography, device, or time-of-day cluster are correlated, so the actual variance exceeds the binomial assumption. Even a small intra-cluster correlation (rho = 0.001) can inflate the required n by 10-30%. The Beta-Binomial model captures this with a single parameter, but standard calculators assume rho = 0.

Measurement uncertainty. In advertising, the outcome you observe is not the true outcome. Viewability measurement has ~15% noise, fraud detection adds ~5%, cross-device attribution adds ~10%. These create an irreducible variance floor that doesn't shrink with more data. If the systematic noise is large relative to the effect you're trying to detect, no sample size will be sufficient — the effect is undetectable. This calculator computes the exact threshold.

Conversion delay. When conversions happen days or weeks after exposure, a short observation window misses a fraction of them. A 7-day window with typical e-commerce delay captures about 88% of conversions, requiring ~14% more users to compensate. Standard calculators assume instant conversion.

Variance reduction opportunities. CUPED (Controlled-experiment Using Pre-Experiment Data) can reduce required sample size by 10-50% by using pre-experiment covariates to absorb noise. If you have this data, you're leaving power on the table. No competitor calculator previews the CUPED benefit.

The result: the "sample size" from a standard calculator is a lower bound that ignores real measurement conditions. Your actual experiment will be underpowered unless you account for these factors. That's exactly what this calculator does.

Under the Hood

Methodology

The core formula is the two-proportion Z-test (Fleiss/Lachin formulation): the pooled variance under H0 for the alpha term and unpooled variance for the beta term. This gives the standard fixed-horizon sample size n_fixed.

Sequential inflation is computed by constructing a group sequential design via alpha spending functions (O'Brien-Fleming or Pocock). The last-look critical value, relative to the fixed-horizon critical value, gives the sample-size inflation factor.

Overdispersion uses the variance inflation factor VIF = 1/(1 - rho) from the Beta-Binomial model, where rho is the intra-cluster correlation.

Measurement systematics add an irreducible variance floor v_sys = 2 * (sigma * p̄)² to the per-observation statistical variance. The inflation factor is delta² / (delta² - (z_alpha + z_beta)² * v_sys). When the denominator reaches zero, the effect is undetectable.

CUPED reduces variance by a factor of (1 - rho²), where rho² is the squared correlation between the pre-experiment covariate and the outcome metric.

Delay correction uses an exponential decay model: observed fraction = 1 - exp(-lambda * window). The inflation factor is 1 / observed fraction.

All adjustments are multiplicative on n_fixed: n_adjusted = n_fixed * inflation_sequential * inflation_{overdispersion} * inflation_systematics * reduction_CUPED * inflation_delay.

The entire computation runs in your browser via WebAssembly, compiled from the same Rust ns-inference library that powers the full NextStat platform. No approximations, no server calls, no data leaves your device.

References:

Fleiss, Levin & Paik (2003). Statistical Methods for Rates and Proportions, 3rd ed. Wiley. — Two-proportion Z-test formula.
O'Brien & Fleming (1979). A multiple testing procedure for clinical trials. Biometrics, 35(3), 549–556. — Group sequential boundaries.
Pocock (1977). Group sequential methods in the design and analysis of clinical trials. Biometrika, 64(2), 191–199.
Deng, Xu, Kohavi & Walker (2013). Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. WSDM '13. — CUPED.
Lan & DeMets (1983). Discrete sequential boundaries for clinical trials. Biometrika, 70(3), 659–663. — Alpha spending functions.

Practical Guide

How to Choose Your Minimum Detectable Effect

The minimum detectable effect (MDE) is the smallest improvement worth detecting. Choosing it is a business decision, not a statistical one. Set it too small and you'll need millions of users; set it too large and you'll miss real improvements.

Start with revenue. Calculate the revenue impact of a 1% relative lift in your conversion rate. If your baseline is 2% on 100,000 monthly visitors with $50 average order value, a 1% relative lift (2% → 2.02%) means ~20 extra conversions/month = $1,000/month. Is that worth the experiment? If not, increase the MDE until the payoff justifies the cost.

Factor in experiment cost. Every day your experiment runs, you're splitting traffic. If the variant is better, you're leaving half the uplift on the table. If it's worse, you're burning budget on the losing side. The cost of running 30 extra days often exceeds the value of detecting a 1% lift vs. a 5% lift.

Rule of thumb for advertising: most campaign-level A/B tests use 5-20% relative MDE. Landing page tests with high traffic can go as low as 2-5%. Creative or bid-strategy tests with limited traffic typically use 10-30%. If your required sample size exceeds your available traffic within a reasonable window (2-4 weeks), increase the MDE.

Quick Reference

Sample Size Reference Tables

Fixed-horizon sample sizes per arm for common baseline rates, at alpha = 0.05 and power = 0.80 (two-sided test). These are naive estimates — add sequential, overdispersion, and systematics corrections for real-world planning.

Baseline Rate	MDE 5%	MDE 10%	MDE 15%	MDE 20%	MDE 30%	MDE 50%
0.5%	6,147,370	1,536,843	683,042	384,211	170,760	61,474
1%	3,042,094	760,524	337,999	190,131	84,503	30,421
2%	1,489,478	372,370	165,498	93,093	41,375	14,895
5%	555,648	138,912	61,739	34,728	15,435	5,557
10%	254,298	63,575	28,255	15,894	7,064	2,543
20%	103,192	25,798	11,466	6,450	2,867	1,032

Comparison

NextStat vs. Other Calculators

Feature	NextStat	Evan Miller	Statsig	Optimizely	GrowthBook
Fixed-horizon Z-test	Yes	Yes	Yes	Yes	Yes
Sequential testing inflation	Yes	No	No	No	No
Overdispersion correction	Yes	No	No	No	No
Measurement systematics	Yes	No	No	No	No
CUPED variance reduction	Yes	No	No	No	No
Delay correction	Yes	No	No	No	No
Power & MDE curves	Yes	Yes	No	No	Yes
Sensitivity breakdown	Yes	No	No	No	No
Naive vs Real comparison	Yes	No	No	No	No
100% client-side (WASM)	Yes	Yes	No	No	No
Open computation engine	Yes	No	No	No	Yes

Questions

Frequently Asked Questions

What is the formula for A/B test sample size?

For a two-proportion Z-test, sample size per arm is n = (z_alpha * sqrt(2 * p_bar * (1 - p_bar)) + z_beta * sqrt(p1 * (1 - p1) + p2 * (1 - p2)))² / (p2 - p1)², where p1 is the baseline rate, p2 is the expected treatment rate, p_bar is the pooled rate, z_alpha is the critical value for your significance level, and z_beta is the critical value for your desired power. This formula assumes a fixed-horizon test with no interim analysis.

How does sequential testing affect sample size?

Sequential testing (group sequential design) lets you check results at pre-planned interim looks without inflating the false positive rate. The trade-off is a larger maximum sample size — typically 3-25% more, depending on the number of looks and the spending function. O'Brien-Fleming boundaries add minimal inflation (~3% for 5 looks) while Pocock boundaries require more (~25% for 5 looks).

What is overdispersion and why does it matter?

Overdispersion occurs when variance exceeds the binomial model prediction. Users within the same geography, device, or time cohort are correlated. Even a small intra-cluster correlation (rho = 0.001) can inflate the required sample size by 10-30%. The Beta-Binomial model captures this. Most calculators ignore it.

What are measurement systematics?

Systematic uncertainties that don't shrink with more data: viewability measurement error (~15%), fraud detection uncertainty (~5%), cross-device attribution noise (~10%). These create an irreducible variance floor. If the systematic noise is large relative to the effect, the effect becomes undetectable regardless of sample size.

What is CUPED and how does it help?

CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance using pre-experiment covariate data. If the correlation (rho-squared) between the covariate and the outcome metric is 0.25, CUPED reduces the required sample size by 25%. This is free power — same data, more signal.

Why does this calculator give larger numbers than others?

Other calculators compute sample size under ideal conditions: no peeking, no overdispersion, no measurement noise, no delay. If you plan to peek (sequential), have correlated users (overdispersion), or noisy measurement (systematics), you need more samples. The "Naive vs Real" bar shows exactly how much.

Is the calculation done on your servers?

No. All computation is 100% client-side via WebAssembly (WASM). The engine is written in Rust, compiled to a ~130KB WASM binary that runs in a Web Worker. Your inputs never leave your browser.

One-sided vs. two-sided — which should I use?

Two-sided (default) detects both improvements and degradations. One-sided detects only improvements and requires fewer samples, but can't detect harm. Use two-sided for production experiments where degradation matters.

O'Brien-Fleming vs. Pocock — which spending function?

O'Brien-Fleming uses conservative early boundaries and permissive late ones — minimal sample size overhead (~3% for 5 looks) but early stopping is unlikely unless the effect is very large. Pocock uses equal boundaries at each look — easier early stopping but 20-30% more maximum sample size. For most advertising experiments, OBF is recommended.

How does conversion delay affect my experiment?

Conversions that happen days or weeks after exposure are missed by short observation windows. A 7-day window with lambda = 0.3 captures about 88% of conversions, requiring ~14% more users to compensate. This calculator models the exact inflation.

Explore

NextStat Ads — Full A/B testing platform for Google Ads with sequential stopping rules, causal attribution, and anomaly detection.
NextStat Documentation — Technical documentation for the NextStat compute engine, including the statistical methodology behind this calculator.
NextStat Cloud — The full NextStat platform for high-energy physics and advanced statistical analysis.

A/B Test Sample Size Calculator

Inputs

Results

Power Curve

MDE Curve

Why Sample Size Matters for A/B Tests

What Most Calculators Get Wrong

Methodology

How to Choose Your Minimum Detectable Effect

Sample Size Reference Tables

NextStat vs. Other Calculators

Frequently Asked Questions

Need more than a calculator?

A/B Test Sample Size Calculator

Inputs

Results

Power Curve

MDE Curve

Why Sample Size Matters for A/B Tests

What Most Calculators Get Wrong

Methodology

How to Choose Your Minimum Detectable Effect

Sample Size Reference Tables

NextStat vs. Other Calculators

Frequently Asked Questions

Related Tools & Resources

Need more than a calculator?