Statistical Concepts

Understanding the key statistical concepts behind reliable A/B testing platforms

Note: This page presents validated statistical concepts based on peer-reviewed research and industry best practices. Each concept is explained in a concise, intuitive manner without making assumptions about your statistical background.

Page Contents

Introduction
p-value
Confidence Intervals
False Positives & Negatives
CUPED
Heterogeneous Effects
Sample Ratio Mismatch
Multiple Hypothesis Testing

Why Statistics Matter in A/B Testing

A/B testing platforms rely on statistical methods to determine whether observed differences between variants are meaningful or just random chance. Understanding these concepts helps you make better decisions and avoid common pitfalls.

Reliable Decisions

Proper statistical methods ensure you can trust your experiment results and make data-driven decisions with confidence.

Avoiding Pitfalls

Understanding statistical concepts helps you avoid common mistakes that can lead to incorrect conclusions and costly decisions.

Efficient Testing

Advanced statistical techniques can help you run experiments more efficiently, requiring fewer users and less time.

p-value

The p-value is the probability of observing a difference as extreme as the one in your experiment, assuming there is no real difference between variants (the null hypothesis).

What it means

A p-value of 0.05 means there's a 5% chance you'd see a difference this large (or larger) if there was actually no difference between your variants.

Lower p-values (like 0.01) indicate stronger evidence against the null hypothesis, suggesting the observed difference is likely real.

Common threshold: p < 0.05 is typically considered "statistically significant," though this is an arbitrary convention.

Common misconceptions

Misconception: p = 0.05 means there's a 95% chance your result is correct.

Reality: It only tells you the probability of seeing such a difference by random chance.

Misconception: A non-significant p-value proves there's no difference.

Reality: It just means you don't have enough evidence to reject the null hypothesis.

Confidence Intervals

A confidence interval provides a range of plausible values for the true effect, giving you both the magnitude and uncertainty of your experiment results.

Understanding Confidence Intervals

A 95% confidence interval means that if you were to repeat your experiment many times, about 95% of the resulting intervals would contain the true effect.

Lower bound

Upper bound

Point estimate

What it tells you

The most likely range for the true effect
How precise your estimate is (narrower = more precise)
Whether the effect is statistically significant (if it doesn't include zero)

Better than just p-values

Shows the magnitude of the effect
Indicates the precision of your estimate
Helps assess practical significance, not just statistical significance

False Positives & Negatives

Understanding the two types of errors in hypothesis testing helps you balance the risks in your experimentation program.

False Positive (Type I Error)

Concluding there is an effect when there actually isn't one

Example scenario:

Your A/B test shows that variant B increases conversion by 15% with p=0.04, but in reality, there's no difference. You implement B and see no actual improvement.

Controlled by

Your significance level (α), typically set at 0.05, which means you accept a 5% chance of false positives.

Business impact

Wasted resources implementing changes that don't actually improve metrics, potential negative impacts on user experience.

Multiple testing problem

When running many tests (multiple metrics, segments, or experiments), your false positive rate increases dramatically. If you run 20 tests with α=0.05, you have a 64% chance of at least one false positive.

Solution: Use correction methods like Bonferroni, Benjamini-Hochberg, or sequential testing to control the overall false positive rate.

CUPED

CUPED (Controlled-experiment Using Pre-Experiment Data) is a variance reduction technique that can dramatically increase the sensitivity of your experiments.

How CUPED Works

CUPED uses historical data from before the experiment to reduce noise in your metrics, making it easier to detect true effects with smaller sample sizes.

Key concept:

CUPED reduces the variance in your experiment data by accounting for pre-experiment metrics. This leads to narrower confidence intervals, allowing you to detect true effects more reliably with the same sample size, or use a smaller sample size to achieve the same statistical power.

Key benefits

Reduces variance by 30-80%, depending on the metric
Allows you to detect smaller effects with the same sample size
Reduces experiment duration by up to 50-80%
Works especially well for metrics with high user-to-user variability

How to implement

CUPED adjusts each user's experiment metric by subtracting a portion of their pre-experiment metric, reducing the noise while preserving the treatment effect:

Y_i^CUPED = Y_i - θ(X_i - μ_X)

Where Y_i is the user's metric during the experiment, X_i is their pre-experiment metric, μ_X is the mean of the pre-experiment metric, and θ is a coefficient determined by regression.

Heterogeneous Treatment Effects

Heterogeneous treatment effects occur when an experiment affects different user segments differently, which can lead to missed opportunities if you only look at average effects.

Understanding Heterogeneity

Based on research from Uber, Microsoft, and others (see arxiv.org/pdf/1610.03917)

Average Treatment Effect

+2% Overall

Heterogeneous Effects

+15% New Users

-5% Returning Users

Why it matters

Looking only at average effects can hide important insights. A feature might be great for some users but harmful for others. Understanding these differences can lead to:

Personalized experiences (show features only to users who benefit)
Improved feature designs that work better for all segments
Deeper understanding of your users and their needs

Multiple Hypothesis Testing

When running multiple tests simultaneously, the probability of finding at least one false positive increases dramatically. Correction methods help control this risk.

The Multiple Testing Problem

If you run a single test with a significance level of 0.05, you have a 5% chance of a false positive. But what happens when you run multiple tests?

The problem:

With each additional test, your chance of finding at least one false positive increases:

5 tests: 23% chance of ≥1 false positive
10 tests: 40% chance of ≥1 false positive
20 tests: 64% chance of ≥1 false positive
100 tests: 99.4% chance of ≥1 false positive

Probability of at least one false positive

Number of tests

100

50%

100%

Common scenarios in A/B testing:

Testing multiple metrics (conversion, revenue, engagement)
Analyzing multiple user segments (new vs. returning, by country, by device)
Running multiple concurrent experiments
Performing repeated interim analyses (peeking at results)

Correction Methods

Several statistical methods can help control the false positive rate when running multiple tests:

Bonferroni Correction

Divides your significance level (α) by the number of tests.

α_adjusted = α / n

Example: For 5 tests with α=0.05, use α=0.01 for each test.

Pros: Simple to apply, controls family-wise error rate (FWER)

Cons: Very conservative, reduces statistical power

Benjamini-Hochberg Procedure

Controls the false discovery rate (FDR) instead of the family-wise error rate.

Process: Sort p-values, compare each to (i/m)×α where i is the rank and m is the number of tests.

Pros: Less conservative than Bonferroni, better statistical power

Cons: Slightly more complex to apply, controls proportion of false discoveries rather than probability of any false discovery

Holm-Bonferroni Method

A step-down procedure that is less conservative than Bonferroni but still controls FWER.

Process: Sort p-values, compare smallest to α/n, next to α/(n-1), etc.

Pros: More powerful than Bonferroni, still controls FWER

Cons: More complex to apply

Sequential Testing

Adjusts significance thresholds for interim analyses during an experiment.

Methods: O'Brien-Fleming, Pocock, alpha-spending functions

Pros: Allows for early stopping while controlling error rates

Cons: Requires planning the number of interim analyses in advance

Practical Recommendations

Define primary and secondary metrics before running experiments
Pre-specify which segments you'll analyze
Use Benjamini-Hochberg for exploratory analyses with many metrics
Use Bonferroni or Holm-Bonferroni when you need strong control of false positives
Consider sequential testing for long-running experiments with interim analyses
Always report which correction method you used and why