Statistical Concepts
Understanding the key statistical concepts behind reliable A/B testing platforms
Note: This page presents validated statistical concepts based on peer-reviewed research and industry best practices. Each concept is explained in a concise, intuitive manner without making assumptions about your statistical background.
Why Statistics Matter in A/B Testing
A/B testing platforms rely on statistical methods to determine whether observed differences between variants are meaningful or just random chance. Understanding these concepts helps you make better decisions and avoid common pitfalls.
Reliable Decisions
Proper statistical methods ensure you can trust your experiment results and make data-driven decisions with confidence.
Avoiding Pitfalls
Understanding statistical concepts helps you avoid common mistakes that can lead to incorrect conclusions and costly decisions.
Efficient Testing
Advanced statistical techniques can help you run experiments more efficiently, requiring fewer users and less time.
p-value
The p-value is the probability of observing a difference as extreme as the one in your experiment, assuming there is no real difference between variants (the null hypothesis).
What it means
A p-value of 0.05 means there's a 5% chance you'd see a difference this large (or larger) if there was actually no difference between your variants.
Lower p-values (like 0.01) indicate stronger evidence against the null hypothesis, suggesting the observed difference is likely real.
Common threshold: p < 0.05 is typically considered "statistically significant," though this is an arbitrary convention.
Common misconceptions
Misconception: p = 0.05 means there's a 95% chance your result is correct.
Reality: It only tells you the probability of seeing such a difference by random chance.
Misconception: A non-significant p-value proves there's no difference.
Reality: It just means you don't have enough evidence to reject the null hypothesis.
Confidence Intervals
A confidence interval provides a range of plausible values for the true effect, giving you both the magnitude and uncertainty of your experiment results.
Understanding Confidence Intervals
A 95% confidence interval means that if you were to repeat your experiment many times, about 95% of the resulting intervals would contain the true effect.
What it tells you
- The most likely range for the true effect
- How precise your estimate is (narrower = more precise)
- Whether the effect is statistically significant (if it doesn't include zero)
Better than just p-values
- Shows the magnitude of the effect
- Indicates the precision of your estimate
- Helps assess practical significance, not just statistical significance
False Positives & Negatives
Understanding the two types of errors in hypothesis testing helps you balance the risks in your experimentation program.
False Positive (Type I Error)
Concluding there is an effect when there actually isn't one
Example scenario:
Your A/B test shows that variant B increases conversion by 15% with p=0.04, but in reality, there's no difference. You implement B and see no actual improvement.
Controlled by
Your significance level (α), typically set at 0.05, which means you accept a 5% chance of false positives.
Business impact
Wasted resources implementing changes that don't actually improve metrics, potential negative impacts on user experience.
Multiple testing problem
When running many tests (multiple metrics, segments, or experiments), your false positive rate increases dramatically. If you run 20 tests with α=0.05, you have a 64% chance of at least one false positive.
Solution: Use correction methods like Bonferroni, Benjamini-Hochberg, or sequential testing to control the overall false positive rate.
CUPED
CUPED (Controlled-experiment Using Pre-Experiment Data) is a variance reduction technique that can dramatically increase the sensitivity of your experiments.
How CUPED Works
CUPED uses historical data from before the experiment to reduce noise in your metrics, making it easier to detect true effects with smaller sample sizes.
Key concept:
CUPED reduces the variance in your experiment data by accounting for pre-experiment metrics. This leads to narrower confidence intervals, allowing you to detect true effects more reliably with the same sample size, or use a smaller sample size to achieve the same statistical power.
Key benefits
- Reduces variance by 30-80%, depending on the metric
- Allows you to detect smaller effects with the same sample size
- Reduces experiment duration by up to 50-80%
- Works especially well for metrics with high user-to-user variability
How to implement
CUPED adjusts each user's experiment metric by subtracting a portion of their pre-experiment metric, reducing the noise while preserving the treatment effect:
Where Yi is the user's metric during the experiment, Xi is their pre-experiment metric, μX is the mean of the pre-experiment metric, and θ is a coefficient determined by regression.
Heterogeneous Treatment Effects
Heterogeneous treatment effects occur when an experiment affects different user segments differently, which can lead to missed opportunities if you only look at average effects.
Understanding Heterogeneity
Based on research from Uber, Microsoft, and others (see arxiv.org/pdf/1610.03917)
Average Treatment Effect
Heterogeneous Effects
Why it matters
Looking only at average effects can hide important insights. A feature might be great for some users but harmful for others. Understanding these differences can lead to:
- Personalized experiences (show features only to users who benefit)
- Improved feature designs that work better for all segments
- Deeper understanding of your users and their needs
Multiple Hypothesis Testing
When running multiple tests simultaneously, the probability of finding at least one false positive increases dramatically. Correction methods help control this risk.
The Multiple Testing Problem
If you run a single test with a significance level of 0.05, you have a 5% chance of a false positive. But what happens when you run multiple tests?
The problem:
With each additional test, your chance of finding at least one false positive increases:
- 5 tests: 23% chance of ≥1 false positive
- 10 tests: 40% chance of ≥1 false positive
- 20 tests: 64% chance of ≥1 false positive
- 100 tests: 99.4% chance of ≥1 false positive
Common scenarios in A/B testing:
- Testing multiple metrics (conversion, revenue, engagement)
- Analyzing multiple user segments (new vs. returning, by country, by device)
- Running multiple concurrent experiments
- Performing repeated interim analyses (peeking at results)
Correction Methods
Several statistical methods can help control the false positive rate when running multiple tests:
Bonferroni Correction
Divides your significance level (α) by the number of tests.
Example: For 5 tests with α=0.05, use α=0.01 for each test.
Pros: Simple to apply, controls family-wise error rate (FWER)
Cons: Very conservative, reduces statistical power
Benjamini-Hochberg Procedure
Controls the false discovery rate (FDR) instead of the family-wise error rate.
Process: Sort p-values, compare each to (i/m)×α where i is the rank and m is the number of tests.
Pros: Less conservative than Bonferroni, better statistical power
Cons: Slightly more complex to apply, controls proportion of false discoveries rather than probability of any false discovery
Holm-Bonferroni Method
A step-down procedure that is less conservative than Bonferroni but still controls FWER.
Process: Sort p-values, compare smallest to α/n, next to α/(n-1), etc.
Pros: More powerful than Bonferroni, still controls FWER
Cons: More complex to apply
Sequential Testing
Adjusts significance thresholds for interim analyses during an experiment.
Methods: O'Brien-Fleming, Pocock, alpha-spending functions
Pros: Allows for early stopping while controlling error rates
Cons: Requires planning the number of interim analyses in advance
Practical Recommendations
- Define primary and secondary metrics before running experiments
- Pre-specify which segments you'll analyze
- Use Benjamini-Hochberg for exploratory analyses with many metrics
- Use Bonferroni or Holm-Bonferroni when you need strong control of false positives
- Consider sequential testing for long-running experiments with interim analyses
- Always report which correction method you used and why