A/B Testing Statistical Analysis
This skill provides guidance on correctly analyzing A/B test results using appropriate statistical methods.
Overview
A/B testing compares a control group against a treatment group to determine if a change has a statistically significant effect. The key challenge is choosing the right statistical test based on the metric type.
Choosing the Right Statistical Test
Binary Metrics (Conversion, Click-through, etc.)
For metrics that are 0/1 outcomes (did user convert?), use a two-proportion z-test:
from statsmodels.stats.proportion import proportions_ztest
# counts: number of successes in each group
# nobs: total observations in each group
counts = [treatment_conversions, control_conversions]
nobs = [treatment_total, control_total]
# Two-sided test
stat, p_value = proportions_ztest(counts, nobs, alternative='two-sided')
Why not chi-squared? The two-proportion z-test is mathematically equivalent for 2x2 tables but directly gives you the z-statistic which can be useful for confidence intervals.
Continuous Metrics (Revenue, Time, etc.)
For continuous measurements, use Welch's t-test (does not assume equal variances):
from scipy import stats
# Two-sided Welch's t-test
stat, p_value = stats.ttest_ind(
treatment_values,
control_values,
equal_var=False # Welch's t-test
)
Important: Use equal_var=False to get Welch's t-test, which is more robust than Student's t-test when sample sizes or variances differ between groups.
When to Use Which Test
| Metric Type | Examples | Test |
|---|---|---|
| Binary (0/1) | Conversion, Click, Purchase | Two-proportion z-test |
| Continuous | Revenue, Time, Page views | Welch's t-test |
| Count data | Number of items | Welch's t-test (if mean > 5) |
Multiple Comparison Corrections
When testing multiple hypotheses, the probability of at least one false positive increases. Apply corrections:
Bonferroni Correction
The simplest and most conservative approach:
# If testing k hypotheses at significance level alpha:
adjusted_alpha = alpha / k
# A result is significant only if p_value < adjusted_alpha
significant = p_value < (0.05 / num_tests)
Example: Testing 3 metrics with α = 0.05:
- Adjusted threshold: 0.05 / 3 = 0.0167
- Only p-values below 0.0167 are considered significant
When to Apply Bonferroni
Apply Bonferroni when:
- Testing multiple metrics in the same experiment
- Comparing one treatment against multiple controls
- Running multiple experiments and want to control family-wise error rate
Do NOT apply across independent experiments if you accept some false positives.
Calculating Effect Sizes
Relative Lift (Relative Change)
The most common way to express A/B test results:
# Relative lift = (treatment - control) / control
relative_lift = (treatment_mean - control_mean) / control_mean
Interpretation: A lift of 0.15 means the treatment is 15% better than control.
Conversion Rate Calculation
import pandas as pd
# For a dataframe with 'variant' and 'converted' columns
control_data = df[df['variant'] == 'control']
treatment_data = df[df['variant'] == 'treatment']
control_rate = control_data['converted'].mean()
treatment_rate = treatment_data['converted'].mean()
Complete Analysis Workflow
Step 1: Load and Split Data
import pandas as pd
df = pd.read_csv('experiment.csv')
control = df[df['variant'] == 'control']
treatment = df[df['variant'] == 'treatment']
Step 2: Analyze Binary Metric
from statsmodels.stats.proportion import proportions_ztest
# Calculate rates
control_rate = control['converted'].mean()
treatment_rate = treatment['converted'].mean()
# Run test
counts = [treatment['converted'].sum(), control['converted'].sum()]
nobs = [len(treatment), len(control)]
_, p_value = proportions_ztest(counts, nobs, alternative='two-sided')
# Calculate lift
lift = (treatment_rate - control_rate) / control_rate
Step 3: Analyze Continuous Metric
from scipy import stats
# Calculate means
control_mean = control['revenue'].mean()
treatment_mean = treatment['revenue'].mean()
# Run Welch's t-test
_, p_value = stats.ttest_ind(
treatment['revenue'],
control['revenue'],
equal_var=False
)
# Calculate lift
lift = (treatment_mean - control_mean) / control_mean
Step 4: Apply Multiple Testing Correction
num_tests = 3 # e.g., conversion, revenue, duration
adjusted_alpha = 0.05 / num_tests # 0.0167
# Determine significance
is_significant = p_value < adjusted_alpha
Power Analysis
Power analysis helps determine how many additional samples are needed to detect an effect. Use this when a result is not statistically significant but you want to know if more data could help.
For Continuous Metrics (t-test)
from statsmodels.stats.power import TTestIndPower
import numpy as np
def additional_samples_needed(control_data, treatment_data, alpha, power=0.8):
"""Calculate additional samples needed for significance."""
control_mean = control_data.mean()
treatment_mean = treatment_data.mean()
pooled_std = np.sqrt((control_data.var() + treatment_data.var()) / 2)
if pooled_std == 0 or control_mean == treatment_mean:
return 0
# Cohen's d effect size
effect_size = abs(treatment_mean - control_mean) / pooled_std
power_analysis = TTestIndPower()
required_n = power_analysis.solve_power(
effect_size=effect_size,
alpha=alpha,
power=power,
alternative='two-sided'
)
current_n = (len(control_data) + len(treatment_data)) / 2
return max(0, int(np.ceil(required_n - current_n)))
For Binary Metrics (proportions)
from statsmodels.stats.power import zt_ind_solve_power
import numpy as np
def additional_samples_proportion(control_prop, treatment_prop, n_control, n_treatment, alpha, power=0.8):
"""Calculate additional samples needed for proportion test."""
if control_prop == treatment_prop:
return 0
# Cohen's h effect size for proportions
effect_size = 2 * (np.arcsin(np.sqrt(treatment_prop)) - np.arcsin(np.sqrt(control_prop)))
required_n = zt_ind_solve_power(
effect_size=abs(effect_size),
alpha=alpha,
power=power,
alternative='two-sided'
)
current_n = (n_control + n_treatment) / 2
return max(0, int(np.ceil(required_n - current_n)))
Key Concepts
- Power (1 - β): Probability of detecting a true effect. Typically 0.8 (80%)
- Effect size: Standardized measure of the difference between groups
- Cohen's d for continuous:
(mean1 - mean2) / pooled_std - Cohen's h for proportions:
2 * (arcsin(√p1) - arcsin(√p2))
- Cohen's d for continuous:
- Alpha (α): Significance level (e.g., 0.05 or Bonferroni-adjusted)
When to Use
- Result is not significant but effect looks promising
- Planning sample size for future experiments
- Deciding whether to continue collecting data
Common Pitfalls
- Using chi-squared for proportions: While valid, proportions_ztest is more direct
- Forgetting equal_var=False: Student's t-test assumes equal variances
- Not correcting for multiple tests: Inflates false positive rate
- Division by zero in lift: Handle cases where control mean is 0
- Confusing one-tailed vs two-tailed: Use two-tailed unless you have a strong prior
- Ignoring power analysis: A non-significant result doesn't mean no effect exists
Dependencies
pip install scipy statsmodels pandas numpy
Key imports:
scipy.stats.ttest_ind- Welch's t-teststatsmodels.stats.proportion.proportions_ztest- Two-proportion z-test
