A/B Testing Statistical Analysis

This skill provides guidance on correctly analyzing A/B test results using appropriate statistical methods.

Overview

A/B testing compares a control group against a treatment group to determine if a change has a statistically significant effect. The key challenge is choosing the right statistical test based on the metric type.

Choosing the Right Statistical Test

Binary Metrics (Conversion, Click-through, etc.)

For metrics that are 0/1 outcomes (did user convert?), use a two-proportion z-test:

from statsmodels.stats.proportion import proportions_ztest

# counts: number of successes in each group
# nobs: total observations in each group
counts = [treatment_conversions, control_conversions]
nobs = [treatment_total, control_total]

# Two-sided test
stat, p_value = proportions_ztest(counts, nobs, alternative='two-sided')

Why not chi-squared? The two-proportion z-test is mathematically equivalent for 2x2 tables but directly gives you the z-statistic which can be useful for confidence intervals.

Continuous Metrics (Revenue, Time, etc.)

For continuous measurements, use Welch's t-test (does not assume equal variances):

from scipy import stats

# Two-sided Welch's t-test
stat, p_value = stats.ttest_ind(
    treatment_values,
    control_values,
    equal_var=False  # Welch's t-test
)

Important: Use equal_var=False to get Welch's t-test, which is more robust than Student's t-test when sample sizes or variances differ between groups.

When to Use Which Test

Metric Type	Examples	Test
Binary (0/1)	Conversion, Click, Purchase	Two-proportion z-test
Continuous	Revenue, Time, Page views	Welch's t-test
Count data	Number of items	Welch's t-test (if mean > 5)

Multiple Comparison Corrections

When testing multiple hypotheses, the probability of at least one false positive increases. Apply corrections:

Bonferroni Correction

The simplest and most conservative approach:

# If testing k hypotheses at significance level alpha:
adjusted_alpha = alpha / k

# A result is significant only if p_value < adjusted_alpha
significant = p_value < (0.05 / num_tests)

Example: Testing 3 metrics with α = 0.05:

Adjusted threshold: 0.05 / 3 = 0.0167
Only p-values below 0.0167 are considered significant

When to Apply Bonferroni

Apply Bonferroni when:

Testing multiple metrics in the same experiment
Comparing one treatment against multiple controls
Running multiple experiments and want to control family-wise error rate

Do NOT apply across independent experiments if you accept some false positives.

Calculating Effect Sizes

Relative Lift (Relative Change)

The most common way to express A/B test results:

# Relative lift = (treatment - control) / control
relative_lift = (treatment_mean - control_mean) / control_mean

Interpretation: A lift of 0.15 means the treatment is 15% better than control.

Conversion Rate Calculation

import pandas as pd

# For a dataframe with 'variant' and 'converted' columns
control_data = df[df['variant'] == 'control']
treatment_data = df[df['variant'] == 'treatment']

control_rate = control_data['converted'].mean()
treatment_rate = treatment_data['converted'].mean()

Complete Analysis Workflow

Step 1: Load and Split Data

import pandas as pd

df = pd.read_csv('experiment.csv')
control = df[df['variant'] == 'control']
treatment = df[df['variant'] == 'treatment']

Step 2: Analyze Binary Metric

from statsmodels.stats.proportion import proportions_ztest

# Calculate rates
control_rate = control['converted'].mean()
treatment_rate = treatment['converted'].mean()

# Run test
counts = [treatment['converted'].sum(), control['converted'].sum()]
nobs = [len(treatment), len(control)]
_, p_value = proportions_ztest(counts, nobs, alternative='two-sided')

# Calculate lift
lift = (treatment_rate - control_rate) / control_rate

Step 3: Analyze Continuous Metric

from scipy import stats

# Calculate means
control_mean = control['revenue'].mean()
treatment_mean = treatment['revenue'].mean()

# Run Welch's t-test
_, p_value = stats.ttest_ind(
    treatment['revenue'],
    control['revenue'],
    equal_var=False
)

# Calculate lift
lift = (treatment_mean - control_mean) / control_mean

Step 4: Apply Multiple Testing Correction

num_tests = 3  # e.g., conversion, revenue, duration
adjusted_alpha = 0.05 / num_tests  # 0.0167

# Determine significance
is_significant = p_value < adjusted_alpha

Power Analysis

Power analysis helps determine how many additional samples are needed to detect an effect. Use this when a result is not statistically significant but you want to know if more data could help.

For Continuous Metrics (t-test)

from statsmodels.stats.power import TTestIndPower
import numpy as np

def additional_samples_needed(control_data, treatment_data, alpha, power=0.8):
    """Calculate additional samples needed for significance."""
    control_mean = control_data.mean()
    treatment_mean = treatment_data.mean()
    pooled_std = np.sqrt((control_data.var() + treatment_data.var()) / 2)

    if pooled_std == 0 or control_mean == treatment_mean:
        return 0

    # Cohen's d effect size
    effect_size = abs(treatment_mean - control_mean) / pooled_std

    power_analysis = TTestIndPower()
    required_n = power_analysis.solve_power(
        effect_size=effect_size,
        alpha=alpha,
        power=power,
        alternative='two-sided'
    )

    current_n = (len(control_data) + len(treatment_data)) / 2
    return max(0, int(np.ceil(required_n - current_n)))

For Binary Metrics (proportions)

from statsmodels.stats.power import zt_ind_solve_power
import numpy as np

def additional_samples_proportion(control_prop, treatment_prop, n_control, n_treatment, alpha, power=0.8):
    """Calculate additional samples needed for proportion test."""
    if control_prop == treatment_prop:
        return 0

    # Cohen's h effect size for proportions
    effect_size = 2 * (np.arcsin(np.sqrt(treatment_prop)) - np.arcsin(np.sqrt(control_prop)))

    required_n = zt_ind_solve_power(
        effect_size=abs(effect_size),
        alpha=alpha,
        power=power,
        alternative='two-sided'
    )

    current_n = (n_control + n_treatment) / 2
    return max(0, int(np.ceil(required_n - current_n)))

Key Concepts

Power (1 - β): Probability of detecting a true effect. Typically 0.8 (80%)
Effect size: Standardized measure of the difference between groups
- Cohen's d for continuous: (mean1 - mean2) / pooled_std
- Cohen's h for proportions: 2 * (arcsin(√p1) - arcsin(√p2))
Alpha (α): Significance level (e.g., 0.05 or Bonferroni-adjusted)

When to Use

Result is not significant but effect looks promising
Planning sample size for future experiments
Deciding whether to continue collecting data

Common Pitfalls

Using chi-squared for proportions: While valid, proportions_ztest is more direct
Forgetting equal_var=False: Student's t-test assumes equal variances
Not correcting for multiple tests: Inflates false positive rate
Division by zero in lift: Handle cases where control mean is 0
Confusing one-tailed vs two-tailed: Use two-tailed unless you have a strong prior
Ignoring power analysis: A non-significant result doesn't mean no effect exists

Dependencies

pip install scipy statsmodels pandas numpy

Key imports:

scipy.stats.ttest_ind - Welch's t-test
statsmodels.stats.proportion.proportions_ztest - Two-proportion z-test

ab-testing-statisticsSafety 95Repository

Package Files

A/B Testing Statistical Analysis

Overview

Choosing the Right Statistical Test

Binary Metrics (Conversion, Click-through, etc.)

Continuous Metrics (Revenue, Time, etc.)

When to Use Which Test

Multiple Comparison Corrections

Bonferroni Correction

When to Apply Bonferroni

Calculating Effect Sizes

Relative Lift (Relative Change)

Conversion Rate Calculation

Complete Analysis Workflow

Step 1: Load and Split Data

Step 2: Analyze Binary Metric

Step 3: Analyze Continuous Metric

Step 4: Apply Multiple Testing Correction

Power Analysis

For Continuous Metrics (t-test)

For Binary Metrics (proportions)

Key Concepts

When to Use

Common Pitfalls

Dependencies

Install

AI Quality Score

Metadata

Tags

ab-testing-statisticsSafety 95Repository ShareFavorite skill

Package Files

A/B Testing Statistical Analysis

Overview

Choosing the Right Statistical Test

Binary Metrics (Conversion, Click-through, etc.)

Continuous Metrics (Revenue, Time, etc.)

When to Use Which Test

Multiple Comparison Corrections

Bonferroni Correction

When to Apply Bonferroni

Calculating Effect Sizes

Relative Lift (Relative Change)

Conversion Rate Calculation

Complete Analysis Workflow

Step 1: Load and Split Data

Step 2: Analyze Binary Metric

Step 3: Analyze Continuous Metric

Step 4: Apply Multiple Testing Correction

Power Analysis

For Continuous Metrics (t-test)

For Binary Metrics (proportions)

Key Concepts

When to Use

Common Pitfalls

Dependencies

Install

AI Quality Score

Metadata

Tags

ab-testing-statisticsSafety 95Repository