askill
ab-testing-statistics

ab-testing-statisticsSafety 95Repository

Statistical methods for analyzing A/B test (controlled experiment) results. Covers hypothesis testing for different metric types, multiple comparison corrections, and effect size calculations. Use when analyzing experiment data with control and treatment groups.

0 stars
1.2k downloads
Updated 1/22/2026

Package Files

Loading files...
SKILL.md

A/B Testing Statistical Analysis

This skill provides guidance on correctly analyzing A/B test results using appropriate statistical methods.

Overview

A/B testing compares a control group against a treatment group to determine if a change has a statistically significant effect. The key challenge is choosing the right statistical test based on the metric type.

Choosing the Right Statistical Test

Binary Metrics (Conversion, Click-through, etc.)

For metrics that are 0/1 outcomes (did user convert?), use a two-proportion z-test:

from statsmodels.stats.proportion import proportions_ztest

# counts: number of successes in each group
# nobs: total observations in each group
counts = [treatment_conversions, control_conversions]
nobs = [treatment_total, control_total]

# Two-sided test
stat, p_value = proportions_ztest(counts, nobs, alternative='two-sided')

Why not chi-squared? The two-proportion z-test is mathematically equivalent for 2x2 tables but directly gives you the z-statistic which can be useful for confidence intervals.

Continuous Metrics (Revenue, Time, etc.)

For continuous measurements, use Welch's t-test (does not assume equal variances):

from scipy import stats

# Two-sided Welch's t-test
stat, p_value = stats.ttest_ind(
    treatment_values,
    control_values,
    equal_var=False  # Welch's t-test
)

Important: Use equal_var=False to get Welch's t-test, which is more robust than Student's t-test when sample sizes or variances differ between groups.

When to Use Which Test

Metric TypeExamplesTest
Binary (0/1)Conversion, Click, PurchaseTwo-proportion z-test
ContinuousRevenue, Time, Page viewsWelch's t-test
Count dataNumber of itemsWelch's t-test (if mean > 5)

Multiple Comparison Corrections

When testing multiple hypotheses, the probability of at least one false positive increases. Apply corrections:

Bonferroni Correction

The simplest and most conservative approach:

# If testing k hypotheses at significance level alpha:
adjusted_alpha = alpha / k

# A result is significant only if p_value < adjusted_alpha
significant = p_value < (0.05 / num_tests)

Example: Testing 3 metrics with α = 0.05:

  • Adjusted threshold: 0.05 / 3 = 0.0167
  • Only p-values below 0.0167 are considered significant

When to Apply Bonferroni

Apply Bonferroni when:

  • Testing multiple metrics in the same experiment
  • Comparing one treatment against multiple controls
  • Running multiple experiments and want to control family-wise error rate

Do NOT apply across independent experiments if you accept some false positives.

Calculating Effect Sizes

Relative Lift (Relative Change)

The most common way to express A/B test results:

# Relative lift = (treatment - control) / control
relative_lift = (treatment_mean - control_mean) / control_mean

Interpretation: A lift of 0.15 means the treatment is 15% better than control.

Conversion Rate Calculation

import pandas as pd

# For a dataframe with 'variant' and 'converted' columns
control_data = df[df['variant'] == 'control']
treatment_data = df[df['variant'] == 'treatment']

control_rate = control_data['converted'].mean()
treatment_rate = treatment_data['converted'].mean()

Complete Analysis Workflow

Step 1: Load and Split Data

import pandas as pd

df = pd.read_csv('experiment.csv')
control = df[df['variant'] == 'control']
treatment = df[df['variant'] == 'treatment']

Step 2: Analyze Binary Metric

from statsmodels.stats.proportion import proportions_ztest

# Calculate rates
control_rate = control['converted'].mean()
treatment_rate = treatment['converted'].mean()

# Run test
counts = [treatment['converted'].sum(), control['converted'].sum()]
nobs = [len(treatment), len(control)]
_, p_value = proportions_ztest(counts, nobs, alternative='two-sided')

# Calculate lift
lift = (treatment_rate - control_rate) / control_rate

Step 3: Analyze Continuous Metric

from scipy import stats

# Calculate means
control_mean = control['revenue'].mean()
treatment_mean = treatment['revenue'].mean()

# Run Welch's t-test
_, p_value = stats.ttest_ind(
    treatment['revenue'],
    control['revenue'],
    equal_var=False
)

# Calculate lift
lift = (treatment_mean - control_mean) / control_mean

Step 4: Apply Multiple Testing Correction

num_tests = 3  # e.g., conversion, revenue, duration
adjusted_alpha = 0.05 / num_tests  # 0.0167

# Determine significance
is_significant = p_value < adjusted_alpha

Power Analysis

Power analysis helps determine how many additional samples are needed to detect an effect. Use this when a result is not statistically significant but you want to know if more data could help.

For Continuous Metrics (t-test)

from statsmodels.stats.power import TTestIndPower
import numpy as np

def additional_samples_needed(control_data, treatment_data, alpha, power=0.8):
    """Calculate additional samples needed for significance."""
    control_mean = control_data.mean()
    treatment_mean = treatment_data.mean()
    pooled_std = np.sqrt((control_data.var() + treatment_data.var()) / 2)

    if pooled_std == 0 or control_mean == treatment_mean:
        return 0

    # Cohen's d effect size
    effect_size = abs(treatment_mean - control_mean) / pooled_std

    power_analysis = TTestIndPower()
    required_n = power_analysis.solve_power(
        effect_size=effect_size,
        alpha=alpha,
        power=power,
        alternative='two-sided'
    )

    current_n = (len(control_data) + len(treatment_data)) / 2
    return max(0, int(np.ceil(required_n - current_n)))

For Binary Metrics (proportions)

from statsmodels.stats.power import zt_ind_solve_power
import numpy as np

def additional_samples_proportion(control_prop, treatment_prop, n_control, n_treatment, alpha, power=0.8):
    """Calculate additional samples needed for proportion test."""
    if control_prop == treatment_prop:
        return 0

    # Cohen's h effect size for proportions
    effect_size = 2 * (np.arcsin(np.sqrt(treatment_prop)) - np.arcsin(np.sqrt(control_prop)))

    required_n = zt_ind_solve_power(
        effect_size=abs(effect_size),
        alpha=alpha,
        power=power,
        alternative='two-sided'
    )

    current_n = (n_control + n_treatment) / 2
    return max(0, int(np.ceil(required_n - current_n)))

Key Concepts

  • Power (1 - β): Probability of detecting a true effect. Typically 0.8 (80%)
  • Effect size: Standardized measure of the difference between groups
    • Cohen's d for continuous: (mean1 - mean2) / pooled_std
    • Cohen's h for proportions: 2 * (arcsin(√p1) - arcsin(√p2))
  • Alpha (α): Significance level (e.g., 0.05 or Bonferroni-adjusted)

When to Use

  • Result is not significant but effect looks promising
  • Planning sample size for future experiments
  • Deciding whether to continue collecting data

Common Pitfalls

  1. Using chi-squared for proportions: While valid, proportions_ztest is more direct
  2. Forgetting equal_var=False: Student's t-test assumes equal variances
  3. Not correcting for multiple tests: Inflates false positive rate
  4. Division by zero in lift: Handle cases where control mean is 0
  5. Confusing one-tailed vs two-tailed: Use two-tailed unless you have a strong prior
  6. Ignoring power analysis: A non-significant result doesn't mean no effect exists

Dependencies

pip install scipy statsmodels pandas numpy

Key imports:

  • scipy.stats.ttest_ind - Welch's t-test
  • statsmodels.stats.proportion.proportions_ztest - Two-proportion z-test

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

88/100Analyzed 2/19/2026

High-quality statistical reference skill covering A/B testing methods comprehensively. Provides clear Python code examples, structured workflows, and explains when to use each statistical test. The depth of content (multiple comparison corrections, power analysis, effect sizes) is excellent. Minor issues: tags don't match content well, path depth suggests potential internal context, but these don't significantly diminish the practical value.

95
90
90
85
90

Metadata

Licenseunknown
Version-
Updated1/22/2026
PublisherKaiserWhoLearns

Tags

github-actionsobservabilitytesting