Version Compatibility

Reference examples tested with: bcftools 1.19+, matplotlib 3.8+, numpy 1.26+, pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:

Python: pip show <package> then help(module.function) to check signatures
CLI: <tool> --version then <tool> --help to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Imputation QC

"Filter my imputed genotypes by quality" → Assess imputation accuracy using INFO/R2 scores, filter low-quality imputed variants, and validate against known genotypes before downstream GWAS.

CLI: bcftools query -f '%CHROM %POS %INFO/R2\n' to extract quality scores
Python: pandas for R2 distribution analysis and threshold selection

Extract INFO Scores

# Beagle DR2 (dosage R-squared)
bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%INFO/DR2\t%INFO/AF\n' \
    imputed.vcf.gz > info_scores.txt

# Minimac R2
bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%INFO/R2\t%INFO/MAF\n' \
    imputed.vcf.gz > info_scores.txt

# IMPUTE info
bcftools query -f '%CHROM\t%POS\t%ID\t%INFO\n' imputed.vcf.gz > info_scores.txt

Filter by INFO Score

# Standard threshold for GWAS
bcftools view -i 'INFO/DR2 > 0.3' imputed.vcf.gz -Oz -o imputed_r2_03.vcf.gz

# Strict threshold for fine-mapping
bcftools view -i 'INFO/DR2 > 0.8' imputed.vcf.gz -Oz -o imputed_r2_08.vcf.gz

# Combined filtering
bcftools view -i 'INFO/DR2 > 0.3 && INFO/AF > 0.01 && INFO/AF < 0.99' \
    imputed.vcf.gz -Oz -o imputed_filtered.vcf.gz

INFO Score Distribution

import pandas as pd
import matplotlib.pyplot as plt

# Load INFO scores
info = pd.read_csv('info_scores.txt', sep='\t',
    names=['CHR', 'POS', 'ID', 'REF', 'ALT', 'R2', 'AF'])

# Distribution plot
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Overall distribution
axes[0].hist(info['R2'], bins=50, edgecolor='black')
axes[0].axvline(0.3, color='red', linestyle='--', label='Threshold 0.3')
axes[0].set_xlabel('INFO Score (R2)')
axes[0].set_ylabel('Count')
axes[0].set_title('INFO Score Distribution')
axes[0].legend()

# R2 by MAF
info['MAF'] = info['AF'].apply(lambda x: min(x, 1-x))
info['MAF_bin'] = pd.cut(info['MAF'], bins=[0, 0.01, 0.05, 0.1, 0.5])
info.boxplot(column='R2', by='MAF_bin', ax=axes[1])
axes[1].set_xlabel('MAF Bin')
axes[1].set_ylabel('INFO Score')
axes[1].set_title('INFO by MAF')

# Scatter
axes[2].scatter(info['MAF'], info['R2'], alpha=0.1, s=1)
axes[2].set_xlabel('Minor Allele Frequency')
axes[2].set_ylabel('INFO Score')
axes[2].set_title('INFO vs MAF')

plt.tight_layout()
plt.savefig('imputation_qc.png', dpi=150)

Summarize Imputation Quality

# Count variants by quality
bcftools query -f '%INFO/DR2\n' imputed.vcf.gz | \
    awk '{
        if ($1 >= 0.8) high++;
        else if ($1 >= 0.3) med++;
        else low++
    } END {
        print "High quality (R2>=0.8):", high
        print "Medium quality (0.3<=R2<0.8):", med
        print "Low quality (R2<0.3):", low
    }'

# Variants passing filter
echo "Total variants: $(bcftools view -H imputed.vcf.gz | wc -l)"
echo "Passing R2>0.3: $(bcftools view -i 'INFO/DR2>0.3' imputed.vcf.gz -H | wc -l)"

Check Concordance with Typed Variants

# Extract typed variants from imputed file
bcftools view -i 'INFO/TYPED=1' imputed.vcf.gz -Oz -o typed.vcf.gz

# Compare imputed vs original genotypes
bcftools gtcheck -g original.vcf.gz typed.vcf.gz > concordance.txt

# Parse concordance
grep "^CN" concordance.txt

Python: Comprehensive QC Report

Goal: Generate a comprehensive imputation quality summary with overall statistics, MAF-stratified accuracy, and per-chromosome breakdowns.

Approach: Load INFO scores into a dataframe, compute aggregate R2 statistics, bin variants by minor allele frequency for quality stratification, and produce per-chromosome summaries.

import pandas as pd
import numpy as np

def imputation_qc_report(info_file, output_prefix):
    '''Generate comprehensive imputation QC report.'''
    info = pd.read_csv(info_file, sep='\t',
        names=['CHR', 'POS', 'ID', 'REF', 'ALT', 'R2', 'AF'])

    # Calculate MAF
    info['MAF'] = info['AF'].apply(lambda x: min(x, 1-x))

    # Basic statistics
    stats = {
        'total_variants': len(info),
        'mean_r2': info['R2'].mean(),
        'median_r2': info['R2'].median(),
        'variants_r2_03': (info['R2'] >= 0.3).sum(),
        'variants_r2_08': (info['R2'] >= 0.8).sum(),
        'pct_r2_03': 100 * (info['R2'] >= 0.3).mean(),
        'pct_r2_08': 100 * (info['R2'] >= 0.8).mean(),
    }

    # By MAF bin
    maf_bins = [(0, 0.001), (0.001, 0.01), (0.01, 0.05), (0.05, 0.5)]
    for low, high in maf_bins:
        mask = (info['MAF'] >= low) & (info['MAF'] < high)
        stats[f'mean_r2_maf_{low}_{high}'] = info.loc[mask, 'R2'].mean()
        stats[f'n_variants_maf_{low}_{high}'] = mask.sum()

    # By chromosome
    chr_stats = info.groupby('CHR').agg({
        'R2': ['mean', 'count'],
        'MAF': 'mean'
    }).round(3)

    # Write reports
    with open(f'{output_prefix}_summary.txt', 'w') as f:
        for k, v in stats.items():
            f.write(f'{k}: {v}\n')

    chr_stats.to_csv(f'{output_prefix}_by_chrom.txt', sep='\t')

    return stats, chr_stats

stats, chr_stats = imputation_qc_report('info_scores.txt', 'imputation_qc')

Compare Multiple Imputation Runs

def compare_imputations(vcf1, vcf2, output):
    '''Compare INFO scores between two imputation runs.'''
    import subprocess

    # Extract INFO from both
    cmd1 = f"bcftools query -f '%CHROM:%POS\t%INFO/DR2\n' {vcf1}"
    cmd2 = f"bcftools query -f '%CHROM:%POS\t%INFO/DR2\n' {vcf2}"

    info1 = pd.read_csv(subprocess.Popen(cmd1, shell=True, stdout=subprocess.PIPE).stdout,
        sep='\t', names=['ID', 'R2_1'])
    info2 = pd.read_csv(subprocess.Popen(cmd2, shell=True, stdout=subprocess.PIPE).stdout,
        sep='\t', names=['ID', 'R2_2'])

    merged = info1.merge(info2, on='ID')
    merged['R2_diff'] = merged['R2_1'] - merged['R2_2']

    # Correlation
    corr = merged['R2_1'].corr(merged['R2_2'])
    print(f'Correlation between R2 scores: {corr:.4f}')

    return merged

Hardy-Weinberg Filter

# Calculate HWE p-values (PLINK2)
plink2 --vcf imputed.vcf.gz \
    --hardy \
    --out hwe_check

# Filter extreme HWE deviations
plink2 --vcf imputed.vcf.gz \
    --hwe 1e-6 \
    --make-pgen \
    --out imputed_hwe_filtered

Final QC Pipeline

#!/bin/bash
# Complete post-imputation QC

INPUT=$1
OUTPUT=$2

# 1. Filter by INFO score
bcftools view -i 'INFO/DR2 > 0.3' $INPUT -Oz -o ${OUTPUT}_r2.vcf.gz

# 2. Filter by MAF
bcftools view -i 'INFO/AF > 0.01 && INFO/AF < 0.99' \
    ${OUTPUT}_r2.vcf.gz -Oz -o ${OUTPUT}_maf.vcf.gz

# 3. Remove duplicates
bcftools norm -d all ${OUTPUT}_maf.vcf.gz -Oz -o ${OUTPUT}_nodup.vcf.gz

# 4. Index
bcftools index ${OUTPUT}_nodup.vcf.gz

# 5. Final stats
echo "Input variants: $(bcftools view -H $INPUT | wc -l)"
echo "After R2 filter: $(bcftools view -H ${OUTPUT}_r2.vcf.gz | wc -l)"
echo "After MAF filter: $(bcftools view -H ${OUTPUT}_maf.vcf.gz | wc -l)"
echo "Final variants: $(bcftools view -H ${OUTPUT}_nodup.vcf.gz | wc -l)"

Quality Thresholds by Application

Application	R2 Threshold	MAF Threshold	Notes
GWAS discovery	0.3	0.01	Standard
GWAS replication	0.5	0.01	More stringent
Fine-mapping	0.8	0.001	High accuracy needed
Polygenic scores	0.9	0.01	Very high accuracy
Meta-analysis	0.5	Study-specific	Match across studies

Related Skills

phasing-imputation/genotype-imputation - Generate imputed data
variant-calling/filtering-best-practices - VCF filtering operations
population-genetics/association-testing - GWAS with imputed data

bio-phasing-imputation-imputation-qcSafety 96Repository

Package Files

Version Compatibility

Imputation QC

Extract INFO Scores

Filter by INFO Score

INFO Score Distribution

Summarize Imputation Quality

Check Concordance with Typed Variants

Python: Comprehensive QC Report

Compare Multiple Imputation Runs

Hardy-Weinberg Filter

Final QC Pipeline

Quality Thresholds by Application

Related Skills

Install

AI Quality Score

Metadata

Tags

bio-phasing-imputation-imputation-qcSafety 96Repository ShareFavorite skill

Package Files

Version Compatibility

Imputation QC

Extract INFO Scores

Filter by INFO Score

INFO Score Distribution

Summarize Imputation Quality

Check Concordance with Typed Variants

Python: Comprehensive QC Report

Compare Multiple Imputation Runs

Hardy-Weinberg Filter

Final QC Pipeline

Quality Thresholds by Application

Related Skills

Install

AI Quality Score

Metadata

Tags

bio-phasing-imputation-imputation-qcSafety 96Repository