Version Compatibility

Reference examples tested with: bcftools 1.19+, matplotlib 3.8+, numpy 1.26+

Before using code patterns, verify installed versions match. If versions differ:

Python: pip show <package> then help(module.function) to check signatures
CLI: <tool> --version then <tool> --help to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

scikit-allel Analysis

"Analyze population genetics in Python" → Read VCF files into efficient array structures, compute allele frequencies, diversity statistics, PCA, and selection scans using scikit-allel.

Python: allel.read_vcf(), allel.GenotypeArray(), allel.mean_pairwise_difference()

Python library for population genetics analysis with efficient array data structures.

Installation

pip install scikit-allel
# Optional: zarr for chunked storage
pip install zarr

Reading VCF Files

Load VCF

import allel

callset = allel.read_vcf('data.vcf.gz')

print(callset.keys())
# dict_keys(['samples', 'calldata/GT', 'variants/CHROM', 'variants/POS', 'variants/REF', 'variants/ALT', ...])

samples = callset['samples']
genotypes = callset['calldata/GT']
positions = callset['variants/POS']
chroms = callset['variants/CHROM']

Specify Fields

callset = allel.read_vcf('data.vcf.gz',
    fields=['samples', 'calldata/GT', 'variants/POS', 'variants/CHROM', 'variants/QUAL'])

callset = allel.read_vcf('data.vcf.gz', fields='*')  # All fields

callset = allel.read_vcf('data.vcf.gz',
    region='chr1:1000000-2000000',
    samples=['sample1', 'sample2'])

Large Files (Chunked)

import zarr

allel.vcf_to_zarr('large.vcf.gz', 'data.zarr', fields='*', overwrite=True)
callset = zarr.open('data.zarr', mode='r')
gt = allel.GenotypeArray(callset['calldata/GT'])

Genotype Arrays

GenotypeArray

gt = allel.GenotypeArray(callset['calldata/GT'])

print(gt.shape)  # (n_variants, n_samples, ploidy)
print(gt.n_variants)
print(gt.n_samples)

print(gt[0])  # Genotypes at first variant
print(gt[:, 0])  # All variants for first sample

Basic Operations

ac = gt.count_alleles()
print(ac.shape)  # (n_variants, n_alleles)

af = ac.to_frequencies()
is_segregating = ac.is_segregating()
gt_filtered = gt.compress(is_segregating, axis=0)

Missing Data

is_called = gt.is_called()
is_missing = gt.is_missing()

miss_per_variant = (~is_called).sum(axis=1)
miss_per_sample = (~is_called).sum(axis=0)

call_rate_variant = is_called.mean(axis=1)
call_rate_sample = is_called.mean(axis=0)

Allele Counts and Frequencies

ac = gt.count_alleles()
ac_ref = ac[:, 0]
ac_alt = ac[:, 1]

af = ac.to_frequencies()
maf = af.min(axis=1)

n_singletons = (ac[:, 1] == 1).sum()
n_doubletons = (ac[:, 1] == 2).sum()

By Population

subpops = {
    'pop1': [0, 1, 2, 3, 4],
    'pop2': [5, 6, 7, 8, 9]
}

ac_subpops = gt.count_alleles_subpops(subpops)

ac_pop1 = ac_subpops['pop1']
ac_pop2 = ac_subpops['pop2']

Haplotype Arrays

h = gt.to_haplotypes()
print(h.shape)  # (n_variants, n_haplotypes)
print(h.n_haplotypes)

ac_hap = h.count_alleles()

PCA

import allel
import numpy as np

gn = gt.to_n_alt(fill=-1)
gn_filtered = gn[is_segregating]
gn_imputed = np.where(gn_filtered < 0, 0, gn_filtered)

coords, model = allel.pca(gn_imputed, n_components=10, scaler='patterson')
print(coords.shape)  # (n_samples, n_components)

Plot PCA

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
plt.scatter(coords[:, 0], coords[:, 1], c=population_labels)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.savefig('pca.png')

Diversity Statistics

Heterozygosity

ho = allel.heterozygosity_observed(gt)
he = allel.heterozygosity_expected(ac, ploidy=2)

mean_ho = np.mean(ho)
mean_he = np.mean(he)

Nucleotide Diversity (Pi)

pi = allel.sequence_diversity(positions, ac)
print(f'Pi = {pi:.6f}')

windows = allel.moving_statistic(positions, statistic=lambda x: allel.sequence_diversity(x, ac), size=10000, step=5000)

Watterson's Theta

theta_w = allel.watterson_theta(positions, ac)
print(f'Theta_W = {theta_w:.6f}')

Site Frequency Spectrum

sfs = allel.sfs(ac[:, 1])

plt.figure(figsize=(10, 5))
allel.plot_sfs(sfs)
plt.savefig('sfs.png')

Folded SFS

sfs_folded = allel.sfs_folded(ac)

plt.figure(figsize=(10, 5))
allel.plot_sfs_folded(sfs_folded)
plt.savefig('sfs_folded.png')

Windowed Statistics

pos = np.array(positions)
windows = np.arange(0, pos.max(), 100000)

pi_windowed, windows_used, n_bases, counts = allel.windowed_diversity(pos, ac, size=100000, step=50000)

plt.figure(figsize=(14, 4))
plt.plot(windows_used[:, 0], pi_windowed)
plt.xlabel('Position')
plt.ylabel('Pi')
plt.savefig('pi_windows.png')

Sample Subsetting

pop1_idx = np.array([0, 1, 2, 3, 4])
pop2_idx = np.array([5, 6, 7, 8, 9])

gt_pop1 = gt.take(pop1_idx, axis=1)
gt_pop2 = gt.take(pop2_idx, axis=1)

ac_pop1 = gt_pop1.count_alleles()
ac_pop2 = gt_pop2.count_alleles()

Filter Variants

is_snp = callset['variants/is_snp']
is_biallelic = ac.max_allele() == 1
is_segregating = ac.is_segregating()
qual = callset['variants/QUAL']
is_high_qual = qual > 30

flt = is_snp & is_biallelic & is_segregating & is_high_qual

gt_filtered = gt.compress(flt, axis=0)
pos_filtered = positions[flt]

Complete Workflow Example

Goal: Load VCF data, filter to segregating biallelic variants, compute summary diversity statistics, and run PCA in a single Python workflow.

Approach: Read VCF into GenotypeArray, apply segregating and biallelic filters, calculate nucleotide diversity and heterozygosity from allele counts, then perform Patterson PCA on the alt-allele count matrix.

import allel
import numpy as np

callset = allel.read_vcf('data.vcf.gz', fields=['samples', 'calldata/GT', 'variants/POS'])
gt = allel.GenotypeArray(callset['calldata/GT'])
pos = callset['variants/POS']
samples = callset['samples']

ac = gt.count_alleles()
flt = ac.is_segregating() & (ac.max_allele() == 1)
gt = gt.compress(flt, axis=0)
pos = pos[flt]
ac = gt.count_alleles()

print(f'Variants after filtering: {gt.n_variants}')
print(f'Samples: {gt.n_samples}')
print(f'Nucleotide diversity: {allel.sequence_diversity(pos, ac):.6f}')
print(f'Mean Het observed: {allel.heterozygosity_observed(gt).mean():.4f}')

gn = gt.to_n_alt(fill=-1)
gn = np.where(gn < 0, 0, gn)
coords, model = allel.pca(gn, n_components=10, scaler='patterson')

Related Skills

selection-statistics - Fst, Tajima's D, iHS with scikit-allel
linkage-disequilibrium - LD calculations in Python
variant-calling/vcf-basics - VCF format and bcftools

bio-population-genetics-scikit-allel-analysisSafety 95Repository

Package Files

Version Compatibility

scikit-allel Analysis

Installation

Reading VCF Files

Load VCF

Specify Fields

Large Files (Chunked)

Genotype Arrays

GenotypeArray

Basic Operations

Missing Data

Allele Counts and Frequencies

By Population

Haplotype Arrays

PCA

Plot PCA

Diversity Statistics

Heterozygosity

Nucleotide Diversity (Pi)

Watterson's Theta

Site Frequency Spectrum

Folded SFS

Windowed Statistics

Sample Subsetting

Filter Variants

Complete Workflow Example

Related Skills

Install

AI Quality Score

Metadata

Tags

bio-population-genetics-scikit-allel-analysisSafety 95Repository ShareFavorite skill

Package Files

Version Compatibility

scikit-allel Analysis

Installation

Reading VCF Files

Load VCF

Specify Fields

Large Files (Chunked)

Genotype Arrays

GenotypeArray

Basic Operations

Missing Data

Allele Counts and Frequencies

By Population

Haplotype Arrays

PCA

Plot PCA

Diversity Statistics

Heterozygosity

Nucleotide Diversity (Pi)

Watterson's Theta

Site Frequency Spectrum

Folded SFS

Windowed Statistics

Sample Subsetting

Filter Variants

Complete Workflow Example

Related Skills

Install

AI Quality Score

Metadata

Tags

bio-population-genetics-scikit-allel-analysisSafety 95Repository