askill
bio-population-genetics-scikit-allel-analysis

bio-population-genetics-scikit-allel-analysisSafety 95Repository

Python population genetics with scikit-allel. Read VCF files, compute allele frequencies, calculate diversity statistics, perform PCA, and run selection scans using GenotypeArray and HaplotypeArray data structures. Use when analyzing population genetics in Python.

251 stars
5k downloads
Updated 2/14/2026

Package Files

Loading files...
SKILL.md

Version Compatibility

Reference examples tested with: bcftools 1.19+, matplotlib 3.8+, numpy 1.26+

Before using code patterns, verify installed versions match. If versions differ:

  • Python: pip show <package> then help(module.function) to check signatures
  • CLI: <tool> --version then <tool> --help to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

scikit-allel Analysis

"Analyze population genetics in Python" → Read VCF files into efficient array structures, compute allele frequencies, diversity statistics, PCA, and selection scans using scikit-allel.

  • Python: allel.read_vcf(), allel.GenotypeArray(), allel.mean_pairwise_difference()

Python library for population genetics analysis with efficient array data structures.

Installation

pip install scikit-allel
# Optional: zarr for chunked storage
pip install zarr

Reading VCF Files

Load VCF

import allel

callset = allel.read_vcf('data.vcf.gz')

print(callset.keys())
# dict_keys(['samples', 'calldata/GT', 'variants/CHROM', 'variants/POS', 'variants/REF', 'variants/ALT', ...])

samples = callset['samples']
genotypes = callset['calldata/GT']
positions = callset['variants/POS']
chroms = callset['variants/CHROM']

Specify Fields

callset = allel.read_vcf('data.vcf.gz',
    fields=['samples', 'calldata/GT', 'variants/POS', 'variants/CHROM', 'variants/QUAL'])

callset = allel.read_vcf('data.vcf.gz', fields='*')  # All fields

callset = allel.read_vcf('data.vcf.gz',
    region='chr1:1000000-2000000',
    samples=['sample1', 'sample2'])

Large Files (Chunked)

import zarr

allel.vcf_to_zarr('large.vcf.gz', 'data.zarr', fields='*', overwrite=True)
callset = zarr.open('data.zarr', mode='r')
gt = allel.GenotypeArray(callset['calldata/GT'])

Genotype Arrays

GenotypeArray

gt = allel.GenotypeArray(callset['calldata/GT'])

print(gt.shape)  # (n_variants, n_samples, ploidy)
print(gt.n_variants)
print(gt.n_samples)

print(gt[0])  # Genotypes at first variant
print(gt[:, 0])  # All variants for first sample

Basic Operations

ac = gt.count_alleles()
print(ac.shape)  # (n_variants, n_alleles)

af = ac.to_frequencies()
is_segregating = ac.is_segregating()
gt_filtered = gt.compress(is_segregating, axis=0)

Missing Data

is_called = gt.is_called()
is_missing = gt.is_missing()

miss_per_variant = (~is_called).sum(axis=1)
miss_per_sample = (~is_called).sum(axis=0)

call_rate_variant = is_called.mean(axis=1)
call_rate_sample = is_called.mean(axis=0)

Allele Counts and Frequencies

ac = gt.count_alleles()
ac_ref = ac[:, 0]
ac_alt = ac[:, 1]

af = ac.to_frequencies()
maf = af.min(axis=1)

n_singletons = (ac[:, 1] == 1).sum()
n_doubletons = (ac[:, 1] == 2).sum()

By Population

subpops = {
    'pop1': [0, 1, 2, 3, 4],
    'pop2': [5, 6, 7, 8, 9]
}

ac_subpops = gt.count_alleles_subpops(subpops)

ac_pop1 = ac_subpops['pop1']
ac_pop2 = ac_subpops['pop2']

Haplotype Arrays

h = gt.to_haplotypes()
print(h.shape)  # (n_variants, n_haplotypes)
print(h.n_haplotypes)

ac_hap = h.count_alleles()

PCA

import allel
import numpy as np

gn = gt.to_n_alt(fill=-1)
gn_filtered = gn[is_segregating]
gn_imputed = np.where(gn_filtered < 0, 0, gn_filtered)

coords, model = allel.pca(gn_imputed, n_components=10, scaler='patterson')
print(coords.shape)  # (n_samples, n_components)

Plot PCA

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
plt.scatter(coords[:, 0], coords[:, 1], c=population_labels)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.savefig('pca.png')

Diversity Statistics

Heterozygosity

ho = allel.heterozygosity_observed(gt)
he = allel.heterozygosity_expected(ac, ploidy=2)

mean_ho = np.mean(ho)
mean_he = np.mean(he)

Nucleotide Diversity (Pi)

pi = allel.sequence_diversity(positions, ac)
print(f'Pi = {pi:.6f}')

windows = allel.moving_statistic(positions, statistic=lambda x: allel.sequence_diversity(x, ac), size=10000, step=5000)

Watterson's Theta

theta_w = allel.watterson_theta(positions, ac)
print(f'Theta_W = {theta_w:.6f}')

Site Frequency Spectrum

sfs = allel.sfs(ac[:, 1])

plt.figure(figsize=(10, 5))
allel.plot_sfs(sfs)
plt.savefig('sfs.png')

Folded SFS

sfs_folded = allel.sfs_folded(ac)

plt.figure(figsize=(10, 5))
allel.plot_sfs_folded(sfs_folded)
plt.savefig('sfs_folded.png')

Windowed Statistics

pos = np.array(positions)
windows = np.arange(0, pos.max(), 100000)

pi_windowed, windows_used, n_bases, counts = allel.windowed_diversity(pos, ac, size=100000, step=50000)

plt.figure(figsize=(14, 4))
plt.plot(windows_used[:, 0], pi_windowed)
plt.xlabel('Position')
plt.ylabel('Pi')
plt.savefig('pi_windows.png')

Sample Subsetting

pop1_idx = np.array([0, 1, 2, 3, 4])
pop2_idx = np.array([5, 6, 7, 8, 9])

gt_pop1 = gt.take(pop1_idx, axis=1)
gt_pop2 = gt.take(pop2_idx, axis=1)

ac_pop1 = gt_pop1.count_alleles()
ac_pop2 = gt_pop2.count_alleles()

Filter Variants

is_snp = callset['variants/is_snp']
is_biallelic = ac.max_allele() == 1
is_segregating = ac.is_segregating()
qual = callset['variants/QUAL']
is_high_qual = qual > 30

flt = is_snp & is_biallelic & is_segregating & is_high_qual

gt_filtered = gt.compress(flt, axis=0)
pos_filtered = positions[flt]

Complete Workflow Example

Goal: Load VCF data, filter to segregating biallelic variants, compute summary diversity statistics, and run PCA in a single Python workflow.

Approach: Read VCF into GenotypeArray, apply segregating and biallelic filters, calculate nucleotide diversity and heterozygosity from allele counts, then perform Patterson PCA on the alt-allele count matrix.

import allel
import numpy as np

callset = allel.read_vcf('data.vcf.gz', fields=['samples', 'calldata/GT', 'variants/POS'])
gt = allel.GenotypeArray(callset['calldata/GT'])
pos = callset['variants/POS']
samples = callset['samples']

ac = gt.count_alleles()
flt = ac.is_segregating() & (ac.max_allele() == 1)
gt = gt.compress(flt, axis=0)
pos = pos[flt]
ac = gt.count_alleles()

print(f'Variants after filtering: {gt.n_variants}')
print(f'Samples: {gt.n_samples}')
print(f'Nucleotide diversity: {allel.sequence_diversity(pos, ac):.6f}')
print(f'Mean Het observed: {allel.heterozygosity_observed(gt).mean():.4f}')

gn = gt.to_n_alt(fill=-1)
gn = np.where(gn < 0, 0, gn)
coords, model = allel.pca(gn, n_components=10, scaler='patterson')

Related Skills

  • selection-statistics - Fst, Tajima's D, iHS with scikit-allel
  • linkage-disequilibrium - LD calculations in Python
  • variant-calling/vcf-basics - VCF format and bcftools

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

95/100Analyzed 2/15/2026

Metadata

Licenseunknown
Version-
Updated2/14/2026
PublisherGPTomics

Tags

apigithub-actions