askill
bio-population-genetics-population-structure

bio-population-genetics-population-structureSafety 96Repository

Analyze population structure using PCA and admixture analysis with PLINK and ADMIXTURE. Identify population clusters, assess ancestry proportions, visualize genetic structure, and choose optimal K for admixture models. Use when analyzing population stratification with PCA or admixture.

251 stars
5k downloads
Updated 2/14/2026

Package Files

Loading files...
SKILL.md

Version Compatibility

Reference examples tested with: matplotlib 3.8+, numpy 1.26+, pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:

  • Python: pip show <package> then help(module.function) to check signatures
  • CLI: <tool> --version then <tool> --help to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Population Structure

"Analyze population structure in my genotype data" → Detect population stratification using PCA of genotypes and estimate ancestry proportions with ADMIXTURE modeling.

  • CLI: plink2 --pca 20 for principal component analysis
  • CLI: admixture genotypes.bed K for admixture proportions

Analyze genetic ancestry and population stratification using PCA and ADMIXTURE.

Principal Component Analysis (PCA)

PLINK 2.0 PCA

# Basic PCA (10 PCs)
plink2 --bfile data --pca 10 --out pca_results

# More PCs
plink2 --bfile data --pca 20 --out pca_results

# Approximate PCA (faster for large datasets)
plink2 --bfile data --pca 10 approx --out pca_results

# Output variant loadings
plink2 --bfile data --pca 10 var-wts --out pca_results

Output Files

FileContents
.eigenvecPC scores per sample (FID, IID, PC1, PC2, ...)
.eigenvalEigenvalues (variance explained)
.eigenvec.varVariant loadings (if var-wts)

Variance Explained

import numpy as np

eigenvalues = np.loadtxt('pca_results.eigenval')
variance_explained = eigenvalues / eigenvalues.sum() * 100
cumulative = np.cumsum(variance_explained)

for i, (ve, cum) in enumerate(zip(variance_explained, cumulative), 1):
    print(f'PC{i}: {ve:.2f}% (cumulative: {cum:.2f}%)')

PCA Visualization

import pandas as pd
import matplotlib.pyplot as plt

eigenvec = pd.read_csv('pca_results.eigenvec', sep='\s+', header=None)
eigenvec.columns = ['FID', 'IID'] + [f'PC{i}' for i in range(1, len(eigenvec.columns) - 1)]

pop_info = pd.read_csv('population_labels.txt', sep='\t')  # FID, IID, Population
eigenvec = eigenvec.merge(pop_info, on=['FID', 'IID'])

plt.figure(figsize=(10, 8))
for pop in eigenvec['Population'].unique():
    subset = eigenvec[eigenvec['Population'] == pop]
    plt.scatter(subset['PC1'], subset['PC2'], label=pop, s=20, alpha=0.7)

plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.savefig('pca_plot.png', dpi=150)

LD Pruning (Before Admixture)

ADMIXTURE requires LD-pruned SNPs:

# Calculate LD and identify pruned set
plink2 --bfile data --indep-pairwise 50 10 0.1 --out prune

# Extract pruned variants
plink2 --bfile data --extract prune.prune.in --make-bed --out data_pruned

Pruning Parameters

ParameterDescription
Window (50)SNPs in each window
Step (10)SNPs to shift per step
r² threshold (0.1)Max LD allowed

ADMIXTURE Analysis

Basic Usage

# Run ADMIXTURE for K=3 clusters
admixture data_pruned.bed 3

# With cross-validation
admixture --cv data_pruned.bed 3

# Multithreaded
admixture -j4 data_pruned.bed 3

Output Files

FileContents
.QAncestry proportions (samples × K)
.PAllele frequencies per cluster

Testing Multiple K Values

# Run for K=2 through K=10
for K in $(seq 2 10); do
    admixture --cv -j4 data_pruned.bed $K 2>&1 | tee log${K}.out
done

# Extract CV errors
grep -h "CV" log*.out | awk '{print NR+1, $4}' > cv_errors.txt

Choose Optimal K

import matplotlib.pyplot as plt

cv_errors = []
with open('cv_errors.txt') as f:
    for line in f:
        k, cv = line.strip().split()
        cv_errors.append((int(k), float(cv)))

ks, cvs = zip(*cv_errors)
plt.figure(figsize=(8, 5))
plt.plot(ks, cvs, 'o-')
plt.xlabel('K')
plt.ylabel('Cross-validation error')
plt.title('Admixture CV Error')
plt.savefig('admixture_cv.png', dpi=150)

optimal_k = ks[cvs.index(min(cvs))]
print(f'Optimal K: {optimal_k}')

Visualize Admixture

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

K = 3
Q = pd.read_csv(f'data_pruned.{K}.Q', sep='\s+', header=None)
fam = pd.read_csv('data_pruned.fam', sep='\s+', header=None)
Q.columns = [f'Cluster{i}' for i in range(1, K + 1)]
Q['IID'] = fam[1].values

pop_info = pd.read_csv('population_labels.txt', sep='\t')
Q = Q.merge(pop_info, on='IID')
Q = Q.sort_values('Population')

colors = plt.cm.Set1(range(K))
fig, ax = plt.subplots(figsize=(14, 4))

bottom = np.zeros(len(Q))
for i in range(K):
    ax.bar(range(len(Q)), Q[f'Cluster{i+1}'], bottom=bottom, color=colors[i], width=1)
    bottom += Q[f'Cluster{i+1}'].values

ax.set_xlim(0, len(Q))
ax.set_ylim(0, 1)
ax.set_ylabel('Ancestry proportion')
plt.savefig('admixture_barplot.png', dpi=150, bbox_inches='tight')

FlashPCA2 (Fast PCA for Large Datasets)

FlashPCA2 is optimized for very large datasets (100,000+ samples). Uses randomized algorithms for speed.

Installation

# From conda
conda install -c bioconda flashpca

# Or download binaries from GitHub
# https://github.com/gabraham/flashpca

Basic Usage

# Standard PCA
flashpca2 --bfile data --ndim 10 --outpc pcs.txt --outvec loadings.txt --outval eigenvalues.txt

# --ndim 10: Number of PCs to compute
# --outpc: Principal components output
# --outvec: Eigenvectors (variant loadings)
# --outval: Eigenvalues

FlashPCA2 Options

OptionDescription
--bfilePLINK binary prefix
--ndimNumber of PCs (default 10)
--outpcPC scores output file
--outvecEigenvectors output
--outvalEigenvalues output
--numthreadsCPU threads to use
--memMemory limit (GB)
--seedRandom seed for reproducibility

Large Dataset Settings

# For biobank-scale data (>100k samples)
# numthreads=16: Adjust to available cores.
# mem=64: Memory in GB. Increase for larger datasets.
flashpca2 \
    --bfile large_data \
    --ndim 20 \
    --numthreads 16 \
    --mem 64 \
    --outpc pcs.txt \
    --outval eigenvalues.txt \
    --seed 42

FlashPCA2 vs PLINK2

FeatureFlashPCA2PLINK2
Speed (100k samples)FasterGood
Memory efficiencyBetterGood
Randomized algorithmYesOptional (approx)
Part of standard toolkitNoYes

Use FlashPCA2 for biobank-scale data; PLINK2 sufficient for most studies.

Parse FlashPCA2 Output

import pandas as pd

# Load PCs
pcs = pd.read_csv('pcs.txt', sep='\t', header=None)
pcs.columns = ['FID', 'IID'] + [f'PC{i}' for i in range(1, len(pcs.columns) - 1)]

# Load eigenvalues
eigenvals = pd.read_csv('eigenvalues.txt', header=None)[0].values
var_explained = eigenvals / eigenvals.sum() * 100

print('Variance explained:')
for i, ve in enumerate(var_explained[:10], 1):
    print(f'  PC{i}: {ve:.2f}%')

MDS (Alternative to PCA)

# PLINK 1.9 MDS
plink --bfile data --cluster --mds-plot 10 --out mds_results

# Output: mds_results.mds (sample coordinates)

Kinship/Relatedness

PLINK 2.0 KING-robust

# Calculate kinship matrix
plink2 --bfile data --make-king-table --out kinship

# Output: kinship.kin0 (pairs with kinship > 0.0442)

Identify Related Individuals

import pandas as pd

kin = pd.read_csv('kinship.kin0', sep='\t')
related = kin[kin['KINSHIP'] > 0.0884]  # First-degree relatives
print(f'Related pairs (1st degree): {len(related)}')

related = kin[kin['KINSHIP'] > 0.0442]  # Second-degree
print(f'Related pairs (2nd degree): {len(related)}')

Remove Related Individuals

# Create list to remove (keep one per pair)
plink2 --bfile data --king-cutoff 0.0884 --out unrelated

# Filter to unrelated
plink2 --bfile data --keep unrelated.king.cutoff.in.id --make-bed --out unrelated

Complete Workflow

Goal: Analyze population structure from raw genotypes through PCA and admixture modeling with optimal K selection.

Approach: Apply QC filters, LD-prune for independent SNPs, run PCA for visual stratification assessment, then fit ADMIXTURE models across multiple K values and select the best fit by cross-validation error.

# 1. QC filtering
plink2 --bfile raw --maf 0.01 --geno 0.05 --hwe 1e-6 --make-bed --out qc

# 2. LD pruning
plink2 --bfile qc --indep-pairwise 50 10 0.1 --out prune
plink2 --bfile qc --extract prune.prune.in --make-bed --out pruned

# 3. PCA
plink2 --bfile pruned --pca 20 --out pca

# 4. Admixture (multiple K)
for K in 2 3 4 5 6; do
    admixture --cv -j4 pruned.bed $K 2>&1 | tee log${K}.out
done

Related Skills

  • plink-basics - Data preparation and QC
  • linkage-disequilibrium - LD pruning details
  • association-testing - Use PCs as covariates
  • ecological-genomics/landscape-genomics - Population structure correction for GEA

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

96/100Analyzed 15 hours ago

Metadata

Licenseunknown
Version-
Updated2/14/2026
PublisherGPTomics

Tags

apigithubgithub-actions