askill
bio-machine-learning-biomarker-discovery

bio-machine-learning-biomarker-discoverySafety 100Repository

Selects informative features for biomarker discovery using Boruta all-relevant selection, mRMR minimum redundancy, and LASSO regularization. Use when identifying biomarkers from high-dimensional omics data.

10 stars
1.2k downloads
Updated 2/16/2026

Package Files

Loading files...
SKILL.md

Feature Selection for Biomarker Discovery

Boruta All-Relevant Selection

Identifies all features that are significantly better than random (shadow features).

from boruta import BorutaPy
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

rf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)

# max_iter=100: Typically sufficient; increase to 200 if many features remain tentative
# perc=100: Use max of shadow features (default); lower for stricter selection
boruta = BorutaPy(rf, n_estimators='auto', max_iter=100, random_state=42, verbose=0)
boruta.fit(X.values, y)

selected = X.columns[boruta.support_]
tentative = X.columns[boruta.support_weak_]
print(f'Selected: {len(selected)}, Tentative: {len(tentative)}')

feature_ranks = pd.DataFrame({
    'feature': X.columns,
    'rank': boruta.ranking_,
    'selected': boruta.support_
}).sort_values('rank')

mRMR (Minimum Redundancy Maximum Relevance)

Selects features that are individually relevant but minimally redundant with each other.

from mrmr import mrmr_classif

# K: Number of features to select; start with 50-100 for omics
selected_features = mrmr_classif(X=X, y=pd.Series(y), K=50)
X_selected = X[selected_features]

LASSO Feature Selection

L1 regularization drives irrelevant coefficients to zero.

from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# cv=5: Standard for selection; eps and n_alphas control alpha grid
lasso = LassoCV(cv=5, random_state=42)
lasso.fit(X_scaled, y)

selected_mask = lasso.coef_ != 0
selected = X.columns[selected_mask]
print(f'LASSO selected {len(selected)} features at alpha={lasso.alpha_:.4f}')

coefs = pd.Series(lasso.coef_, index=X.columns)
nonzero = coefs[coefs != 0].sort_values(key=abs, ascending=False)

Univariate Filtering (Pre-filter)

Reduce dimensionality before more expensive methods.

from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif

# f_classif: Fast, assumes normality; good for log-counts
# mutual_info_classif: Nonlinear relationships but slower
# k=1000: Reasonable pre-filter; increase for larger omics datasets (>10k features)
selector = SelectKBest(f_classif, k=1000)
X_filtered = selector.fit_transform(X, y)
selected_idx = selector.get_support(indices=True)

Combined Pipeline

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

# Pre-filter then Boruta for efficiency
pipe = Pipeline([
    ('prefilter', SelectKBest(f_classif, k=5000)),
    ('boruta', BorutaPy(RandomForestClassifier(n_jobs=-1), max_iter=100, random_state=42))
])
# Note: BorutaPy doesn't follow sklearn API perfectly; manual fit may be needed

Method Comparison

MethodStrengthsWeaknessesUse When
BorutaFinds all relevant featuresSlow on large dataWant complete biomarker panel
mRMRReduces redundancyFixed KWant compact signature
LASSOSparse, interpretablePicks one of correlatedWant minimal predictive set
UnivariateFastIgnores interactionsPre-filtering

Stability Selection

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
import numpy as np

n_bootstrap = 100
selection_counts = np.zeros(X.shape[1])

for i in range(n_bootstrap):
    idx = np.random.choice(len(X), size=len(X), replace=True)
    X_boot, y_boot = X.iloc[idx], y[idx]

    lasso = LogisticRegression(penalty='l1', solver='saga', C=0.1, max_iter=1000)
    lasso.fit(X_boot, y_boot)
    selection_counts += (lasso.coef_[0] != 0)

# stability_threshold=0.6: Features selected in >60% of bootstrap samples
stable_features = X.columns[selection_counts / n_bootstrap > 0.6]

Related Skills

  • differential-expression/de-results - Pre-filter with DE genes
  • pathway-analysis/go-enrichment - Functional enrichment of selected features
  • machine-learning/omics-classifiers - Use selected features for prediction

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

95/100Analyzed 2/13/2026

An excellent, high-density technical reference for biomarker feature selection. It provides clear, modular, and well-commented Python code for multiple algorithms (Boruta, mRMR, LASSO) and includes a valuable comparison table to guide method selection.

100
95
100
90
95

Metadata

Licenseunknown
Version-
Updated2/16/2026
Publishermdbabumiamssm

Tags

apici-cd