askill
scikit-learn

scikit-learnSafety 95Repository

Use when "scikit-learn", "sklearn", "machine learning", "classification", "regression", "clustering", or asking about "train test split", "cross validation", "hyperparameter tuning", "ML pipeline", "random forest", "SVM", "preprocessing"

0 stars
1.2k downloads
Updated 1/15/2026

Package Files

Loading files...
SKILL.md

Scikit-learn Machine Learning

Industry-standard Python library for classical machine learning.

When to Use

  • Classification or regression tasks
  • Clustering or dimensionality reduction
  • Preprocessing and feature engineering
  • Model evaluation and cross-validation
  • Hyperparameter tuning
  • Building ML pipelines

Algorithm Selection

Classification

AlgorithmBest ForStrengths
Logistic RegressionBaseline, interpretableFast, probabilistic
Random ForestGeneral purposeHandles non-linear, feature importance
Gradient BoostingBest accuracyState-of-art for tabular
SVMHigh-dimensional dataWorks well with few samples
KNNSimple problemsNo training, instance-based

Regression

AlgorithmBest ForNotes
Linear RegressionBaselineInterpretable coefficients
Ridge/LassoRegularization neededL2 vs L1 penalty
Random ForestNon-linear relationshipsRobust to outliers
Gradient BoostingBest accuracyXGBoost, LightGBM wrappers

Clustering

AlgorithmBest ForKey Parameter
KMeansSpherical clustersn_clusters (must specify)
DBSCANArbitrary shapeseps (density)
AgglomerativeHierarchicaln_clusters or distance threshold
Gaussian MixtureSoft clusteringn_components

Dimensionality Reduction

MethodPreservesUse Case
PCAGlobal varianceFeature reduction
t-SNELocal structure2D/3D visualization
UMAPBoth local/globalVisualization + downstream

Pipeline Concepts

Key concept: Pipelines prevent data leakage by ensuring transformations are fit only on training data.

ComponentPurpose
PipelineSequential steps (transform → model)
ColumnTransformerApply different transforms to different columns
FeatureUnionCombine multiple feature extraction methods

Common preprocessing flow:

  1. Impute missing values (SimpleImputer)
  2. Scale numeric features (StandardScaler, MinMaxScaler)
  3. Encode categoricals (OneHotEncoder, OrdinalEncoder)
  4. Optional: feature selection or polynomial features

Model Evaluation

Cross-Validation Strategies

StrategyUse Case
KFoldGeneral purpose
StratifiedKFoldImbalanced classification
TimeSeriesSplitTemporal data
LeaveOneOutVery small datasets

Metrics

TaskMetricWhen to Use
ClassificationAccuracyBalanced classes
F1-scoreImbalanced classes
ROC-AUCRanking, threshold tuning
Precision/RecallDomain-specific costs
RegressionRMSEPenalize large errors
MAERobust to outliers
Explained variance

Hyperparameter Tuning

MethodProsCons
GridSearchCVExhaustiveSlow for many params
RandomizedSearchCVFasterMay miss optimal
HalvingGridSearchCVEfficientRequires sklearn 0.24+

Key concept: Always tune on validation set, evaluate final model on held-out test set.


Best Practices

PracticeWhy
Split data firstPrevent leakage
Use pipelinesReproducible, no leakage
Scale for distance-basedKNN, SVM, PCA need scaled features
Stratify imbalancedPreserve class distribution
Cross-validateReliable performance estimates
Check learning curvesDiagnose over/underfitting

Common Pitfalls

PitfallSolution
Fitting scaler on all dataUse pipeline or fit only on train
Using accuracy for imbalancedUse F1, ROC-AUC, or balanced accuracy
Too many hyperparametersStart simple, add complexity
Ignoring feature importanceUse feature_importances_ or permutation importance

Resources

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

90/100Analyzed 2/12/2026

A comprehensive, high-density reference guide for scikit-learn, providing structured tables for algorithm selection, pipeline components, evaluation metrics, and best practices. It effectively guides decision-making for machine learning tasks.

95
95
100
85
80

Metadata

Licenseunknown
Version1.0.0
Updated1/15/2026
Publishereyadsibai

Tags

ci-cdobservabilitytesting