askill
ml-fundamentals

ml-fundamentalsSafety 95Repository

Master machine learning foundations - algorithms, preprocessing, feature engineering, and evaluation

1 stars
1.2k downloads
Updated 1/5/2026

Package Files

Loading files...
SKILL.md

ML Fundamentals Skill

Master the building blocks of machine learning: from raw data to trained models.

Quick Start

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

# 1. Load and split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 2. Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

# 3. Train and evaluate
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(f"Accuracy: {score:.4f}")

Key Topics

1. Data Preprocessing

StepPurposeImplementation
Missing ValuesHandle NaN/NoneSimpleImputer(strategy='median')
ScalingNormalize rangesStandardScaler() or MinMaxScaler()
EncodingConvert categoriesOneHotEncoder() or LabelEncoder()
OutliersRemove extremesIQR method or Z-score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Define column types
numeric_features = ['age', 'income', 'score']
categorical_features = ['gender', 'city', 'category']

# Create preprocessor
preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ]), numeric_features),
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('encoder', OneHotEncoder(handle_unknown='ignore'))
    ]), categorical_features)
])

2. Feature Engineering

TechniqueUse CaseExample
PolynomialNon-linear relationshipsPolynomialFeatures(degree=2)
BinningDiscretize continuousKBinsDiscretizer(n_bins=5)
Log TransformRight-skewed datanp.log1p(x)
InteractionFeature combinationsx1 * x2

3. Model Evaluation

from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report

# Cross-validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='f1_weighted')
print(f"CV F1: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")

# Detailed report
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

4. Cross-Validation Strategies

StrategyWhen to Use
KFoldStandard, balanced data
StratifiedKFoldImbalanced classification
TimeSeriesSplitTemporal data
GroupKFoldGrouped samples

Best Practices

DO

  • Split data BEFORE any preprocessing
  • Use pipelines for reproducibility
  • Stratify splits for classification
  • Log all preprocessing parameters
  • Version your feature engineering code

DON'T

  • Don't fit on test data
  • Don't ignore data leakage
  • Don't use accuracy for imbalanced data
  • Don't hard-code parameters

Exercises

Exercise 1: Basic Pipeline

# TODO: Create a pipeline that:
# 1. Imputes missing values
# 2. Scales features
# 3. Trains a logistic regression

Exercise 2: Cross-Validation

# TODO: Implement 5-fold stratified CV
# and report mean and std of F1 score

Unit Test Template

import pytest
import numpy as np
from sklearn.datasets import make_classification

def test_preprocessing_pipeline():
    """Test preprocessing handles missing values."""
    X, y = make_classification(n_samples=100, n_features=10)
    X[0, 0] = np.nan  # Introduce missing value

    pipeline = create_preprocessing_pipeline()
    X_transformed = pipeline.fit_transform(X)

    assert not np.isnan(X_transformed).any()
    assert X_transformed.shape[0] == X.shape[0]

def test_no_data_leakage():
    """Verify preprocessing doesn't leak test data."""
    X_train, X_test = X[:80], X[80:]

    pipeline.fit(X_train)
    X_test_transformed = pipeline.transform(X_test)

    # Check that test transform uses train statistics
    assert pipeline.named_steps['scaler'].mean_ is not None

Troubleshooting

ProblemCauseSolution
NaN in predictionMissing imputerAdd SimpleImputer to pipeline
Shape mismatchInconsistent featuresUse ColumnTransformer
Memory errorToo many one-hot featuresUse max_categories or hashing
Poor CV varianceData leakageCheck preprocessing order

Related Resources


Version: 1.4.0 | Status: Production Ready

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

96/100Analyzed 2/12/2026

An exceptional skill document providing a comprehensive, highly actionable guide to ML fundamentals with scikit-learn, including code, tests, and troubleshooting.

95
100
90
98
95

Metadata

Licenseunknown
Version1.4.0
Updated1/5/2026
Publisherpluginagentmarketplace

Tags

ci-cdobservabilitytesting