askill
supervised-learning

supervised-learningSafety 85Repository

Build production-ready classification and regression models with hyperparameter tuning

1 stars
1.2k downloads
Updated 1/5/2026

Package Files

Loading files...
SKILL.md

Supervised Learning Skill

Build, tune, and evaluate classification and regression models.

Quick Start

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1_weighted')
print(f"CV F1: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")
print(f"Test Accuracy: {model.score(X_test, y_test):.4f}")

Key Topics

1. Classification Algorithms

AlgorithmBest ForComplexity
Logistic RegressionBaseline, interpretableO(n*d)
Random ForestTabular, generalO(ndtrees)
XGBoostCompetitions, accuracyO(ndtrees)
SVMHigh-dim, small dataO(n²)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

classifiers = {
    'lr': LogisticRegression(max_iter=1000, class_weight='balanced'),
    'rf': RandomForestClassifier(n_estimators=100, class_weight='balanced'),
    'xgb': XGBClassifier(n_estimators=100, eval_metric='logloss')
}

2. Regression Algorithms

AlgorithmBest ForKey Param
RidgeMulticollinearityalpha
LassoFeature selectionalpha
Random ForestNon-linearn_estimators
XGBoostBest accuracylearning_rate

3. Hyperparameter Tuning

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_dist = {
    'n_estimators': randint(50, 300),
    'max_depth': randint(3, 15),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10)
}

search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_dist,
    n_iter=50,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1,
    random_state=42
)

search.fit(X_train, y_train)
print(f"Best params: {search.best_params_}")
print(f"Best CV score: {search.best_score_:.4f}")

4. Handling Class Imbalance

TechniqueImplementation
Class Weightsclass_weight='balanced'
SMOTEimblearn.over_sampling.SMOTE()
Threshold TuningAdjust prediction threshold
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

pipeline = Pipeline([
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier())
])

5. Model Comparison

from sklearn.model_selection import cross_validate
import pandas as pd

def compare_models(models, X, y, cv=5):
    results = []
    for name, model in models.items():
        cv_results = cross_validate(
            model, X, y, cv=cv,
            scoring=['accuracy', 'f1_weighted', 'roc_auc_ovr_weighted'],
            return_train_score=True
        )
        results.append({
            'model': name,
            'train_acc': cv_results['train_accuracy'].mean(),
            'test_acc': cv_results['test_accuracy'].mean(),
            'test_f1': cv_results['test_f1_weighted'].mean(),
            'test_auc': cv_results['test_roc_auc_ovr_weighted'].mean()
        })
    return pd.DataFrame(results).round(4)

Best Practices

DO

  • Start with a simple baseline
  • Use stratified splits for classification
  • Log all hyperparameters
  • Check for overfitting (train vs test gap)
  • Use early stopping for boosting

DON'T

  • Don't tune on test set
  • Don't ignore class imbalance
  • Don't skip feature importance analysis
  • Don't use accuracy for imbalanced data

Exercises

Exercise 1: Model Selection

# TODO: Compare 3 different classifiers using cross-validation
# Report F1 score for each

Exercise 2: Hyperparameter Tuning

# TODO: Use RandomizedSearchCV to tune XGBoost
# Find optimal n_estimators, max_depth, learning_rate

Unit Test Template

import pytest
from sklearn.datasets import make_classification

def test_classifier_trains():
    """Test classifier can fit and predict."""
    X, y = make_classification(n_samples=100, random_state=42)
    model = get_classifier()

    model.fit(X[:80], y[:80])
    predictions = model.predict(X[80:])

    assert len(predictions) == 20
    assert set(predictions).issubset({0, 1})

def test_handles_imbalance():
    """Test model handles imbalanced classes."""
    X, y = make_classification(n_samples=100, weights=[0.9, 0.1])
    model = get_balanced_classifier()

    model.fit(X, y)
    predictions = model.predict(X)

    # Should predict both classes
    assert len(set(predictions)) == 2

Troubleshooting

ProblemCauseSolution
OverfittingModel too complexReduce depth, add regularization
UnderfittingModel too simpleIncrease complexity
Class imbalanceSkewed dataUse SMOTE or class weights
Slow trainingLarge dataUse LightGBM, reduce estimators

Related Resources

  • Agent: 02-supervised-learning
  • Previous: ml-fundamentals
  • Next: clustering

Version: 1.4.0 | Status: Production Ready

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

95/100Analyzed 2/11/2026

An exceptional technical reference for supervised learning, providing comprehensive code examples, structured comparisons, and production-ready best practices.

85
95
90
95
98

Metadata

Licenseunknown
Version1.4.0
Updated1/5/2026
Publisherpluginagentmarketplace

Tags

ci-cdtesting