Skip to content

Machine Learning with Scikit-learn: A Comprehensive Guide

Published: at 03:30 PMSuggest Changes

Scikit-learn is the most popular machine learning library for Python, providing simple and efficient tools for data mining and data analysis. Built on NumPy, SciPy, and matplotlib, it offers a consistent API for a wide range of machine learning algorithms and utilities.

Table of Contents

Open Table of Contents

What is Scikit-learn?

Scikit-learn is an open-source machine learning library that provides:

Key Features

Installation

Installing scikit-learn is straightforward using pip or conda:

# Using pip
pip install scikit-learn

# Using conda
conda install scikit-learn

# For additional dependencies
pip install scikit-learn matplotlib seaborn pandas

Basic Workflow

The typical machine learning workflow with scikit-learn follows these steps:

  1. Load and explore data
  2. Preprocess the data
  3. Split data into training and testing sets
  4. Choose and train a model
  5. Make predictions
  6. Evaluate the model

Supervised Learning Examples

Classification Example

Let’s start with a simple classification example using the famous iris dataset:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load the iris dataset
iris = datasets.load_iris()
X = iris.data  # Features
y = iris.target  # Labels

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Create and train the model
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Regression Example

Here’s a linear regression example with a synthetic dataset:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Generate sample data
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2 * X.ravel() + 1 + np.random.randn(100) * 2

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Create and train the model
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Make predictions
y_pred = regressor.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.3f}")
print(f"R² Score: {r2:.3f}")

# Visualize results
plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression Results')
plt.legend()
plt.show()

Unsupervised Learning Examples

Clustering with K-Means

K-means clustering is one of the most popular unsupervised learning algorithms:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate sample data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Apply K-means clustering
kmeans = KMeans(n_clusters=4, random_state=42)
y_pred = kmeans.fit_predict(X)

# Visualize results
plt.figure(figsize=(10, 8))
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis', marker='o')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            c='red', marker='x', s=200, linewidths=3, label='Centroids')
plt.title('K-Means Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

Principal Component Analysis (PCA)

PCA is useful for dimensionality reduction:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits

# Load the digits dataset
digits = load_digits()
X = digits.data

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Visualize results
plt.figure(figsize=(10, 8))
colors = ['red', 'blue', 'green', 'purple', 'orange',
          'brown', 'pink', 'gray', 'olive', 'cyan']

for i in range(10):
    plt.scatter(X_pca[digits.target == i, 0],
                X_pca[digits.target == i, 1],
                c=colors[i], label=str(i), alpha=0.7)

plt.xlabel(f'First Principal Component ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'Second Principal Component ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.title('PCA of Digits Dataset')
plt.legend()
plt.show()

print(f"Total variance explained: {pca.explained_variance_ratio_.sum():.1%}")

Data Preprocessing

Scikit-learn provides powerful preprocessing tools:

Feature Scaling

from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np

# Sample data
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

# Standard Scaling (mean=0, std=1)
scaler_standard = StandardScaler()
data_standard = scaler_standard.fit_transform(data)

# Min-Max Scaling (range 0-1)
scaler_minmax = MinMaxScaler()
data_minmax = scaler_minmax.fit_transform(data)

print("Original data:")
print(data)
print("\nStandardized data:")
print(data_standard)
print("\nMin-Max scaled data:")
print(data_minmax)

Handling Categorical Data

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd

# Sample categorical data
data = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'red', 'blue'],
    'size': ['small', 'large', 'medium', 'large', 'small']
})

# Label Encoding
le_color = LabelEncoder()
data['color_encoded'] = le_color.fit_transform(data['color'])

# One-Hot Encoding
ohe = OneHotEncoder(sparse_output=False)
color_ohe = ohe.fit_transform(data[['color']])
color_ohe_df = pd.DataFrame(color_ohe, columns=ohe.get_feature_names_out(['color']))

print("Original data:")
print(data)
print("\nOne-hot encoded colors:")
print(color_ohe_df)

Model Evaluation and Selection

Cross-Validation

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Create model
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Perform cross-validation
cv_scores = cross_val_score(rf, X, y, cv=5, scoring='accuracy')

print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

Grid Search for Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Define parameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf', 'linear']
}

# Create SVM classifier
svm = SVC()

# Perform grid search
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X, y)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.3f}")
print(f"Test set score: {grid_search.score(X, y):.3f}")

Classification Algorithms

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

# Dictionary of classifiers
classifiers = {
    'Random Forest': RandomForestClassifier(n_estimators=100),
    'SVM': SVC(kernel='rbf'),
    'Naive Bayes': GaussianNB(),
    'K-NN': KNeighborsClassifier(n_neighbors=5),
    'Logistic Regression': LogisticRegression()
}

# Compare performance on iris dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

for name, clf in classifiers.items():
    clf.fit(X_train, y_train)
    accuracy = clf.score(X_test, y_test)
    print(f"{name}: {accuracy:.3f}")

Best Practices

1. Always Split Your Data

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

2. Scale Your Features

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Don't fit on test data!

3. Use Pipelines

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

# Create a pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier())
])

# Fit and predict
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

4. Handle Missing Values

from sklearn.impute import SimpleImputer

# For numerical features
imputer_num = SimpleImputer(strategy='mean')
X_imputed = imputer_num.fit_transform(X)

# For categorical features
imputer_cat = SimpleImputer(strategy='most_frequent')

Real-World Example: Predicting House Prices

Here’s a more comprehensive example using a regression problem:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.pipeline import Pipeline

# Generate synthetic house price data
np.random.seed(42)
n_samples = 1000

data = {
    'area': np.random.normal(2000, 500, n_samples),
    'bedrooms': np.random.randint(1, 6, n_samples),
    'bathrooms': np.random.randint(1, 4, n_samples),
    'age': np.random.randint(0, 50, n_samples),
    'garage': np.random.randint(0, 3, n_samples)
}

# Create target variable with some realistic relationships
price = (
    data['area'] * 100 +
    data['bedrooms'] * 10000 +
    data['bathrooms'] * 15000 +
    (50 - data['age']) * 1000 +
    data['garage'] * 5000 +
    np.random.normal(0, 20000, n_samples)
)

df = pd.DataFrame(data)
df['price'] = price

# Prepare features and target
X = df.drop('price', axis=1)
y = df['price']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and train model using pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])

pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("Model Performance:")
print(f"Mean Absolute Error: ${mae:,.2f}")
print(f"Root Mean Square Error: ${rmse:,.2f}")
print(f"R² Score: {r2:.3f}")

# Feature importance
feature_importance = pipeline.named_steps['regressor'].feature_importances_
feature_names = X.columns

print("\nFeature Importance:")
for name, importance in zip(feature_names, feature_importance):
    print(f"{name}: {importance:.3f}")

Conclusion

Scikit-learn is an incredibly powerful and user-friendly library that makes machine learning accessible to Python developers. Its consistent API, comprehensive documentation, and wide range of algorithms make it an excellent choice for both beginners and experienced practitioners.

Key takeaways:

Whether you’re working on classification, regression, clustering, or dimensionality reduction, scikit-learn provides the tools you need to build robust machine learning solutions. The library’s emphasis on simplicity and consistency makes it an ideal starting point for anyone looking to dive into the world of machine learning with Python.


Previous Post
Building APIs with FastAPI vs Django: A Comprehensive Comparison
Next Post
CI/CD with GitHub Actions: Automated Testing and Deployment