Scikit-learn is the most popular machine learning library for Python, providing simple and efficient tools for data mining and data analysis. Built on NumPy, SciPy, and matplotlib, it offers a consistent API for a wide range of machine learning algorithms and utilities.
Table of Contents
Open Table of Contents
What is Scikit-learn?
Scikit-learn is an open-source machine learning library that provides:
- Simple and efficient tools for predictive data analysis
- Accessible to everybody and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib
- Open source, commercially usable - BSD license
Key Features
- Classification: Identifying which category an object belongs to
- Regression: Predicting a continuous-valued attribute associated with an object
- Clustering: Automatic grouping of similar objects into sets
- Dimensionality reduction: Reducing the number of random variables to consider
- Model selection: Comparing, validating and choosing parameters and models
- Preprocessing: Feature extraction and normalization
Installation
Installing scikit-learn is straightforward using pip or conda:
# Using pip
pip install scikit-learn
# Using conda
conda install scikit-learn
# For additional dependencies
pip install scikit-learn matplotlib seaborn pandas
Basic Workflow
The typical machine learning workflow with scikit-learn follows these steps:
- Load and explore data
- Preprocess the data
- Split data into training and testing sets
- Choose and train a model
- Make predictions
- Evaluate the model
Supervised Learning Examples
Classification Example
Let’s start with a simple classification example using the famous iris dataset:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# Load the iris dataset
iris = datasets.load_iris()
X = iris.data # Features
y = iris.target # Labels
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Create and train the model
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
Regression Example
Here’s a linear regression example with a synthetic dataset:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Generate sample data
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2 * X.ravel() + 1 + np.random.randn(100) * 2
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Create and train the model
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Make predictions
y_pred = regressor.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.3f}")
print(f"R² Score: {r2:.3f}")
# Visualize results
plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression Results')
plt.legend()
plt.show()
Unsupervised Learning Examples
Clustering with K-Means
K-means clustering is one of the most popular unsupervised learning algorithms:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
# Generate sample data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
# Apply K-means clustering
kmeans = KMeans(n_clusters=4, random_state=42)
y_pred = kmeans.fit_predict(X)
# Visualize results
plt.figure(figsize=(10, 8))
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis', marker='o')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
c='red', marker='x', s=200, linewidths=3, label='Centroids')
plt.title('K-Means Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()
Principal Component Analysis (PCA)
PCA is useful for dimensionality reduction:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
# Load the digits dataset
digits = load_digits()
X = digits.data
# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Visualize results
plt.figure(figsize=(10, 8))
colors = ['red', 'blue', 'green', 'purple', 'orange',
'brown', 'pink', 'gray', 'olive', 'cyan']
for i in range(10):
plt.scatter(X_pca[digits.target == i, 0],
X_pca[digits.target == i, 1],
c=colors[i], label=str(i), alpha=0.7)
plt.xlabel(f'First Principal Component ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'Second Principal Component ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.title('PCA of Digits Dataset')
plt.legend()
plt.show()
print(f"Total variance explained: {pca.explained_variance_ratio_.sum():.1%}")
Data Preprocessing
Scikit-learn provides powerful preprocessing tools:
Feature Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np
# Sample data
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
# Standard Scaling (mean=0, std=1)
scaler_standard = StandardScaler()
data_standard = scaler_standard.fit_transform(data)
# Min-Max Scaling (range 0-1)
scaler_minmax = MinMaxScaler()
data_minmax = scaler_minmax.fit_transform(data)
print("Original data:")
print(data)
print("\nStandardized data:")
print(data_standard)
print("\nMin-Max scaled data:")
print(data_minmax)
Handling Categorical Data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd
# Sample categorical data
data = pd.DataFrame({
'color': ['red', 'blue', 'green', 'red', 'blue'],
'size': ['small', 'large', 'medium', 'large', 'small']
})
# Label Encoding
le_color = LabelEncoder()
data['color_encoded'] = le_color.fit_transform(data['color'])
# One-Hot Encoding
ohe = OneHotEncoder(sparse_output=False)
color_ohe = ohe.fit_transform(data[['color']])
color_ohe_df = pd.DataFrame(color_ohe, columns=ohe.get_feature_names_out(['color']))
print("Original data:")
print(data)
print("\nOne-hot encoded colors:")
print(color_ohe_df)
Model Evaluation and Selection
Cross-Validation
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
# Load dataset
wine = load_wine()
X, y = wine.data, wine.target
# Create model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
# Perform cross-validation
cv_scores = cross_val_score(rf, X, y, cv=5, scoring='accuracy')
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")
Grid Search for Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Define parameter grid
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': [0.001, 0.01, 0.1, 1],
'kernel': ['rbf', 'linear']
}
# Create SVM classifier
svm = SVC()
# Perform grid search
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X, y)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.3f}")
print(f"Test set score: {grid_search.score(X, y):.3f}")
Popular Algorithms in Scikit-learn
Classification Algorithms
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
# Dictionary of classifiers
classifiers = {
'Random Forest': RandomForestClassifier(n_estimators=100),
'SVM': SVC(kernel='rbf'),
'Naive Bayes': GaussianNB(),
'K-NN': KNeighborsClassifier(n_neighbors=5),
'Logistic Regression': LogisticRegression()
}
# Compare performance on iris dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
for name, clf in classifiers.items():
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print(f"{name}: {accuracy:.3f}")
Best Practices
1. Always Split Your Data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
2. Scale Your Features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Don't fit on test data!
3. Use Pipelines
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
# Create a pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier())
])
# Fit and predict
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
4. Handle Missing Values
from sklearn.impute import SimpleImputer
# For numerical features
imputer_num = SimpleImputer(strategy='mean')
X_imputed = imputer_num.fit_transform(X)
# For categorical features
imputer_cat = SimpleImputer(strategy='most_frequent')
Real-World Example: Predicting House Prices
Here’s a more comprehensive example using a regression problem:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.pipeline import Pipeline
# Generate synthetic house price data
np.random.seed(42)
n_samples = 1000
data = {
'area': np.random.normal(2000, 500, n_samples),
'bedrooms': np.random.randint(1, 6, n_samples),
'bathrooms': np.random.randint(1, 4, n_samples),
'age': np.random.randint(0, 50, n_samples),
'garage': np.random.randint(0, 3, n_samples)
}
# Create target variable with some realistic relationships
price = (
data['area'] * 100 +
data['bedrooms'] * 10000 +
data['bathrooms'] * 15000 +
(50 - data['age']) * 1000 +
data['garage'] * 5000 +
np.random.normal(0, 20000, n_samples)
)
df = pd.DataFrame(data)
df['price'] = price
# Prepare features and target
X = df.drop('price', axis=1)
y = df['price']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Create and train model using pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])
pipeline.fit(X_train, y_train)
# Make predictions
y_pred = pipeline.predict(X_test)
# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print("Model Performance:")
print(f"Mean Absolute Error: ${mae:,.2f}")
print(f"Root Mean Square Error: ${rmse:,.2f}")
print(f"R² Score: {r2:.3f}")
# Feature importance
feature_importance = pipeline.named_steps['regressor'].feature_importances_
feature_names = X.columns
print("\nFeature Importance:")
for name, importance in zip(feature_names, feature_importance):
print(f"{name}: {importance:.3f}")
Conclusion
Scikit-learn is an incredibly powerful and user-friendly library that makes machine learning accessible to Python developers. Its consistent API, comprehensive documentation, and wide range of algorithms make it an excellent choice for both beginners and experienced practitioners.
Key takeaways:
- Start simple: Begin with basic algorithms before moving to complex ones
- Always preprocess your data: Scaling, encoding, and handling missing values are crucial
- Use cross-validation: Don’t rely on a single train-test split
- Leverage pipelines: They make your code cleaner and prevent data leakage
- Experiment with different algorithms: What works best depends on your specific problem
Whether you’re working on classification, regression, clustering, or dimensionality reduction, scikit-learn provides the tools you need to build robust machine learning solutions. The library’s emphasis on simplicity and consistency makes it an ideal starting point for anyone looking to dive into the world of machine learning with Python.