In many real-world machine learning problems, especially in areas like fraud detection, medical diagnosis, or churn prediction, datasets often have a highly imbalanced class distribution. This means that one class significantly outnumbers the other(s), which can result in a biased model that fails to properly learn the minority class.
To address this issue, two of the most commonly used techniques are:
In this tutorial, we will explore both of these techniques using a complete example in Python.
We will use make_classification from sklearn.datasets to create a binary classification dataset with an imbalance.
from sklearn.datasets import make_classification
import pandas as pd
from collections import Counter
# Create an imbalanced dataset
X, y = make_classification(n_samples=1000,
n_features=20,
n_informative=2,
n_redundant=10,
n_clusters_per_class=1,
weights=[0.9, 0.1],
flip_y=0,
random_state=1)
print("Original class distribution:", Counter(y))Output -
Original class distribution: Counter({np.int64(0): 900, np.int64(1): 100})Split the data into training and test sets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
stratify=y,
random_state=42)Some models like Logistic Regression, Random Forest, and SVM support a parameter called class_weight. Setting class_weight='balanced' tells the model to automatically adjust weights inversely proportional to class frequencies.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
model = LogisticRegression(class_weight='balanced', max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Classification Report with Class Weights:")
print(classification_report(y_test, y_pred))Output -
Classification Report with Class Weights:
precision recall f1-score support
0 0.98 0.94 0.96 270
1 0.63 0.87 0.73 30
accuracy 0.94 300
macro avg 0.81 0.91 0.85 300
weighted avg 0.95 0.94 0.94 300Alternatively, you can specify custom weights:
model = LogisticRegression(class_weight={0: 1, 1: 9}, max_iter=1000)SMOTE (Synthetic Minority Oversampling Technique) generates synthetic data points for the minority class rather than simply duplicating existing ones.
pip install imbalanced-learnfrom imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
print("After SMOTE:", Counter(y_train_smote))Output -
After SMOTE: Counter({np.int64(0): 630, np.int64(1): 630})model = LogisticRegression(max_iter=1000)
model.fit(X_train_smote, y_train_smote)
y_pred = model.predict(X_test)
print("Classification Report with SMOTE:")
print(classification_report(y_test, y_pred))Output -
Classification Report with SMOTE:
precision recall f1-score support
0 0.98 0.96 0.97 270
1 0.70 0.87 0.78 30
accuracy 0.95 300
macro avg 0.84 0.91 0.87 300
weighted avg 0.96 0.95 0.95 300For imbalanced datasets, accuracy is often misleading. Instead, use:
from sklearn.metrics import roc_auc_score
y_proba = model.predict_proba(X_test)[:, 1]
roc_score = roc_auc_score(y_test, y_proba)
print("ROC AUC Score:", roc_score)Output -
ROC AUC Score: 0.9669135802469137Undersampling reduces the size of the majority class.
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)
print("After Undersampling:", Counter(y_train_rus))
model = LogisticRegression(max_iter=1000)
model.fit(X_train_rus, y_train_rus)
y_pred = model.predict(X_test)
print("Classification Report with Undersampling:")
print(classification_report(y_test, y_pred))Output -
Classification Report with Undersampling:
precision recall f1-score support
0 0.98 0.91 0.95 270
1 0.52 0.83 0.64 30
accuracy 0.91 300
macro avg 0.75 0.87 0.79 300
weighted avg 0.93 0.91 0.92 300Handling imbalanced datasets is critical to building robust machine learning models that generalize well, especially when the minority class carries high importance (like detecting fraud or disease). Two major approaches include:
You should also evaluate models with metrics beyond accuracy, such as precision, recall, F1-score, and ROC AUC.
Sign in to join the discussion and post comments.
Sign inSupervised Learning
Discover what Supervised Learning is, how it works, and what you'll learn in this hands-on tutorial series covering top ML algorithms like Linear Regression, Decision Trees, SVM, and more.
Unsupervised Learning
Explore Unsupervised Learning techniques to uncover patterns, structures, and relationships in unlabeled data.