- Feature Engineering & Data Preprocessing
-
Overview
- Handling Missing Data in ML
- Feature Scaling (Normalization vs. Standardization)
- Encoding Categorical Variables
- Feature Selection Techniques
- Dimensionality Reduction Techniques
- Feature Extraction from Text and Images
- Handling Imbalanced Data (SMOTE, Class Weights)
Handling Imbalanced Data (SMOTE, Class Weights)
Add to BookmarkIn many real-world machine learning problems, especially in areas like fraud detection, medical diagnosis, or churn prediction, datasets often have a highly imbalanced class distribution. This means that one class significantly outnumbers the other(s), which can result in a biased model that fails to properly learn the minority class.
To address this issue, two of the most commonly used techniques are:
- Using Class Weights – which adjusts the training process to penalize misclassification of minority classes more.
- Synthetic Oversampling (SMOTE) – which creates synthetic examples of the minority class to balance the dataset.
In this tutorial, we will explore both of these techniques using a complete example in Python.
Step 1: Simulate an Imbalanced Dataset
We will use make_classification
from sklearn.datasets
to create a binary classification dataset with an imbalance.
from sklearn.datasets import make_classification
import pandas as pd
from collections import Counter
# Create an imbalanced dataset
X, y = make_classification(n_samples=1000,
n_features=20,
n_informative=2,
n_redundant=10,
n_clusters_per_class=1,
weights=[0.9, 0.1],
flip_y=0,
random_state=1)
print("Original class distribution:", Counter(y))
Output -
Original class distribution: Counter({np.int64(0): 900, np.int64(1): 100})
Step 2: Train-Test Split
Split the data into training and test sets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
stratify=y,
random_state=42)
Step 3: Using Class Weights
Some models like Logistic Regression, Random Forest, and SVM support a parameter called class_weight
. Setting class_weight='balanced'
tells the model to automatically adjust weights inversely proportional to class frequencies.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
model = LogisticRegression(class_weight='balanced', max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Classification Report with Class Weights:")
print(classification_report(y_test, y_pred))
Output -
Classification Report with Class Weights:
precision recall f1-score support
0 0.98 0.94 0.96 270
1 0.63 0.87 0.73 30
accuracy 0.94 300
macro avg 0.81 0.91 0.85 300
weighted avg 0.95 0.94 0.94 300
Alternatively, you can specify custom weights:
model = LogisticRegression(class_weight={0: 1, 1: 9}, max_iter=1000)
Step 4: Using SMOTE for Oversampling
SMOTE (Synthetic Minority Oversampling Technique) generates synthetic data points for the minority class rather than simply duplicating existing ones.
Install imbalanced-learn (if not already installed)
pip install imbalanced-learn
Apply SMOTE
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
print("After SMOTE:", Counter(y_train_smote))
Output -
After SMOTE: Counter({np.int64(0): 630, np.int64(1): 630})
Train Model on SMOTE Data
model = LogisticRegression(max_iter=1000)
model.fit(X_train_smote, y_train_smote)
y_pred = model.predict(X_test)
print("Classification Report with SMOTE:")
print(classification_report(y_test, y_pred))
Output -
Classification Report with SMOTE:
precision recall f1-score support
0 0.98 0.96 0.97 270
1 0.70 0.87 0.78 30
accuracy 0.95 300
macro avg 0.84 0.91 0.87 300
weighted avg 0.96 0.95 0.95 300
Step 5: Comparing Evaluation Metrics
For imbalanced datasets, accuracy is often misleading. Instead, use:
- Precision: How many predicted positives are truly positive.
- Recall: How many actual positives were correctly predicted.
- F1-Score: Harmonic mean of precision and recall.
- ROC AUC Score: Overall ability of the model to distinguish between classes.
from sklearn.metrics import roc_auc_score
y_proba = model.predict_proba(X_test)[:, 1]
roc_score = roc_auc_score(y_test, y_proba)
print("ROC AUC Score:", roc_score)
Output -
ROC AUC Score: 0.9669135802469137
Step 6: Optional – Undersampling
Undersampling reduces the size of the majority class.
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)
print("After Undersampling:", Counter(y_train_rus))
model = LogisticRegression(max_iter=1000)
model.fit(X_train_rus, y_train_rus)
y_pred = model.predict(X_test)
print("Classification Report with Undersampling:")
print(classification_report(y_test, y_pred))
Output -
Classification Report with Undersampling:
precision recall f1-score support
0 0.98 0.91 0.95 270
1 0.52 0.83 0.64 30
accuracy 0.91 300
macro avg 0.75 0.87 0.79 300
weighted avg 0.93 0.91 0.92 300
Conclusion
Handling imbalanced datasets is critical to building robust machine learning models that generalize well, especially when the minority class carries high importance (like detecting fraud or disease). Two major approaches include:
- Adjusting class weights – simple and often effective.
- Oversampling using SMOTE – improves balance by adding synthetic examples.
You should also evaluate models with metrics beyond accuracy, such as precision, recall, F1-score, and ROC AUC.
Prepare for Interview
- JavaScript Interview Questions for 5+ Years Experience
- JavaScript Interview Questions for 2–5 Years Experience
- JavaScript Interview Questions for 1–2 Years Experience
- JavaScript Interview Questions for 0–1 Year Experience
- JavaScript Interview Questions For Fresher
- SQL Interview Questions for 5+ Years Experience
- SQL Interview Questions for 2–5 Years Experience
- SQL Interview Questions for 1–2 Years Experience
- SQL Interview Questions for 0–1 Year Experience
- SQL Interview Questions for Freshers
- Design Patterns in Python
- Dynamic Programming and Recursion in Python
- Trees and Graphs in Python
- Linked Lists, Stacks, and Queues in Python
- Sorting and Searching in Python
Random Blogs
- Top 15 Recommended SEO Tools
- Python Challenging Programming Exercises Part 2
- Extract RGB Color From a Image Using CV2
- Deep Learning (DL): The Core of Modern AI
- Why to learn Digital Marketing?
- Avoiding the Beginner’s Trap: Key Python Fundamentals You Shouldn't Skip
- Variable Assignment in Python
- Python Challenging Programming Exercises Part 3
- Exploratory Data Analysis On Iris Dataset
- What is YII? and How to Install it?
- Google’s Core Update in May 2020: What You Need to Know
- The Beginner’s Guide to Normalization and Denormalization in Databases
- The Ultimate Guide to Artificial Intelligence (AI) for Beginners
- Internet of Things (IoT) & AI – Smart Devices and AI Working Together
- Store Data Into CSV File Using Python Tkinter GUI Library
Datasets for Machine Learning
- Amazon Product Reviews Dataset
- Ozone Level Detection Dataset
- Bank Transaction Fraud Detection
- YouTube Trending Video Dataset (updated daily)
- Covid-19 Case Surveillance Public Use Dataset
- US Election 2020
- Forest Fires Dataset
- Mobile Robots Dataset
- Safety Helmet Detection
- All Space Missions from 1957
- OSIC Pulmonary Fibrosis Progression Dataset
- Wine Quality Dataset
- Google Audio Dataset
- Iris flower dataset
- Artificial Characters Dataset