- Feature Engineering & Data Preprocessing
-
Overview
- Handling Missing Data in ML
- Feature Scaling (Normalization vs. Standardization)
- Encoding Categorical Variables
- Feature Selection Techniques
- Dimensionality Reduction Techniques
- Feature Extraction from Text and Images
- Handling Imbalanced Data (SMOTE, Class Weights)
Handling Imbalanced Data (SMOTE, Class Weights)
Add to BookmarkIn many real-world machine learning problems, especially in areas like fraud detection, medical diagnosis, or churn prediction, datasets often have a highly imbalanced class distribution. This means that one class significantly outnumbers the other(s), which can result in a biased model that fails to properly learn the minority class.
To address this issue, two of the most commonly used techniques are:
- Using Class Weights – which adjusts the training process to penalize misclassification of minority classes more.
- Synthetic Oversampling (SMOTE) – which creates synthetic examples of the minority class to balance the dataset.
In this tutorial, we will explore both of these techniques using a complete example in Python.
Step 1: Simulate an Imbalanced Dataset
We will use make_classification
from sklearn.datasets
to create a binary classification dataset with an imbalance.
from sklearn.datasets import make_classification
import pandas as pd
from collections import Counter
# Create an imbalanced dataset
X, y = make_classification(n_samples=1000,
n_features=20,
n_informative=2,
n_redundant=10,
n_clusters_per_class=1,
weights=[0.9, 0.1],
flip_y=0,
random_state=1)
print("Original class distribution:", Counter(y))
Output -
Original class distribution: Counter({np.int64(0): 900, np.int64(1): 100})
Step 2: Train-Test Split
Split the data into training and test sets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
stratify=y,
random_state=42)
Step 3: Using Class Weights
Some models like Logistic Regression, Random Forest, and SVM support a parameter called class_weight
. Setting class_weight='balanced'
tells the model to automatically adjust weights inversely proportional to class frequencies.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
model = LogisticRegression(class_weight='balanced', max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Classification Report with Class Weights:")
print(classification_report(y_test, y_pred))
Output -
Classification Report with Class Weights:
precision recall f1-score support
0 0.98 0.94 0.96 270
1 0.63 0.87 0.73 30
accuracy 0.94 300
macro avg 0.81 0.91 0.85 300
weighted avg 0.95 0.94 0.94 300
Alternatively, you can specify custom weights:
model = LogisticRegression(class_weight={0: 1, 1: 9}, max_iter=1000)
Step 4: Using SMOTE for Oversampling
SMOTE (Synthetic Minority Oversampling Technique) generates synthetic data points for the minority class rather than simply duplicating existing ones.
Install imbalanced-learn (if not already installed)
pip install imbalanced-learn
Apply SMOTE
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
print("After SMOTE:", Counter(y_train_smote))
Output -
After SMOTE: Counter({np.int64(0): 630, np.int64(1): 630})
Train Model on SMOTE Data
model = LogisticRegression(max_iter=1000)
model.fit(X_train_smote, y_train_smote)
y_pred = model.predict(X_test)
print("Classification Report with SMOTE:")
print(classification_report(y_test, y_pred))
Output -
Classification Report with SMOTE:
precision recall f1-score support
0 0.98 0.96 0.97 270
1 0.70 0.87 0.78 30
accuracy 0.95 300
macro avg 0.84 0.91 0.87 300
weighted avg 0.96 0.95 0.95 300
Step 5: Comparing Evaluation Metrics
For imbalanced datasets, accuracy is often misleading. Instead, use:
- Precision: How many predicted positives are truly positive.
- Recall: How many actual positives were correctly predicted.
- F1-Score: Harmonic mean of precision and recall.
- ROC AUC Score: Overall ability of the model to distinguish between classes.
from sklearn.metrics import roc_auc_score
y_proba = model.predict_proba(X_test)[:, 1]
roc_score = roc_auc_score(y_test, y_proba)
print("ROC AUC Score:", roc_score)
Output -
ROC AUC Score: 0.9669135802469137
Step 6: Optional – Undersampling
Undersampling reduces the size of the majority class.
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)
print("After Undersampling:", Counter(y_train_rus))
model = LogisticRegression(max_iter=1000)
model.fit(X_train_rus, y_train_rus)
y_pred = model.predict(X_test)
print("Classification Report with Undersampling:")
print(classification_report(y_test, y_pred))
Output -
Classification Report with Undersampling:
precision recall f1-score support
0 0.98 0.91 0.95 270
1 0.52 0.83 0.64 30
accuracy 0.91 300
macro avg 0.75 0.87 0.79 300
weighted avg 0.93 0.91 0.92 300
Conclusion
Handling imbalanced datasets is critical to building robust machine learning models that generalize well, especially when the minority class carries high importance (like detecting fraud or disease). Two major approaches include:
- Adjusting class weights – simple and often effective.
- Oversampling using SMOTE – improves balance by adding synthetic examples.
You should also evaluate models with metrics beyond accuracy, such as precision, recall, F1-score, and ROC AUC.
Prepare for Interview
- JavaScript Interview Questions for 5+ Years Experience
- JavaScript Interview Questions for 2–5 Years Experience
- JavaScript Interview Questions for 1–2 Years Experience
- JavaScript Interview Questions for 0–1 Year Experience
- JavaScript Interview Questions For Fresher
- SQL Interview Questions for 5+ Years Experience
- SQL Interview Questions for 2–5 Years Experience
- SQL Interview Questions for 1–2 Years Experience
- SQL Interview Questions for 0–1 Year Experience
- SQL Interview Questions for Freshers
- Design Patterns in Python
- Dynamic Programming and Recursion in Python
- Trees and Graphs in Python
- Linked Lists, Stacks, and Queues in Python
- Sorting and Searching in Python
Random Blogs
- How AI is Making Humans Weaker – The Hidden Impact of Artificial Intelligence
- OLTP vs. OLAP Databases: Advanced Insights and Query Optimization Techniques
- How to Start Your Career as a DevOps Engineer
- Grow your business with Facebook Marketing
- What Is SEO and Why Is It Important?
- Python Challenging Programming Exercises Part 3
- AI Agents: The Future of Automation, Work, and Opportunities in 2025
- AI in Marketing & Advertising: The Future of AI-Driven Strategies
- Career Guide: Natural Language Processing (NLP)
- Transforming Logistics: The Power of AI in Supply Chain Management
- Data Analytics: The Power of Data-Driven Decision Making
- What to Do When Your MySQL Table Grows Too Wide
- Government Datasets from 50 Countries for Machine Learning Training
- Understanding HTAP Databases: Bridging Transactions and Analytics
- Mastering SQL in 2025: A Complete Roadmap for Beginners
Datasets for Machine Learning
- Awesome-ChatGPT-Prompts
- Amazon Product Reviews Dataset
- Ozone Level Detection Dataset
- Bank Transaction Fraud Detection
- YouTube Trending Video Dataset (updated daily)
- Covid-19 Case Surveillance Public Use Dataset
- US Election 2020
- Forest Fires Dataset
- Mobile Robots Dataset
- Safety Helmet Detection
- All Space Missions from 1957
- OSIC Pulmonary Fibrosis Progression Dataset
- Wine Quality Dataset
- Google Audio Dataset
- Iris flower dataset