Handling Imbalanced Data (SMOTE, Class Weights)

  Add to Bookmark

In many real-world machine learning problems, especially in areas like fraud detection, medical diagnosis, or churn prediction, datasets often have a highly imbalanced class distribution. This means that one class significantly outnumbers the other(s), which can result in a biased model that fails to properly learn the minority class.

To address this issue, two of the most commonly used techniques are:

  1. Using Class Weights – which adjusts the training process to penalize misclassification of minority classes more.
  2. Synthetic Oversampling (SMOTE) – which creates synthetic examples of the minority class to balance the dataset.

In this tutorial, we will explore both of these techniques using a complete example in Python.


Step 1: Simulate an Imbalanced Dataset

We will use make_classification from sklearn.datasets to create a binary classification dataset with an imbalance.

from sklearn.datasets import make_classification
import pandas as pd
from collections import Counter

# Create an imbalanced dataset
X, y = make_classification(n_samples=1000,
                           n_features=20,
                           n_informative=2,
                           n_redundant=10,
                           n_clusters_per_class=1,
                           weights=[0.9, 0.1],
                           flip_y=0,
                           random_state=1)

print("Original class distribution:", Counter(y))

Output -

Original class distribution: Counter({np.int64(0): 900, np.int64(1): 100})

Step 2: Train-Test Split

Split the data into training and test sets.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.3, 
                                                    stratify=y,
                                                    random_state=42)

Step 3: Using Class Weights

Some models like Logistic Regression, Random Forest, and SVM support a parameter called class_weight. Setting class_weight='balanced' tells the model to automatically adjust weights inversely proportional to class frequencies.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

model = LogisticRegression(class_weight='balanced', max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Classification Report with Class Weights:")
print(classification_report(y_test, y_pred))

Output - 

Classification Report with Class Weights:
              precision    recall  f1-score   support

           0       0.98      0.94      0.96       270
           1       0.63      0.87      0.73        30

    accuracy                           0.94       300
   macro avg       0.81      0.91      0.85       300
weighted avg       0.95      0.94      0.94       300

Alternatively, you can specify custom weights:

model = LogisticRegression(class_weight={0: 1, 1: 9}, max_iter=1000)

Step 4: Using SMOTE for Oversampling

SMOTE (Synthetic Minority Oversampling Technique) generates synthetic data points for the minority class rather than simply duplicating existing ones.

Install imbalanced-learn (if not already installed)

pip install imbalanced-learn

Apply SMOTE

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print("After SMOTE:", Counter(y_train_smote))

Output -

After SMOTE: Counter({np.int64(0): 630, np.int64(1): 630})

Train Model on SMOTE Data

model = LogisticRegression(max_iter=1000)
model.fit(X_train_smote, y_train_smote)
y_pred = model.predict(X_test)

print("Classification Report with SMOTE:")
print(classification_report(y_test, y_pred))

Output -

Classification Report with SMOTE:
              precision    recall  f1-score   support

           0       0.98      0.96      0.97       270
           1       0.70      0.87      0.78        30

    accuracy                           0.95       300
   macro avg       0.84      0.91      0.87       300
weighted avg       0.96      0.95      0.95       300

Step 5: Comparing Evaluation Metrics

For imbalanced datasets, accuracy is often misleading. Instead, use:

  • Precision: How many predicted positives are truly positive.
  • Recall: How many actual positives were correctly predicted.
  • F1-Score: Harmonic mean of precision and recall.
  • ROC AUC Score: Overall ability of the model to distinguish between classes.
from sklearn.metrics import roc_auc_score

y_proba = model.predict_proba(X_test)[:, 1]
roc_score = roc_auc_score(y_test, y_proba)

print("ROC AUC Score:", roc_score)

Output - 

ROC AUC Score: 0.9669135802469137

Step 6: Optional – Undersampling

Undersampling reduces the size of the majority class.

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

print("After Undersampling:", Counter(y_train_rus))

model = LogisticRegression(max_iter=1000)
model.fit(X_train_rus, y_train_rus)
y_pred = model.predict(X_test)

print("Classification Report with Undersampling:")
print(classification_report(y_test, y_pred))

Output -

Classification Report with Undersampling:
              precision    recall  f1-score   support

           0       0.98      0.91      0.95       270
           1       0.52      0.83      0.64        30

    accuracy                           0.91       300
   macro avg       0.75      0.87      0.79       300
weighted avg       0.93      0.91      0.92       300

Conclusion

Handling imbalanced datasets is critical to building robust machine learning models that generalize well, especially when the minority class carries high importance (like detecting fraud or disease). Two major approaches include:

  • Adjusting class weights – simple and often effective.
  • Oversampling using SMOTE – improves balance by adding synthetic examples.

You should also evaluate models with metrics beyond accuracy, such as precision, recall, F1-score, and ROC AUC.