Feature scaling is one of the most essential preprocessing steps in machine learning. Algorithms that rely on distances or gradients (like KNN, SVM, or Gradient Descent-based models) can perform poorly if features are on different scales.
In this tutorial, we’ll explore two core techniques: Normalization and Standardization, understand when to use each, and implement them with Python.
Imagine you have two features:
Many ML algorithms (like K-Means, Logistic Regression, SVM, and KNN) treat larger scale features as more important unless scaled properly.
Without scaling:
Rescales the feature to a fixed range, usually [0, 1].
Formula:
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
# Sample data
df = pd.DataFrame({
'Age': [20, 25, 30, 35, 40],
'Income': [20000, 40000, 60000, 80000, 100000]
})
scaler = MinMaxScaler()
normalized = scaler.fit_transform(df)
normalized_df = pd.DataFrame(normalized, columns=df.columns)
print(normalized_df)Output-
Age Income
0 0.00 0.00
1 0.25 0.25
2 0.50 0.50
3 0.75 0.75
4 1.00 1.00Transforms data to have zero mean and unit variance.
Formula:
Where:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
standardized = scaler.fit_transform(df)
standardized_df = pd.DataFrame(standardized, columns=df.columns)
print(standardized_df)Output-
Age Income
0 -1.414214 -1.414214
1 -0.707107 -0.707107
2 0.000000 0.000000
3 0.707107 0.707107
4 1.414214 1.414214| Feature | Normalization | Standardization |
|---|---|---|
| Range | [0, 1] | Mean = 0, Std = 1 |
| Sensitive to outliers | Yes | Less |
| When to use | KNN, NN, distance-based models | Linear models, PCA, Gaussian assumptions |
| Requires normal distribution | No | Preferably yes |
Sign in to join the discussion and post comments.
Sign inSupervised Learning
Discover what Supervised Learning is, how it works, and what you'll learn in this hands-on tutorial series covering top ML algorithms like Linear Regression, Decision Trees, SVM, and more.
Unsupervised Learning
Explore Unsupervised Learning techniques to uncover patterns, structures, and relationships in unlabeled data.