Feature Scaling (Normalization vs. Standardization)

  Add to Bookmark

Feature scaling is one of the most essential preprocessing steps in machine learning. Algorithms that rely on distances or gradients (like KNN, SVM, or Gradient Descent-based models) can perform poorly if features are on different scales.

In this tutorial, we’ll explore two core techniques: Normalization and Standardization, understand when to use each, and implement them with Python.


Why Feature Scaling Is Needed

Imagine you have two features:

  • Age (values between 20–60)
  • Income (values in the thousands)

Many ML algorithms (like K-Means, Logistic Regression, SVM, and KNN) treat larger scale features as more important unless scaled properly.

Without scaling:

  • Distance-based models become biased.
  • Gradient descent may converge slowly or incorrectly.

Normalization (Min-Max Scaling)

Definition:

Rescales the feature to a fixed range, usually [0, 1].

Formula:

Use When:

  • You don’t know the distribution of your data.
  • You need to bound data between a specific range.
  • You're using models like KNN, Neural Networks, or K-Means.

Python Example:

from sklearn.preprocessing import MinMaxScaler
import pandas as pd
# Sample data
df = pd.DataFrame({
    'Age': [20, 25, 30, 35, 40],
    'Income': [20000, 40000, 60000, 80000, 100000]
})
scaler = MinMaxScaler()
normalized = scaler.fit_transform(df)
normalized_df = pd.DataFrame(normalized, columns=df.columns)
print(normalized_df)

Output-

    Age  Income
0  0.00    0.00
1  0.25    0.25
2  0.50    0.50
3  0.75    0.75
4  1.00    1.00

Standardization (Z-Score Scaling)

Definition:

Transforms data to have zero mean and unit variance.

Formula:

Where:

  • μ\mu: mean
  • σ\sigma: standard deviation

Use When:

  • Data follows a Gaussian distribution
  • Algorithms assume standardized data, like Logistic Regression, Linear Regression, or PCA

Python Example:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
standardized = scaler.fit_transform(df)
standardized_df = pd.DataFrame(standardized, columns=df.columns)
print(standardized_df)

Output-

        Age    Income
0 -1.414214 -1.414214
1 -0.707107 -0.707107
2  0.000000  0.000000
3  0.707107  0.707107
4  1.414214  1.414214

Normalization vs. Standardization: Summary

FeatureNormalizationStandardization
Range[0, 1]Mean = 0, Std = 1
Sensitive to outliersYesLess
When to useKNN, NN, distance-based modelsLinear models, PCA, Gaussian assumptions
Requires normal distributionNoPreferably yes

Which to Choose?

  • Use Normalization if you need bounded data or when using models sensitive to magnitude.
  • Use Standardization for linear models and when your data follows a normal distribution.
  • Use RobustScaler (not covered here) if your data has many outliers.