Feature Scaling (Normalization vs. Standardization)
Add to BookmarkFeature scaling is one of the most essential preprocessing steps in machine learning. Algorithms that rely on distances or gradients (like KNN, SVM, or Gradient Descent-based models) can perform poorly if features are on different scales.
In this tutorial, we’ll explore two core techniques: Normalization and Standardization, understand when to use each, and implement them with Python.
Why Feature Scaling Is Needed
Imagine you have two features:
- Age (values between 20–60)
- Income (values in the thousands)
Many ML algorithms (like K-Means, Logistic Regression, SVM, and KNN) treat larger scale features as more important unless scaled properly.
Without scaling:
- Distance-based models become biased.
- Gradient descent may converge slowly or incorrectly.
Normalization (Min-Max Scaling)
Definition:
Rescales the feature to a fixed range, usually [0, 1].
Formula:
Use When:
- You don’t know the distribution of your data.
- You need to bound data between a specific range.
- You're using models like KNN, Neural Networks, or K-Means.
Python Example:
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
# Sample data
df = pd.DataFrame({
'Age': [20, 25, 30, 35, 40],
'Income': [20000, 40000, 60000, 80000, 100000]
})
scaler = MinMaxScaler()
normalized = scaler.fit_transform(df)
normalized_df = pd.DataFrame(normalized, columns=df.columns)
print(normalized_df)
Output-
Age Income
0 0.00 0.00
1 0.25 0.25
2 0.50 0.50
3 0.75 0.75
4 1.00 1.00
Standardization (Z-Score Scaling)
Definition:
Transforms data to have zero mean and unit variance.
Formula:
Where:
- : mean
- : standard deviation
Use When:
- Data follows a Gaussian distribution
- Algorithms assume standardized data, like Logistic Regression, Linear Regression, or PCA
Python Example:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
standardized = scaler.fit_transform(df)
standardized_df = pd.DataFrame(standardized, columns=df.columns)
print(standardized_df)
Output-
Age Income
0 -1.414214 -1.414214
1 -0.707107 -0.707107
2 0.000000 0.000000
3 0.707107 0.707107
4 1.414214 1.414214
Normalization vs. Standardization: Summary
Feature | Normalization | Standardization |
---|---|---|
Range | [0, 1] | Mean = 0, Std = 1 |
Sensitive to outliers | Yes | Less |
When to use | KNN, NN, distance-based models | Linear models, PCA, Gaussian assumptions |
Requires normal distribution | No | Preferably yes |
Which to Choose?
- Use Normalization if you need bounded data or when using models sensitive to magnitude.
- Use Standardization for linear models and when your data follows a normal distribution.
- Use RobustScaler (not covered here) if your data has many outliers.
Prepare for Interview
- JavaScript Interview Questions for 0–1 Year Experience
- JavaScript Interview Questions For Fresher
- SQL Interview Questions for 5+ Years Experience
- SQL Interview Questions for 2–5 Years Experience
- SQL Interview Questions for 1–2 Years Experience
- SQL Interview Questions for 0–1 Year Experience
- SQL Interview Questions for Freshers
- Design Patterns in Python
- Dynamic Programming and Recursion in Python
- Trees and Graphs in Python
- Linked Lists, Stacks, and Queues in Python
- Sorting and Searching in Python
- Debugging in Python
- Unit Testing in Python
- Asynchronous Programming in PYthon
Random Blogs
- Generative AI - The Future of Artificial Intelligence
- Avoiding the Beginner’s Trap: Key Python Fundamentals You Shouldn't Skip
- AI in Marketing & Advertising: The Future of AI-Driven Strategies
- Why to learn Digital Marketing?
- Robotics & AI – How AI is Powering Modern Robotics
- Understanding SQL vs MySQL vs PostgreSQL vs MS SQL vs Oracle and Other Popular Databases
- Big Data: The Future of Data-Driven Decision Making
- Datasets for analyze in Tableau
- Internet of Things (IoT) & AI – Smart Devices and AI Working Together
- What is YII? and How to Install it?
- Time Series Analysis on Air Passenger Data
- The Ultimate Guide to Artificial Intelligence (AI) for Beginners
- Understanding Data Lake, Data Warehouse, Data Mart, and Data Lakehouse – And Why We Need Them
- 10 Awesome Data Science Blogs To Check Out
- Ideas for Content of Every niche on Reader’s Demand during COVID-19
Datasets for Machine Learning
- Amazon Product Reviews Dataset
- Ozone Level Detection Dataset
- Bank Transaction Fraud Detection
- YouTube Trending Video Dataset (updated daily)
- Covid-19 Case Surveillance Public Use Dataset
- US Election 2020
- Forest Fires Dataset
- Mobile Robots Dataset
- Safety Helmet Detection
- All Space Missions from 1957
- OSIC Pulmonary Fibrosis Progression Dataset
- Wine Quality Dataset
- Google Audio Dataset
- Iris flower dataset
- Artificial Characters Dataset