- Feature Engineering & Data Preprocessing
-
Overview
- Handling Missing Data in ML
- Feature Scaling (Normalization vs. Standardization)
- Encoding Categorical Variables
- Feature Selection Techniques
- Dimensionality Reduction Techniques
- Feature Extraction from Text and Images
- Handling Imbalanced Data (SMOTE, Class Weights)
Feature Scaling (Normalization vs. Standardization)
Add to BookmarkFeature scaling is one of the most essential preprocessing steps in machine learning. Algorithms that rely on distances or gradients (like KNN, SVM, or Gradient Descent-based models) can perform poorly if features are on different scales.
In this tutorial, we’ll explore two core techniques: Normalization and Standardization, understand when to use each, and implement them with Python.
Why Feature Scaling Is Needed
Imagine you have two features:
- Age (values between 20–60)
- Income (values in the thousands)
Many ML algorithms (like K-Means, Logistic Regression, SVM, and KNN) treat larger scale features as more important unless scaled properly.
Without scaling:
- Distance-based models become biased.
- Gradient descent may converge slowly or incorrectly.
Normalization (Min-Max Scaling)
Definition:
Rescales the feature to a fixed range, usually [0, 1].
Formula:
Use When:
- You don’t know the distribution of your data.
- You need to bound data between a specific range.
- You're using models like KNN, Neural Networks, or K-Means.
Python Example:
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
# Sample data
df = pd.DataFrame({
'Age': [20, 25, 30, 35, 40],
'Income': [20000, 40000, 60000, 80000, 100000]
})
scaler = MinMaxScaler()
normalized = scaler.fit_transform(df)
normalized_df = pd.DataFrame(normalized, columns=df.columns)
print(normalized_df)
Output-
Age Income
0 0.00 0.00
1 0.25 0.25
2 0.50 0.50
3 0.75 0.75
4 1.00 1.00
Standardization (Z-Score Scaling)
Definition:
Transforms data to have zero mean and unit variance.
Formula:
Where:
- : mean
- : standard deviation
Use When:
- Data follows a Gaussian distribution
- Algorithms assume standardized data, like Logistic Regression, Linear Regression, or PCA
Python Example:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
standardized = scaler.fit_transform(df)
standardized_df = pd.DataFrame(standardized, columns=df.columns)
print(standardized_df)
Output-
Age Income
0 -1.414214 -1.414214
1 -0.707107 -0.707107
2 0.000000 0.000000
3 0.707107 0.707107
4 1.414214 1.414214
Normalization vs. Standardization: Summary
Feature | Normalization | Standardization |
---|---|---|
Range | [0, 1] | Mean = 0, Std = 1 |
Sensitive to outliers | Yes | Less |
When to use | KNN, NN, distance-based models | Linear models, PCA, Gaussian assumptions |
Requires normal distribution | No | Preferably yes |
Which to Choose?
- Use Normalization if you need bounded data or when using models sensitive to magnitude.
- Use Standardization for linear models and when your data follows a normal distribution.
- Use RobustScaler (not covered here) if your data has many outliers.
Prepare for Interview
- JavaScript Interview Questions for 5+ Years Experience
- JavaScript Interview Questions for 2–5 Years Experience
- JavaScript Interview Questions for 1–2 Years Experience
- JavaScript Interview Questions for 0–1 Year Experience
- JavaScript Interview Questions For Fresher
- SQL Interview Questions for 5+ Years Experience
- SQL Interview Questions for 2–5 Years Experience
- SQL Interview Questions for 1–2 Years Experience
- SQL Interview Questions for 0–1 Year Experience
- SQL Interview Questions for Freshers
- Design Patterns in Python
- Dynamic Programming and Recursion in Python
- Trees and Graphs in Python
- Linked Lists, Stacks, and Queues in Python
- Sorting and Searching in Python
Random Blogs
- 5 Ways Use Jupyter Notebook Online Free of Cost
- AI Agents: The Future of Automation, Work, and Opportunities in 2025
- Variable Assignment in Python
- Role of Digital Marketing Services to Uplift Online business of Company and Beat Its Competitors
- AI in Cybersecurity: The Future of Digital Protection
- What to Do When Your MySQL Table Grows Too Wide
- The Ultimate Guide to Starting a Career in Computer Vision
- Data Analytics: The Power of Data-Driven Decision Making
- Deep Learning (DL): The Core of Modern AI
- Top 15 Recommended SEO Tools
- Big Data: The Future of Data-Driven Decision Making
- How to Start Your Career as a DevOps Engineer
- 15 Amazing Keyword Research Tools You Should Explore
- What is YII? and How to Install it?
- Avoiding the Beginner’s Trap: Key Python Fundamentals You Shouldn't Skip
Datasets for Machine Learning
- Amazon Product Reviews Dataset
- Ozone Level Detection Dataset
- Bank Transaction Fraud Detection
- YouTube Trending Video Dataset (updated daily)
- Covid-19 Case Surveillance Public Use Dataset
- US Election 2020
- Forest Fires Dataset
- Mobile Robots Dataset
- Safety Helmet Detection
- All Space Missions from 1957
- OSIC Pulmonary Fibrosis Progression Dataset
- Wine Quality Dataset
- Google Audio Dataset
- Iris flower dataset
- Artificial Characters Dataset