- Feature Engineering & Data Preprocessing
-
Overview
- Handling Missing Data in ML
- Feature Scaling (Normalization vs. Standardization)
- Encoding Categorical Variables
- Feature Selection Techniques
- Dimensionality Reduction Techniques
- Feature Extraction from Text and Images
- Handling Imbalanced Data (SMOTE, Class Weights)
Handling Missing Data in ML
Add to BookmarkMissing data is one of the most common problems in real-world datasets. Whether it's due to manual entry errors, sensor failures, or system issues, missing values can negatively impact model accuracy and reliability.
This tutorial covers various techniques to detect, analyze, and handle missing data with Python code examples using pandas
and scikit-learn
.
Why Missing Data Matters
Missing values can:
- Bias your model
- Cause errors during training or prediction
- Lead to inaccurate conclusions
Handling them properly ensures your model learns from clean, representative data.
Step 1: Detecting Missing Values
In Python, we typically use pandas
to inspect missing values.
import pandas as pd
# Sample dataset
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', None],
'Age': [25, 30, None, 45, 28],
'Salary': [50000, None, 60000, 80000, 75000]
}
df = pd.DataFrame(data)
# Check for missing values
print(df.isnull())
print("\nMissing values per column:\n", df.isnull().sum())
Output-
Name Age Salary
0 False False False
1 False False True
2 False True False
3 False False False
4 True False False
Missing values per column:
Name 1
Age 1
Salary 1
dtype: int64
Step 2: Handling Missing Values
There are multiple strategies depending on context and type of data.
1. Dropping Missing Data
Best when the missing values are few or random.
# Drop rows with any missing values
cleaned_df = df.dropna()
You can also drop columns instead:
df.dropna(axis=1, inplace=True)
Use this only when you're sure the data removed is not critical.
2. Imputing Missing Values
More common and useful when data is missing at random.
a. Mean/Median Imputation (for numeric columns)
# Fill missing Age with mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Fill missing Salary with median
df['Salary'].fillna(df['Salary'].median(), inplace=True)
b. Mode Imputation (for categorical columns)
# Fill missing names with mode (most frequent)
df['Name'].fillna(df['Name'].mode()[0], inplace=True)
3. Using scikit-learn’s Imputer
from sklearn.impute import SimpleImputer
# Numeric imputer
imputer = SimpleImputer(strategy='mean')
df[['Age', 'Salary']] = imputer.fit_transform(df[['Age', 'Salary']])
You can also use strategy='median'
, strategy='most_frequent'
, or strategy='constant'
.
4. Advanced: KNN Imputation / Iterative Imputer
Used when you want a more context-aware imputation.
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)
df[['Age', 'Salary']] = imputer.fit_transform(df[['Age', 'Salary']])
Suitable for datasets with correlated features.
Tips for Choosing a Strategy
Situation | Recommended Strategy |
---|---|
< 5% data missing | Drop rows |
Numeric missing | Mean / Median / KNN Imputer |
Categorical missing | Mode or Most Frequent |
Sequential/Time data | Use forward/backward fill (ffill / bfill ) |
Critical feature | Consider domain knowledge or leave as NaN for modeling |
Conclusion
Handling missing data is an essential step in your ML preprocessing pipeline. Always:
- Analyze the pattern of missingness
- Choose a strategy based on data type and domain
- Use automated imputation methods when needed
Clean data → better models → better results.
Prepare for Interview
- JavaScript Interview Questions for 5+ Years Experience
- JavaScript Interview Questions for 2–5 Years Experience
- JavaScript Interview Questions for 1–2 Years Experience
- JavaScript Interview Questions for 0–1 Year Experience
- JavaScript Interview Questions For Fresher
- SQL Interview Questions for 5+ Years Experience
- SQL Interview Questions for 2–5 Years Experience
- SQL Interview Questions for 1–2 Years Experience
- SQL Interview Questions for 0–1 Year Experience
- SQL Interview Questions for Freshers
- Design Patterns in Python
- Dynamic Programming and Recursion in Python
- Trees and Graphs in Python
- Linked Lists, Stacks, and Queues in Python
- Sorting and Searching in Python
Random Blogs
- Grow your business with Facebook Marketing
- Generative AI - The Future of Artificial Intelligence
- Understanding SQL vs MySQL vs PostgreSQL vs MS SQL vs Oracle and Other Popular Databases
- Datasets for Speech Recognition Analysis
- Variable Assignment in Python
- Google’s Core Update in May 2020: What You Need to Know
- Store Data Into CSV File Using Python Tkinter GUI Library
- What Is SEO and Why Is It Important?
- AI Agents & Autonomous Systems – The Future of Self-Driven Intelligence
- Understanding LLMs (Large Language Models): The Ultimate Guide for 2025
- Top 10 Blogs of Digital Marketing you Must Follow
- String Operations in Python
- Mastering Python in 2025: A Complete Roadmap for Beginners
- SQL Joins Explained: A Complete Guide with Examples
- Window Functions in SQL – The Ultimate Guide
Datasets for Machine Learning
- Amazon Product Reviews Dataset
- Ozone Level Detection Dataset
- Bank Transaction Fraud Detection
- YouTube Trending Video Dataset (updated daily)
- Covid-19 Case Surveillance Public Use Dataset
- US Election 2020
- Forest Fires Dataset
- Mobile Robots Dataset
- Safety Helmet Detection
- All Space Missions from 1957
- OSIC Pulmonary Fibrosis Progression Dataset
- Wine Quality Dataset
- Google Audio Dataset
- Iris flower dataset
- Artificial Characters Dataset