Handling Missing Data in ML
Add to BookmarkMissing data is one of the most common problems in real-world datasets. Whether it's due to manual entry errors, sensor failures, or system issues, missing values can negatively impact model accuracy and reliability.
This tutorial covers various techniques to detect, analyze, and handle missing data with Python code examples using pandas
and scikit-learn
.
Why Missing Data Matters
Missing values can:
- Bias your model
- Cause errors during training or prediction
- Lead to inaccurate conclusions
Handling them properly ensures your model learns from clean, representative data.
Step 1: Detecting Missing Values
In Python, we typically use pandas
to inspect missing values.
import pandas as pd
# Sample dataset
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', None],
'Age': [25, 30, None, 45, 28],
'Salary': [50000, None, 60000, 80000, 75000]
}
df = pd.DataFrame(data)
# Check for missing values
print(df.isnull())
print("\nMissing values per column:\n", df.isnull().sum())
Output-
Name Age Salary
0 False False False
1 False False True
2 False True False
3 False False False
4 True False False
Missing values per column:
Name 1
Age 1
Salary 1
dtype: int64
Step 2: Handling Missing Values
There are multiple strategies depending on context and type of data.
1. Dropping Missing Data
Best when the missing values are few or random.
# Drop rows with any missing values
cleaned_df = df.dropna()
You can also drop columns instead:
df.dropna(axis=1, inplace=True)
Use this only when you're sure the data removed is not critical.
2. Imputing Missing Values
More common and useful when data is missing at random.
a. Mean/Median Imputation (for numeric columns)
# Fill missing Age with mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Fill missing Salary with median
df['Salary'].fillna(df['Salary'].median(), inplace=True)
b. Mode Imputation (for categorical columns)
# Fill missing names with mode (most frequent)
df['Name'].fillna(df['Name'].mode()[0], inplace=True)
3. Using scikit-learn’s Imputer
from sklearn.impute import SimpleImputer
# Numeric imputer
imputer = SimpleImputer(strategy='mean')
df[['Age', 'Salary']] = imputer.fit_transform(df[['Age', 'Salary']])
You can also use strategy='median'
, strategy='most_frequent'
, or strategy='constant'
.
4. Advanced: KNN Imputation / Iterative Imputer
Used when you want a more context-aware imputation.
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)
df[['Age', 'Salary']] = imputer.fit_transform(df[['Age', 'Salary']])
Suitable for datasets with correlated features.
Tips for Choosing a Strategy
Situation | Recommended Strategy |
---|---|
< 5% data missing | Drop rows |
Numeric missing | Mean / Median / KNN Imputer |
Categorical missing | Mode or Most Frequent |
Sequential/Time data | Use forward/backward fill (ffill / bfill ) |
Critical feature | Consider domain knowledge or leave as NaN for modeling |
Conclusion
Handling missing data is an essential step in your ML preprocessing pipeline. Always:
- Analyze the pattern of missingness
- Choose a strategy based on data type and domain
- Use automated imputation methods when needed
Clean data → better models → better results.
Prepare for Interview
- JavaScript Interview Questions for 0–1 Year Experience
- JavaScript Interview Questions For Fresher
- SQL Interview Questions for 5+ Years Experience
- SQL Interview Questions for 2–5 Years Experience
- SQL Interview Questions for 1–2 Years Experience
- SQL Interview Questions for 0–1 Year Experience
- SQL Interview Questions for Freshers
- Design Patterns in Python
- Dynamic Programming and Recursion in Python
- Trees and Graphs in Python
- Linked Lists, Stacks, and Queues in Python
- Sorting and Searching in Python
- Debugging in Python
- Unit Testing in Python
- Asynchronous Programming in PYthon
Random Blogs
- Data Analytics: The Power of Data-Driven Decision Making
- How to Become a Good Data Scientist ?
- Convert RBG Image to Gray Scale Image Using CV2
- Datasets for Exploratory Data Analysis for Beginners
- How AI is Making Humans Weaker – The Hidden Impact of Artificial Intelligence
- Quantum AI – The Future of AI Powered by Quantum Computing
- The Ultimate Guide to Data Science: Everything You Need to Know
- AI & Space Exploration – AI’s Role in Deep Space Missions and Planetary Research
- Downlaod Youtube Video in Any Format Using Python Pytube Library
- Understanding AI, ML, Data Science, and More: A Beginner's Guide to Choosing Your Career Path
- Important Mistakes to Avoid While Advertising on Facebook
- Create Virtual Host for Nginx on Ubuntu (For Yii2 Basic & Advanced Templates)
- Internet of Things (IoT) & AI – Smart Devices and AI Working Together
- Exploratory Data Analysis On Iris Dataset
- Understanding SQL vs MySQL vs PostgreSQL vs MS SQL vs Oracle and Other Popular Databases
Datasets for Machine Learning
- Amazon Product Reviews Dataset
- Ozone Level Detection Dataset
- Bank Transaction Fraud Detection
- YouTube Trending Video Dataset (updated daily)
- Covid-19 Case Surveillance Public Use Dataset
- US Election 2020
- Forest Fires Dataset
- Mobile Robots Dataset
- Safety Helmet Detection
- All Space Missions from 1957
- OSIC Pulmonary Fibrosis Progression Dataset
- Wine Quality Dataset
- Google Audio Dataset
- Iris flower dataset
- Artificial Characters Dataset