Handling Missing Data in ML
Add to BookmarkMissing data is one of the most common problems in real-world datasets. Whether it's due to manual entry errors, sensor failures, or system issues, missing values can negatively impact model accuracy and reliability.
This tutorial covers various techniques to detect, analyze, and handle missing data with Python code examples using pandas
and scikit-learn
.
Why Missing Data Matters
Missing values can:
- Bias your model
- Cause errors during training or prediction
- Lead to inaccurate conclusions
Handling them properly ensures your model learns from clean, representative data.
Step 1: Detecting Missing Values
In Python, we typically use pandas
to inspect missing values.
import pandas as pd
# Sample dataset
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', None],
'Age': [25, 30, None, 45, 28],
'Salary': [50000, None, 60000, 80000, 75000]
}
df = pd.DataFrame(data)
# Check for missing values
print(df.isnull())
print("\nMissing values per column:\n", df.isnull().sum())
Output-
Name Age Salary
0 False False False
1 False False True
2 False True False
3 False False False
4 True False False
Missing values per column:
Name 1
Age 1
Salary 1
dtype: int64
Step 2: Handling Missing Values
There are multiple strategies depending on context and type of data.
1. Dropping Missing Data
Best when the missing values are few or random.
# Drop rows with any missing values
cleaned_df = df.dropna()
You can also drop columns instead:
df.dropna(axis=1, inplace=True)
Use this only when you're sure the data removed is not critical.
2. Imputing Missing Values
More common and useful when data is missing at random.
a. Mean/Median Imputation (for numeric columns)
# Fill missing Age with mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Fill missing Salary with median
df['Salary'].fillna(df['Salary'].median(), inplace=True)
b. Mode Imputation (for categorical columns)
# Fill missing names with mode (most frequent)
df['Name'].fillna(df['Name'].mode()[0], inplace=True)
3. Using scikit-learn’s Imputer
from sklearn.impute import SimpleImputer
# Numeric imputer
imputer = SimpleImputer(strategy='mean')
df[['Age', 'Salary']] = imputer.fit_transform(df[['Age', 'Salary']])
You can also use strategy='median'
, strategy='most_frequent'
, or strategy='constant'
.
4. Advanced: KNN Imputation / Iterative Imputer
Used when you want a more context-aware imputation.
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)
df[['Age', 'Salary']] = imputer.fit_transform(df[['Age', 'Salary']])
Suitable for datasets with correlated features.
Tips for Choosing a Strategy
Situation | Recommended Strategy |
---|---|
< 5% data missing | Drop rows |
Numeric missing | Mean / Median / KNN Imputer |
Categorical missing | Mode or Most Frequent |
Sequential/Time data | Use forward/backward fill (ffill / bfill ) |
Critical feature | Consider domain knowledge or leave as NaN for modeling |
Conclusion
Handling missing data is an essential step in your ML preprocessing pipeline. Always:
- Analyze the pattern of missingness
- Choose a strategy based on data type and domain
- Use automated imputation methods when needed
Clean data → better models → better results.
Prepare for Interview
- JavaScript Interview Questions for 1–2 Years Experience
- JavaScript Interview Questions for 0–1 Year Experience
- JavaScript Interview Questions For Fresher
- SQL Interview Questions for 5+ Years Experience
- SQL Interview Questions for 2–5 Years Experience
- SQL Interview Questions for 1–2 Years Experience
- SQL Interview Questions for 0–1 Year Experience
- SQL Interview Questions for Freshers
- Design Patterns in Python
- Dynamic Programming and Recursion in Python
- Trees and Graphs in Python
- Linked Lists, Stacks, and Queues in Python
- Sorting and Searching in Python
- Debugging in Python
- Unit Testing in Python
Random Blogs
- Exploratory Data Analysis On Iris Dataset
- How to Start Your Career as a DevOps Engineer
- Why to learn Digital Marketing?
- What is YII? and How to Install it?
- The Ultimate Guide to Artificial Intelligence (AI) for Beginners
- Understanding SQL vs MySQL vs PostgreSQL vs MS SQL vs Oracle and Other Popular Databases
- Types of Numbers in Python
- Downlaod Youtube Video in Any Format Using Python Pytube Library
- Avoiding the Beginner’s Trap: Key Python Fundamentals You Shouldn't Skip
- Grow your business with Facebook Marketing
- Transforming Logistics: The Power of AI in Supply Chain Management
- Government Datasets from 50 Countries for Machine Learning Training
- Mastering Python in 2025: A Complete Roadmap for Beginners
- Variable Assignment in Python
- The Ultimate Guide to Data Science: Everything You Need to Know
Datasets for Machine Learning
- Amazon Product Reviews Dataset
- Ozone Level Detection Dataset
- Bank Transaction Fraud Detection
- YouTube Trending Video Dataset (updated daily)
- Covid-19 Case Surveillance Public Use Dataset
- US Election 2020
- Forest Fires Dataset
- Mobile Robots Dataset
- Safety Helmet Detection
- All Space Missions from 1957
- OSIC Pulmonary Fibrosis Progression Dataset
- Wine Quality Dataset
- Google Audio Dataset
- Iris flower dataset
- Artificial Characters Dataset