Missing data is one of the most common problems in real-world datasets. Whether it's due to manual entry errors, sensor failures, or system issues, missing values can negatively impact model accuracy and reliability.
This tutorial covers various techniques to detect, analyze, and handle missing data with Python code examples using pandas and scikit-learn.
Missing values can:
Handling them properly ensures your model learns from clean, representative data.
In Python, we typically use pandas to inspect missing values.
import pandas as pd
# Sample dataset
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', None],
'Age': [25, 30, None, 45, 28],
'Salary': [50000, None, 60000, 80000, 75000]
}
df = pd.DataFrame(data)
# Check for missing values
print(df.isnull())
print("\nMissing values per column:\n", df.isnull().sum())Output-
Name Age Salary
0 False False False
1 False False True
2 False True False
3 False False False
4 True False False
Missing values per column:
Name 1
Age 1
Salary 1
dtype: int64There are multiple strategies depending on context and type of data.
Best when the missing values are few or random.
# Drop rows with any missing values
cleaned_df = df.dropna()You can also drop columns instead:
df.dropna(axis=1, inplace=True)Use this only when you're sure the data removed is not critical.
More common and useful when data is missing at random.
# Fill missing Age with mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Fill missing Salary with median
df['Salary'].fillna(df['Salary'].median(), inplace=True)# Fill missing names with mode (most frequent)
df['Name'].fillna(df['Name'].mode()[0], inplace=True)from sklearn.impute import SimpleImputer
# Numeric imputer
imputer = SimpleImputer(strategy='mean')
df[['Age', 'Salary']] = imputer.fit_transform(df[['Age', 'Salary']])You can also use strategy='median', strategy='most_frequent', or strategy='constant'.
Used when you want a more context-aware imputation.
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)
df[['Age', 'Salary']] = imputer.fit_transform(df[['Age', 'Salary']])Suitable for datasets with correlated features.
| Situation | Recommended Strategy |
|---|---|
| < 5% data missing | Drop rows |
| Numeric missing | Mean / Median / KNN Imputer |
| Categorical missing | Mode or Most Frequent |
| Sequential/Time data | Use forward/backward fill (ffill / bfill) |
| Critical feature | Consider domain knowledge or leave as NaN for modeling |
Handling missing data is an essential step in your ML preprocessing pipeline. Always:
Clean data → better models → better results.
Sign in to join the discussion and post comments.
Sign inUnsupervised Learning
Explore Unsupervised Learning techniques to uncover patterns, structures, and relationships in unlabeled data.
Supervised Learning
Discover what Supervised Learning is, how it works, and what you'll learn in this hands-on tutorial series covering top ML algorithms like Linear Regression, Decision Trees, SVM, and more.