Handling Missing Data in ML

  Add to Bookmark

Missing data is one of the most common problems in real-world datasets. Whether it's due to manual entry errors, sensor failures, or system issues, missing values can negatively impact model accuracy and reliability.

This tutorial covers various techniques to detect, analyze, and handle missing data with Python code examples using pandas and scikit-learn.


Why Missing Data Matters

Missing values can:

  • Bias your model
  • Cause errors during training or prediction
  • Lead to inaccurate conclusions

Handling them properly ensures your model learns from clean, representative data.


Step 1: Detecting Missing Values

In Python, we typically use pandas to inspect missing values.

import pandas as pd
# Sample dataset
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', None],
    'Age': [25, 30, None, 45, 28],
    'Salary': [50000, None, 60000, 80000, 75000]
}
df = pd.DataFrame(data)
# Check for missing values
print(df.isnull())
print("\nMissing values per column:\n", df.isnull().sum())

Output-

    Name    Age  Salary
0  False  False   False
1  False  False    True
2  False   True   False
3  False  False   False
4   True  False   False
Missing values per column:
 Name      1
Age       1
Salary    1
dtype: int64

Step 2: Handling Missing Values

There are multiple strategies depending on context and type of data.

1. Dropping Missing Data

Best when the missing values are few or random.

# Drop rows with any missing values
cleaned_df = df.dropna()

You can also drop columns instead:

df.dropna(axis=1, inplace=True)

 Use this only when you're sure the data removed is not critical.


2. Imputing Missing Values

More common and useful when data is missing at random.

a. Mean/Median Imputation (for numeric columns)

# Fill missing Age with mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Fill missing Salary with median
df['Salary'].fillna(df['Salary'].median(), inplace=True)

b. Mode Imputation (for categorical columns)

# Fill missing names with mode (most frequent)
df['Name'].fillna(df['Name'].mode()[0], inplace=True)

3. Using scikit-learn’s Imputer

from sklearn.impute import SimpleImputer
# Numeric imputer
imputer = SimpleImputer(strategy='mean')
df[['Age', 'Salary']] = imputer.fit_transform(df[['Age', 'Salary']])

You can also use strategy='median', strategy='most_frequent', or strategy='constant'.


4. Advanced: KNN Imputation / Iterative Imputer

Used when you want a more context-aware imputation.

from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)
df[['Age', 'Salary']] = imputer.fit_transform(df[['Age', 'Salary']])

Suitable for datasets with correlated features.


Tips for Choosing a Strategy

SituationRecommended Strategy
< 5% data missingDrop rows
Numeric missingMean / Median / KNN Imputer
Categorical missingMode or Most Frequent
Sequential/Time dataUse forward/backward fill (ffill / bfill)
Critical featureConsider domain knowledge or leave as NaN for modeling

Conclusion

Handling missing data is an essential step in your ML preprocessing pipeline. Always:

  • Analyze the pattern of missingness
  • Choose a strategy based on data type and domain
  • Use automated imputation methods when needed

Clean data → better models → better results.