Handling Missing Data and Duplicates

  Add to Bookmark

Real-world data is often incomplete, containing missing values and duplicate entries. Handling these issues is crucial for accurate analysis. In this tutorial, we will cover:

  1. Identifying missing values
  2. Strategies to handle missing data
  3. Identifying and removing duplicate records

1. Identifying Missing Values

Before fixing missing data, we need to find where they exist. In Pandas, we can check for missing values using:

import pandas as pd

# Load dataset
df = pd.read_csv("data.csv")

# Check for missing values
print(df.isnull().sum())  # Count missing values in each column

Visualizing Missing Data

We can use Seaborn’s heatmap to visualize missing values:

import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.isnull(), cmap="viridis", cbar=False)
plt.title("Missing Data Heatmap")
plt.show()

2. Strategies to Handle Missing Data

A. Removing Missing Data

If a column has too many missing values, it may be best to drop it:

df_cleaned = df.dropna()  # Removes rows with missing values

To remove only specific columns:

df_cleaned = df.drop(columns=["Column_Name"])

B. Filling Missing Data

1. Fill with a Specific Value

Replace missing values with 0 or another fixed value:

df.fillna(0, inplace=True)

2. Fill with Mean, Median, or Mode

This method is useful for numerical data.

df["Age"].fillna(df["Age"].mean(), inplace=True)  # Fill with mean
df["Salary"].fillna(df["Salary"].median(), inplace=True)  # Fill with median
df["City"].fillna(df["City"].mode()[0], inplace=True)  # Fill with most common value

3. Forward or Backward Fill

If missing values are sequential, we can use forward (ffill) or backward (bfill) filling:

df.fillna(method="ffill", inplace=True)  # Fill with previous value
df.fillna(method="bfill", inplace=True)  # Fill with next value

3. Handling Duplicate Data

Duplicates can occur due to data entry errors or merging datasets.

A. Identifying Duplicates

To find duplicate rows:

print(df.duplicated().sum())  # Count duplicate rows

To display duplicate rows:

print(df[df.duplicated()])

B. Removing Duplicates

To remove duplicate rows:

df_no_duplicates = df.drop_duplicates()

If only specific columns should be considered:

df_no_duplicates = df.drop_duplicates(subset=["Name", "Email"])

Conclusion

In this tutorial, we covered:

  • How to find missing values
  • How to handle missing data using deletion, filling, and interpolation
  • How to find and remove duplicate records