Real-world data is often incomplete, containing missing values and duplicate entries. Handling these issues is crucial for accurate analysis. In this tutorial, we will cover:
Before fixing missing data, we need to find where they exist. In Pandas, we can check for missing values using:
import pandas as pd
# Load dataset
df = pd.read_csv("data.csv")
# Check for missing values
print(df.isnull().sum()) # Count missing values in each columnWe can use Seaborn’s heatmap to visualize missing values:
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.isnull(), cmap="viridis", cbar=False)
plt.title("Missing Data Heatmap")
plt.show()If a column has too many missing values, it may be best to drop it:
df_cleaned = df.dropna() # Removes rows with missing valuesTo remove only specific columns:
df_cleaned = df.drop(columns=["Column_Name"])Replace missing values with 0 or another fixed value:
df.fillna(0, inplace=True)This method is useful for numerical data.
df["Age"].fillna(df["Age"].mean(), inplace=True) # Fill with mean
df["Salary"].fillna(df["Salary"].median(), inplace=True) # Fill with median
df["City"].fillna(df["City"].mode()[0], inplace=True) # Fill with most common valueIf missing values are sequential, we can use forward (ffill) or backward (bfill) filling:
df.fillna(method="ffill", inplace=True) # Fill with previous value
df.fillna(method="bfill", inplace=True) # Fill with next valueDuplicates can occur due to data entry errors or merging datasets.
To find duplicate rows:
print(df.duplicated().sum()) # Count duplicate rowsTo display duplicate rows:
print(df[df.duplicated()])To remove duplicate rows:
df_no_duplicates = df.drop_duplicates()If only specific columns should be considered:
df_no_duplicates = df.drop_duplicates(subset=["Name", "Email"])In this tutorial, we covered:
Sign in to join the discussion and post comments.
Sign in