- Data Analysis with Python
-
Overview
- Introduction to Data Science and Analytics
- Loading and Cleaning Data in Pandas
- Data Manipulation with NumPy and Pandas
- Exploratory Data Analysis (EDA) Techniques
- Handling Missing Data and Duplicates
- Merging, Joining, and Concatenating DataFrames
- Time Series Analysis Basics
- Data Visualization with Matplotlib and Seaborn
- Descriptive Statistics and Data Summarization
- Advanced Pandas Operations
Handling Missing Data and Duplicates
Add to BookmarkReal-world data is often incomplete, containing missing values and duplicate entries. Handling these issues is crucial for accurate analysis. In this tutorial, we will cover:
- Identifying missing values
- Strategies to handle missing data
- Identifying and removing duplicate records
1. Identifying Missing Values
Before fixing missing data, we need to find where they exist. In Pandas, we can check for missing values using:
import pandas as pd
# Load dataset
df = pd.read_csv("data.csv")
# Check for missing values
print(df.isnull().sum()) # Count missing values in each columnVisualizing Missing Data
We can use Seaborn’s heatmap to visualize missing values:
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.isnull(), cmap="viridis", cbar=False)
plt.title("Missing Data Heatmap")
plt.show()2. Strategies to Handle Missing Data
A. Removing Missing Data
If a column has too many missing values, it may be best to drop it:
df_cleaned = df.dropna() # Removes rows with missing valuesTo remove only specific columns:
df_cleaned = df.drop(columns=["Column_Name"])B. Filling Missing Data
1. Fill with a Specific Value
Replace missing values with 0 or another fixed value:
df.fillna(0, inplace=True)2. Fill with Mean, Median, or Mode
This method is useful for numerical data.
df["Age"].fillna(df["Age"].mean(), inplace=True) # Fill with mean
df["Salary"].fillna(df["Salary"].median(), inplace=True) # Fill with median
df["City"].fillna(df["City"].mode()[0], inplace=True) # Fill with most common value3. Forward or Backward Fill
If missing values are sequential, we can use forward (ffill) or backward (bfill) filling:
df.fillna(method="ffill", inplace=True) # Fill with previous value
df.fillna(method="bfill", inplace=True) # Fill with next value3. Handling Duplicate Data
Duplicates can occur due to data entry errors or merging datasets.
A. Identifying Duplicates
To find duplicate rows:
print(df.duplicated().sum()) # Count duplicate rowsTo display duplicate rows:
print(df[df.duplicated()])B. Removing Duplicates
To remove duplicate rows:
df_no_duplicates = df.drop_duplicates()If only specific columns should be considered:
df_no_duplicates = df.drop_duplicates(subset=["Name", "Email"])Conclusion
In this tutorial, we covered:
- How to find missing values
- How to handle missing data using deletion, filling, and interpolation
- How to find and remove duplicate records
Prepare for Interview
- JavaScript Interview Questions for 5+ Years Experience
- JavaScript Interview Questions for 2–5 Years Experience
- JavaScript Interview Questions for 1–2 Years Experience
- JavaScript Interview Questions for 0–1 Year Experience
- JavaScript Interview Questions For Fresher
- SQL Interview Questions for 5+ Years Experience
- SQL Interview Questions for 2–5 Years Experience
- SQL Interview Questions for 1–2 Years Experience
- SQL Interview Questions for 0–1 Year Experience
- SQL Interview Questions for Freshers
- Design Patterns in Python
- Dynamic Programming and Recursion in Python
- Trees and Graphs in Python
- Linked Lists, Stacks, and Queues in Python
- Sorting and Searching in Python
Random Blogs
- Python Challenging Programming Exercises Part 2
- Window Functions in SQL – The Ultimate Guide
- Store Data Into CSV File Using Python Tkinter GUI Library
- Extract RGB Color From a Image Using CV2
- What to Do When Your MySQL Table Grows Too Wide
- Avoiding the Beginner’s Trap: Key Python Fundamentals You Shouldn't Skip
- Quantum AI – The Future of AI Powered by Quantum Computing
- Python Challenging Programming Exercises Part 1
- Datasets for analyze in Tableau
- Where to Find Free Datasets for Your Next Machine Learning & Data Science Project
- The Ultimate Guide to Starting a Career in Computer Vision
- Grow your business with Facebook Marketing
- Navigating AI Careers in 2025: Data Science, Machine Learning, Deep Learning, and More
- Understanding AI, ML, Data Science, and More: A Beginner's Guide to Choosing Your Career Path
- How AI is Making Humans Weaker – The Hidden Impact of Artificial Intelligence
Datasets for Machine Learning
- Awesome-ChatGPT-Prompts
- Amazon Product Reviews Dataset
- Ozone Level Detection Dataset
- Bank Transaction Fraud Detection
- YouTube Trending Video Dataset (updated daily)
- Covid-19 Case Surveillance Public Use Dataset
- US Election 2020
- Forest Fires Dataset
- Mobile Robots Dataset
- Safety Helmet Detection
- All Space Missions from 1957
- OSIC Pulmonary Fibrosis Progression Dataset
- Wine Quality Dataset
- Google Audio Dataset
- Iris flower dataset
