- Data Analysis with Python
-
Overview
- Introduction to Data Science and Analytics
- Loading and Cleaning Data in Pandas
- Data Manipulation with NumPy and Pandas
- Exploratory Data Analysis (EDA) Techniques
- Handling Missing Data and Duplicates
- Merging, Joining, and Concatenating DataFrames
- Time Series Analysis Basics
- Data Visualization with Matplotlib and Seaborn
- Descriptive Statistics and Data Summarization
- Advanced Pandas Operations
Handling Missing Data and Duplicates
Add to BookmarkReal-world data is often incomplete, containing missing values and duplicate entries. Handling these issues is crucial for accurate analysis. In this tutorial, we will cover:
- Identifying missing values
- Strategies to handle missing data
- Identifying and removing duplicate records
1. Identifying Missing Values
Before fixing missing data, we need to find where they exist. In Pandas, we can check for missing values using:
import pandas as pd
# Load dataset
df = pd.read_csv("data.csv")
# Check for missing values
print(df.isnull().sum()) # Count missing values in each column
Visualizing Missing Data
We can use Seaborn’s heatmap to visualize missing values:
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.isnull(), cmap="viridis", cbar=False)
plt.title("Missing Data Heatmap")
plt.show()
2. Strategies to Handle Missing Data
A. Removing Missing Data
If a column has too many missing values, it may be best to drop it:
df_cleaned = df.dropna() # Removes rows with missing values
To remove only specific columns:
df_cleaned = df.drop(columns=["Column_Name"])
B. Filling Missing Data
1. Fill with a Specific Value
Replace missing values with 0 or another fixed value:
df.fillna(0, inplace=True)
2. Fill with Mean, Median, or Mode
This method is useful for numerical data.
df["Age"].fillna(df["Age"].mean(), inplace=True) # Fill with mean
df["Salary"].fillna(df["Salary"].median(), inplace=True) # Fill with median
df["City"].fillna(df["City"].mode()[0], inplace=True) # Fill with most common value
3. Forward or Backward Fill
If missing values are sequential, we can use forward (ffill
) or backward (bfill
) filling:
df.fillna(method="ffill", inplace=True) # Fill with previous value
df.fillna(method="bfill", inplace=True) # Fill with next value
3. Handling Duplicate Data
Duplicates can occur due to data entry errors or merging datasets.
A. Identifying Duplicates
To find duplicate rows:
print(df.duplicated().sum()) # Count duplicate rows
To display duplicate rows:
print(df[df.duplicated()])
B. Removing Duplicates
To remove duplicate rows:
df_no_duplicates = df.drop_duplicates()
If only specific columns should be considered:
df_no_duplicates = df.drop_duplicates(subset=["Name", "Email"])
Conclusion
In this tutorial, we covered:
- How to find missing values
- How to handle missing data using deletion, filling, and interpolation
- How to find and remove duplicate records
Prepare for Interview
- SQL Interview Questions for 2–5 Years Experience
- SQL Interview Questions for 1–2 Years Experience
- SQL Interview Questions for 0–1 Year Experience
- SQL Interview Questions for Freshers
- Design Patterns in Python
- Dynamic Programming and Recursion in Python
- Trees and Graphs in Python
- Linked Lists, Stacks, and Queues in Python
- Sorting and Searching in Python
- Debugging in Python
- Unit Testing in Python
- Asynchronous Programming in PYthon
- Multithreading and Multiprocessing in Python
- Context Managers in Python
- Decorators in Python
Random Blogs
- Generative AI - The Future of Artificial Intelligence
- What is YII? and How to Install it?
- Grow your business with Facebook Marketing
- Understanding OLTP vs OLAP Databases: How SQL Handles Query Optimization
- How AI is Making Humans Weaker – The Hidden Impact of Artificial Intelligence
- Career Guide: Natural Language Processing (NLP)
- Python Challenging Programming Exercises Part 2
- OLTP vs. OLAP Databases: Advanced Insights and Query Optimization Techniques
- The Ultimate Guide to Artificial Intelligence (AI) for Beginners
- Loan Default Prediction Project Using Machine Learning
- Datasets for analyze in Tableau
- What Is SEO and Why Is It Important?
- Types of Numbers in Python
- AI in Marketing & Advertising: The Future of AI-Driven Strategies
- Datasets for Speech Recognition Analysis
Datasets for Machine Learning
- Amazon Product Reviews Dataset
- Ozone Level Detection Dataset
- Bank Transaction Fraud Detection
- YouTube Trending Video Dataset (updated daily)
- Covid-19 Case Surveillance Public Use Dataset
- US Election 2020
- Forest Fires Dataset
- Mobile Robots Dataset
- Safety Helmet Detection
- All Space Missions from 1957
- OSIC Pulmonary Fibrosis Progression Dataset
- Wine Quality Dataset
- Google Audio Dataset
- Iris flower dataset
- Artificial Characters Dataset