- Data Analysis with Python
-
Overview
- Introduction to Data Science and Analytics
- Loading and Cleaning Data in Pandas
- Data Manipulation with NumPy and Pandas
- Exploratory Data Analysis (EDA) Techniques
- Handling Missing Data and Duplicates
- Merging, Joining, and Concatenating DataFrames
- Time Series Analysis Basics
- Data Visualization with Matplotlib and Seaborn
- Descriptive Statistics and Data Summarization
- Advanced Pandas Operations
Loading and Cleaning Data in Pandas
Add to BookmarkIn real-world projects, raw data is often messy, inconsistent, and incomplete. Pandas, a powerful Python library, helps in loading, cleaning, and transforming data efficiently. This tutorial covers how to:
- Load data from various sources
- Handle missing values
- Remove duplicates
- Clean and transform data for analysis
1. Installing and Importing Pandas
Before starting, ensure you have Pandas installed:
pip install pandas
Now, import Pandas in Python:
import pandas as pd
2. Loading Data into Pandas
Pandas supports multiple data sources like CSV, Excel, SQL, and JSON.
Loading a CSV File
CSV (Comma-Separated Values) is the most common format for storing data.
df = pd.read_csv("data.csv") # Load CSV file
print(df.head()) # Display first 5 rows
Example: Loading a sales dataset (sales_data.csv
) containing Customer Name, Product, Price, and Quantity.
Loading an Excel File
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")
Loading JSON Data
df = pd.read_json("data.json")
Loading Data from SQL Database
import sqlite3
conn = sqlite3.connect("database.db") # Connect to a database
df = pd.read_sql("SELECT * FROM customers", conn)
3. Understanding Data Structure
After loading, it's important to explore the dataset.
df.info() # Basic info about dataset
df.describe() # Summary statistics
df.shape # (rows, columns)
df.columns # Column names
df.dtypes # Data types of columns
4. Handling Missing Data
Missing values can cause incorrect analysis. Pandas provides ways to handle them.
Checking for Missing Data
print(df.isnull().sum()) # Count missing values per column
Removing Missing Values
df.dropna(inplace=True) # Removes rows with missing values
Filling Missing Values
- Fill with a specific value
df.fillna(0, inplace=True) # Replace NaN with 0
- Fill with column mean/median
df["Price"].fillna(df["Price"].mean(), inplace=True)
5. Removing Duplicates
Duplicate data can cause bias in analysis.
df.drop_duplicates(inplace=True) # Remove duplicate rows
6. Handling Incorrect Data
Replacing Incorrect Values
df["Category"] = df["Category"].replace({"Eletronics": "Electronics"})
Filtering Outliers
df = df[df["Price"] < 50000] # Remove products priced above 50,000
7. Converting Data Types
Pandas allows converting data types for consistency.
df["Date"] = pd.to_datetime(df["Date"]) # Convert to datetime
df["Price"] = df["Price"].astype(float) # Convert to float
8. Renaming and Reordering Columns
Renaming Columns
df.rename(columns={"Cust_Name": "Customer Name"}, inplace=True)
Reordering Columns
df = df[["Customer Name", "Product", "Quantity", "Price"]]
9. Saving Cleaned Data
After cleaning, save the dataset for further analysis.
df.to_csv("cleaned_data.csv", index=False) # Save as CSV
df.to_excel("cleaned_data.xlsx", index=False) # Save as Excel
Conclusion
In this tutorial, we covered how to load, explore, clean, and save data using Pandas. Cleaning data is a crucial step before performing data analysis and visualization. In the next tutorial, we will explore data manipulation with NumPy and Pandas for better insights.
Prepare for Interview
- JavaScript Interview Questions for 0–1 Year Experience
- JavaScript Interview Questions For Fresher
- SQL Interview Questions for 5+ Years Experience
- SQL Interview Questions for 2–5 Years Experience
- SQL Interview Questions for 1–2 Years Experience
- SQL Interview Questions for 0–1 Year Experience
- SQL Interview Questions for Freshers
- Design Patterns in Python
- Dynamic Programming and Recursion in Python
- Trees and Graphs in Python
- Linked Lists, Stacks, and Queues in Python
- Sorting and Searching in Python
- Debugging in Python
- Unit Testing in Python
- Asynchronous Programming in PYthon
Random Blogs
- Datasets for Exploratory Data Analysis for Beginners
- What is YII? and How to Install it?
- 5 Ways Use Jupyter Notebook Online Free of Cost
- The Beginner’s Guide to Normalization and Denormalization in Databases
- AI in Marketing & Advertising: The Future of AI-Driven Strategies
- Understanding OLTP vs OLAP Databases: How SQL Handles Query Optimization
- Mastering Python in 2025: A Complete Roadmap for Beginners
- How to Start Your Career as a DevOps Engineer
- The Ultimate Guide to Machine Learning (ML) for Beginners
- String Operations in Python
- What Is SEO and Why Is It Important?
- Best Platform to Learn Digital Marketing in Free
- Store Data Into CSV File Using Python Tkinter GUI Library
- The Ultimate Guide to Data Science: Everything You Need to Know
- Window Functions in SQL – The Ultimate Guide
Datasets for Machine Learning
- Amazon Product Reviews Dataset
- Ozone Level Detection Dataset
- Bank Transaction Fraud Detection
- YouTube Trending Video Dataset (updated daily)
- Covid-19 Case Surveillance Public Use Dataset
- US Election 2020
- Forest Fires Dataset
- Mobile Robots Dataset
- Safety Helmet Detection
- All Space Missions from 1957
- OSIC Pulmonary Fibrosis Progression Dataset
- Wine Quality Dataset
- Google Audio Dataset
- Iris flower dataset
- Artificial Characters Dataset