- Data Analysis with Python
-
Overview
- Introduction to Data Science and Analytics
- Loading and Cleaning Data in Pandas
- Data Manipulation with NumPy and Pandas
- Exploratory Data Analysis (EDA) Techniques
- Handling Missing Data and Duplicates
- Merging, Joining, and Concatenating DataFrames
- Time Series Analysis Basics
- Data Visualization with Matplotlib and Seaborn
- Descriptive Statistics and Data Summarization
- Advanced Pandas Operations
Loading and Cleaning Data in Pandas
In real-world projects, raw data is often messy, inconsistent, and incomplete. Pandas, a powerful Python library, helps in loading, cleaning, and transforming data efficiently. This tutorial covers how to:
- Load data from various sources
- Handle missing values
- Remove duplicates
- Clean and transform data for analysis
1. Installing and Importing Pandas
Before starting, ensure you have Pandas installed:
pip install pandas
Now, import Pandas in Python:
import pandas as pd
2. Loading Data into Pandas
Pandas supports multiple data sources like CSV, Excel, SQL, and JSON.
Loading a CSV File
CSV (Comma-Separated Values) is the most common format for storing data.
df = pd.read_csv("data.csv") # Load CSV file
print(df.head()) # Display first 5 rows
Example: Loading a sales dataset (sales_data.csv
) containing Customer Name, Product, Price, and Quantity.
Loading an Excel File
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")
Loading JSON Data
df = pd.read_json("data.json")
Loading Data from SQL Database
import sqlite3
conn = sqlite3.connect("database.db") # Connect to a database
df = pd.read_sql("SELECT * FROM customers", conn)
3. Understanding Data Structure
After loading, it's important to explore the dataset.
df.info() # Basic info about dataset
df.describe() # Summary statistics
df.shape # (rows, columns)
df.columns # Column names
df.dtypes # Data types of columns
4. Handling Missing Data
Missing values can cause incorrect analysis. Pandas provides ways to handle them.
Checking for Missing Data
print(df.isnull().sum()) # Count missing values per column
Removing Missing Values
df.dropna(inplace=True) # Removes rows with missing values
Filling Missing Values
- Fill with a specific value
df.fillna(0, inplace=True) # Replace NaN with 0
- Fill with column mean/median
df["Price"].fillna(df["Price"].mean(), inplace=True)
5. Removing Duplicates
Duplicate data can cause bias in analysis.
df.drop_duplicates(inplace=True) # Remove duplicate rows
6. Handling Incorrect Data
Replacing Incorrect Values
df["Category"] = df["Category"].replace({"Eletronics": "Electronics"})
Filtering Outliers
df = df[df["Price"] < 50000] # Remove products priced above 50,000
7. Converting Data Types
Pandas allows converting data types for consistency.
df["Date"] = pd.to_datetime(df["Date"]) # Convert to datetime
df["Price"] = df["Price"].astype(float) # Convert to float
8. Renaming and Reordering Columns
Renaming Columns
df.rename(columns={"Cust_Name": "Customer Name"}, inplace=True)
Reordering Columns
df = df[["Customer Name", "Product", "Quantity", "Price"]]
9. Saving Cleaned Data
After cleaning, save the dataset for further analysis.
df.to_csv("cleaned_data.csv", index=False) # Save as CSV
df.to_excel("cleaned_data.xlsx", index=False) # Save as Excel
Conclusion
In this tutorial, we covered how to load, explore, clean, and save data using Pandas. Cleaning data is a crucial step before performing data analysis and visualization. In the next tutorial, we will explore data manipulation with NumPy and Pandas for better insights.
Prepare for Interview
- Debugging in Python
- Multithreading and Multiprocessing in Python
- Context Managers in Python
- Decorators in Python
- Generators in Python
- Requests in Python
- Django
- Flask
- Matplotlib/Seaborn
- Pandas
- NumPy
- Modules and Packages in Python
- File Handling in Python
- Error Handling and Exceptions in Python
- Indexing and Performance Optimization in SQL
Random Blogs
- Datasets for Natural Language Processing
- Python Challenging Programming Exercises Part 3
- Mastering SQL in 2025: A Complete Roadmap for Beginners
- Understanding AI, ML, Data Science, and More: A Beginner's Guide to Choosing Your Career Path
- Best Platform to Learn Digital Marketing in Free
- Robotics & AI – How AI is Powering Modern Robotics
- Avoiding the Beginner’s Trap: Key Python Fundamentals You Shouldn't Skip
- Deep Learning (DL): The Core of Modern AI
- AI in Cybersecurity: The Future of Digital Protection
- Extract RGB Color From a Image Using CV2
- Role of Digital Marketing Services to Uplift Online business of Company and Beat Its Competitors
- Variable Assignment in Python
- How to Become a Good Data Scientist ?
- Grow your business with Facebook Marketing
- Data Analytics: The Power of Data-Driven Decision Making
Datasets for Machine Learning
- Ozone Level Detection Dataset
- Bank Transaction Fraud Detection
- YouTube Trending Video Dataset (updated daily)
- Covid-19 Case Surveillance Public Use Dataset
- US Election 2020
- Forest Fires Dataset
- Mobile Robots Dataset
- Safety Helmet Detection
- All Space Missions from 1957
- OSIC Pulmonary Fibrosis Progression Dataset
- Wine Quality Dataset
- Google Audio Dataset
- Iris flower dataset
- Artificial Characters Dataset
- Bitcoin Heist Ransomware Address Dataset