Loading and Cleaning Data in Pandas

In real-world projects, raw data is often messy, inconsistent, and incomplete. Pandas, a powerful Python library, helps in loading, cleaning, and transforming data efficiently. This tutorial covers how to:

  • Load data from various sources
  • Handle missing values
  • Remove duplicates
  • Clean and transform data for analysis

1. Installing and Importing Pandas

Before starting, ensure you have Pandas installed:

pip install pandas

Now, import Pandas in Python:

import pandas as pd

2. Loading Data into Pandas

Pandas supports multiple data sources like CSV, Excel, SQL, and JSON.

Loading a CSV File

CSV (Comma-Separated Values) is the most common format for storing data.

df = pd.read_csv("data.csv")  # Load CSV file
print(df.head())  # Display first 5 rows

Example: Loading a sales dataset (sales_data.csv) containing Customer Name, Product, Price, and Quantity.

Loading an Excel File

df = pd.read_excel("data.xlsx", sheet_name="Sheet1")

Loading JSON Data

df = pd.read_json("data.json")

Loading Data from SQL Database

import sqlite3
conn = sqlite3.connect("database.db")  # Connect to a database
df = pd.read_sql("SELECT * FROM customers", conn)

3. Understanding Data Structure

After loading, it's important to explore the dataset.

df.info()  # Basic info about dataset
df.describe()  # Summary statistics
df.shape  # (rows, columns)
df.columns  # Column names
df.dtypes  # Data types of columns

4. Handling Missing Data

Missing values can cause incorrect analysis. Pandas provides ways to handle them.

Checking for Missing Data

print(df.isnull().sum())  # Count missing values per column

Removing Missing Values

df.dropna(inplace=True)  # Removes rows with missing values

Filling Missing Values

  • Fill with a specific value
df.fillna(0, inplace=True)  # Replace NaN with 0
  • Fill with column mean/median
df["Price"].fillna(df["Price"].mean(), inplace=True)

5. Removing Duplicates

Duplicate data can cause bias in analysis.

df.drop_duplicates(inplace=True)  # Remove duplicate rows

6. Handling Incorrect Data

Replacing Incorrect Values

df["Category"] = df["Category"].replace({"Eletronics": "Electronics"})

Filtering Outliers

df = df[df["Price"] < 50000]  # Remove products priced above 50,000

7. Converting Data Types

Pandas allows converting data types for consistency.

df["Date"] = pd.to_datetime(df["Date"])  # Convert to datetime
df["Price"] = df["Price"].astype(float)  # Convert to float

8. Renaming and Reordering Columns

Renaming Columns

df.rename(columns={"Cust_Name": "Customer Name"}, inplace=True)

Reordering Columns

df = df[["Customer Name", "Product", "Quantity", "Price"]]

9. Saving Cleaned Data

After cleaning, save the dataset for further analysis.

df.to_csv("cleaned_data.csv", index=False)  # Save as CSV
df.to_excel("cleaned_data.xlsx", index=False)  # Save as Excel

Conclusion

In this tutorial, we covered how to load, explore, clean, and save data using Pandas. Cleaning data is a crucial step before performing data analysis and visualization. In the next tutorial, we will explore data manipulation with NumPy and Pandas for better insights.