In real-world projects, raw data is often messy, inconsistent, and incomplete. Pandas, a powerful Python library, helps in loading, cleaning, and transforming data efficiently. This tutorial covers how to:
Before starting, ensure you have Pandas installed:
pip install pandasNow, import Pandas in Python:
import pandas as pdPandas supports multiple data sources like CSV, Excel, SQL, and JSON.
CSV (Comma-Separated Values) is the most common format for storing data.
df = pd.read_csv("data.csv") # Load CSV file
print(df.head()) # Display first 5 rowsExample: Loading a sales dataset (sales_data.csv) containing Customer Name, Product, Price, and Quantity.
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")df = pd.read_json("data.json")import sqlite3
conn = sqlite3.connect("database.db") # Connect to a database
df = pd.read_sql("SELECT * FROM customers", conn)After loading, it's important to explore the dataset.
df.info() # Basic info about dataset
df.describe() # Summary statistics
df.shape # (rows, columns)
df.columns # Column names
df.dtypes # Data types of columnsMissing values can cause incorrect analysis. Pandas provides ways to handle them.
print(df.isnull().sum()) # Count missing values per columndf.dropna(inplace=True) # Removes rows with missing valuesdf.fillna(0, inplace=True) # Replace NaN with 0df["Price"].fillna(df["Price"].mean(), inplace=True)Duplicate data can cause bias in analysis.
df.drop_duplicates(inplace=True) # Remove duplicate rowsdf["Category"] = df["Category"].replace({"Eletronics": "Electronics"})df = df[df["Price"] < 50000] # Remove products priced above 50,000Pandas allows converting data types for consistency.
df["Date"] = pd.to_datetime(df["Date"]) # Convert to datetime
df["Price"] = df["Price"].astype(float) # Convert to floatdf.rename(columns={"Cust_Name": "Customer Name"}, inplace=True)df = df[["Customer Name", "Product", "Quantity", "Price"]]After cleaning, save the dataset for further analysis.
df.to_csv("cleaned_data.csv", index=False) # Save as CSV
df.to_excel("cleaned_data.xlsx", index=False) # Save as ExcelIn this tutorial, we covered how to load, explore, clean, and save data using Pandas. Cleaning data is a crucial step before performing data analysis and visualization. In the next tutorial, we will explore data manipulation with NumPy and Pandas for better insights.
Sign in to join the discussion and post comments.
Sign in