- Data Analysis with Python
-
Overview
- Introduction to Data Science and Analytics
- Loading and Cleaning Data in Pandas
- Data Manipulation with NumPy and Pandas
- Exploratory Data Analysis (EDA) Techniques
- Handling Missing Data and Duplicates
- Merging, Joining, and Concatenating DataFrames
- Time Series Analysis Basics
- Data Visualization with Matplotlib and Seaborn
- Descriptive Statistics and Data Summarization
- Advanced Pandas Operations
Exploratory Data Analysis (EDA) Techniques
Add to BookmarkExploratory Data Analysis (EDA) is the process of understanding, summarizing, and visualizing data before applying machine learning or making decisions. EDA helps us identify patterns, missing values, outliers, and relationships in data.
In this tutorial, we will cover:
- Understanding the dataset
- Descriptive statistics
- Handling missing values
- Detecting and removing outliers
- Data visualization techniques
1. Understanding the Dataset
Before analyzing data, we need to load and inspect it. Let's use the Pandas library to load a dataset.
import pandas as pd
df = pd.read_csv("data.csv") # Load dataset
print(df.head()) # Show first 5 rows
print(df.info()) # Summary of dataset
print(df.describe()) # Statistical summaryKey Methods to Understand Data
df.head(n): Shows the firstnrowsdf.tail(n): Shows the lastnrowsdf.info(): Provides an overview of the dataset (data types, null values)df.describe(): Shows statistics like mean, median, and standard deviation
2. Descriptive Statistics
Descriptive statistics help summarize numerical data.
print(df["Salary"].mean()) # Average salary
print(df["Salary"].median()) # Middle value
print(df["Salary"].mode()) # Most frequent value
print(df["Salary"].std()) # Standard deviation (spread of data)Understanding These Metrics
- Mean (Average): Sum of all values divided by total count
- Median: Middle value in sorted data (useful for skewed data)
- Mode: Most frequent value
- Standard Deviation: Shows how spread out values are
3. Handling Missing Values
Missing values can affect data quality. We can find and fix them.
print(df.isnull().sum()) # Check missing values in each columnFixing Missing Data
- Remove missing rows
df_cleaned = df.dropna()- Fill missing values with a specific number (e.g., 0)
df.fillna(0, inplace=True)- Fill with the column’s mean, median, or mode
df["Salary"].fillna(df["Salary"].median(), inplace=True)4. Detecting and Removing Outliers
Outliers are values that are very different from the rest of the data. They can affect the accuracy of analysis.
Using Boxplot to Detect Outliers
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(x=df["Salary"])
plt.show()Removing Outliers using the Interquartile Range (IQR)
Q1 = df["Salary"].quantile(0.25) # 25th percentile
Q3 = df["Salary"].quantile(0.75) # 75th percentile
IQR = Q3 - Q1 # Interquartile range
# Define lower and upper limits
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Remove outliers
df_no_outliers = df[(df["Salary"] > lower_bound) & (df["Salary"] < upper_bound)]5. Data Visualization Techniques
Visualizing data helps in understanding trends, distributions, and relationships.
Histogram (For Data Distribution)
df["Salary"].hist(bins=20)
plt.xlabel("Salary")
plt.ylabel("Frequency")
plt.title("Salary Distribution")
plt.show()Scatter Plot (For Relationships Between Variables)
sns.scatterplot(x=df["Age"], y=df["Salary"])
plt.xlabel("Age")
plt.ylabel("Salary")
plt.title("Age vs Salary")
plt.show()Correlation Heatmap (For Finding Relationships Between Multiple Columns)
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
plt.title("Feature Correlation Heatmap")
plt.show()Conclusion
In this tutorial, we learned:
- How to load and inspect a dataset
- How to calculate descriptive statistics
- How to handle missing values
- How to detect and remove outliers
- How to visualize data using histograms, scatter plots, and heatmaps
Prepare for Interview
- JavaScript Interview Questions for 5+ Years Experience
- JavaScript Interview Questions for 2–5 Years Experience
- JavaScript Interview Questions for 1–2 Years Experience
- JavaScript Interview Questions for 0–1 Year Experience
- JavaScript Interview Questions For Fresher
- SQL Interview Questions for 5+ Years Experience
- SQL Interview Questions for 2–5 Years Experience
- SQL Interview Questions for 1–2 Years Experience
- SQL Interview Questions for 0–1 Year Experience
- SQL Interview Questions for Freshers
- Design Patterns in Python
- Dynamic Programming and Recursion in Python
- Trees and Graphs in Python
- Linked Lists, Stacks, and Queues in Python
- Sorting and Searching in Python
Random Blogs
- AI & Space Exploration – AI’s Role in Deep Space Missions and Planetary Research
- Internet of Things (IoT) & AI – Smart Devices and AI Working Together
- Why to learn Digital Marketing?
- Types of Numbers in Python
- Ideas for Content of Every niche on Reader’s Demand during COVID-19
- How to Install Tableau and Power BI on Ubuntu Using VirtualBox
- Important Mistakes to Avoid While Advertising on Facebook
- Mastering Python in 2025: A Complete Roadmap for Beginners
- Google’s Core Update in May 2020: What You Need to Know
- AI in Cybersecurity: The Future of Digital Protection
- Python Challenging Programming Exercises Part 1
- Datasets for Exploratory Data Analysis for Beginners
- Exploratory Data Analysis On Iris Dataset
- Understanding Data Lake, Data Warehouse, Data Mart, and Data Lakehouse – And Why We Need Them
- Top 10 Blogs of Digital Marketing you Must Follow
Datasets for Machine Learning
- Awesome-ChatGPT-Prompts
- Amazon Product Reviews Dataset
- Ozone Level Detection Dataset
- Bank Transaction Fraud Detection
- YouTube Trending Video Dataset (updated daily)
- Covid-19 Case Surveillance Public Use Dataset
- US Election 2020
- Forest Fires Dataset
- Mobile Robots Dataset
- Safety Helmet Detection
- All Space Missions from 1957
- OSIC Pulmonary Fibrosis Progression Dataset
- Wine Quality Dataset
- Google Audio Dataset
- Iris flower dataset
