Exploratory Data Analysis (EDA) Techniques

Exploratory Data Analysis (EDA) is the process of understanding, summarizing, and visualizing data before applying machine learning or making decisions. EDA helps us identify patterns, missing values, outliers, and relationships in data.

In this tutorial, we will cover:

Understanding the dataset
Descriptive statistics
Handling missing values
Detecting and removing outliers
Data visualization techniques

1. Understanding the Dataset

Before analyzing data, we need to load and inspect it. Let's use the Pandas library to load a dataset.

import pandas as pd

df = pd.read_csv("data.csv")  # Load dataset
print(df.head())  # Show first 5 rows
print(df.info())  # Summary of dataset
print(df.describe())  # Statistical summary

Key Methods to Understand Data

df.head(n): Shows the first n rows
df.tail(n): Shows the last n rows
df.info(): Provides an overview of the dataset (data types, null values)
df.describe(): Shows statistics like mean, median, and standard deviation

2. Descriptive Statistics

Descriptive statistics help summarize numerical data.

print(df["Salary"].mean())  # Average salary
print(df["Salary"].median())  # Middle value
print(df["Salary"].mode())  # Most frequent value
print(df["Salary"].std())  # Standard deviation (spread of data)

Understanding These Metrics

Mean (Average): Sum of all values divided by total count
Median: Middle value in sorted data (useful for skewed data)
Mode: Most frequent value
Standard Deviation: Shows how spread out values are

3. Handling Missing Values

Missing values can affect data quality. We can find and fix them.

print(df.isnull().sum())  # Check missing values in each column

Fixing Missing Data

Remove missing rows

df_cleaned = df.dropna()

Fill missing values with a specific number (e.g., 0)

df.fillna(0, inplace=True)

Fill with the column’s mean, median, or mode

df["Salary"].fillna(df["Salary"].median(), inplace=True)

4. Detecting and Removing Outliers

Outliers are values that are very different from the rest of the data. They can affect the accuracy of analysis.

Using Boxplot to Detect Outliers

import seaborn as sns
import matplotlib.pyplot as plt

sns.boxplot(x=df["Salary"])
plt.show()

Removing Outliers using the Interquartile Range (IQR)

Q1 = df["Salary"].quantile(0.25)  # 25th percentile
Q3 = df["Salary"].quantile(0.75)  # 75th percentile
IQR = Q3 - Q1  # Interquartile range

# Define lower and upper limits
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers
df_no_outliers = df[(df["Salary"] > lower_bound) & (df["Salary"] < upper_bound)]

5. Data Visualization Techniques

Visualizing data helps in understanding trends, distributions, and relationships.

Histogram (For Data Distribution)

df["Salary"].hist(bins=20)
plt.xlabel("Salary")
plt.ylabel("Frequency")
plt.title("Salary Distribution")
plt.show()

Scatter Plot (For Relationships Between Variables)

sns.scatterplot(x=df["Age"], y=df["Salary"])
plt.xlabel("Age")
plt.ylabel("Salary")
plt.title("Age vs Salary")
plt.show()

Correlation Heatmap (For Finding Relationships Between Multiple Columns)

sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
plt.title("Feature Correlation Heatmap")
plt.show()

Conclusion

In this tutorial, we learned:

How to load and inspect a dataset
How to calculate descriptive statistics
How to handle missing values
How to detect and remove outliers
How to visualize data using histograms, scatter plots, and heatmaps

←Data Manipulation with NumPy and Pandas Handling Missing Data and Duplicates→

Overview