Exploratory Data Analysis (EDA) is the process of understanding, summarizing, and visualizing data before applying machine learning or making decisions. EDA helps us identify patterns, missing values, outliers, and relationships in data.
In this tutorial, we will cover:
Before analyzing data, we need to load and inspect it. Let's use the Pandas library to load a dataset.
import pandas as pd
df = pd.read_csv("data.csv") # Load dataset
print(df.head()) # Show first 5 rows
print(df.info()) # Summary of dataset
print(df.describe()) # Statistical summarydf.head(n): Shows the first n rowsdf.tail(n): Shows the last n rowsdf.info(): Provides an overview of the dataset (data types, null values)df.describe(): Shows statistics like mean, median, and standard deviationDescriptive statistics help summarize numerical data.
print(df["Salary"].mean()) # Average salary
print(df["Salary"].median()) # Middle value
print(df["Salary"].mode()) # Most frequent value
print(df["Salary"].std()) # Standard deviation (spread of data)Missing values can affect data quality. We can find and fix them.
print(df.isnull().sum()) # Check missing values in each columndf_cleaned = df.dropna()df.fillna(0, inplace=True)df["Salary"].fillna(df["Salary"].median(), inplace=True)Outliers are values that are very different from the rest of the data. They can affect the accuracy of analysis.
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(x=df["Salary"])
plt.show()Q1 = df["Salary"].quantile(0.25) # 25th percentile
Q3 = df["Salary"].quantile(0.75) # 75th percentile
IQR = Q3 - Q1 # Interquartile range
# Define lower and upper limits
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Remove outliers
df_no_outliers = df[(df["Salary"] > lower_bound) & (df["Salary"] < upper_bound)]Visualizing data helps in understanding trends, distributions, and relationships.
df["Salary"].hist(bins=20)
plt.xlabel("Salary")
plt.ylabel("Frequency")
plt.title("Salary Distribution")
plt.show()sns.scatterplot(x=df["Age"], y=df["Salary"])
plt.xlabel("Age")
plt.ylabel("Salary")
plt.title("Age vs Salary")
plt.show()sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
plt.title("Feature Correlation Heatmap")
plt.show()In this tutorial, we learned:
Sign in to join the discussion and post comments.
Sign in