- Data Analysis with Python
-
Overview
- Introduction to Data Science and Analytics
- Loading and Cleaning Data in Pandas
- Data Manipulation with NumPy and Pandas
- Exploratory Data Analysis (EDA) Techniques
- Handling Missing Data and Duplicates
- Merging, Joining, and Concatenating DataFrames
- Time Series Analysis Basics
- Data Visualization with Matplotlib and Seaborn
- Descriptive Statistics and Data Summarization
- Advanced Pandas Operations
Descriptive Statistics and Data Summarization
Add to BookmarkIntroduction
Descriptive statistics help summarize and understand key aspects of a dataset. It includes measures such as mean, median, mode, standard deviation, and percentiles. Using NumPy, Pandas, and Seaborn, we can quickly analyze datasets and extract useful insights.
In this tutorial, we will cover:
- Introduction to Descriptive Statistics
- Measures of Central Tendency (Mean, Median, Mode)
- Measures of Dispersion (Variance, Standard Deviation, Range, IQR)
- Summary Statistics with Pandas
- Data Distribution Visualization
- Percentiles and Quartiles
1. Introduction to Descriptive Statistics
Descriptive statistics summarize data without making predictions. Some commonly used measures include:
- Central Tendency: Mean, Median, Mode
- Dispersion: Variance, Standard Deviation, Range
- Data Distribution: Percentiles, Quartiles, Skewness
We will use the Pandas library for data summarization and Seaborn for visualization.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load sample dataset
df = sns.load_dataset("tips")
df.head()
2. Measures of Central Tendency
Mean (Average)
The mean is the sum of all values divided by the total count.
mean_value = df['total_bill'].mean()
print("Mean of total bill:", mean_value)
Median
The median is the middle value when data is sorted.
median_value = df['total_bill'].median()
print("Median of total bill:", median_value)
Mode
The mode is the most frequently occurring value in a dataset.
mode_value = df['total_bill'].mode()[0]
print("Mode of total bill:", mode_value)
3. Measures of Dispersion
Variance
Variance shows how far data points are from the mean.
variance_value = df['total_bill'].var()
print("Variance of total bill:", variance_value)
Standard Deviation
Standard deviation measures the spread of data around the mean.
std_dev = df['total_bill'].std()
print("Standard Deviation of total bill:", std_dev)
Range
Range is the difference between the maximum and minimum values.
data_range = df['total_bill'].max() - df['total_bill'].min()
print("Range of total bill:", data_range)
Interquartile Range (IQR)
IQR measures the spread of the middle 50% of data (Q3 - Q1).
Q1 = df['total_bill'].quantile(0.25)
Q3 = df['total_bill'].quantile(0.75)
IQR = Q3 - Q1
print("Interquartile Range (IQR):", IQR)
4. Summary Statistics with Pandas
The .describe()
function provides a quick summary of numerical columns.
df.describe()
To get a summary of categorical columns, use:
df.describe(include=['O']) # 'O' stands for object (categorical)
5. Data Distribution Visualization
Histogram (Visualizing Distribution)
sns.histplot(df['total_bill'], bins=20, kde=True)
plt.title("Distribution of Total Bill")
plt.show()
Box Plot (Detecting Outliers)
sns.boxplot(y=df['total_bill'])
plt.title("Box Plot of Total Bill")
plt.show()
Pair Plot (Checking Relationships)
sns.pairplot(df)
plt.show()
6. Percentiles and Quartiles
Percentiles help understand how data is distributed.
percentiles = df['total_bill'].quantile([0.25, 0.5, 0.75, 0.95])
print(percentiles)
Conclusion
- Descriptive statistics help summarize and understand datasets.
- Mean, Median, Mode give central tendency measures.
- Variance, Standard Deviation, IQR show data spread.
- Pandas
.describe()
function provides quick insights. - Histograms and box plots visualize distributions and outliers.
Prepare for Interview
- JavaScript Interview Questions for 0–1 Year Experience
- JavaScript Interview Questions For Fresher
- SQL Interview Questions for 5+ Years Experience
- SQL Interview Questions for 2–5 Years Experience
- SQL Interview Questions for 1–2 Years Experience
- SQL Interview Questions for 0–1 Year Experience
- SQL Interview Questions for Freshers
- Design Patterns in Python
- Dynamic Programming and Recursion in Python
- Trees and Graphs in Python
- Linked Lists, Stacks, and Queues in Python
- Sorting and Searching in Python
- Debugging in Python
- Unit Testing in Python
- Asynchronous Programming in PYthon
Random Blogs
- Important Mistakes to Avoid While Advertising on Facebook
- Variable Assignment in Python
- OLTP vs. OLAP Databases: Advanced Insights and Query Optimization Techniques
- How to Start Your Career as a DevOps Engineer
- Datasets for Natural Language Processing
- The Beginner’s Guide to Normalization and Denormalization in Databases
- What to Do When Your MySQL Table Grows Too Wide
- The Ultimate Guide to Machine Learning (ML) for Beginners
- Extract RGB Color From a Image Using CV2
- Understanding HTAP Databases: Bridging Transactions and Analytics
- Datasets for analyze in Tableau
- The Ultimate Guide to Data Science: Everything You Need to Know
- Understanding AI, ML, Data Science, and More: A Beginner's Guide to Choosing Your Career Path
- Create Virtual Host for Nginx on Ubuntu (For Yii2 Basic & Advanced Templates)
- Understanding SQL vs MySQL vs PostgreSQL vs MS SQL vs Oracle and Other Popular Databases
Datasets for Machine Learning
- Amazon Product Reviews Dataset
- Ozone Level Detection Dataset
- Bank Transaction Fraud Detection
- YouTube Trending Video Dataset (updated daily)
- Covid-19 Case Surveillance Public Use Dataset
- US Election 2020
- Forest Fires Dataset
- Mobile Robots Dataset
- Safety Helmet Detection
- All Space Missions from 1957
- OSIC Pulmonary Fibrosis Progression Dataset
- Wine Quality Dataset
- Google Audio Dataset
- Iris flower dataset
- Artificial Characters Dataset