Descriptive Statistics and Data Summarization

Introduction

Descriptive statistics help summarize and understand key aspects of a dataset. It includes measures such as mean, median, mode, standard deviation, and percentiles. Using NumPy, Pandas, and Seaborn, we can quickly analyze datasets and extract useful insights.

In this tutorial, we will cover:

  1. Introduction to Descriptive Statistics
  2. Measures of Central Tendency (Mean, Median, Mode)
  3. Measures of Dispersion (Variance, Standard Deviation, Range, IQR)
  4. Summary Statistics with Pandas
  5. Data Distribution Visualization
  6. Percentiles and Quartiles

1. Introduction to Descriptive Statistics

Descriptive statistics summarize data without making predictions. Some commonly used measures include:

  • Central Tendency: Mean, Median, Mode
  • Dispersion: Variance, Standard Deviation, Range
  • Data Distribution: Percentiles, Quartiles, Skewness

We will use the Pandas library for data summarization and Seaborn for visualization.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load sample dataset
df = sns.load_dataset("tips")
df.head()

2. Measures of Central Tendency

Mean (Average)

The mean is the sum of all values divided by the total count.

mean_value = df['total_bill'].mean()
print("Mean of total bill:", mean_value)

Median

The median is the middle value when data is sorted.

median_value = df['total_bill'].median()
print("Median of total bill:", median_value)

Mode

The mode is the most frequently occurring value in a dataset.

mode_value = df['total_bill'].mode()[0]
print("Mode of total bill:", mode_value)

3. Measures of Dispersion

Variance

Variance shows how far data points are from the mean.

variance_value = df['total_bill'].var()
print("Variance of total bill:", variance_value)

Standard Deviation

Standard deviation measures the spread of data around the mean.

std_dev = df['total_bill'].std()
print("Standard Deviation of total bill:", std_dev)

Range

Range is the difference between the maximum and minimum values.

data_range = df['total_bill'].max() - df['total_bill'].min()
print("Range of total bill:", data_range)

Interquartile Range (IQR)

IQR measures the spread of the middle 50% of data (Q3 - Q1).

Q1 = df['total_bill'].quantile(0.25)
Q3 = df['total_bill'].quantile(0.75)
IQR = Q3 - Q1
print("Interquartile Range (IQR):", IQR)

4. Summary Statistics with Pandas

The .describe() function provides a quick summary of numerical columns.

df.describe()

To get a summary of categorical columns, use:

df.describe(include=['O'])  # 'O' stands for object (categorical)

5. Data Distribution Visualization

Histogram (Visualizing Distribution)

sns.histplot(df['total_bill'], bins=20, kde=True)
plt.title("Distribution of Total Bill")
plt.show()

Box Plot (Detecting Outliers)

sns.boxplot(y=df['total_bill'])
plt.title("Box Plot of Total Bill")
plt.show()

Pair Plot (Checking Relationships)

sns.pairplot(df)
plt.show()

6. Percentiles and Quartiles

Percentiles help understand how data is distributed.

percentiles = df['total_bill'].quantile([0.25, 0.5, 0.75, 0.95])
print(percentiles)

Conclusion

  • Descriptive statistics help summarize and understand datasets.
  • Mean, Median, Mode give central tendency measures.
  • Variance, Standard Deviation, IQR show data spread.
  • Pandas .describe() function provides quick insights.
  • Histograms and box plots visualize distributions and outliers.