- Data Analysis with Python
-
Overview
- Introduction to Data Science and Analytics
- Loading and Cleaning Data in Pandas
- Data Manipulation with NumPy and Pandas
- Exploratory Data Analysis (EDA) Techniques
- Handling Missing Data and Duplicates
- Merging, Joining, and Concatenating DataFrames
- Time Series Analysis Basics
- Data Visualization with Matplotlib and Seaborn
- Descriptive Statistics and Data Summarization
- Advanced Pandas Operations
Advanced Pandas Operations
Add to BookmarkPandas is a powerful data manipulation library, and mastering advanced operations can significantly improve data analysis efficiency. This tutorial covers essential Advanced Pandas Operations, including:
- Advanced Indexing and Selection
- Applying Custom Functions with
apply()
- Pivot Tables and Crosstabs
- Working with MultiIndex DataFrames
- Efficient Data Filtering and Querying
- Performance Optimization with Pandas
- Parallel Processing in Pandas
By the end of this tutorial, you’ll be able to handle complex data structures and optimize performance in Pandas.
1. Advanced Indexing and Selection
Pandas provides several ways to select data efficiently.
Using .loc[]
for Label-Based Indexing
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Amit', 'Priya', 'Raj', 'Neha', 'Vikram'],
'Age': [25, 30, 22, 28, 26],
'Salary': [50000, 60000, 45000, 52000, 58000]
}, index=['A', 'B', 'C', 'D', 'E'])
# Selecting a row by index label
print(df.loc['B'])
# Selecting multiple columns
print(df.loc[:, ['Name', 'Salary']])
Using .iloc[]
for Position-Based Indexing
# Select the first row
print(df.iloc[0])
# Select the first two rows and first two columns
print(df.iloc[0:2, 0:2])
Boolean Indexing
# Selecting rows where Salary > 50,000
high_salary = df[df['Salary'] > 50000]
print(high_salary)
2. Applying Custom Functions with apply()
Pandas' apply()
function allows custom operations on DataFrames.
# Define a function to categorize salary
def salary_category(sal):
if sal > 55000:
return 'High'
elif sal > 50000:
return 'Medium'
else:
return 'Low'
df['Salary Category'] = df['Salary'].apply(salary_category)
print(df)
Applying functions on multiple columns:
df['Age in 5 Years'] = df['Age'].apply(lambda x: x + 5)
print(df)
3. Pivot Tables and Crosstabs
Pivot tables summarize data and are useful for grouping large datasets.
df = pd.DataFrame({
'Department': ['HR', 'IT', 'Finance', 'IT', 'HR'],
'Employee': ['Amit', 'Raj', 'Neha', 'Vikram', 'Priya'],
'Salary': [50000, 60000, 55000, 58000, 53000]
})
# Pivot Table to calculate mean salary per department
pivot = df.pivot_table(values='Salary', index='Department', aggfunc='mean')
print(pivot)
Using Crosstab for Frequency Count
df['Experience Level'] = ['Junior', 'Senior', 'Senior', 'Junior', 'Junior']
crosstab = pd.crosstab(df['Department'], df['Experience Level'])
print(crosstab)
4. Working with MultiIndex DataFrames
MultiIndex allows hierarchical indexing for better data organization.
arrays = [['HR', 'HR', 'IT', 'IT'], ['Amit', 'Priya', 'Raj', 'Vikram']]
index = pd.MultiIndex.from_tuples(list(zip(*arrays)), names=['Department', 'Employee'])
df_multi = pd.DataFrame({'Salary': [50000, 53000, 60000, 58000]}, index=index)
print(df_multi)
# Accessing data from MultiIndex DataFrame
print(df_multi.loc['IT'])
5. Efficient Data Filtering and Querying
Using .query()
for Filtering
df = pd.DataFrame({
'City': ['Delhi', 'Mumbai', 'Bangalore', 'Chennai', 'Kolkata'],
'Temperature': [32, 35, 29, 31, 30],
'Humidity': [70, 65, 80, 75, 72]
})
# Selecting cities where temperature > 30
filtered_df = df.query('Temperature > 30')
print(filtered_df)
6. Performance Optimization with Pandas
Using astype()
to Reduce Memory Usage
df['Temperature'] = df['Temperature'].astype('int16')
df['Humidity'] = df['Humidity'].astype('int16')
print(df.info())
Using categorical
Data Type for Better Performance
df['City'] = df['City'].astype('category')
print(df.info())
Using chunk_size
for Large CSV Files
Instead of loading large files at once, use chunk_size
.
chunk_size = 10000
for chunk in pd.read_csv("large_file.csv", chunksize=chunk_size):
process(chunk) # Define your processing function
7. Parallel Processing in Pandas
For large datasets, swifter
allows parallel execution of apply()
.
import swifter # Install using: pip install swifter
df['Processed Salary'] = df['Salary'].swifter.apply(lambda x: x * 1.1)
print(df)
Conclusion
- Advanced Indexing:
.loc[]
,.iloc[]
, and Boolean filtering. - Custom Functions:
apply()
for transforming data. - Pivot Tables & Crosstabs: Data summarization and frequency tables.
- MultiIndex DataFrames: Hierarchical indexing for complex datasets.
- Querying Data:
.query()
for filtering efficiently. - Performance Optimization: Memory-efficient data types, chunk processing.
- Parallel Processing:
swifter
for faster computations.
Prepare for Interview
- JavaScript Interview Questions for 5+ Years Experience
- JavaScript Interview Questions for 2–5 Years Experience
- JavaScript Interview Questions for 1–2 Years Experience
- JavaScript Interview Questions for 0–1 Year Experience
- JavaScript Interview Questions For Fresher
- SQL Interview Questions for 5+ Years Experience
- SQL Interview Questions for 2–5 Years Experience
- SQL Interview Questions for 1–2 Years Experience
- SQL Interview Questions for 0–1 Year Experience
- SQL Interview Questions for Freshers
- Design Patterns in Python
- Dynamic Programming and Recursion in Python
- Trees and Graphs in Python
- Linked Lists, Stacks, and Queues in Python
- Sorting and Searching in Python
Random Blogs
- What Is SEO and Why Is It Important?
- Deep Learning (DL): The Core of Modern AI
- AI & Space Exploration – AI’s Role in Deep Space Missions and Planetary Research
- Why to learn Digital Marketing?
- Important Mistakes to Avoid While Advertising on Facebook
- Variable Assignment in Python
- Python Challenging Programming Exercises Part 2
- Top 15 Recommended SEO Tools
- Store Data Into CSV File Using Python Tkinter GUI Library
- Mastering Python in 2025: A Complete Roadmap for Beginners
- Understanding OLTP vs OLAP Databases: How SQL Handles Query Optimization
- Understanding AI, ML, Data Science, and More: A Beginner's Guide to Choosing Your Career Path
- Create Virtual Host for Nginx on Ubuntu (For Yii2 Basic & Advanced Templates)
- The Ultimate Guide to Starting a Career in Computer Vision
- Types of Numbers in Python
Datasets for Machine Learning
- Awesome-ChatGPT-Prompts
- Amazon Product Reviews Dataset
- Ozone Level Detection Dataset
- Bank Transaction Fraud Detection
- YouTube Trending Video Dataset (updated daily)
- Covid-19 Case Surveillance Public Use Dataset
- US Election 2020
- Forest Fires Dataset
- Mobile Robots Dataset
- Safety Helmet Detection
- All Space Missions from 1957
- OSIC Pulmonary Fibrosis Progression Dataset
- Wine Quality Dataset
- Google Audio Dataset
- Iris flower dataset