Pandas is a powerful data manipulation library, and mastering advanced operations can significantly improve data analysis efficiency. This tutorial covers essential Advanced Pandas Operations, including:
apply()By the end of this tutorial, you’ll be able to handle complex data structures and optimize performance in Pandas.
Pandas provides several ways to select data efficiently.
.loc[] for Label-Based Indexingimport pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Amit', 'Priya', 'Raj', 'Neha', 'Vikram'],
'Age': [25, 30, 22, 28, 26],
'Salary': [50000, 60000, 45000, 52000, 58000]
}, index=['A', 'B', 'C', 'D', 'E'])
# Selecting a row by index label
print(df.loc['B'])
# Selecting multiple columns
print(df.loc[:, ['Name', 'Salary']]).iloc[] for Position-Based Indexing# Select the first row
print(df.iloc[0])
# Select the first two rows and first two columns
print(df.iloc[0:2, 0:2])# Selecting rows where Salary > 50,000
high_salary = df[df['Salary'] > 50000]
print(high_salary)apply()Pandas' apply() function allows custom operations on DataFrames.
# Define a function to categorize salary
def salary_category(sal):
if sal > 55000:
return 'High'
elif sal > 50000:
return 'Medium'
else:
return 'Low'
df['Salary Category'] = df['Salary'].apply(salary_category)
print(df)Applying functions on multiple columns:
df['Age in 5 Years'] = df['Age'].apply(lambda x: x + 5)
print(df)Pivot tables summarize data and are useful for grouping large datasets.
df = pd.DataFrame({
'Department': ['HR', 'IT', 'Finance', 'IT', 'HR'],
'Employee': ['Amit', 'Raj', 'Neha', 'Vikram', 'Priya'],
'Salary': [50000, 60000, 55000, 58000, 53000]
})
# Pivot Table to calculate mean salary per department
pivot = df.pivot_table(values='Salary', index='Department', aggfunc='mean')
print(pivot)df['Experience Level'] = ['Junior', 'Senior', 'Senior', 'Junior', 'Junior']
crosstab = pd.crosstab(df['Department'], df['Experience Level'])
print(crosstab)MultiIndex allows hierarchical indexing for better data organization.
arrays = [['HR', 'HR', 'IT', 'IT'], ['Amit', 'Priya', 'Raj', 'Vikram']]
index = pd.MultiIndex.from_tuples(list(zip(*arrays)), names=['Department', 'Employee'])
df_multi = pd.DataFrame({'Salary': [50000, 53000, 60000, 58000]}, index=index)
print(df_multi)
# Accessing data from MultiIndex DataFrame
print(df_multi.loc['IT']).query() for Filteringdf = pd.DataFrame({
'City': ['Delhi', 'Mumbai', 'Bangalore', 'Chennai', 'Kolkata'],
'Temperature': [32, 35, 29, 31, 30],
'Humidity': [70, 65, 80, 75, 72]
})
# Selecting cities where temperature > 30
filtered_df = df.query('Temperature > 30')
print(filtered_df)astype() to Reduce Memory Usagedf['Temperature'] = df['Temperature'].astype('int16')
df['Humidity'] = df['Humidity'].astype('int16')
print(df.info())categorical Data Type for Better Performancedf['City'] = df['City'].astype('category')
print(df.info())chunk_size for Large CSV FilesInstead of loading large files at once, use chunk_size.
chunk_size = 10000
for chunk in pd.read_csv("large_file.csv", chunksize=chunk_size):
process(chunk) # Define your processing functionFor large datasets, swifter allows parallel execution of apply().
import swifter # Install using: pip install swifter
df['Processed Salary'] = df['Salary'].swifter.apply(lambda x: x * 1.1)
print(df).loc[], .iloc[], and Boolean filtering.apply() for transforming data..query() for filtering efficiently.swifter for faster computations.Sign in to join the discussion and post comments.
Sign in