Pandas - Interview Questions and Answers
Pandas is an open-source Python library used for data manipulation and analysis. It provides flexible data structures like Series and DataFrame to handle structured data efficiently.
- Series: One-dimensional labeled array.
- DataFrame: Two-dimensional table-like structure.
- Panel (deprecated): Three-dimensional data structure.
Use the following command:
pip install pandas
import pandas as pd
import pandas as pd
data = pd.Series([10, 20, 30, 40])
print(data)
data = {'Name': ['Kartik', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
print(df)
df = pd.read_csv("data.csv")
df.to_csv("output.csv", index=False)
df.head()
df.shape
Displays metadata about the DataFrame, including data types and non-null values.
Provides summary statistics of numerical columns.
df["column_name"]
df[df['Age'] > 25]
df['Salary'] = [50000, 60000]
df.drop(columns=['Salary'], inplace=True)
loc[]
selects by label (index name).iloc[]
selects by position (index number).
df.reset_index(drop=True, inplace=True)
df.set_index('Name', inplace=True)
- Fill with a value:
df.fillna(0)
- Drop missing values:
df.dropna()
- Forward fill:
df.fillna(method='ffill')
- Backward fill:
df.fillna(method='bfill')
df.isnull().sum()
df.replace("old_value", "new_value", inplace=True)
df["Age"] = df["Age"].astype(int)
df.sort_values(by="Age", ascending=True)
df.drop_duplicates(inplace=True)
df.rename(columns={'old_name': 'new_name'}, inplace=True)
df['Age'] = df['Age'].apply(lambda x: x + 1)
apply()
works on entire columns.map()
is for element-wise transformations in a Series.
df.groupby("Department").mean()
merged_df = pd.merge(df1, df2, on='ID', how='inner')
pd.concat([df1, df2], axis=0)
To summarize data dynamically.
df.pivot_table(index="Category", values="Sales", aggfunc="sum")
pd.get_dummies(df['Category'])
df.sample(n=5)
df.corr()
df.reset_index(drop=True, inplace=True)
This resets the index to the default integer index while optionally dropping the old index.
df.set_index("ColumnName", inplace=True)
This makes the specified column the index of the DataFrame.
df[(df['Age'] > 30) & (df['Salary'] > 50000)]
The &
operator is used for multiple conditions; use |
for OR conditions.
df["ColumnName"].unique()
Returns an array of unique values from the specified column.
df["ColumnName"].nunique()
Returns the count of unique values in the specified column.
df["ColumnName"].value_counts()
This method returns a Series with the frequency of each unique value.
df.rename(index={0: "FirstRow", 1: "SecondRow"}, inplace=True)
Renames specific index labels in the DataFrame.
df["ColumnName"] = df["ColumnName"].astype(int)
Changes the data type of a column to the specified type (e.g., int, float, string).
df.replace({"old_value": "new_value"}, inplace=True)
Replaces specified values in the DataFrame.
df["ColumnName"].isnull().sum()
Returns the count of missing values in the specified column.
df.dropna(inplace=True)
Removes all rows that contain at least one missing value.
df.fillna(value="Default Value", inplace=True)
Fills missing values with a specified default value.
df.fillna(method="ffill", inplace=True)
Replaces missing values with the previous row's value.
df.fillna(method="bfill", inplace=True)
Replaces missing values with the next row's value.
df["ColumnName"] = df["ColumnName"].apply(lambda x: x * 2)
The apply()
function allows applying a function or lambda expression to every row in a column.
df = pd.read_csv("data.csv", usecols=['Name', 'Age'])
chunk = pd.read_csv("large_file.csv", chunksize=1000)
df.query("Age > 30 and Salary < 60000")
df.set_index(['Region', 'Department'], inplace=True)
df['RollingAvg'] = df['Sales'].rolling(window=3).mean()
Use Parquet format:
df.to_parquet("data.parquet")
df = pd.read_parquet("data.parquet")
df.explode("ListColumn")
df['Date'] = pd.to_datetime(df['Date'])
df.set_index("Date", inplace=True)
df.resample("M").sum()
df['Category'] = df['Category'].astype('category')
df.eval("NewColumn = Age * 2")
df_merged = df1.merge(df2, on=["Column1", "Column2"], how="inner")
This merges two DataFrames using multiple columns as keys.
merge()
is used for combining DataFrames based on key columns (like SQL joins).concat()
is used for stacking DataFrames either row-wise (axis=0
) or column-wise (axis=1
).
df.duplicated()
Returns a boolean Series indicating duplicate rows.
df.drop_duplicates(inplace=True)
Removes duplicate rows while keeping the first occurrence by default.
df.groupby("ColumnName")["ValueColumn"].sum()
Groups the DataFrame by a column and calculates the sum for each group.
df.pivot_table(values="Sales", index="Region", columns="Product", aggfunc="sum")
Creates a pivot table summarizing data based on specified columns and an aggregation function.
df.melt(id_vars=["ID"], value_vars=["Sales", "Profit"])
Converts wide-form DataFrame into a long-form, making it easier to analyze.
df.resample("M").sum()
Resamples data to monthly frequency and calculates the sum for each period.
df["RollingAvg"] = df["Sales"].rolling(window=3).mean()
Calculates a rolling average for a given window size.
chunk_size = 10000
for chunk in pd.read_csv("large_file.csv", chunksize=chunk_size):
process(chunk) # Process each chunk separately
Reading large files in chunks helps in handling memory efficiently.
Tutorials
Random Blogs
- Role of Digital Marketing Services to Uplift Online business of Company and Beat Its Competitors
- Convert RBG Image to Gray Scale Image Using CV2
- Datasets for analyze in Tableau
- Time Series Analysis on Air Passenger Data
- Top 10 Blogs of Digital Marketing you Must Follow
- Avoiding the Beginner’s Trap: Key Python Fundamentals You Shouldn't Skip
- Understanding HTAP Databases: Bridging Transactions and Analytics
- How AI is Making Humans Weaker – The Hidden Impact of Artificial Intelligence
- Internet of Things (IoT) & AI – Smart Devices and AI Working Together
- Variable Assignment in Python