Pandas - Interview Questions and Answers

Pandas is an open-source Python library used for data manipulation and analysis. It provides flexible data structures like Series and DataFrame to handle structured data efficiently.

  • Series: One-dimensional labeled array.
  • DataFrame: Two-dimensional table-like structure.
  • Panel (deprecated): Three-dimensional data structure.

Use the following command:

pip install pandas

 

import pandas as pd

 

import pandas as pd
data = pd.Series([10, 20, 30, 40])
print(data)

 

data = {'Name': ['Kartik', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
print(df)

 

df = pd.read_csv("data.csv")

 

df.to_csv("output.csv", index=False)

 

df.head()

 

df.shape

 

Displays metadata about the DataFrame, including data types and non-null values.

Provides summary statistics of numerical columns.

df["column_name"]

 

df[df['Age'] > 25]

 

df['Salary'] = [50000, 60000]

 

df.drop(columns=['Salary'], inplace=True)

 

  • loc[] selects by label (index name).
  • iloc[] selects by position (index number).

df.reset_index(drop=True, inplace=True)

 

df.set_index('Name', inplace=True)

 

  • Fill with a value: df.fillna(0)
  • Drop missing values: df.dropna()
  • Forward fill: df.fillna(method='ffill')
  • Backward fill: df.fillna(method='bfill')

df.isnull().sum()

 

df.replace("old_value", "new_value", inplace=True)

 

df["Age"] = df["Age"].astype(int)

 

df.sort_values(by="Age", ascending=True)

 

df.drop_duplicates(inplace=True)

 

df.rename(columns={'old_name': 'new_name'}, inplace=True)

 

df['Age'] = df['Age'].apply(lambda x: x + 1)

 

  • apply() works on entire columns.
  • map() is for element-wise transformations in a Series.

df.groupby("Department").mean()

 

merged_df = pd.merge(df1, df2, on='ID', how='inner')

 

pd.concat([df1, df2], axis=0)

 

To summarize data dynamically.

df.pivot_table(index="Category", values="Sales", aggfunc="sum")

 

pd.get_dummies(df['Category'])

 

df.sample(n=5)

 

df.corr()

 

df.reset_index(drop=True, inplace=True)

This resets the index to the default integer index while optionally dropping the old index.

df.set_index("ColumnName", inplace=True)

This makes the specified column the index of the DataFrame.

df[(df['Age'] > 30) & (df['Salary'] > 50000)]

The & operator is used for multiple conditions; use | for OR conditions.

df["ColumnName"].unique()

Returns an array of unique values from the specified column.

df["ColumnName"].nunique()

Returns the count of unique values in the specified column.

df["ColumnName"].value_counts()

This method returns a Series with the frequency of each unique value.

df.rename(index={0: "FirstRow", 1: "SecondRow"}, inplace=True)

Renames specific index labels in the DataFrame.

 

df["ColumnName"] = df["ColumnName"].astype(int)

Changes the data type of a column to the specified type (e.g., int, float, string).

 

df.replace({"old_value": "new_value"}, inplace=True)

Replaces specified values in the DataFrame.

 

df["ColumnName"].isnull().sum()

Returns the count of missing values in the specified column.

 

df.dropna(inplace=True)

Removes all rows that contain at least one missing value.

 

df.fillna(value="Default Value", inplace=True)

Fills missing values with a specified default value.

df.fillna(method="ffill", inplace=True)

Replaces missing values with the previous row's value.

 

df.fillna(method="bfill", inplace=True)

Replaces missing values with the next row's value.

 

df["ColumnName"] = df["ColumnName"].apply(lambda x: x * 2)

The apply() function allows applying a function or lambda expression to every row in a column.

df = pd.read_csv("data.csv", usecols=['Name', 'Age'])

 

chunk = pd.read_csv("large_file.csv", chunksize=1000)

 

df.query("Age > 30 and Salary < 60000")

 

df.set_index(['Region', 'Department'], inplace=True)

 

df['RollingAvg'] = df['Sales'].rolling(window=3).mean()

 

Use Parquet format:

df.to_parquet("data.parquet")
df = pd.read_parquet("data.parquet")

 

df.explode("ListColumn")

 

df['Date'] = pd.to_datetime(df['Date'])
df.set_index("Date", inplace=True)
df.resample("M").sum()

 

df['Category'] = df['Category'].astype('category')

 

df.eval("NewColumn = Age * 2")

 

df_merged = df1.merge(df2, on=["Column1", "Column2"], how="inner")

This merges two DataFrames using multiple columns as keys.

  • merge() is used for combining DataFrames based on key columns (like SQL joins).
  • concat() is used for stacking DataFrames either row-wise (axis=0) or column-wise (axis=1).

 

df.duplicated()

Returns a boolean Series indicating duplicate rows.

df.drop_duplicates(inplace=True)

Removes duplicate rows while keeping the first occurrence by default.

 

df.groupby("ColumnName")["ValueColumn"].sum()

Groups the DataFrame by a column and calculates the sum for each group.

 

df.pivot_table(values="Sales", index="Region", columns="Product", aggfunc="sum")

Creates a pivot table summarizing data based on specified columns and an aggregation function.

df.melt(id_vars=["ID"], value_vars=["Sales", "Profit"])

Converts wide-form DataFrame into a long-form, making it easier to analyze.

df.resample("M").sum()

Resamples data to monthly frequency and calculates the sum for each period.

df["RollingAvg"] = df["Sales"].rolling(window=3).mean()

Calculates a rolling average for a given window size.

chunk_size = 10000
for chunk in pd.read_csv("large_file.csv", chunksize=chunk_size):
    process(chunk)  # Process each chunk separately

Reading large files in chunks helps in handling memory efficiently.

Share   Share