Data Manipulation with NumPy and Pandas

Manipulating data efficiently is essential in data analysis. NumPy and Pandas provide powerful tools for handling and transforming data. This tutorial covers:

  • Using NumPy for numerical operations
  • Manipulating DataFrames in Pandas
  • Filtering, sorting, and grouping data

1. Why Use NumPy and Pandas?

  • NumPy: Optimized for numerical computations with fast operations on large arrays.
  • Pandas: Built on NumPy, provides high-level data manipulation tools for structured data.
import numpy as np
import pandas as pd

2. NumPy Basics for Data Manipulation

NumPy is mainly used for handling arrays.

Creating and Manipulating NumPy Arrays

arr = np.array([10, 20, 30, 40, 50])
print(arr * 2)  # Multiply each element by 2

Generating Random Data

rand_arr = np.random.rand(5)  # Array with 5 random numbers

Reshaping Arrays

arr2D = np.array([[1, 2, 3], [4, 5, 6]])
print(arr2D.T)  # Transpose of the array

Mathematical Operations

arr = np.array([1, 2, 3, 4, 5])
print(np.mean(arr))  # Calculate mean
print(np.sum(arr))   # Calculate sum
print(np.sqrt(arr))  # Square root

3. Data Manipulation with Pandas

Pandas makes data manipulation easier using Series and DataFrames.

Creating a DataFrame

data = {"Name": ["Amit", "Pooja", "Rahul", "Neha"],
        "Age": [25, 30, 22, 35],
        "Salary": [50000, 60000, 45000, 70000]}

df = pd.DataFrame(data)
print(df)

4. Selecting and Filtering Data

Selecting Columns

print(df["Name"])  # Select single column
print(df[["Name", "Salary"]])  # Select multiple columns

Selecting Rows

print(df.iloc[1])  # Select second row
print(df.loc[df["Age"] > 25])  # Filter rows where Age > 25

Conditional Filtering

high_salary = df[df["Salary"] > 50000]
print(high_salary)

5. Adding, Updating, and Deleting Data

Adding a New Column

df["Bonus"] = df["Salary"] * 0.10  # 10% Bonus

Updating Values

df.loc[df["Name"] == "Rahul", "Salary"] = 50000

Deleting a Column

df.drop(columns=["Bonus"], inplace=True)

Deleting a Row

df.drop(index=2, inplace=True)  # Remove Rahul

6. Sorting and Rearranging Data

Sorting by a Column

df_sorted = df.sort_values(by="Salary", ascending=False)

Reordering Columns

df = df[["Name", "Salary", "Age"]]

7. Grouping and Aggregating Data

Grouping helps in summarizing large datasets.

df_grouped = df.groupby("Age").mean()
df.groupby("Age")["Salary"].sum()

8. Merging and Joining DataFrames

Merging DataFrames

df1 = pd.DataFrame({"ID": [1, 2], "Name": ["Amit", "Pooja"]})
df2 = pd.DataFrame({"ID": [1, 2], "Salary": [50000, 60000]})

df_merged = pd.merge(df1, df2, on="ID")

Concatenating DataFrames

df_concat = pd.concat([df1, df2], axis=0)

9. Handling Missing Data

df.fillna(0, inplace=True)  # Fill missing values with 0
df.dropna(inplace=True)  # Remove rows with missing values

10. Saving Processed Data

df.to_csv("processed_data.csv", index=False)  # Save as CSV
df.to_excel("processed_data.xlsx", index=False)  # Save as Excel

Conclusion

In this tutorial, we explored NumPy and Pandas for data manipulation. You learned how to filter, sort, merge, and clean data effectively. In the next tutorial, we will focus on Exploratory Data Analysis (EDA) Techniques.