Principal Component Analysis (PCA)

  Add to Bookmark

Principal Component Analysis (PCA) is a dimensionality reduction technique used in unsupervised learning. It helps simplify datasets by transforming features into a smaller set of uncorrelated variables called principal components, which capture the maximum variance in the data.

PCA is widely used for:

  • Visualization of high-dimensional data
  • Speeding up training by reducing input size
  • Eliminating multicollinearity
  • Noise reduction

Why Use PCA?

  • Real-world datasets can have dozens or hundreds of features.
  • Not all features contribute equally to the model.
  • PCA projects data to a lower-dimensional space while preserving the most critical information (variance).

How PCA Works

  1. Standardize the data
    PCA is sensitive to scale, so data must be normalized.
  2. Compute the covariance matrix
    Captures the relationship between features.
  3. Compute eigenvectors and eigenvalues
    Eigenvectors (principal components) define new axes, and eigenvalues represent variance captured.
  4. Select top k components
    Based on the cumulative explained variance.
  5. Project original data
    Transform the data onto the new k-dimensional subspace.

Python Example of PCA with sklearn

from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# 1. Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# 4. Plot results
plt.figure(figsize=(8, 5))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k')
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA of Iris Dataset")
plt.colorbar(label='Target Class')
plt.show()

# 5. Explained Variance
print("Explained Variance Ratio:", pca.explained_variance_ratio_)

Output-

Explained Variance Ratio: [0.72962445 0.22850762]

Explanation

  • Standardization ensures that features with different scales are normalized.
  • n_components=2 reduces 4-dimensional Iris data to 2D for visualization.
  • explained_variance_ratio_ shows how much variance is captured by each component.

Interpreting PCA Output

  • Principal Components are new axes (linear combinations of original features).
  • Explained Variance Ratio tells how much information (variance) each principal component retains.
  • Scree Plot: Plot cumulative variance to choose how many components to keep.

Applications of PCA

  • Data Visualization (especially in 2D/3D)
  • Noise filtering
  • Preprocessing before clustering or classification
  • Feature compression in image processing
  • Speeding up machine learning algorithms

Advantages

  • Reduces dimensionality and simplifies models
  • Minimizes data redundancy
  • Helps remove multicollinearity
  • Fast and computationally efficient

Limitations

  • Loses interpretability of original features
  • Linear method – can't capture nonlinear relationships
  • Sensitive to outliers
  • Requires scaled data

Tips for Beginners

  • Always scale your data before applying PCA.
  • Use the elbow method on explained variance to decide the number of components.
  • Use PCA for visualizing clusters in high-dimensional datasets (e.g., with K-Means).

Tips for Professionals

  • Combine PCA with clustering algorithms (e.g., DBSCAN, GMM) for better results.
  • Use PCA before t-SNE or UMAP for improved 2D visualizations.
  • Monitor explained_variance_ratio_ to retain 90–95% of variance for optimal compression.

Summary

  • PCA transforms high-dimensional data into fewer, uncorrelated components.
  • It's useful for reducing overfitting, visualizing complex datasets, and speeding up learning algorithms.
  • While powerful, PCA should be used with care, especially when interpretability is important.