Principal Component Analysis (PCA)

Add to Bookmark

Principal Component Analysis (PCA) is a dimensionality reduction technique used in unsupervised learning. It helps simplify datasets by transforming features into a smaller set of uncorrelated variables called principal components, which capture the maximum variance in the data.

PCA is widely used for:

Visualization of high-dimensional data
Speeding up training by reducing input size
Eliminating multicollinearity
Noise reduction

Why Use PCA?

Real-world datasets can have dozens or hundreds of features.
Not all features contribute equally to the model.
PCA projects data to a lower-dimensional space while preserving the most critical information (variance).

How PCA Works

Standardize the data
PCA is sensitive to scale, so data must be normalized.
Compute the covariance matrix
Captures the relationship between features.
Compute eigenvectors and eigenvalues
Eigenvectors (principal components) define new axes, and eigenvalues represent variance captured.
Select top k components
Based on the cumulative explained variance.
Project original data
Transform the data onto the new k-dimensional subspace.

Python Example of PCA with `sklearn`

from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# 1. Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# 4. Plot results
plt.figure(figsize=(8, 5))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k')
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA of Iris Dataset")
plt.colorbar(label='Target Class')
plt.show()

# 5. Explained Variance
print("Explained Variance Ratio:", pca.explained_variance_ratio_)

Output-

Explained Variance Ratio: [0.72962445 0.22850762]

Explanation

Standardization ensures that features with different scales are normalized.
n_components=2 reduces 4-dimensional Iris data to 2D for visualization.
explained_variance_ratio_ shows how much variance is captured by each component.

Interpreting PCA Output

Principal Components are new axes (linear combinations of original features).
Explained Variance Ratio tells how much information (variance) each principal component retains.
Scree Plot: Plot cumulative variance to choose how many components to keep.

Applications of PCA

Data Visualization (especially in 2D/3D)
Noise filtering
Preprocessing before clustering or classification
Feature compression in image processing
Speeding up machine learning algorithms

Advantages

Reduces dimensionality and simplifies models
Minimizes data redundancy
Helps remove multicollinearity
Fast and computationally efficient

Limitations

Loses interpretability of original features
Linear method – can't capture nonlinear relationships
Sensitive to outliers
Requires scaled data

Tips for Beginners

Always scale your data before applying PCA.
Use the elbow method on explained variance to decide the number of components.
Use PCA for visualizing clusters in high-dimensional datasets (e.g., with K-Means).

Tips for Professionals

Combine PCA with clustering algorithms (e.g., DBSCAN, GMM) for better results.
Use PCA before t-SNE or UMAP for improved 2D visualizations.
Monitor explained_variance_ratio_ to retain 90–95% of variance for optimal compression.

Summary

PCA transforms high-dimensional data into fewer, uncorrelated components.
It's useful for reducing overfitting, visualizing complex datasets, and speeding up learning algorithms.
While powerful, PCA should be used with care, especially when interpretability is important.

Overview

Principal Component Analysis (PCA)

Why Use PCA?

How PCA Works

Python Example of PCA with `sklearn`

Explanation

Interpreting PCA Output

Applications of PCA

Advantages

Limitations

Tips for Beginners

Tips for Professionals

Summary

Hierarchical Clustering

Autoencoders for Dimensionality Reduction

Prepare for Interview

Tutorials

Random Blogs

Datasets for Machine Learning

Categories

Follow us on Linkedin

Overview

Principal Component Analysis (PCA)

Why Use PCA?

How PCA Works

Python Example of PCA with sklearn

Explanation

Interpreting PCA Output

Applications of PCA

Advantages

Limitations

Tips for Beginners

Tips for Professionals

Summary

Hierarchical Clustering

Autoencoders for Dimensionality Reduction

Related Tutorials

Prepare for Interview

Tutorials

Random Blogs

Datasets for Machine Learning

Categories

Follow us on Linkedin

Python Example of PCA with `sklearn`