Principal Component Analysis (PCA)
Add to BookmarkPrincipal Component Analysis (PCA) is a dimensionality reduction technique used in unsupervised learning. It helps simplify datasets by transforming features into a smaller set of uncorrelated variables called principal components, which capture the maximum variance in the data.
PCA is widely used for:
- Visualization of high-dimensional data
- Speeding up training by reducing input size
- Eliminating multicollinearity
- Noise reduction
Why Use PCA?
- Real-world datasets can have dozens or hundreds of features.
- Not all features contribute equally to the model.
- PCA projects data to a lower-dimensional space while preserving the most critical information (variance).
How PCA Works
- Standardize the data
PCA is sensitive to scale, so data must be normalized. - Compute the covariance matrix
Captures the relationship between features. - Compute eigenvectors and eigenvalues
Eigenvectors (principal components) define new axes, and eigenvalues represent variance captured. - Select top k components
Based on the cumulative explained variance. - Project original data
Transform the data onto the new k-dimensional subspace.
Python Example of PCA with sklearn
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# 1. Load dataset
iris = load_iris()
X = iris.data
y = iris.target
# 2. Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 3. Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# 4. Plot results
plt.figure(figsize=(8, 5))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k')
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA of Iris Dataset")
plt.colorbar(label='Target Class')
plt.show()
# 5. Explained Variance
print("Explained Variance Ratio:", pca.explained_variance_ratio_)
Output-
Explained Variance Ratio: [0.72962445 0.22850762]
Explanation
- Standardization ensures that features with different scales are normalized.
n_components=2
reduces 4-dimensional Iris data to 2D for visualization.explained_variance_ratio_
shows how much variance is captured by each component.
Interpreting PCA Output
- Principal Components are new axes (linear combinations of original features).
- Explained Variance Ratio tells how much information (variance) each principal component retains.
- Scree Plot: Plot cumulative variance to choose how many components to keep.
Applications of PCA
- Data Visualization (especially in 2D/3D)
- Noise filtering
- Preprocessing before clustering or classification
- Feature compression in image processing
- Speeding up machine learning algorithms
Advantages
- Reduces dimensionality and simplifies models
- Minimizes data redundancy
- Helps remove multicollinearity
- Fast and computationally efficient
Limitations
- Loses interpretability of original features
- Linear method – can't capture nonlinear relationships
- Sensitive to outliers
- Requires scaled data
Tips for Beginners
- Always scale your data before applying PCA.
- Use the elbow method on explained variance to decide the number of components.
- Use PCA for visualizing clusters in high-dimensional datasets (e.g., with K-Means).
Tips for Professionals
- Combine PCA with clustering algorithms (e.g., DBSCAN, GMM) for better results.
- Use PCA before t-SNE or UMAP for improved 2D visualizations.
- Monitor
explained_variance_ratio_
to retain 90–95% of variance for optimal compression.
Summary
- PCA transforms high-dimensional data into fewer, uncorrelated components.
- It's useful for reducing overfitting, visualizing complex datasets, and speeding up learning algorithms.
- While powerful, PCA should be used with care, especially when interpretability is important.
Prepare for Interview
- SQL Interview Questions for 5+ Years Experience
- SQL Interview Questions for 2–5 Years Experience
- SQL Interview Questions for 1–2 Years Experience
- SQL Interview Questions for 0–1 Year Experience
- SQL Interview Questions for Freshers
- Design Patterns in Python
- Dynamic Programming and Recursion in Python
- Trees and Graphs in Python
- Linked Lists, Stacks, and Queues in Python
- Sorting and Searching in Python
- Debugging in Python
- Unit Testing in Python
- Asynchronous Programming in PYthon
- Multithreading and Multiprocessing in Python
- Context Managers in Python
Random Blogs
- Avoiding the Beginner’s Trap: Key Python Fundamentals You Shouldn't Skip
- The Ultimate Guide to Starting a Career in Computer Vision
- Understanding AI, ML, Data Science, and More: A Beginner's Guide to Choosing Your Career Path
- Datasets for analyze in Tableau
- Generative AI - The Future of Artificial Intelligence
- How to Become a Good Data Scientist ?
- Types of Numbers in Python
- AI Agents & Autonomous Systems – The Future of Self-Driven Intelligence
- The Ultimate Guide to Artificial Intelligence (AI) for Beginners
- Government Datasets from 50 Countries for Machine Learning Training
- Exploratory Data Analysis On Iris Dataset
- Why to learn Digital Marketing?
- What Is SEO and Why Is It Important?
- The Beginner’s Guide to Normalization and Denormalization in Databases
- OLTP vs. OLAP Databases: Advanced Insights and Query Optimization Techniques
Datasets for Machine Learning
- Amazon Product Reviews Dataset
- Ozone Level Detection Dataset
- Bank Transaction Fraud Detection
- YouTube Trending Video Dataset (updated daily)
- Covid-19 Case Surveillance Public Use Dataset
- US Election 2020
- Forest Fires Dataset
- Mobile Robots Dataset
- Safety Helmet Detection
- All Space Missions from 1957
- OSIC Pulmonary Fibrosis Progression Dataset
- Wine Quality Dataset
- Google Audio Dataset
- Iris flower dataset
- Artificial Characters Dataset