- Unsupervised Learning
-
Overview
- Introduction to Unsupervised Learning
- K-Means Clustering Algorithm
- Hierarchical Clustering
- Principal Component Analysis (PCA)
- Autoencoders for Dimensionality Reduction
- Gaussian Mixture Models (GMM)
- Association Rule Learning (Apriori, FP-Growth)
- DBSCAN Clustering Algorithm
- Self-Organizing Maps (SOM)
- Applications of Unsupervised Learning
Principal Component Analysis (PCA)
Add to BookmarkPrincipal Component Analysis (PCA) is a dimensionality reduction technique used in unsupervised learning. It helps simplify datasets by transforming features into a smaller set of uncorrelated variables called principal components, which capture the maximum variance in the data.
PCA is widely used for:
- Visualization of high-dimensional data
- Speeding up training by reducing input size
- Eliminating multicollinearity
- Noise reduction
Why Use PCA?
- Real-world datasets can have dozens or hundreds of features.
- Not all features contribute equally to the model.
- PCA projects data to a lower-dimensional space while preserving the most critical information (variance).
How PCA Works
- Standardize the data
PCA is sensitive to scale, so data must be normalized. - Compute the covariance matrix
Captures the relationship between features. - Compute eigenvectors and eigenvalues
Eigenvectors (principal components) define new axes, and eigenvalues represent variance captured. - Select top k components
Based on the cumulative explained variance. - Project original data
Transform the data onto the new k-dimensional subspace.
Python Example of PCA with sklearn
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# 1. Load dataset
iris = load_iris()
X = iris.data
y = iris.target
# 2. Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 3. Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# 4. Plot results
plt.figure(figsize=(8, 5))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k')
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA of Iris Dataset")
plt.colorbar(label='Target Class')
plt.show()
# 5. Explained Variance
print("Explained Variance Ratio:", pca.explained_variance_ratio_)
Output-
Explained Variance Ratio: [0.72962445 0.22850762]
Explanation
- Standardization ensures that features with different scales are normalized.
n_components=2
reduces 4-dimensional Iris data to 2D for visualization.explained_variance_ratio_
shows how much variance is captured by each component.
Interpreting PCA Output
- Principal Components are new axes (linear combinations of original features).
- Explained Variance Ratio tells how much information (variance) each principal component retains.
- Scree Plot: Plot cumulative variance to choose how many components to keep.
Applications of PCA
- Data Visualization (especially in 2D/3D)
- Noise filtering
- Preprocessing before clustering or classification
- Feature compression in image processing
- Speeding up machine learning algorithms
Advantages
- Reduces dimensionality and simplifies models
- Minimizes data redundancy
- Helps remove multicollinearity
- Fast and computationally efficient
Limitations
- Loses interpretability of original features
- Linear method – can't capture nonlinear relationships
- Sensitive to outliers
- Requires scaled data
Tips for Beginners
- Always scale your data before applying PCA.
- Use the elbow method on explained variance to decide the number of components.
- Use PCA for visualizing clusters in high-dimensional datasets (e.g., with K-Means).
Tips for Professionals
- Combine PCA with clustering algorithms (e.g., DBSCAN, GMM) for better results.
- Use PCA before t-SNE or UMAP for improved 2D visualizations.
- Monitor
explained_variance_ratio_
to retain 90–95% of variance for optimal compression.
Summary
- PCA transforms high-dimensional data into fewer, uncorrelated components.
- It's useful for reducing overfitting, visualizing complex datasets, and speeding up learning algorithms.
- While powerful, PCA should be used with care, especially when interpretability is important.
Prepare for Interview
- JavaScript Interview Questions for 0–1 Year Experience
- JavaScript Interview Questions For Fresher
- SQL Interview Questions for 5+ Years Experience
- SQL Interview Questions for 2–5 Years Experience
- SQL Interview Questions for 1–2 Years Experience
- SQL Interview Questions for 0–1 Year Experience
- SQL Interview Questions for Freshers
- Design Patterns in Python
- Dynamic Programming and Recursion in Python
- Trees and Graphs in Python
- Linked Lists, Stacks, and Queues in Python
- Sorting and Searching in Python
- Debugging in Python
- Unit Testing in Python
- Asynchronous Programming in PYthon
Random Blogs
- The Ultimate Guide to Starting a Career in Computer Vision
- Extract RGB Color From a Image Using CV2
- 10 Awesome Data Science Blogs To Check Out
- The Ultimate Guide to Data Science: Everything You Need to Know
- Ideas for Content of Every niche on Reader’s Demand during COVID-19
- Mastering SQL in 2025: A Complete Roadmap for Beginners
- Best Platform to Learn Digital Marketing in Free
- Loan Default Prediction Project Using Machine Learning
- Important Mistakes to Avoid While Advertising on Facebook
- The Beginner’s Guide to Normalization and Denormalization in Databases
- Quantum AI – The Future of AI Powered by Quantum Computing
- How to Start Your Career as a DevOps Engineer
- Internet of Things (IoT) & AI – Smart Devices and AI Working Together
- Career Guide: Natural Language Processing (NLP)
- Variable Assignment in Python
Datasets for Machine Learning
- Amazon Product Reviews Dataset
- Ozone Level Detection Dataset
- Bank Transaction Fraud Detection
- YouTube Trending Video Dataset (updated daily)
- Covid-19 Case Surveillance Public Use Dataset
- US Election 2020
- Forest Fires Dataset
- Mobile Robots Dataset
- Safety Helmet Detection
- All Space Missions from 1957
- OSIC Pulmonary Fibrosis Progression Dataset
- Wine Quality Dataset
- Google Audio Dataset
- Iris flower dataset
- Artificial Characters Dataset