- Unsupervised Learning
-
Overview
- Introduction to Unsupervised Learning
- K-Means Clustering Algorithm
- Hierarchical Clustering
- Principal Component Analysis (PCA)
- Autoencoders for Dimensionality Reduction
- Gaussian Mixture Models (GMM)
- Association Rule Learning (Apriori, FP-Growth)
- DBSCAN Clustering Algorithm
- Self-Organizing Maps (SOM)
- Applications of Unsupervised Learning
K-Means Clustering Algorithm
Add to BookmarkK-Means is a popular unsupervised learning algorithm used to partition data into K distinct clusters based on feature similarity. It is iterative and aims to minimize within-cluster variance by assigning each data point to the nearest cluster center (centroid) and then updating centroids until convergence.
What You’ll Learn
- How K-Means works (intuition and algorithm steps)
- Choosing the number of clusters (K)
- Python implementation using
scikit-learn
- Evaluating clustering quality
- Advantages and limitations
- Practical tips for both beginners and professionals
How K-Means Works
- Initialization
- Select K initial centroids (randomly or via methods like K-Means++).
- Assignment Step
- For each data point, calculate the distance to each centroid (usually Euclidean).
- Assign the point to the nearest centroid’s cluster.
- Update Step
- Compute the new centroid of each cluster by averaging the points assigned to it.
- Repeat
- Repeat Assignment and Update steps until centroids no longer move (convergence) or a maximum number of iterations is reached.
Choosing K: The Elbow Method
- Run K-Means for a range of K values (e.g., 1 through 10).
- Compute the Sum of Squared Errors (SSE) (also called within-cluster sum of squares) for each K.
- Plot SSE vs. K.
- Look for the “elbow” point where SSE improvement significantly slows; that K is a good choice.
Python Implementation
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# 1. Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)
# 2. Use the Elbow Method to find optimal K
sse = []
k_values = range(1, 11)
for k in k_values:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
sse.append(kmeans.inertia_) # inertia_ is the SSE
plt.plot(k_values, sse, marker='o')
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Sum of Squared Errors (SSE)")
plt.title("Elbow Method for Choosing K")
plt.show()
# 3. Apply K-Means with chosen K (e.g., 4)
optimal_k = 4
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
labels = kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_
# 4. Visualize the clusters and centroids
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=30)
plt.scatter(centroids[:, 0], centroids[:, 1], color='red', marker='x', s=100, linewidths=2)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title(f"K-Means Clustering (K={optimal_k})")
plt.show()
Output-
Explanation
- Data Generation:
make_blobs
creates a dataset with 4 centers. - Elbow Method: We plot SSE (inertia) for K from 1 to 10. The “elbow” indicates the optimal K.
- Fit K-Means: We choose K = 4, fit the model, and get cluster labels and centroids.
- Visualization: Clusters are colored by label; centroids marked with a red “×”.
Evaluating Clustering Quality
- Inertia (SSE)
- Lower inertia indicates tighter clusters—but cannot directly compare across different K without context.
- Silhouette Score
- Measures how similar a point is to its own cluster compared to other clusters
where a = average intra-cluster distance, b = average nearest-cluster distance. - Values range from -1 to 1; higher is better.
- Measures how similar a point is to its own cluster compared to other clusters
from sklearn.metrics import silhouette_score
score = silhouette_score(X, labels)
print("Silhouette Score:", score)
Output-
Silhouette Score: 0.8756469540734731
Advantages
- Simple to understand and implement.
- Scales to large datasets (time complexity roughly O(n × k × i), where i = iterations).
- Works well when clusters are roughly spherical and balanced in size.
Limitations
- Must specify K in advance.
- Sensitive to initial centroid placement (K-Means++ can help).
- Not robust to clusters of different shapes or densities.
- Sensitive to outliers.
Tips for Beginners
- Visualize data before applying K-Means to estimate reasonable K values.
- Preprocess data: scale features so that distance metrics aren’t dominated by high-variance features.
- Use K-Means++ initialization (
init='k-means++'
) to improve convergence.
Tips for Professionals
- If clusters have different shapes or densities, consider alternatives like DBSCAN or Gaussian Mixture Models.
- For high-dimensional data, apply dimensionality reduction (e.g., PCA) before K-Means to improve performance.
- Monitor the convergence tolerance (
tol
parameter) and maximum iterations (max_iter
) for large datasets. - Use Mini-Batch K-Means (
sklearn.cluster.MiniBatchKMeans
) for very large datasets to reduce computation time.
Summary
- K-Means partitions data into K clusters by minimizing within-cluster variance.
- The Elbow Method and Silhouette Score help choose an appropriate K.
- Preprocessing (scaling, dimensionality reduction) and initialization strategies (K-Means++) improve results.
- K-Means is effective for roughly spherical, equally sized clusters but has limitations with complex structures.
Prepare for Interview
- JavaScript Interview Questions for 0–1 Year Experience
- JavaScript Interview Questions For Fresher
- SQL Interview Questions for 5+ Years Experience
- SQL Interview Questions for 2–5 Years Experience
- SQL Interview Questions for 1–2 Years Experience
- SQL Interview Questions for 0–1 Year Experience
- SQL Interview Questions for Freshers
- Design Patterns in Python
- Dynamic Programming and Recursion in Python
- Trees and Graphs in Python
- Linked Lists, Stacks, and Queues in Python
- Sorting and Searching in Python
- Debugging in Python
- Unit Testing in Python
- Asynchronous Programming in PYthon
Random Blogs
- Loan Default Prediction Project Using Machine Learning
- The Ultimate Guide to Data Science: Everything You Need to Know
- Data Analytics: The Power of Data-Driven Decision Making
- Variable Assignment in Python
- Python Challenging Programming Exercises Part 3
- Time Series Analysis on Air Passenger Data
- Best Platform to Learn Digital Marketing in Free
- Store Data Into CSV File Using Python Tkinter GUI Library
- Internet of Things (IoT) & AI – Smart Devices and AI Working Together
- String Operations in Python
- Important Mistakes to Avoid While Advertising on Facebook
- OLTP vs. OLAP Databases: Advanced Insights and Query Optimization Techniques
- Datasets for Natural Language Processing
- Big Data: The Future of Data-Driven Decision Making
- Google’s Core Update in May 2020: What You Need to Know
Datasets for Machine Learning
- Amazon Product Reviews Dataset
- Ozone Level Detection Dataset
- Bank Transaction Fraud Detection
- YouTube Trending Video Dataset (updated daily)
- Covid-19 Case Surveillance Public Use Dataset
- US Election 2020
- Forest Fires Dataset
- Mobile Robots Dataset
- Safety Helmet Detection
- All Space Missions from 1957
- OSIC Pulmonary Fibrosis Progression Dataset
- Wine Quality Dataset
- Google Audio Dataset
- Iris flower dataset
- Artificial Characters Dataset