K-Means Clustering Algorithm

Add to Bookmark

K-Means is a popular unsupervised learning algorithm used to partition data into K distinct clusters based on feature similarity. It is iterative and aims to minimize within-cluster variance by assigning each data point to the nearest cluster center (centroid) and then updating centroids until convergence.

What You’ll Learn

How K-Means works (intuition and algorithm steps)
Choosing the number of clusters (K)
Python implementation using scikit-learn
Evaluating clustering quality
Advantages and limitations
Practical tips for both beginners and professionals

How K-Means Works

Initialization
- Select K initial centroids (randomly or via methods like K-Means++).
Assignment Step
- For each data point, calculate the distance to each centroid (usually Euclidean).
- Assign the point to the nearest centroid’s cluster.
Update Step
- Compute the new centroid of each cluster by averaging the points assigned to it.
Repeat
- Repeat Assignment and Update steps until centroids no longer move (convergence) or a maximum number of iterations is reached.

Choosing K: The Elbow Method

Run K-Means for a range of K values (e.g., 1 through 10).
Compute the Sum of Squared Errors (SSE) (also called within-cluster sum of squares) for each K.
Plot SSE vs. K.
Look for the “elbow” point where SSE improvement significantly slows; that K is a good choice.

Python Implementation

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# 1. Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

# 2. Use the Elbow Method to find optimal K
sse = []
k_values = range(1, 11)
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    sse.append(kmeans.inertia_)  # inertia_ is the SSE

plt.plot(k_values, sse, marker='o')
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Sum of Squared Errors (SSE)")
plt.title("Elbow Method for Choosing K")
plt.show()

# 3. Apply K-Means with chosen K (e.g., 4)
optimal_k = 4
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
labels = kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_

# 4. Visualize the clusters and centroids
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=30)
plt.scatter(centroids[:, 0], centroids[:, 1], color='red', marker='x', s=100, linewidths=2)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title(f"K-Means Clustering (K={optimal_k})")
plt.show()

Output-

Explanation

Data Generation: make_blobs creates a dataset with 4 centers.
Elbow Method: We plot SSE (inertia) for K from 1 to 10. The “elbow” indicates the optimal K.
Fit K-Means: We choose K = 4, fit the model, and get cluster labels and centroids.
Visualization: Clusters are colored by label; centroids marked with a red “×”.

Evaluating Clustering Quality

Inertia (SSE)
- Lower inertia indicates tighter clusters—but cannot directly compare across different K without context.
Silhouette Score
- Measures how similar a point is to its own cluster compared to other clusters
  where a = average intra-cluster distance, b = average nearest-cluster distance.
- Values range from -1 to 1; higher is better.

from sklearn.metrics import silhouette_score

score = silhouette_score(X, labels)
print("Silhouette Score:", score)

Output-

Silhouette Score: 0.8756469540734731

Advantages

Simple to understand and implement.
Scales to large datasets (time complexity roughly O(n × k × i), where i = iterations).
Works well when clusters are roughly spherical and balanced in size.

Limitations

Must specify K in advance.
Sensitive to initial centroid placement (K-Means++ can help).
Not robust to clusters of different shapes or densities.
Sensitive to outliers.

Tips for Beginners

Visualize data before applying K-Means to estimate reasonable K values.
Preprocess data: scale features so that distance metrics aren’t dominated by high-variance features.
Use K-Means++ initialization (init='k-means++') to improve convergence.

Tips for Professionals

If clusters have different shapes or densities, consider alternatives like DBSCAN or Gaussian Mixture Models.
For high-dimensional data, apply dimensionality reduction (e.g., PCA) before K-Means to improve performance.
Monitor the convergence tolerance (tol parameter) and maximum iterations (max_iter) for large datasets.
Use Mini-Batch K-Means (sklearn.cluster.MiniBatchKMeans) for very large datasets to reduce computation time.

Summary

K-Means partitions data into K clusters by minimizing within-cluster variance.
The Elbow Method and Silhouette Score help choose an appropriate K.
Preprocessing (scaling, dimensionality reduction) and initialization strategies (K-Means++) improve results.
K-Means is effective for roughly spherical, equally sized clusters but has limitations with complex structures.

Overview

K-Means Clustering Algorithm

What You’ll Learn

How K-Means Works

Choosing K: The Elbow Method

Python Implementation

Explanation

Evaluating Clustering Quality

Advantages

Limitations

Tips for Beginners

Tips for Professionals

Summary

Introduction to Unsupervised Learning

Hierarchical Clustering

Prepare for Interview

Tutorials

Random Blogs

Datasets for Machine Learning

Categories

Follow us on Linkedin

Overview

K-Means Clustering Algorithm

What You’ll Learn

How K-Means Works

Choosing K: The Elbow Method

Python Implementation

Explanation

Evaluating Clustering Quality

Advantages

Limitations

Tips for Beginners

Tips for Professionals

Summary

Introduction to Unsupervised Learning

Hierarchical Clustering

Related Tutorials

Prepare for Interview

Tutorials

Random Blogs

Datasets for Machine Learning

Categories

Follow us on Linkedin