K-Means Clustering Algorithm

  Add to Bookmark

K-Means is a popular unsupervised learning algorithm used to partition data into K distinct clusters based on feature similarity. It is iterative and aims to minimize within-cluster variance by assigning each data point to the nearest cluster center (centroid) and then updating centroids until convergence.


What You’ll Learn

  • How K-Means works (intuition and algorithm steps)
  • Choosing the number of clusters (K)
  • Python implementation using scikit-learn
  • Evaluating clustering quality
  • Advantages and limitations
  • Practical tips for both beginners and professionals

How K-Means Works

  1. Initialization
    • Select K initial centroids (randomly or via methods like K-Means++).
  2. Assignment Step
    • For each data point, calculate the distance to each centroid (usually Euclidean).
    • Assign the point to the nearest centroid’s cluster.
  3. Update Step
    • Compute the new centroid of each cluster by averaging the points assigned to it.
  4. Repeat
    • Repeat Assignment and Update steps until centroids no longer move (convergence) or a maximum number of iterations is reached.

Choosing K: The Elbow Method

  1. Run K-Means for a range of K values (e.g., 1 through 10).
  2. Compute the Sum of Squared Errors (SSE) (also called within-cluster sum of squares) for each K.
  3. Plot SSE vs. K.
  4. Look for the “elbow” point where SSE improvement significantly slows; that K is a good choice.

Python Implementation

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# 1. Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

# 2. Use the Elbow Method to find optimal K
sse = []
k_values = range(1, 11)
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    sse.append(kmeans.inertia_)  # inertia_ is the SSE

plt.plot(k_values, sse, marker='o')
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Sum of Squared Errors (SSE)")
plt.title("Elbow Method for Choosing K")
plt.show()

# 3. Apply K-Means with chosen K (e.g., 4)
optimal_k = 4
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
labels = kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_

# 4. Visualize the clusters and centroids
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=30)
plt.scatter(centroids[:, 0], centroids[:, 1], color='red', marker='x', s=100, linewidths=2)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title(f"K-Means Clustering (K={optimal_k})")
plt.show()

Output-


Explanation

  • Data Generation: make_blobs creates a dataset with 4 centers.
  • Elbow Method: We plot SSE (inertia) for K from 1 to 10. The “elbow” indicates the optimal K.
  • Fit K-Means: We choose K = 4, fit the model, and get cluster labels and centroids.
  • Visualization: Clusters are colored by label; centroids marked with a red “×”.

Evaluating Clustering Quality

  1. Inertia (SSE)
    • Lower inertia indicates tighter clusters—but cannot directly compare across different K without context.
  2. Silhouette Score
    • Measures how similar a point is to its own cluster compared to other clusters
      where a = average intra-cluster distance, b = average nearest-cluster distance.
    • Values range from -1 to 1; higher is better.

from sklearn.metrics import silhouette_score

score = silhouette_score(X, labels)
print("Silhouette Score:", score)

Output-

Silhouette Score: 0.8756469540734731

Advantages

  • Simple to understand and implement.
  • Scales to large datasets (time complexity roughly O(n × k × i), where i = iterations).
  • Works well when clusters are roughly spherical and balanced in size.

Limitations

  • Must specify K in advance.
  • Sensitive to initial centroid placement (K-Means++ can help).
  • Not robust to clusters of different shapes or densities.
  • Sensitive to outliers.

Tips for Beginners

  • Visualize data before applying K-Means to estimate reasonable K values.
  • Preprocess data: scale features so that distance metrics aren’t dominated by high-variance features.
  • Use K-Means++ initialization (init='k-means++') to improve convergence.

Tips for Professionals

  • If clusters have different shapes or densities, consider alternatives like DBSCAN or Gaussian Mixture Models.
  • For high-dimensional data, apply dimensionality reduction (e.g., PCA) before K-Means to improve performance.
  • Monitor the convergence tolerance (tol parameter) and maximum iterations (max_iter) for large datasets.
  • Use Mini-Batch K-Means (sklearn.cluster.MiniBatchKMeans) for very large datasets to reduce computation time.

Summary

  • K-Means partitions data into K clusters by minimizing within-cluster variance.
  • The Elbow Method and Silhouette Score help choose an appropriate K.
  • Preprocessing (scaling, dimensionality reduction) and initialization strategies (K-Means++) improve results.
  • K-Means is effective for roughly spherical, equally sized clusters but has limitations with complex structures.