Hierarchical Clustering
Add to BookmarkHierarchical Clustering is an unsupervised learning technique used to group similar data points into clusters based on a hierarchy. Unlike K-Means, you don't need to predefine the number of clusters. Instead, the algorithm builds a tree-like structure (dendrogram) representing nested groupings of data.
There are two main types of hierarchical clustering:
- Agglomerative (bottom-up): Start with individual data points, merge clusters iteratively.
- Divisive (top-down): Start with all data in one cluster and split recursively.
Key Concepts
Dendrogram
A dendrogram is a tree-like diagram that records the sequences of merges or splits. Cutting the dendrogram at a chosen level yields the desired number of clusters.
Distance Metrics
Hierarchical clustering relies on a distance metric:
- Euclidean (default)
- Manhattan
- Cosine, etc.
Linkage Criteria
Defines how the distance between clusters is calculated:
- Single Linkage – Minimum distance between points of the clusters
- Complete Linkage – Maximum distance between points
- Average Linkage – Average distance between points
- Ward’s Method – Minimizes variance within clusters
Python Example: Agglomerative Clustering
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as sch
# 1. Generate synthetic data
X, _ = make_blobs(n_samples=150, centers=3, cluster_std=0.60, random_state=42)
# 2. Plot dendrogram to visualize hierarchy
plt.figure(figsize=(8, 5))
dendrogram = sch.dendrogram(sch.linkage(X, method='ward'))
plt.title("Dendrogram")
plt.xlabel("Data Points")
plt.ylabel("Euclidean Distance")
plt.show()
# 3. Fit Agglomerative Clustering with 3 clusters
model = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')
labels = model.fit_predict(X)
# 4. Visualize Clusters
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title("Agglomerative Clustering (3 Clusters)")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()
Output-
Explanation
- Step 1: Create data with known centers.
- Step 2: Plot a dendrogram to observe the natural cluster hierarchy.
- Step 3: Use
AgglomerativeClustering
withward
linkage. - Step 4: Visualize the clustering results.
When to Use Hierarchical Clustering
- When the number of clusters is unknown.
- When you want to understand the data structure or nested groupings.
- For smaller datasets (does not scale well for very large data).
Advantages
- No need to predefine number of clusters.
- Produces a dendrogram for visual analysis.
- Can capture nested data groupings.
Limitations
- Computationally expensive (O(n²) time and space).
- Sensitive to noise and outliers.
- Final clusters are not always optimal due to early merge decisions (no reassignments).
Tips for Beginners
- Use dendrograms to visually decide on the number of clusters.
- Stick with
ward
linkage andeuclidean
distance for standard tasks. - Normalize data if features have different scales.
Tips for Professionals
- For large datasets, use scikit-learn’s connectivity constraints to speed up clustering.
- Combine PCA with hierarchical clustering for better performance and visualization.
- Use hierarchical clustering as a preprocessing step to inform other models.
Summary
- Hierarchical clustering builds a hierarchy of clusters without needing to specify K.
- Dendrograms help visualize and select cluster groups.
- Best suited for small to medium datasets with clear hierarchical structures.
Prepare for Interview
- SQL Interview Questions for 5+ Years Experience
- SQL Interview Questions for 2–5 Years Experience
- SQL Interview Questions for 1–2 Years Experience
- SQL Interview Questions for 0–1 Year Experience
- SQL Interview Questions for Freshers
- Design Patterns in Python
- Dynamic Programming and Recursion in Python
- Trees and Graphs in Python
- Linked Lists, Stacks, and Queues in Python
- Sorting and Searching in Python
- Debugging in Python
- Unit Testing in Python
- Asynchronous Programming in PYthon
- Multithreading and Multiprocessing in Python
- Context Managers in Python
Random Blogs
- Government Datasets from 50 Countries for Machine Learning Training
- Grow your business with Facebook Marketing
- Big Data: The Future of Data-Driven Decision Making
- Why to learn Digital Marketing?
- Python Challenging Programming Exercises Part 3
- Internet of Things (IoT) & AI – Smart Devices and AI Working Together
- Types of Numbers in Python
- Top 10 Blogs of Digital Marketing you Must Follow
- How to Become a Good Data Scientist ?
- Exploratory Data Analysis On Iris Dataset
- The Ultimate Guide to Data Science: Everything You Need to Know
- Generative AI - The Future of Artificial Intelligence
- Understanding AI, ML, Data Science, and More: A Beginner's Guide to Choosing Your Career Path
- Datasets for analyze in Tableau
- Extract RGB Color From a Image Using CV2
Datasets for Machine Learning
- Amazon Product Reviews Dataset
- Ozone Level Detection Dataset
- Bank Transaction Fraud Detection
- YouTube Trending Video Dataset (updated daily)
- Covid-19 Case Surveillance Public Use Dataset
- US Election 2020
- Forest Fires Dataset
- Mobile Robots Dataset
- Safety Helmet Detection
- All Space Missions from 1957
- OSIC Pulmonary Fibrosis Progression Dataset
- Wine Quality Dataset
- Google Audio Dataset
- Iris flower dataset
- Artificial Characters Dataset