Introduction to Hierarchical Clustering in Machine Learning

Sharmasaravanan
2 min readMar 12, 2023

--

Hierarchical clustering is a popular unsupervised machine learning algorithm used for grouping similar data points together. It is a simple yet powerful algorithm that creates a hierarchy of clusters by iteratively merging the most similar clusters. Hierarchical clustering can be performed in two ways: Agglomerative clustering and Divisive clustering.

Agglomerative clustering is the most commonly used type of hierarchical clustering. It works by starting with each data point as a separate cluster and iteratively merging the most similar clusters together until a single cluster is left.

Here is a step-by-step explanation of the Agglomerative clustering algorithm:

  1. Start with each data point as a separate cluster.
  2. Compute the distance between all pairs of clusters using a chosen distance metric.
  3. Merge the two closest clusters into a single cluster.
  4. Recompute the distance between the new cluster and all other clusters.
  5. Repeat steps 3 and 4 until all data points are in a single cluster or a desired number of clusters is reached.

Here is a Python code snippet for implementing Agglomerative clustering using the Scikit-learn library:

# Import necessary libraries
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Generate random data points
X, y = make_blobs(n_samples=100, centers=3, n_features=2, random_state=42)

# Create an AgglomerativeClustering object
agg = AgglomerativeClustering(n_clusters=3)

# Fit the AgglomerativeClustering model to the data
agg.fit(X)

# Get the cluster labels
labels = agg.labels_

# Plot the data points and clusters
plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.show()

Output

In this code snippet, we first generate random data points using the make_blobs function from Scikit-learn. We then create an AgglomerativeClustering object with 3 clusters and fit the model to the data. We get the cluster labels using the labels_ attribute of the AgglomerativeClustering object. Finally, we plot the data points and clusters using the scatter function from Matplotlib.

The n_clusters parameter in the AgglomerativeClustering object specifies the number of clusters to form. The fit method of the AgglomerativeClustering object fits the model to the data and returns the labels of the clusters assigned to each data point.

Hierarchical clustering is a powerful and widely used unsupervised machine learning algorithm. However, it also has some limitations, such as sensitivity to the distance metric and the need for a dendrogram to determine the number of clusters. Therefore, it is important to choose the appropriate distance metric and carefully inspect the dendrogram to determine the optimal number of clusters.

Stay tuned for more!

I am always happy to connect with my followers and readers on LinkedIn. If you have any questions or just want to say hello, please don’t hesitate to reach out.

https://www.linkedin.com/in/sharmasaravanan/

Happy learning!

Adios, me gusta!! 🤗🤗

--

--

Sharmasaravanan

Versatile AI/ML maestro with 500+ deployed microservices for image/text processing. AWS and CI/CD virtuoso. Crafting the future, one line of code at a time. ✨