This tutorial aims to introduce the concept of clustering, a technique used in unssupervised learning. We will primarily focus on K-Means and Hierarchical Clustering techniques.
By the end of this tutorial, the user will get a comprehensive understanding of clustering techniques and be able to implement K-Means and Hierarchical clustering using Python programming.
The user should have a basic understanding of Python programming and knowledge of machine learning concepts.
Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.
There are two types of clustering we'll focus on:
1. K-Means Clustering: K-means is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid.
2. Hierarchical Clustering: Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into groups called clusters.
In K-Means clustering, we initialize 'k' centroids randomly, then assign each data point to the nearest centroid, and find the new centroid by taking the average of all points in the cluster. The steps are repeated until the centroid does not change.
In Hierarchical Clustering, we start by treating each data point as a cluster. Then, we merge the two closest clusters together on the basis of some distance measure. This process continues until only a single cluster is left.
# Importing Required Libraries
from sklearn.cluster import KMeans
import pandas as pd
# Creating a Dataframe
data = pd.DataFrame({
'x': [12, 20, 28, 18, 29, 33, 24, 45, 45, 52, 51, 52, 55, 53, 55, 61, 64, 69, 72],
'y': [39, 36, 30, 52, 54, 46, 55, 59, 63, 70, 66, 63, 58, 23, 14, 8, 19, 7, 24]
})
# Initializing KMeans
kmeans = KMeans(n_clusters=3) # Number of clusters
# Fitting with inputs
kmeans = kmeans.fit(data)
# Predicting the clusters
labels = kmeans.predict(data) # Gives you the cluster number for each data point
# Getting the cluster centers
C = kmeans.cluster_centers_ # Gives you the cluster centroids
# Importing Required Libraries
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
# Creating a Dataframe
data = pd.DataFrame({
'x': [12, 20, 28, 18, 29, 33, 24, 45, 45, 52, 51, 52, 55, 53, 55, 61, 64, 69, 72],
'y': [39, 36, 30, 52, 54, 46, 55, 59, 63, 70, 66, 63, 58, 23, 14, 8, 19, 7, 24]
})
# Creating a Linkage Matrix
linked = linkage(data, 'single')
# Dendrogram
dendrogram(linked,
orientation='top',
labels=data.index,
distance_sort='descending',
show_leaf_counts=True)
plt.show()
In this tutorial, we covered the basics of clustering techniques in unsupervised learning, focusing on K-Means and Hierarchical Clustering. We went through a step-by-step guide on how to implement these techniques and provided code snippets for better understanding.
Continue learning more advanced clustering techniques like DBSCAN, Mean-Shift etc. Also, study about the ways to determine the optimal number of clusters like Elbow Method, Silhouette Method etc.