PolarSPARC

Machine Learning - K-Means Clustering using Scikit-Learn


Bhaskar S 07/24/2022


Overview

Clustering is an unsupervised machine learning technique that attempts to group the samples from the unlabeled data set into meaningful clusters.

In other words, the clustering algorithm divides the unlabeled data population into groups, such that the samples within a group are similar to each other.

K-Means Clustering Algorithm

In the following sections, we will unravel the steps behind the K-Means Clustering algorithm using a very simple data set related to Math and Logic scores.

The following illustration displays rows from the simple data set:

Data Set
Figure.1

The following illustration shows the plot of the samples from the data set:

Initial Plot
Figure.2

Before we get started, the following are some of terminology referenced in this article:

The following are the steps in the K-Means Clustering algorithm:

  1. Choose the number of clusters $K$. For our simple data set we select $K = 2$

  2. Randomly choose $K$ cluster centroids $C_k$ (where $k = 1 ... K$) from the data set. For our data set, let $C_1 = [45, 55]$ and $C_2 = [70, 70]$.

    The following illustration shows the initial centroids for our simple data set:

    Initial Centroids
    Figure.3

  3. For each sample $X_i$ in the data set, assign the cluster it belongs to. To do that, compute the distance to each of the centroids $C_k$ (where $k = 1 ... K$) and the find the minimum of the distances. If the distance to centroid $C_c$ (where $c \le K$) is the minimum, then assign the sample $X_i$ to the cluster $c$.

    The following illustration shows the computed L1 distances and the cluster assignment for our data set:

    Cluster Assignment
    Figure.4

  4. Compute the new cluster centroids $C_k$ (where $k = 1 ... K$) using the assigned samples $X_i$ to each cluster.

    For our data set, the samples $[45, 55]$, $[55, 50]$, and $[50, 60]$ are assigned to cluster $1$. The new cluster centroid for cluster $1$ ($C_1$) would be $\Large{[\frac{45+55+50}{3},\frac{55+50+60}{3}]}$ $= [50, 55]$. Similarly, the new cluster centroid for cluster $2$ ($C_2$) would be $[75, 70]$.

  5. Once again, assign a cluster for each of the samples $X_i$ from the data set.

    The following illustration shows the computed L1 distances and the cluster assignment for our data set:

    Cluster Assignment 2
    Figure.5

  6. Continue this process of computing the new cluster centroids and assigning samples from the data set to the appropriate clusters until the previous cluster centroids and current cluster centroids are the same.

The following illustration shows the plot of the samples from the data set after the final cluster assignment:

Final Plot
Figure.6

For data sets that are unlabeled, how does one determine the number of clusters $K$ ???

This is where the visual Elbow Method comes in handy. The Elbow Method is a plot of the cluster size $K$ vs the Distortion. From the plot, the idea is to determine the value of $K$, after which the Distortion starts to decrease in smaller steps.

Hands-on Demo

In the following sections, we will demonstrate the use of the K-Means model for clustering (using scikit-learn) by leveraging the Palmer Penguins data set.

The Palmer Penguins includes samples with the following features for the penguin species near the Palmer Station, Antarctica:

The first step is to import all the necessary Python modules such as, pandas and scikit-learn as shown below:


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans

The next step is to load the palmer penguins data set into a pandas dataframe, drop the index column, and then display the dataframe as shown below:


url = 'https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv'
penguins_df = pd.read_csv(url)
penguins_df = penguins_df.drop(penguins_df.columns[0], axis=1)
penguins_df

The following illustration displays few rows from the palmer penguins dataframe:

Dataframe Display
Figure.7

The next step is to remediate the missing values from the palmer penguins dataframe. Note that we went through the steps to fix the missing values in the article Random Forest using Scikit-Learn, so we will not repeat it here.

For each of the missing values, we perform data analysis such as comparing the mean or (mean + std) values for the features and apply the following fixes as shown below:


penguins_df = penguins_df.drop([3, 271], axis=0)
penguins_df.loc[[8, 10, 11], 'sex'] = 'female'
penguins_df.at[9, 'sex'] = 'male'
penguins_df.at[47, 'sex'] = 'female'
penguins_df.loc[[178, 218, 256, 268], 'sex'] = 'female'

The next step is to display information about the palmer penguins dataframe, such as index and column types, memory usage, etc., as shown below:


penguins_df.info()

The following illustration displays information about the palmer penguins dataframe:

Dataframe Information
Figure.8

Note that there are no missing values at this point in the palmer penguins dataframe.

The next step is to drop the target variable species from the cleansed palmer penguins dataframe to make it an unlabeled data set as shown below:


penguins_df = penguins_df.drop('species', axis=1)

The next step is to create and display dummy binary variables for the two categorical (nominal) feature variables - island and sex from the cleansed palmer penguins dataframe as shown below:


nom_features = ['island', 'sex']
nom_encoded_df = pd.get_dummies(penguins_df[nom_features], prefix_sep='.', drop_first=True, sparse=False)
nom_encoded_df

The following illustration displays the dataframe of all dummy binary variables for the two categorical (nominal) feature variables from the cleansed palmer penguins dataframe:

Dummy Variables
Figure.9

The next step is to drop the two nominal categorical features and merge the dataframe of dummy binary variables into the cleansed palmer penguins dataframe as shown below:


penguins_df = penguins_df.drop(penguins_df[nom_features], axis=1)
penguins_df = pd.concat([penguins_df, nom_encoded_df], axis=1)
penguins_df

The following illustration displays few columns/rows from the merged palmer penguins dataframe:

Merged Dataframe
Figure.10

The next step is to initialize the K-Means model class from scikit-learn and train the model using the data set as shown below:


model = KMeans(n_clusters=3, init='random', max_iter=100, random_state=101)
model.fit(penguins_df)

The following are a brief description of some of the hyperparameters used by the KMeans model:

The next step is to display a scatter plot of the samples from the palmer penguins data set segregated into the 3 clusters as shown below:


sns.scatterplot(x=penguins_df['bill_length_mm'], y=penguins_df['body_mass_g'], hue=model.labels_, palette='tab10')
plt.show()

The following illustration displays the scatter plot of the 3 clusters:

Identified Clusters
Figure.11

Demo Notebook

The following is the link to the Jupyter Notebook that provides an hands-on demo for this article:



© PolarSPARC