Simple Unsupervised Learning Example¶

Use Labelled Iris Flower DataSet to demonstrate unsupervised learning clustering with a kmeans model.

In [ ]:
import pandas as pd

Load the dataset, adding column names to the loaded DataFrame

In [ ]:
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

col_names = ['sepal-length', 'sepal-width', 'petal-length',
             'petal-width', 'class']

iris_df = pd.read_csv(url, names=col_names)

iris_df.head()
Out[ ]:
sepal-length sepal-width petal-length petal-width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

A quick look at how many classes of Iris there are

In [ ]:
# how many iris classes (unique?)
iris_df['class'].unique()
Out[ ]:
array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

Separate the dataframe into:

  • X: independent variable matrix - this is the first 4 columns
  • y: dependant variable vector - this is just the 'class' column
In [ ]:
# X is all columns except class
X = iris_df.iloc[:,:4]

# y is the class column
y = iris_df[ ['class'] ]

display(X.head())
display(y)
sepal-length sepal-width petal-length petal-width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
class
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
... ...
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica

150 rows × 1 columns

KMeans clustering model¶

We create and train a clustering model object as with supervised models.

In this case the model does not use the y data at all during training.

Generally we need to give a clustering model the number of clusters to identify.

In [ ]:
# use KMeans from sklearn.cluster
from sklearn.cluster import KMeans

k_means_model = KMeans(n_clusters=3)

k_means_model.fit(X)
Out[ ]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)
In [ ]:
# display the labels the model determined
k_means_model.labels_
Out[ ]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2,
       2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2,
       2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1])

Verify the model against the original labels¶

Note how the model may have chosen a different "code" for each of the classes than we are using below. For simplicity you can change the class code (0, 1 or 2) below to match what the model has assigned.

In [ ]:
# let's create a new "verification_data" DataFrame, with the same class encoding as the model appears to use
verification_data = iris_df[ ['class'] ].copy()

verification_data['label'] = 0   # This is the code for Iris-setosa
verification_data.loc[ verification_data['class'] == 'Iris-versicolor', 'label'] = 1
verification_data.loc[ verification_data['class'] == 'Iris-virginica', 'label'] = 2

verification_data
Out[ ]:
class label
0 Iris-setosa 0
1 Iris-setosa 0
2 Iris-setosa 0
3 Iris-setosa 0
4 Iris-setosa 0
... ... ...
145 Iris-virginica 2
146 Iris-virginica 2
147 Iris-virginica 2
148 Iris-virginica 2
149 Iris-virginica 2

150 rows × 2 columns

In [ ]:
# create a confusion matrix with sklearn.metrics
from sklearn.metrics import confusion_matrix

print(confusion_matrix(k_means_model.labels_, verification_data['label']))
[[50  0  0]
 [ 0 48 14]
 [ 0  2 36]]

How to determine the number of clusters?¶

So, how did we determine that 3 clusters was the correct amount?

A common method is called the elbow method, here we run the algorithm for a number of cluster sizes, and plot the results

For each amount of clusters we plot the "inertia" which is a measure of the sum of the squares to the centroid in each cluster. Effectively this is telling us how "coherent" the clusters are.

The "elbow" will tell us the optimum number of clusters for the dataset

In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt

#Finding the optimum number of clusters for k-means classification
wcss = []

# loop over a range of cluster amounts, e.g. 1 to 11, create and train a model for each
# append the "inertia_" property of the model to the wcss list
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
    
# Plotting the results onto a line graph, allowing us to observe 'The elbow'
# In this case we want to plot our range of cluster amounts (1 to 11) against the inertia values
plt.plot(range(1, 11), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') #within cluster sum of squares
plt.show()

Reference - a more complete example: https://www.kaggle.com/tonzowonzo/simple-k-means-clustering-on-the-iris-dataset