Use Labelled Iris Flower DataSet to demonstrate unsupervised learning clustering with a kmeans model.
import pandas as pd
Load the dataset, adding column names to the loaded DataFrame
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
col_names = ['sepal-length', 'sepal-width', 'petal-length',
'petal-width', 'class']
iris_df = pd.read_csv(url, names=col_names)
iris_df.head()
sepal-length | sepal-width | petal-length | petal-width | class | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
A quick look at how many classes of Iris there are
# how many iris classes (unique?)
iris_df['class'].unique()
array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)
Separate the dataframe into:
# X is all columns except class
X = iris_df.iloc[:,:4]
# y is the class column
y = iris_df[ ['class'] ]
display(X.head())
display(y)
sepal-length | sepal-width | petal-length | petal-width | |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 |
class | |
---|---|
0 | Iris-setosa |
1 | Iris-setosa |
2 | Iris-setosa |
3 | Iris-setosa |
4 | Iris-setosa |
... | ... |
145 | Iris-virginica |
146 | Iris-virginica |
147 | Iris-virginica |
148 | Iris-virginica |
149 | Iris-virginica |
150 rows × 1 columns
We create and train a clustering model object as with supervised models.
In this case the model does not use the y data at all during training.
Generally we need to give a clustering model the number of clusters to identify.
# use KMeans from sklearn.cluster
from sklearn.cluster import KMeans
k_means_model = KMeans(n_clusters=3)
k_means_model.fit(X)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0)
# display the labels the model determined
k_means_model.labels_
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1])
Note how the model may have chosen a different "code" for each of the classes than we are using below. For simplicity you can change the class code (0, 1 or 2) below to match what the model has assigned.
# let's create a new "verification_data" DataFrame, with the same class encoding as the model appears to use
verification_data = iris_df[ ['class'] ].copy()
verification_data['label'] = 0 # This is the code for Iris-setosa
verification_data.loc[ verification_data['class'] == 'Iris-versicolor', 'label'] = 1
verification_data.loc[ verification_data['class'] == 'Iris-virginica', 'label'] = 2
verification_data
class | label | |
---|---|---|
0 | Iris-setosa | 0 |
1 | Iris-setosa | 0 |
2 | Iris-setosa | 0 |
3 | Iris-setosa | 0 |
4 | Iris-setosa | 0 |
... | ... | ... |
145 | Iris-virginica | 2 |
146 | Iris-virginica | 2 |
147 | Iris-virginica | 2 |
148 | Iris-virginica | 2 |
149 | Iris-virginica | 2 |
150 rows × 2 columns
# create a confusion matrix with sklearn.metrics
from sklearn.metrics import confusion_matrix
print(confusion_matrix(k_means_model.labels_, verification_data['label']))
[[50 0 0] [ 0 48 14] [ 0 2 36]]
So, how did we determine that 3 clusters was the correct amount?
A common method is called the elbow method, here we run the algorithm for a number of cluster sizes, and plot the results
For each amount of clusters we plot the "inertia" which is a measure of the sum of the squares to the centroid in each cluster. Effectively this is telling us how "coherent" the clusters are.
The "elbow" will tell us the optimum number of clusters for the dataset
%matplotlib inline
import matplotlib.pyplot as plt
#Finding the optimum number of clusters for k-means classification
wcss = []
# loop over a range of cluster amounts, e.g. 1 to 11, create and train a model for each
# append the "inertia_" property of the model to the wcss list
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
# Plotting the results onto a line graph, allowing us to observe 'The elbow'
# In this case we want to plot our range of cluster amounts (1 to 11) against the inertia values
plt.plot(range(1, 11), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') #within cluster sum of squares
plt.show()
Reference - a more complete example: https://www.kaggle.com/tonzowonzo/simple-k-means-clustering-on-the-iris-dataset