Welcome to the 34th part of our machine learning tutorial series and the start of a new section: Clustering and Unsupervised machine learning. Up to this point, everything we have covered has been "supervised" machine learning, which means, we, the scientist, have told the machine what the classes of featuresets were. With unsupervised machine learning, the role of the scientist begins to be removed. To begin, we're going to cover clustering, which comes in two major forms: Flat and Hierarchical.
With both forms of clustering, the machine is tasked with receiving a dataset that is just featuresets, and then the machine searches for groups and assigns classes on its own. With Flat clustering, the scientist tells the machine how many classes/clusters to find. With Hierarchical clustering, the machine figures out the groups and how many.
What are some reasons why we might utilize clustering? The objective of clustering is to find relationships and meaning in data. In most cases that I have personally seen, people use clustering mostly as a step in what's known as "semi-supervised" machine learning. The idea here is you might use clustering to define classes, and then use a supervised machine learning algorithm to do further classification. Another use is for feature selection and validation. For example, consider our Breast Cancer dataset that we've been using. We might believe the features we've chosen are indeed descriptive and meaningful. One option we have is to feed this data through a K Means algorithm, and then indeed see whether or not the two groups we thought we were tracking were actually described by the data in a way we expected.
Consider next that you're a data scientist for Amazon. Your CTO has data collected that they believe could be used to predict whether or not customers are buyers or non-buyers. They want you to use K-Means to see if K-Means correctly organizes the customers by the data that the CTO thinks is meaningful.
What about Hierarchical Clustering? Consider you're still the same data scientist for Amazon. This time, you run the seemingly meaningful data through a hierarchical clustering algorithm, such as Mean Shift, and you actually get 5 groups. After further analysis, you realize visitors aren't actually just buyers and non-buyers, they are a spectrum! You really have non-buyers, unlikely-to-buyers, slightly-to-buyers, highly-likely-to-buyers, and certain-to-buyers.
Clustering can also be used on truly unknown data, in attempt to find structure. Consider you're an alien race first discovering human texts in North America. You might take all of the written characters you find and compile them into a big list of features. Then you might feed this list through a Hierarchical Clustering algorithm just to see if you can find specific groups so that you can start to decode the language by characters.
The field of "Big Data Analysis" is generally a prime area for clustering. There's plenty of data, but most companies have absolutely no idea what to do with it or how to actually get meaning from it. Clustering can help data scientists to begin to structure and look into expansive datasets for meaning.
Finally, clustering can also be used for typical classification, you just don't actually need to feed it what the classifications are beforehand, but, if you use clustering on most of the popular classification datasets, you should find that it tends to figure out the groups anyway. This is mainly useful for "testing" the algorithm to make sure it actually works.
Our first algorithm to cover is the K-Means algorithm. The idea of K-Means is to attempt to cluster a given dataset into K
clusters. The way it works is actually pretty impressive, and, luckily for us, quite simple. The process goes as follows:
K
number of centroids
. Centroids are just the "centers" of your clusters. To start, I typically just take the first K
values in my dataset and have those as the start, but you could also randomly select them if you wanted. It should not matter, but, if you are not optimizing for some reason, it may make sense to try to shuffle the data and try again.Simple enough, much easier than the Support Vector Machine to understand! Let's see a quick basic code example. To start, we'll have some data like:
import matplotlib.pyplot as plt from matplotlib import style import numpy as np from sklearn.cluster import KMeans style.use('ggplot') #ORIGINAL: X = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]]) plt.scatter(X[:, 0],X[:, 1], s=150, linewidths = 5, zorder = 10) plt.show()
Our data looks like:
Awesome, this looks pretty simple, so our K-Means algorithm better figure this one out! First we will fit everything:
clf = KMeans(n_clusters=2) clf.fit(X)
Simple as that, but we'd probably like to see it. As we saw before with the Support Vector Machine, most of the Scikit-Learn classifiers also have various attributes. With the K-Means algorithm, we can grab both the centroids and labels:
centroids = clf.cluster_centers_ labels = clf.labels_
Now to graph, we can do something like:
colors = ["g.","r.","c.","y."] for i in range(len(X)): plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10) plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10) plt.show()
Next, we're going to apply K-Means to an actual dataset, along with cover what happens when you data contains information that is non-numerical.