Welcome to the 40th part of our machine learning tutorial series, and another tutorial within the topic of Clustering.. We continue the topic of clustering and unsupervised machine learning with Mean Shift, this time applying it to our Titanic dataset.
There is some degree of randomness here, so your results may not be the same. You can probably re-run the program to get similar data if you don't get something similar, however.
We're going to take a look at the Titanic dataset via clustering with Mean Shift. What we're interested to know is whether or not Mean Shift will automatically separate passengers into groups or not. If so, it will be interesting to inspect the groups that are created. The first obvious curiosity will be the survival rates of the groups found, but, then, we will also poke into the attributes of these groups to see if we can understand why the Mean Shift algorithm decided on the specific groups.
To begin, we will use code you have seen already up to this point:
import numpy as np from sklearn.cluster import MeanShift, KMeans from sklearn import preprocessing, cross_validation import pandas as pd import matplotlib.pyplot as plt ''' Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) survival Survival (0 = No; 1 = Yes) name Name sex Sex age Age sibsp Number of Siblings/Spouses Aboard parch Number of Parents/Children Aboard ticket Ticket Number fare Passenger Fare (British pound) cabin Cabin embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) boat Lifeboat body Body Identification Number home.dest Home/Destination ''' # https://pythonprogramming.net/static/downloads/machine-learning-data/titanic.xls df = pd.read_excel('titanic.xls') original_df = pd.DataFrame.copy(df) df.drop(['body','name'], 1, inplace=True) df.fillna(0,inplace=True) def handle_non_numerical_data(df): # handling non-numerical data: must convert. columns = df.columns.values for column in columns: text_digit_vals = {} def convert_to_int(val): return text_digit_vals[val] #print(column,df[column].dtype) if df[column].dtype != np.int64 and df[column].dtype != np.float64: column_contents = df[column].values.tolist() #finding just the uniques unique_elements = set(column_contents) # great, found them. x = 0 for unique in unique_elements: if unique not in text_digit_vals: # creating dict that contains new # id per unique string text_digit_vals[unique] = x x+=1 # now we map the new "id" vlaue # to replace the string. df[column] = list(map(convert_to_int,df[column])) return df df = handle_non_numerical_data(df) df.drop(['ticket','home.dest'], 1, inplace=True) X = np.array(df.drop(['survived'], 1).astype(float)) X = preprocessing.scale(X) y = np.array(df['survived']) clf = MeanShift() clf.fit(X)
...except for two additions, one is original_df = pd.DataFrame.copy(df)
right after we read the csv file to our df
object, the other is importing MeanShift from sklearn.cluster (and using MeanShift as our classifier). We are making the copy so that we can later reference the data in it's original non-numerical form.
Now that we've created the fitment, we can get some attributes from our clf
object:
labels = clf.labels_ cluster_centers = clf.cluster_centers_
Next, we're going to add a new column to our original dataframe:
original_df['cluster_group']=np.nan
Now, we can iterate through the labels and populate the labels to the empty column:
for i in range(len(X)): original_df['cluster_group'].iloc[i] = labels[i]
Next, we can check the survival rates for each of the groups we happen to find:
n_clusters_ = len(np.unique(labels)) survival_rates = {} for i in range(n_clusters_): temp_df = original_df[ (original_df['cluster_group']==float(i)) ] #print(temp_df.head()) survival_cluster = temp_df[ (temp_df['survived'] == 1) ] survival_rate = len(survival_cluster) / len(temp_df) #print(i,survival_rate) survival_rates[i] = survival_rate print(survival_rates)
If we run this, we get something like:
{0: 0.3796583850931677, 1: 0.9090909090909091, 2: 0.1}
Again, you may get more groups. I got three here, but I've personally got up to six groups on this same dataset. Right away, we see that group 0 has a 38% survival rate, group 1 has a 91% survival rate, and group 2 has a 10% survival rate. This is somewhat curious as we know there were three actual "passenger classes" on the ship. I immediately wonder if 0 is the second-class group, 1 is first-class, and 2 is 3rd class. The classes on the ship were ordered with 3rd class on the bottom, and first class on the top. The bottom flooded first, and the top is where the life-boats were. I can look deeper by doing:
print(original_df[ (original_df['cluster_group']==1) ])
What this does is give us just the rows from the original_df where the cluster_group
column is 1.
Printing this out:
pclass survived name \ 17 1 1 Baxter, Mrs. James (Helene DeLaudeniere Chaput) 49 1 1 Cardeza, Mr. Thomas Drake Martinez 50 1 1 Cardeza, Mrs. James Warburton Martinez (Charlo... 66 1 1 Chaudanson, Miss. Victorine 97 1 1 Douglas, Mrs. Frederick Charles (Mary Helene B... 116 1 1 Fortune, Mrs. Mark (Mary McDougald) 183 1 1 Lesurer, Mr. Gustave J 251 1 1 Ryerson, Miss. Susan Parker "Suzette" 252 1 0 Ryerson, Mr. Arthur Larned 253 1 1 Ryerson, Mrs. Arthur Larned (Emily Maria Borie) 302 1 1 Ward, Miss. Anna sex age sibsp parch ticket fare cabin embarked \ 17 female 50.0 0 1 PC 17558 247.5208 B58 B60 C 49 male 36.0 0 1 PC 17755 512.3292 B51 B53 B55 C 50 female 58.0 0 1 PC 17755 512.3292 B51 B53 B55 C 66 female 36.0 0 0 PC 17608 262.3750 B61 C 97 female 27.0 1 1 PC 17558 247.5208 B58 B60 C 116 female 60.0 1 4 19950 263.0000 C23 C25 C27 S 183 male 35.0 0 0 PC 17755 512.3292 B101 C 251 female 21.0 2 2 PC 17608 262.3750 B57 B59 B63 B66 C 252 male 61.0 1 3 PC 17608 262.3750 B57 B59 B63 B66 C 253 female 48.0 1 3 PC 17608 262.3750 B57 B59 B63 B66 C 302 female 35.0 0 0 PC 17755 512.3292 NaN C boat body home.dest cluster_group 17 6 NaN Montreal, PQ 1.0 49 3 NaN Austria-Hungary / Germantown, Philadelphia, PA 1.0 50 3 NaN Germantown, Philadelphia, PA 1.0 66 4 NaN NaN 1.0 97 6 NaN Montreal, PQ 1.0 116 10 NaN Winnipeg, MB 1.0 183 3 NaN NaN 1.0 251 4 NaN Haverford, PA / Cooperstown, NY 1.0 252 NaN NaN Haverford, PA / Cooperstown, NY 1.0 253 4 NaN Haverford, PA / Cooperstown, NY 1.0 302 3 NaN NaN 1.0
Sure enough, this entire group is first-class. That said, there are actually only 11 people here. Let's look into group 0, which seemed a bit more diverse. This time, we will use the .describe()
method via Pandas:
print(original_df[ (original_df['cluster_group']==0) ].describe())
pclass survived age sibsp parch \ count 1288.000000 1288.000000 1027.000000 1288.000000 1288.000000 mean 2.300466 0.379658 29.668614 0.496118 0.332298 std 0.833785 0.485490 14.395610 1.047430 0.686068 min 1.000000 0.000000 0.166700 0.000000 0.000000 25% 2.000000 0.000000 21.000000 0.000000 0.000000 50% 3.000000 0.000000 28.000000 0.000000 0.000000 75% 3.000000 1.000000 38.000000 1.000000 0.000000 max 3.000000 1.000000 80.000000 8.000000 4.000000 fare body cluster_group count 1287.000000 119.000000 1288.0 mean 30.510172 159.571429 0.0 std 41.511032 97.302914 0.0 min 0.000000 1.000000 0.0 25% 7.895800 71.000000 0.0 50% 14.108300 155.000000 0.0 75% 30.070800 255.500000 0.0 max 263.000000 328.000000 0.0
1,287 people here. We can see the average class here is just above 2nd class, but this ranges from 1st to 3rd.
Let's check the final group, 2, which we are expected to all be 3rd class:
print(original_df[ (original_df['cluster_group']==2) ].describe())
pclass survived age sibsp parch fare \ count 10.0 10.000000 8.000000 10.000000 10.000000 10.000000 mean 3.0 0.100000 39.875000 0.800000 6.000000 42.703750 std 0.0 0.316228 1.552648 0.421637 1.632993 15.590194 min 3.0 0.000000 38.000000 0.000000 5.000000 29.125000 25% 3.0 0.000000 39.000000 1.000000 5.000000 31.303125 50% 3.0 0.000000 39.500000 1.000000 5.000000 35.537500 75% 3.0 0.000000 40.250000 1.000000 6.000000 46.900000 max 3.0 1.000000 43.000000 1.000000 9.000000 69.550000 body cluster_group count 2.000000 10.0 mean 234.500000 2.0 std 130.814755 0.0 min 142.000000 2.0 25% 188.250000 2.0 50% 234.500000 2.0 75% 280.750000 2.0 max 327.000000 2.0
Sure enough, we are correct, this group, which had the worst survival rate, is all 3rd class.
Interestingly enough, when looking at all groups, the range of ticket prices in group 2, which was the worst faring group, indeed had the lowest fares, ranging from 29 to 69 pounds.
When we look at cluster 0, the range of fares goes up to 263 pounds. This is the largest group, with 38% survival.
When we revisit cluster 1, which is all first-class, we see the range of fare here is 247-512, with a mean of 350. Despite cluster 0 having some 1st class passengers, it's clear this group is the most elite group.
Out of curiosity, what is the survival rate of the 1st class passengers in cluster 0, compared to the overall survival rate of cluster 0?
>>> cluster_0 = (original_df[ (original_df['cluster_group']==0) ]) >>> cluster_0_fc = (cluster_0[ (cluster_0['pclass']==1) ]) >>> print(cluster_0_fc.describe()) pclass survived age sibsp parch fare \ count 312.0 312.000000 273.000000 312.000000 312.000000 312.000000 mean 1.0 0.608974 39.027167 0.432692 0.326923 78.232519 std 0.0 0.488764 14.589592 0.606997 0.653100 60.300654 min 1.0 0.000000 0.916700 0.000000 0.000000 0.000000 25% 1.0 0.000000 28.000000 0.000000 0.000000 30.500000 50% 1.0 1.000000 39.000000 0.000000 0.000000 58.689600 75% 1.0 1.000000 49.000000 1.000000 0.000000 91.079200 max 1.0 1.000000 80.000000 3.000000 4.000000 263.000000 body cluster_group count 35.000000 312.0 mean 162.828571 0.0 std 82.652172 0.0 min 16.000000 0.0 25% 109.500000 0.0 50% 166.000000 0.0 75% 233.000000 0.0 max 307.000000 0.0 >>>
Sure enough, they have a better survival rate, ~61%, but still much worse than the 91% of the more apparently elite group (by both ticket price and survival rate). Spend some time poking around to see what you can find if you like. Otherwise, we're going to next head on to writing a Mean Shift algorithm of our own.