Welcome to part four of the Machine Learning with Python tutorial series. In the previous tutorials, we got our initial data, we transformed and manipulated it a bit to our liking, and then we began to define our features. Scikit-Learn does not fundamentally need to work with Pandas and dataframes, I just prefer to do my data-handling with it, as it is fast and efficient. Instead, Scikit-learn actually fundamentally requires numpy arrays. Pandas dataframes can be easily converted to NumPy arrays, so it just so happens to work out for us!
Our code up to this point:
import Quandl, math import numpy as np import pandas as pd from sklearn import preprocessing, cross_validation, svm from sklearn.linear_model import LinearRegression df = Quandl.get("WIKI/GOOGL") print(df.head()) #print(df.tail()) df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume']] df['HL_PCT'] = (df['Adj. High'] - df['Adj. Low']) / df['Adj. Close'] * 100.0 df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] * 100.0 df = df[['Adj. Close', 'HL_PCT', 'PCT_change', 'Adj. Volume']] print(df.head()) forecast_col = 'Adj. Close' df.fillna(value=-99999, inplace=True) forecast_out = int(math.ceil(0.01 * len(df))) df['label'] = df[forecast_col].shift(-forecast_out)
We'll then drop any still NaN information from the dataframe:
df.dropna(inplace=True)
It is a typical standard with machine learning in code to define X (capital x), as the features, and y (lowercase y) as the label that corresponds to the features. As such, we can define our features and labels like so:
X = np.array(df.drop(['label'], 1)) y = np.array(df['label'])
Above, what we've done, is defined X (features), as our entire dataframe EXCEPT for the label column, converted to a numpy array. We do this using the .drop
method that can be applied to dataframes, which returns a new dataframe. Next, we define our y variable, which is our label, as simply the label column of the dataframe, converted to a numpy array.
We could leave it at this, and move on to training and testing, but we're going to do some pre-processing. Generally, you want your features in machine learning to be in a range of -1 to 1. This may do nothing, but it usually speeds up processing and can also help with accuracy. Because this range is so popularly used, it is included in the preprocessing
module of Scikit-Learn. To utilize this, you can apply preprocessing.scale
to your X variable:
X = preprocessing.scale(X)
Next, create the label, y:
y = np.array(df['label'])
Now comes the training and testing. The way this works is you take, for example, 75% of your data, and use this to train the machine learning classifier. Then you take the remaining 25% of your data, and test the classifier. Since this is your sample data, you should have the features and known labels. Thus, if you test on the last 25% of your data, you can get a sort of accuracy and reliability, often called the confidence score. There are many ways to do this, but, probably the best way is using the build in cross_validation provided, since this also shuffles your data for you. The code to do this:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)
The return here is the training set of features, testing set of features, training set of labels, and testing set of labels. Now, we're ready to define our classifier. There are many classifiers in general available through Scikit-Learn, and even a few specifically for regression. We'll show a couple in this example, but for now, let's use Support Vector Regression from Scikit-Learn's svm
package:
clf = svm.SVR()
We're just going to use all of the defaults to keep things simple here, but you can learn much more about Support Vector Regression in the sklearn.svm.SVR documentation.
Once you have defined the classifer, you're ready to train it. With Scikit-Learn (sklearn), you train with .fit
:
clf.fit(X_train, y_train)
Here, we're "fitting" our training features and training labels.
Our classifier is now trained. Wow that was easy. Now we can test it!
confidence = clf.score(X_test, y_test)
Boom tested, and then:
print(confidence)
0.960075071072
So here, we can see the accuracy is about 96%. Nothing to write home about. Let's try another classifier, this time using LinearRegression
from sklearn:
clf = LinearRegression()
0.963311624499
A bit better, but basically the same. So how might we know, as scientists, which algorithm to choose? After a while, you will get used to what works in most situations and what doesn't. You can also check out: choosing the right estimator from scikit-learn's website. This can help you walk through some basic choices. If you ask people who use machine learning often though, it's really trial and error. You will try a handful of algorithms and simply go with the one that works best. Another thing to note is that some of the algorithms must run linearly, others not. Do not confuse linear regression with the requirement to run linearly, by the way. So what does that all mean? Some of the machine learning algorithms here will process one step at a time, with no threading, others can thread and use all the CPU cores you have available. You could learn a lot about each algorithm to figure out which ones can thread, or you can visit the documentation, and look for the n_jobs parameter. If it has n_jobs, you have an algorithm that can be threaded for high performance. If not, tough luck! Thus, if you are processing massive amounts of data, or you need to process medium data but at a very high rate of speed, then you would want something threaded. Let's check for our two algorithms.
Heading to the docs for sklearn.svm.SVR, and looking through the parameters, do you see n_jobs? Not me. So no, no threading here. As you could see, on our small data, it makes very little difference, but, on say even as little as 20mb of data, it makes a massive difference. Next up, let's check out the LinearRegression algorithm. Do you see n_jobs here? Indeed! So here, you can specify exactly how many threads you'll want. If you put in -1
for the value, then the algorithm will use all available threads.
To do this:
clf = LinearRegression(n_jobs=-1)
That's all. While I have you doing such a rare thing (looking at documentation), let me draw your attention to the fact that, just because machine learning algorithms work with default parameters, it doesn't mean you can just ignore them. For example, let's revisit svm.SVR. SVR is support vector regression, which is a kind of...architecture... when doing machine learning. I highly encourage anyone interested in learning more to research the topic and learn from people who are far more educated than I am on fundamentals, by the way, I will do my best to explain things simply here, but I am not an expert. Back on topic, however. There is a parameter to svm.SVR for example which is kernel
. What in the heck is that? Think of a kernel like a transformation against your data. It's a way to grossly, and I mean grossly, simplify your data. This makes processing go much faster. In the case of svm.SVR
, the default is rbf, which is a type of kernel. You have a few other choices though. Check the documentation, you have 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' or a callable. Again, just like the suggestion to try the various ML algorithms that can do what you want, try the kernels. Let's do a few:
for k in ['linear','poly','rbf','sigmoid']: clf = svm.SVR(kernel=k) clf.fit(X_train, y_train) confidence = clf.score(X_test, y_test) print(k,confidence)
linear 0.960075071072 poly 0.63712232551 rbf 0.802831714511 sigmoid -0.125347960903
As we can see, the linear kernel performed the best, closely by rbf, then poly, then sigmoid was clearly just goofing off and definitely needs to be kicked from the team.
So we have trained, and tested. Let's say we're happy with 71% at this point, what would we do next? We've trained and tested, but now we want to move forward and forecast out, which is what we'll be covering in the next tutorial.