Python Programming Tutorials

Regression - Training and Testing

Welcome to part four of the Machine Learning with Python tutorial series. In the previous tutorials, we got our initial data, we transformed and manipulated it a bit to our liking, and then we began to define our features. Scikit-Learn does not fundamentally need to work with Pandas and dataframes, I just prefer to do my data-handling with it, as it is fast and efficient. Instead, Scikit-learn actually fundamentally requires numpy arrays. Pandas dataframes can be easily converted to NumPy arrays, so it just so happens to work out for us!

Our code up to this point:

import Quandl, math
import numpy as np
import pandas as pd
from sklearn import preprocessing, cross_validation, svm
from sklearn.linear_model import LinearRegression

df = Quandl.get("WIKI/GOOGL")

print(df.head())
#print(df.tail())

df = df[['Adj. Open',  'Adj. High',  'Adj. Low',  'Adj. Close', 'Adj. Volume']]

df['HL_PCT'] = (df['Adj. High'] - df['Adj. Low']) / df['Adj. Close'] * 100.0
df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] * 100.0

df = df[['Adj. Close', 'HL_PCT', 'PCT_change', 'Adj. Volume']]
print(df.head())

forecast_col = 'Adj. Close'
df.fillna(value=-99999, inplace=True)
forecast_out = int(math.ceil(0.01 * len(df)))

df['label'] = df[forecast_col].shift(-forecast_out)

We'll then drop any still NaN information from the dataframe:

df.dropna(inplace=True)

It is a typical standard with machine learning in code to define X (capital x), as the features, and y (lowercase y) as the label that corresponds to the features. As such, we can define our features and labels like so:

X = np.array(df.drop(['label'], 1))
y = np.array(df['label'])

Above, what we've done, is defined X (features), as our entire dataframe EXCEPT for the label column, converted to a numpy array. We do this using the .drop method that can be applied to dataframes, which returns a new dataframe. Next, we define our y variable, which is our label, as simply the label column of the dataframe, converted to a numpy array.

We could leave it at this, and move on to training and testing, but we're going to do some pre-processing. Generally, you want your features in machine learning to be in a range of -1 to 1. This may do nothing, but it usually speeds up processing and can also help with accuracy. Because this range is so popularly used, it is included in the preprocessing module of Scikit-Learn. To utilize this, you can apply preprocessing.scale to your X variable:

X = preprocessing.scale(X)

Next, create the label, y:

y = np.array(df['label'])

Now comes the training and testing. The way this works is you take, for example, 75% of your data, and use this to train the machine learning classifier. Then you take the remaining 25% of your data, and test the classifier. Since this is your sample data, you should have the features and known labels. Thus, if you test on the last 25% of your data, you can get a sort of accuracy and reliability, often called the confidence score. There are many ways to do this, but, probably the best way is using the build in cross_validation provided, since this also shuffles your data for you. The code to do this:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)

The return here is the training set of features, testing set of features, training set of labels, and testing set of labels. Now, we're ready to define our classifier. There are many classifiers in general available through Scikit-Learn, and even a few specifically for regression. We'll show a couple in this example, but for now, let's use Support Vector Regression from Scikit-Learn's svm package:

clf = svm.SVR()

We're just going to use all of the defaults to keep things simple here, but you can learn much more about Support Vector Regression in the sklearn.svm.SVR documentation.

Once you have defined the classifer, you're ready to train it. With Scikit-Learn (sklearn), you train with .fit:

clf.fit(X_train, y_train)

Here, we're "fitting" our training features and training labels.

Our classifier is now trained. Wow that was easy. Now we can test it!

confidence = clf.score(X_test, y_test)

Boom tested, and then:

print(confidence)

0.960075071072

So here, we can see the accuracy is about 96%. Nothing to write home about. Let's try another classifier, this time using LinearRegression from sklearn:

clf = LinearRegression()

0.963311624499

A bit better, but basically the same. So how might we know, as scientists, which algorithm to choose? After a while, you will get used to what works in most situations and what doesn't. You can also check out: choosing the right estimator from scikit-learn's website. This can help you walk through some basic choices. If you ask people who use machine learning often though, it's really trial and error. You will try a handful of algorithms and simply go with the one that works best. Another thing to note is that some of the algorithms must run linearly, others not. Do not confuse linear regression with the requirement to run linearly, by the way. So what does that all mean? Some of the machine learning algorithms here will process one step at a time, with no threading, others can thread and use all the CPU cores you have available. You could learn a lot about each algorithm to figure out which ones can thread, or you can visit the documentation, and look for the n_jobs parameter. If it has n_jobs, you have an algorithm that can be threaded for high performance. If not, tough luck! Thus, if you are processing massive amounts of data, or you need to process medium data but at a very high rate of speed, then you would want something threaded. Let's check for our two algorithms.

Heading to the docs for sklearn.svm.SVR, and looking through the parameters, do you see n_jobs? Not me. So no, no threading here. As you could see, on our small data, it makes very little difference, but, on say even as little as 20mb of data, it makes a massive difference. Next up, let's check out the LinearRegression algorithm. Do you see n_jobs here? Indeed! So here, you can specify exactly how many threads you'll want. If you put in -1 for the value, then the algorithm will use all available threads.

To do this:

clf = LinearRegression(n_jobs=-1)

That's all. While I have you doing such a rare thing (looking at documentation), let me draw your attention to the fact that, just because machine learning algorithms work with default parameters, it doesn't mean you can just ignore them. For example, let's revisit svm.SVR. SVR is support vector regression, which is a kind of...architecture... when doing machine learning. I highly encourage anyone interested in learning more to research the topic and learn from people who are far more educated than I am on fundamentals, by the way, I will do my best to explain things simply here, but I am not an expert. Back on topic, however. There is a parameter to svm.SVR for example which is kernel. What in the heck is that? Think of a kernel like a transformation against your data. It's a way to grossly, and I mean grossly, simplify your data. This makes processing go much faster. In the case of svm.SVR, the default is rbf, which is a type of kernel. You have a few other choices though. Check the documentation, you have 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' or a callable. Again, just like the suggestion to try the various ML algorithms that can do what you want, try the kernels. Let's do a few:

for k in ['linear','poly','rbf','sigmoid']:
    clf = svm.SVR(kernel=k)
    clf.fit(X_train, y_train)
    confidence = clf.score(X_test, y_test)
    print(k,confidence)

linear 0.960075071072
poly 0.63712232551
rbf 0.802831714511
sigmoid -0.125347960903

As we can see, the linear kernel performed the best, closely by rbf, then poly, then sigmoid was clearly just goofing off and definitely needs to be kicked from the team.

So we have trained, and tested. Let's say we're happy with 71% at this point, what would we do next? We've trained and tested, but now we want to move forward and forecast out, which is what we'll be covering in the next tutorial.

The next tutorial:

Practical Machine Learning Tutorial with Python Introduction
Regression - Intro and Data
Regression - Features and Labels
Regression - Training and Testing
Regression - Forecasting and Predicting
Pickling and Scaling
Regression - Theory and how it works
Regression - How to program the Best Fit Slope
Regression - How to program the Best Fit Line
Regression - R Squared and Coefficient of Determination Theory
Regression - How to Program R Squared
Creating Sample Data for Testing
Classification Intro with K Nearest Neighbors
Applying K Nearest Neighbors to Data
Euclidean Distance theory
Creating a K Nearest Neighbors Classifer from scratch
Creating a K Nearest Neighbors Classifer from scratch part 2
Testing our K Nearest Neighbors classifier
Final thoughts on K Nearest Neighbors
Support Vector Machine introduction
Vector Basics
Support Vector Assertions
Support Vector Machine Fundamentals
Constraint Optimization with Support Vector Machine
Beginning SVM from Scratch in Python
Support Vector Machine Optimization in Python
Support Vector Machine Optimization in Python part 2
Visualization and Predicting with our Custom SVM
Kernels Introduction
Why Kernels
Soft Margin Support Vector Machine
Kernels, Soft Margin SVM, and Quadratic Programming with Python and CVXOPT
Support Vector Machine Parameters
Machine Learning - Clustering Introduction
Handling Non-Numerical Data for Machine Learning
K-Means with Titanic Dataset
K-Means from Scratch in Python
Finishing K-Means from Scratch in Python
Hierarchical Clustering with Mean Shift Introduction
Mean Shift applied to Titanic Dataset
Mean Shift algorithm from scratch in Python
Dynamically Weighted Bandwidth for Mean Shift
Introduction to Neural Networks
Installing TensorFlow for Deep Learning - OPTIONAL
Introduction to Deep Learning with TensorFlow
Deep Learning with TensorFlow - Creating the Neural Network Model
Deep Learning with TensorFlow - How the Network will run
Deep Learning with our own Data
Simple Preprocessing Language Data for Deep Learning
Training and Testing on our Data for Deep Learning
10K samples compared to 1.6 million samples with Deep Learning
How to use CUDA and the GPU Version of Tensorflow for Deep Learning
Recurrent Neural Network (RNN) basics and the Long Short Term Memory (LSTM) cell
RNN w/ LSTM cell example in TensorFlow and Python
Convolutional Neural Network (CNN) basics
Convolutional Neural Network CNN with TensorFlow tutorial
TFLearn - High Level Abstraction Layer for TensorFlow Tutorial
Using a 3D Convolutional Neural Network on medical imaging data (CT Scans) for Kaggle
Classifying Cats vs Dogs with a Convolutional Neural Network on Kaggle
Using a neural network to solve OpenAI's CartPole balancing environment