Welcome to the 8th part of our machine learning regression tutorial within our Machine Learning with Python tutorial series. Where we left off, we had just realized that we needed to replicate some non-trivial algorithms into Python code in an attempt to calculate a best-fit line for a given dataset.
Before we embark on that, why are we going to bother with all of this? Linear Regression is basically the brick to the machine learning building. It is used in almost every single major machine learning algorithm, so an understanding of it will help you to get the foundation for most major machine learning algorithms. For the enthusiastic among us, understanding linear regression and general linear algebra is the first step towards writing your own custom machine learning algorithms and branching out into the bleeding edge of machine learning, using what ever the best processing is at the time. As processing improves and hardware architecture changes, the methodologies used for machine learning also change. The more recent rise in neural networks has had much to do with general purpose graphics processing units. Ever wonder what's at the heart of an artificial neural network? You guessed it: linear regression.
If you recall, the calculation for the best-fit/regression/'y-hat' line's slope, m
:
Alright, we'll break it down into parts. First, let's grab a couple imports:
from statistics import mean import numpy as np
We're importing mean
from statistics so we can easily get the mean of a list or array. Next, we're grabbing numpy as np
so that we can create NumPy arrays. We can do a lot with lists, but we need to be able to do some simple matrix operations, which aren't available with simple lists, so we'll be using NumPy. We wont be getting too complex at this stage with NumPy, but later on NumPy is going to be your best friend. Next, let's define some starting datapoints:
xs = [1,2,3,4,5] ys = [5,4,6,5,6]
So these are the datapoints we're going to use, xs and ys. You can already be framing this right now as xs are the features and ys are the labels, or maybe these are both features and we're establishing a relationship. As mentioned earlier, we actually want these to be NumPy arrays so we can perform matrix operations, so let's modify those two lines:
xs = np.array([1,2,3,4,5], dtype=np.float64) ys = np.array([5,4,6,5,6], dtype=np.float64)
Now these are numpy arrays. We're also being explicit with the datatype here. Without getting in to deep here, datatypes have certain attributes, and those attributes boil down to how the data itself is stored into memory and can be manipulated. This wont matter as much right now as it will down the line when and if we're doing massive operations and hoping to do them on our GPUs rather than CPUs.
If graphed, our data should look like:
Okay now we're ready to build a function to calculate m
, which is our regression line's slope:
def best_fit_slope(xs,ys): return m m = best_fit_slope(xs,ys)
Done!
Just kidding, so there's our skeleton, now we'll fill it in.
Our first order of business is to do the mean of the x points, multiplied by the mean of our y points. Continuing to fill out our skeleton:
def best_fit_slope(xs,ys): m = (mean(xs) * mean(ys)) return m
Easy enough so far. You can use the mean
function on lists, tuples, or arrays. Notice my use of parenthesis here. Python honors the order of operations with mathematics. So, if you're wanting to ensure order, make sure you're explicit. Remember your PEMDAS!
Next, we need to subtract the mean of x*y, which is going to be our matrix operation: mean(xs*ys)
. In full now:
def best_fit_slope(xs,ys): m = ( (mean(xs)*mean(ys)) - mean(xs*ys) ) return m
We're done with the top part of our equation, now we're going to work on the denominator, starting with the squared mean of x: (mean(xs)*mean(xs))
. While Python does support something like ^2
, it's not going to work on our NumPy array float64 datatype. Adding this in:
def best_fit_slope(xs,ys): m = ( ((mean(xs)*mean(ys)) - mean(xs*ys)) / (mean(xs)**2)) return m
While it is not necessary by the order of operations to encase the entire calculation in parenthesis, I am doing it here so I can add a new line after our division, making things a bit easier to read and follow. Without it, we'd get a syntax error at the new line. We're almost complete here, now we just need to subtract the mean of the squared x values: mean(xs*xs)
. Again, we can't get away with a simple carrot 2, but we can multiple the array by itself and get the same outcome we desire. All together now:
def best_fit_slope(xs,ys): m = (((mean(xs)*mean(ys)) - mean(xs*ys)) / ((mean(xs)**2) - mean(xs*xs))) return m
Great, our full script is now:
from statistics import mean import numpy as np xs = np.array([1,2,3,4,5], dtype=np.float64) ys = np.array([5,4,6,5,6], dtype=np.float64) def best_fit_slope(xs,ys): m = (((mean(xs)*mean(ys)) - mean(xs*ys)) / ((mean(xs)**2) - mean(xs**2))) return m m = best_fit_slope(xs,ys) print(m)
Output: 0.3
What's next? We need to calculate the y intercept: b
. We will be tackling that in the next tutorial along with completing the best-fit line calculation overall. It's an easier calculation than the slope was, try to write your own function to do it. If you get it, don't skip the next tutorial, as we'll be doing more than simply calculating b
.