Welcome to the 9th part of our machine learning regression tutorial within our Machine Learning with Python tutorial series. We've been working on calculating the regression, or best-fit, line for a given dataset in Python. Previously, we wrote a function that will gather the slope, and now we need to calculate the y-intercept. Our code up to this point:
from statistics import mean import numpy as np xs = np.array([1,2,3,4,5], dtype=np.float64) ys = np.array([5,4,6,5,6], dtype=np.float64) def best_fit_slope(xs,ys): m = (((mean(xs)*mean(ys)) - mean(xs*ys)) / ((mean(xs)*mean(xs)) - mean(xs*xs))) return m m = best_fit_slope(xs,ys) print(m)
As a reminder, the calculation for the best-fit line's y-intercept is:
This one will be a bit easier than the slope was. We can save a few lines by incorporating this into our other function. We'll rename it to best_fit_slope_and_intercept
.
Next, we can fill in: b = mean(ys) - (m*mean(xs))
, and return m and b:
def best_fit_slope_and_intercept(xs,ys): m = (((mean(xs)*mean(ys)) - mean(xs*ys)) / ((mean(xs)*mean(xs)) - mean(xs*xs))) b = mean(ys) - m*mean(xs) return m, b
Now we can call upon it with: m, b = best_fit_slope_and_intercept(xs,ys)
Our full code up to this point:
from statistics import mean import numpy as np xs = np.array([1,2,3,4,5], dtype=np.float64) ys = np.array([5,4,6,5,6], dtype=np.float64) def best_fit_slope_and_intercept(xs,ys): m = (((mean(xs)*mean(ys)) - mean(xs*ys)) / ((mean(xs)*mean(xs)) - mean(xs*xs))) b = mean(ys) - m*mean(xs) return m, b m, b = best_fit_slope_and_intercept(xs,ys) print(m,b)
Output should be: 0.3 4.3
Now we just need to create a line for the data:
Recall that y=mx+b
. We could make a function for this... or just knock it out in a single 1-liner for loop:
regression_line = [(m*x)+b for x in xs]
The above 1-liner for loop is the same as doing:
regression_line = [] for x in xs: regression_line.append((m*x)+b)
Great, let's reap the fruits of our labor finally! Add the following imports:
import matplotlib.pyplot as plt from matplotlib import style style.use('ggplot')
This will allow us to make graphs, and make them not so ugly. Now at the end:
plt.scatter(xs,ys,color='#003F72') plt.plot(xs, regression_line) plt.show()
First we plot a scatter plot of the existing data, then we graph our regression line, then finally show it. If you're not familiar with , you can check out the Data Visualization with Python and Matplotlib tutorial series.
Output:
Congratulations for making it this far! So, how might you go about actually making a prediction based on this model you just made? Simple enough, right? You have your model, you just fill in x
. For example, let's predict out a couple of points:
predict_x = 7
We have our input data, our "feature" so to speak. What's the label?
predict_y = (m*predict_x)+b print(predict_y)
Output: 6.4
We can even graph it:
predict_x = 7 predict_y = (m*predict_x)+b plt.scatter(xs,ys,color='#003F72',label='data') plt.plot(xs, regression_line, label='regression line') plt.legend(loc=4) plt.show()
Output:
We now know how to create our own models, which is great, but we're stilling missing something integral: how accurate is our model? This is the topic for discussion in the next tutorial!