All content here is under a Creative Commons Attribution CC-BY 4.0 and all source code is released under a BSD-2 clause license.
Please reuse, remix, revise, and reshare this content in any way, keeping this notice.
This is the fifth module of several (11, 12, 13, 14, 15 and 16), which refocuses the course material in the prior 10 modules in a slightly different way. It places more emphasis on
In short: *how to extract value from your data*.
In this module we will cover
.iloc
Requirements before starting
Have your Python installation working as you had for modules 11 to 14, including the Pandas library installed.
An extra requirement: install the scikit-learn
and seaborn
libraries. See instructions below.
Newer versions of packages are released frequently. You can update your packages (libraries), with these commands. Do this at the command line (not in Jupyter notebook!):
conda update -n base conda
conda update --all
You will come across people recommending different packages in Python for all sorts of interesting applications. For example, the library seaborn
is often recommended for visualization. But you might not have it installed yet. In this module we will use seaborn
and also scikit-learn
.
This is how you can install the seaborn
and scikit-learn
packages in your virtual environment called myenv
:
conda activate myenv <--- change the last word in the command to the name of your actual environment
conda install seaborn scikit-learn
Or in one line:
conda install -n myenv seaborn scikit-learn
Similar to the above, you can update a package to the latest version. Just change install
to update
instead.
Or in one line:
conda update -n myenv seaborn scikit-learn
In the prior module you learned about setting the date and time when importing data, visualizing your data with box plots, histograms, line or time-series plots, and scatter plots. You applied these to your own data, and learned about the very powerful groupby
function in Pandas.
In this module we will take these skills a step further, but first, we will learn about fitting regression models to some data.
Start by importing Pandas, and also a tool to build regression models, which is from the scikit-learn
library. This is imported as sklearn
. You can read more about scikit-learn at their website: https://scikit-learn.org/stable/
import pandas as pd
from sklearn.linear_model import LinearRegression
We will use a data set that is from a distillation column, and predicting an important output variable, called the Reid Vapour Pressure (RVP).
Read in the data set and set the column called "Case" to be the index:
distill = pd.read_csv("https://openmv.net/file/distillation-tower.csv")
In the prior module you were asked to use your own data and:
We will do this interactively below, but also introduce a new plot, the heat map.
The correlation matrix of numbers can be tedious to read on a screen. You can easily visualize it though:
import seaborn as sns
display(distill.corr());
# Let's visualize it instead, as a heat map.
sns.set(rc={'figure.figsize':(15, 15)})
sns.heatmap(distill.corr());
# This is not so attractive. Use a different colour map (cmap):
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(distill.corr(), cmap=cmap, square=True, linewidths=0.2, cbar_kws={"shrink": 0.5});
We saw the scatter plots before, and the scatter plot matrix.
from pandas.plotting import scatter_matrix
scatter_matrix(distill, alpha = 0.2, figsize=(15, 15), diagonal = "kde");
The data set is quite big and takes some time to generate all the scatter plot combinations.
We can use every third row instead.
print(distill.shape)
subset = distill.iloc[0::3, :]
subset.shape
The .iloc
function accesses the data by index
(the i
in iloc
) and for a given loc
ation, so iloc
= index location.
Some examples:
mydata.iloc[0:10, :]
will return the 10 rows, and all columnsmydata.iloc[20, 2:4]
will return only row ___, and columns ___mydata.iloc[0:10:2, :]
will return only rows with index ___; and ___ columnsmydata.iloc[0::2, :]
will return ___ row; and ___ columnsmydata.iloc[:, -1]
will return ___ rows of the ___ columnTry some examples of using .iloc
below on the distill
data:
Now that you understand what .iloc
is doing, you can understand why this code is faster, because it uses half the data set to create the scatter plot matrix:
scatter_matrix(distill.iloc[0::2,:], alpha = 0.2, figsize=(15, 15), diagonal = "kde");
Which 2 columns are the most correlated with the outcome variable called "VapourPressure"?
distill.corr().shape
distill.corr().iloc[...] # <-- fill in some code to show only the last row of the correlation matrix
Let us build a regression model using the InvTemp3
measurement, the inverse temperature measured at position 3 in the distillation column, to predict the VapourPressure
.
There are 253 measurements (rows) in the dataset. We will use the first 150 rows in the data set to build the model, and then use the remaining rows to test the model, to see how well we can predict vapour pressure. This is good statistical practice. Do not use all the data to build the prediction model; you will get an inflated sense of how well the model works. Always keep some testing for validation/verification.
Create the model building data set from the first 150 rows:
build = distill.iloc[___] # <-- what goes here?
display(build)
Then create test testing data from the rest of the data frame:
test = distill.iloc[___] # <-- what goes here?
display(test)
Try it below, and use the .shape
command to verify the build
ing and test
ing data have the correct shape.
First, we set up an instance of the linear regression model:
mymodel = LinearRegression()
type(mymodel)
The mymodel
is an object. It is an object of a linear regression model, but it is empty at the moment. We will give it some training data, to build the model:
mymodel = LinearRegression()
mymodel.fit(X, y)
but we have to tell it what is X
and what is y
. So we have a small detour...
We need numeric values for X
and y
. Scikit-learn requires the X
data (the values used to predict y
) to be a column vector or a matrix.
Notice that a column vector is just a matrix with 1 column. This is because, you will see later, you can have 1 or more columns used to predict y
. Therefore every input used to predict y
must be in a column. Each row in the input matrix is one observation.
There is a shortcut to force Pandas to return a single column to be extracted as a column, and not a Series
print(build["InvTemp3"].shape)
print(build[["InvTemp3"]].shape)
So this will work to build your regression model:
# A single column in matrix X (capital X indicates one or more input columns)
X = build[["InvTemp3"]].values
y = build["VapourPressure"].values
mymodel = LinearRegression()
mymodel.fit(X, y)
If you run this code and see no error messages, then then model has been built. But it is not that exciting, ... yet.
So what can you do with this model? Use
dir(mymodel)
to ask Python what can be done with that object
. Note that the dir(...)
function works on any object and is something that you will use regularly.
There are several interesting methods that you see there which we will get to use.
coef_
intercept_
predict
score
The first two, are as you might guess, the intercept of the model and the coefficient (slope).
print(f"The intercept is {mymodel.intercept_} and the slope is = {mymodel.coef_}")
Now it is not so handy having all those decimal places. Python allows you to truncate them to the desired number:
print(f"The intercept is {mymodel.intercept_:.5g} and the slope is = {mymodel.coef_}")
We have to be a bit more careful with the slope. It is an array (see the square brackets?): so we need to extract the first entry from that vector before displaying it:
print(f"The intercept is {mymodel.intercept_:.5g} and the slope is = {mymodel.coef_[0]:.5g}")
Try this as well:
print(f"The intercept is {mymodel.intercept_:.5f} and the slope is = {mymodel.coef_[0]:.5f}")
There is a subtle difference between the f
and the g
format specifiers.
After building the regression model it is helpful to visualize it. The Seaborn library has a useful function to do so.
import seaborn as sns
ax = sns.regplot(x="InvTemp3", y="VapourPressure", data=distill)
ax.grid(True)
Take a look at the documentation for the regplot
function: https://seaborn.pydata.org/generated/seaborn.regplot.html
for more options, but the simple function above already does most of what you would expect:
An "upgrade" you might be interested in, is the joint plot, which adds the histograms to the axes:
sns.jointplot(x="InvTemp3", y="VapourPressure", data=distill, kind="reg");
# Or, show the kde=kernel density estimate
sns.jointplot(x="InvTemp3", y="VapourPressure", data=distill, kind="kde");
Next, we would like to extract some idea of how the model performs. For that we can look at
build
ing data,test
ing data. This is a more accurate estimate.Firstly, for the model building data:
# Get the predicted values from the data used to build the model
X_build = build[["InvTemp3"]]
y_build = build["VapourPressure"]
prediction_build = mymodel.predict(X_build)
errors_build = y_build - prediction_build # error = actual minus predicted
# There are several ways to see "how good" the model is, but the average
# of the absolute values of the errors gives a good feeling. Smaller is better.
avg_absolute_error = pd.Series(errors_build).abs().mean()
The above gives an idea of how the model works on the data used to build the model. But of course, the idea is to use the model in the future, on data not seen before. So let's test the model on the remaining rows:
# Create the testing data set
test = distill.iloc[150:]
X_test = test[["InvTemp3"]]
y_test = test["VapourPressure"]
Then use the predict(...)
function again, but on the testing data. Notice how simple scikit-learn makes this; just replace the input to the predict
function with a different data frame:
prediction_test = mymodel.predict(X_test)
errors_test = y_test - prediction_test
avg_absolute_error = pd.Series(errors_test).abs().mean()
avg_absolute_error, errors_test.std()
Using the score
method, we can get the $R^2$ value. The function needs two inputs:
mymodel.score(X_build, y_build)
and you will get a value that shows how the two variables are correlated. NOTE: the $R^2$ value is *not a measure of prediction precision or accuracy*. It is only an estimate of the degree of correlation. High correlation is no guarantee of good prediction.
A least squares model $y = b_0 + b_1x_1$ with an intercept $b_0$ and a single slope of $b_1$ can be improved by adding a second, or more predictors: $$ y = b_0 + b_1x_1 + b_2x_2 + \ldots$$
This is called a multiple linear regression (MLR) model. We will try to improve our model by adding an extra predictor InvPressure1
:
# Specify the predictors in a list:
predictors = ["InvTemp3", "InvPressure1"]
# Specify the training data:
X_build_MLR = build[predictors]
y_build = build["VapourPressure"]
# Fit the model
full_model = LinearRegression()
full_model.fit(X=X_build_MLR, y=y_build)
# Print some stats:
predict_MLR_build = full_model.predict(X_build_MLR)
errors_MLR_build = y_build - predict_MLR_build
avg_absolute_error_MLR_build = pd.Series(errors_MLR_build).abs().mean()
print(f"The average absolute error {avg_absolute_error_MLR_build:.3f}")
pd.Series(errors_MLR_build).plot(grid=True, title="Error = Actual - Predicted")
Notice the power here: you only have to change the first line to add new predictor. The rest of the code is the same as before (just more generic variable names have been used).
Try creating code for the testing data, which uses the 2 predictors.
X_test_MLR
and y_test
.full_model.predict(...)
function, but on the testing data.