Notebook

In [ ]:

1 Keeping Conda up to date and installing new packages
- 1.1 Installing a new package in your virtual environment
- 1.2 Updating an existing package
2 Building regression models
- 2.1 Visualizing the correlation matrix
- 2.2 Visualizing the scatter plot matrix
3 Detour: Visualization of the regression model
4 How well the regression model works: training (building) data
5 Secondly, how well the regression model works on testing data
6 Calculate the $R^2$ value
7 Improving the model: adding extra predictors
8 Challenge: checking the MLR model on the testing data

All content here is under a Creative Commons Attribution CC-BY 4.0 and all source code is released under a BSD-2 clause license.

Please reuse, remix, revise, and reshare this content in any way, keeping this notice.

Course overview¶

This is the fifth module of several (11, 12, 13, 14, 15 and 16), which refocuses the course material in the prior 10 modules in a slightly different way. It places more emphasis on

dealing with data: importing, merging, filtering;
calculations from the data;
visualization of it.

In short: *how to extract value from your data*.

Module 15 Overview¶

In this module we will cover

Visualization of a correlation matrix with a heat map
Fitting a linear regression model to the data
Visualization of the linear regression model
Accessing data from your data frame using .iloc

Requirements before starting

Have your Python installation working as you had for modules 11 to 14, including the Pandas library installed.
An extra requirement: install the scikit-learn and seaborn libraries. See instructions below.

Keeping Conda up to date and installing new packages¶

Newer versions of packages are released frequently. You can update your packages (libraries), with these commands. Do this at the command line (not in Jupyter notebook!):

conda update -n base conda
    conda update --all

Installing a new package in your virtual environment¶

You will come across people recommending different packages in Python for all sorts of interesting applications. For example, the library seaborn is often recommended for visualization. But you might not have it installed yet. In this module we will use seaborn and also scikit-learn.

This is how you can install the seaborn and scikit-learn packages in your virtual environment called myenv:

conda activate myenv    <--- change the last word in the command to the name of your actual environment
    conda install seaborn scikit-learn

Or in one line:

conda install -n myenv seaborn scikit-learn

Updating an existing package¶

Similar to the above, you can update a package to the latest version. Just change install to update instead. Or in one line:

conda update -n myenv seaborn scikit-learn

Building regression models¶

In the prior module you learned about setting the date and time when importing data, visualizing your data with box plots, histograms, line or time-series plots, and scatter plots. You applied these to your own data, and learned about the very powerful groupby function in Pandas.

In this module we will take these skills a step further, but first, we will learn about fitting regression models to some data.

Start by importing Pandas, and also a tool to build regression models, which is from the scikit-learn library. This is imported as sklearn. You can read more about scikit-learn at their website: https://scikit-learn.org/stable/

In [1]:

import pandas as pd
from sklearn.linear_model import LinearRegression

We will use a data set that is from a distillation column, and predicting an important output variable, called the Reid Vapour Pressure (RVP).

Read in the data set and set the column called "Case" to be the index:

distill = pd.read_csv("https://openmv.net/file/distillation-tower.csv")

In the prior module you were asked to use your own data and:

Calculate the correlation matrix of values and display that. Were you able to do so?
Could you also visualize a scatter plot matrix of these values with the "kde" on the diagonal, squares for the markers and an alpha value of 0.8 for the points?

We will do this interactively below, but also introduce a new plot, the heat map.

In [ ]:

Visualizing the correlation matrix¶

The correlation matrix of numbers can be tedious to read on a screen. You can easily visualize it though:

import seaborn as sns
display(distill.corr());

# Let's visualize it instead, as a heat map.
sns.set(rc={'figure.figsize':(15, 15)})
sns.heatmap(distill.corr());

# This is not so attractive. Use a different colour map (cmap):
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(distill.corr(), cmap=cmap,  square=True, linewidths=0.2, cbar_kws={"shrink": 0.5});

In [ ]:

Visualizing the scatter plot matrix¶

We saw the scatter plots before, and the scatter plot matrix.

from pandas.plotting import scatter_matrix
scatter_matrix(distill, alpha = 0.2, figsize=(15, 15),  diagonal = "kde");

The data set is quite big and takes some time to generate all the scatter plot combinations.

In [ ]:

We can use every third row instead.

print(distill.shape)
subset = distill.iloc[0::3, :]
subset.shape

The .iloc function accesses the data by index (the i in iloc) and for a given location, so iloc= index location.

Some examples:

mydata.iloc[0:10, :] will return the 10 rows, and all columns
mydata.iloc[20, 2:4] will return only row ___, and columns ___
mydata.iloc[0:10:2, :] will return only rows with index ___; and ___ columns
mydata.iloc[0::2, :] will return ___ row; and ___ columns
mydata.iloc[:, -1] will return ___ rows of the ___ column

Try some examples of using .iloc below on the distill data:

In [ ]:

Now that you understand what .iloc is doing, you can understand why this code is faster, because it uses half the data set to create the scatter plot matrix:

scatter_matrix(distill.iloc[0::2,:], alpha = 0.2, figsize=(15, 15),  diagonal = "kde");

In [ ]:

Which 2 columns are the most correlated with the outcome variable called "VapourPressure"?

distill.corr().shape
distill.corr().iloc[...] # <-- fill in some code to show only the last row of the correlation matrix

In [ ]:

Let us build a regression model using the InvTemp3 measurement, the inverse temperature measured at position 3 in the distillation column, to predict the VapourPressure.

There are 253 measurements (rows) in the dataset. We will use the first 150 rows in the data set to build the model, and then use the remaining rows to test the model, to see how well we can predict vapour pressure. This is good statistical practice. Do not use all the data to build the prediction model; you will get an inflated sense of how well the model works. Always keep some testing for validation/verification.

Create the model building data set from the first 150 rows:

build = distill.iloc[___]  # <-- what goes here?
display(build)

Then create test testing data from the rest of the data frame:

test = distill.iloc[___]  # <-- what goes here?
display(test)

Try it below, and use the .shape command to verify the building and testing data have the correct shape.

In [ ]:

First, we set up an instance of the linear regression model:

mymodel = LinearRegression()
type(mymodel)

The mymodel is an object. It is an object of a linear regression model, but it is empty at the moment. We will give it some training data, to build the model:

mymodel = LinearRegression()
mymodel.fit(X, y)

but we have to tell it what is X and what is y. So we have a small detour...

In [ ]:

We need numeric values for X and y. Scikit-learn requires the X data (the values used to predict y) to be a column vector or a matrix.

Notice that a column vector is just a matrix with 1 column. This is because, you will see later, you can have 1 or more columns used to predict y. Therefore every input used to predict y must be in a column. Each row in the input matrix is one observation.

There is a shortcut to force Pandas to return a single column to be extracted as a column, and not a Series

print(build["InvTemp3"].shape)
print(build[["InvTemp3"]].shape)

So this will work to build your regression model:

# A single column in matrix X (capital X indicates one or more input columns)
X = build[["InvTemp3"]].values
y = build["VapourPressure"].values
mymodel = LinearRegression()
mymodel.fit(X, y)

If you run this code and see no error messages, then then model has been built. But it is not that exciting, ... yet.

In [ ]:

So what can you do with this model? Use

dir(mymodel)

to ask Python what can be done with that object. Note that the dir(...) function works on any object and is something that you will use regularly.

There are several interesting methods that you see there which we will get to use.

coef_
intercept_
predict
score

The first two, are as you might guess, the intercept of the model and the coefficient (slope).

print(f"The intercept is {mymodel.intercept_} and the slope is = {mymodel.coef_}")

Now it is not so handy having all those decimal places. Python allows you to truncate them to the desired number:

print(f"The intercept is {mymodel.intercept_:.5g} and the slope is = {mymodel.coef_}")

We have to be a bit more careful with the slope. It is an array (see the square brackets?): so we need to extract the first entry from that vector before displaying it:

print(f"The intercept is {mymodel.intercept_:.5g} and the slope is = {mymodel.coef_[0]:.5g}")

Try this as well:

print(f"The intercept is {mymodel.intercept_:.5f} and the slope is = {mymodel.coef_[0]:.5f}")

There is a subtle difference between the f and the g format specifiers.

In [ ]:

Detour: Visualization of the regression model¶

After building the regression model it is helpful to visualize it. The Seaborn library has a useful function to do so.

import seaborn as sns
ax = sns.regplot(x="InvTemp3", y="VapourPressure", data=distill)
ax.grid(True)

Take a look at the documentation for the regplot function: https://seaborn.pydata.org/generated/seaborn.regplot.html

for more options, but the simple function above already does most of what you would expect:

it draws a scatter plot of the raw data
adds the regression line to the plot
shows the confidence interval for the regression (the interval expected for the true but unknown slope)
adds labels to the axes.

In [ ]:

An "upgrade" you might be interested in, is the joint plot, which adds the histograms to the axes:

sns.jointplot(x="InvTemp3", y="VapourPressure", data=distill, kind="reg");

# Or, show the kde=kernel density estimate
sns.jointplot(x="InvTemp3", y="VapourPressure", data=distill, kind="kde");

In [ ]:

How well the regression model works: training (building) data¶

Next, we would like to extract some idea of how the model performs. For that we can look at

The predictions of the building data,
The predictions of the testing data. This is a more accurate estimate.

Firstly, for the model building data:

# Get the predicted values from the data used to build the model
X_build = build[["InvTemp3"]]
y_build = build["VapourPressure"]

prediction_build = mymodel.predict(X_build)
errors_build = y_build - prediction_build # error = actual minus predicted

# There are several ways to see "how good" the model is, but the average 
# of the absolute values of the errors gives a good feeling. Smaller is better.
avg_absolute_error = pd.Series(errors_build).abs().mean()

Calculate this average absolute error below.
Also calculate the standard deviation of the errors (this is another way to judge the model). Smaller is better.
Lastly, plot the prediction errors for the building data (first 150 rows) to see what time-based trends there are.

In [ ]:

Secondly, how well the regression model works on testing data¶

The above gives an idea of how the model works on the data used to build the model. But of course, the idea is to use the model in the future, on data not seen before. So let's test the model on the remaining rows:

# Create the testing data set
test = distill.iloc[150:]
X_test = test[["InvTemp3"]]
y_test = test["VapourPressure"]

Then use the predict(...) function again, but on the testing data. Notice how simple scikit-learn makes this; just replace the input to the predict function with a different data frame:

prediction_test = mymodel.predict(X_test)
errors_test = y_test - prediction_test
avg_absolute_error = pd.Series(errors_test).abs().mean()
avg_absolute_error, errors_test.std()

Calculate the average absolute error below, but for the model testing data.
Calculate the standard deviation of the prediction errors. Again, smaller is better.
Lastly, plot the prediction errors for the testing data to see what time-based trends there are. Any concerns?

In [ ]:

Calculate the $R^2$ value¶

Using the score method, we can get the $R^2$ value. The function needs two inputs:

mymodel.score(X_build, y_build)

and you will get a value that shows how the two variables are correlated. NOTE: the $R^2$ value is *not a measure of prediction precision or accuracy*. It is only an estimate of the degree of correlation. High correlation is no guarantee of good prediction.

In [ ]:

Improving the model: adding extra predictors¶

A least squares model $y = b_0 + b_1x_1$ with an intercept $b_0$ and a single slope of $b_1$ can be improved by adding a second, or more predictors: $$ y = b_0 + b_1x_1 + b_2x_2 + \ldots$$

This is called a multiple linear regression (MLR) model. We will try to improve our model by adding an extra predictor InvPressure1:

# Specify the predictors in a list:
predictors = ["InvTemp3", "InvPressure1"]

# Specify the training data:
X_build_MLR = build[predictors]
y_build = build["VapourPressure"]

# Fit the model
full_model = LinearRegression()
full_model.fit(X=X_build_MLR, y=y_build)

# Print some stats:
predict_MLR_build = full_model.predict(X_build_MLR)
errors_MLR_build = y_build - predict_MLR_build
avg_absolute_error_MLR_build = pd.Series(errors_MLR_build).abs().mean()
print(f"The average absolute error {avg_absolute_error_MLR_build:.3f}")
pd.Series(errors_MLR_build).plot(grid=True, title="Error = Actual - Predicted")

Notice the power here: you only have to change the first line to add new predictor. The rest of the code is the same as before (just more generic variable names have been used).

In [ ]:

Challenge: checking the MLR model on the testing data¶

Try creating code for the testing data, which uses the 2 predictors.

Extract the values for X_test_MLR and y_test.
Use the full_model.predict(...) function, but on the testing data.
As before, calculate the average absolute error and the standard deviation of the error. How does it compare to the prior model with a single predictor?
Plot the errors over time. Are they smaller than for the case where you had only 1 predictor?

In [ ]:

Table of Contents

Course overview¶

Module 15 Overview¶

Keeping Conda up to date and installing new packages¶

Installing a new package in your virtual environment¶

Updating an existing package¶

Building regression models¶

Visualizing the correlation matrix¶

Visualizing the scatter plot matrix¶

Detour: Visualization of the regression model¶

How well the regression model works: training (building) data¶

Secondly, how well the regression model works on testing data¶

Calculate the $R^2$ value¶

Improving the model: adding extra predictors¶

Challenge: checking the MLR model on the testing data¶