1
00:00:02,186 --> 00:00:05,716
In this video, I am going to show
you how we can analyze the results

2
00:00:05,716 --> 00:00:09,606
from the wastewater treatment example
we were considering in the prior module.

3
00:00:10,806 --> 00:00:14,306
Remember how tedious it was to
analyze the results by hand?

4
00:00:14,496 --> 00:00:18,736
I am going to show you a really fast
way of using R in today's class.

5
00:00:20,686 --> 00:00:26,926
In the previous video, I showed you how we used
the software, RStudio, to analyze the results

6
00:00:26,926 --> 00:00:30,426
for a very small two-factor system.

7
00:00:30,426 --> 00:00:35,746
Open RStudio now, and start by creating a new
file for this wastewater treatment example.

8
00:00:36,776 --> 00:00:38,546
And allow me to work backwards again.

9
00:00:38,876 --> 00:00:44,226
Start by creating a least squares model
called "water", which is a linear model,

10
00:00:44,226 --> 00:00:48,926
where we predict the outcome
value "y" from several factors.

11
00:00:50,416 --> 00:00:55,776
Remember that we had three factors in a water
treatment example, "C" the chemical factor,

12
00:00:56,066 --> 00:00:59,886
"T" the temperature, and "S"
the stirring speed factor.

13
00:01:00,926 --> 00:01:05,326
We also have several two
factor interactions, CT, CS,

14
00:01:05,326 --> 00:01:11,996
and ST. And there's also a
three factor interaction: CTS.

15
00:01:12,376 --> 00:01:18,096
In the water treatment example, recall that we
had eight experiments required for this system.

16
00:01:18,996 --> 00:01:23,236
We learned that we will always
require at least as many experiments

17
00:01:23,236 --> 00:01:25,236
as there are parameters being estimated.

18
00:01:25,376 --> 00:01:32,976
For example, in the popcorn video, we
had four parameters and four experiments.

19
00:01:32,976 --> 00:01:35,426
In this example, we have eight experiments.

20
00:01:35,576 --> 00:01:38,126
So we are able to estimate eight parameters.

21
00:01:39,436 --> 00:01:43,466
However, as before, we see only
seven of them represented here.

22
00:01:43,746 --> 00:01:48,056
The main effects of C, T and S,
the three two factor interactions,

23
00:01:48,396 --> 00:01:49,986
and the three factor interaction.

24
00:01:50,996 --> 00:01:54,366
The eighth parameter, the
intercept, is built in.

25
00:01:54,956 --> 00:01:57,526
R will automatically calculate that for us.

26
00:01:57,526 --> 00:02:01,446
So there are actually eight
parameters here, in our linear model.

27
00:02:03,506 --> 00:02:09,056
We need to define what C, T and S are, and
also provide the outcome variable, "y".

28
00:02:10,306 --> 00:02:14,536
We can let R automatically create
C, T and S, using the special form

29
00:02:14,536 --> 00:02:16,166
of the command, shown on the screen.

30
00:02:16,556 --> 00:02:21,016
The first line of the code,
creates three variables in one line.

31
00:02:21,706 --> 00:02:27,046
If we inspect the variables, we can
see they are simply -1, followed by +1.

32
00:02:27,196 --> 00:02:32,886
Next, expand this into the standard order
table, using the code shown here on the screen.

33
00:02:33,996 --> 00:02:38,156
We need to also extract from
this, the C, T, and S columns.

34
00:02:39,006 --> 00:02:41,316
These are the columns from
our standard order table.

35
00:02:41,406 --> 00:02:46,166
Feel free to reuse this code at the start
of every experimental analysis you do

36
00:02:46,166 --> 00:02:52,826
in R. The last step that we need, is the "y"
vector containing the eight outcome values.

37
00:02:53,406 --> 00:02:56,166
We can take these directly
from the standard order table.

38
00:02:57,516 --> 00:03:00,206
Now we are ready to let the
software calculate the model.

39
00:03:00,896 --> 00:03:05,146
Run this code to create the linear
model; and use the "summary" command

40
00:03:05,146 --> 00:03:06,906
to display the model on the screen.

41
00:03:07,906 --> 00:03:12,186
Notice that the parameters in this prediction
model are identical to those we calculated

42
00:03:12,186 --> 00:03:18,276
in the prior module: 11.25,
6.25, 0.75, and so on.

43
00:03:19,176 --> 00:03:24,166
I want to show you a quick shortcut that you can
use in R. Instead of writing the linear model

44
00:03:24,166 --> 00:03:29,466
by hand, where you could make mistakes writing
out all those two and three factor interactions;

45
00:03:29,916 --> 00:03:32,336
rather use this notation shown on the screen.

46
00:03:32,856 --> 00:03:35,726
This gives you exactly the
same model as you had before.

47
00:03:36,746 --> 00:03:38,316
Now it's time for some advice.

48
00:03:39,086 --> 00:03:43,216
Always perform your analysis using
software code that you write out by hand.

49
00:03:43,696 --> 00:03:48,916
This is a permanent record of your work and is
especially helpful if you add many comments.

50
00:03:52,196 --> 00:03:54,356
Devon: Um, that seems like a lot of work.

51
00:03:54,786 --> 00:03:57,926
Couldn't I use Excel, or
other statistical tools to,

52
00:03:57,926 --> 00:04:00,346
sort of click a few buttons
and get the same result?

53
00:04:00,676 --> 00:04:03,516
Kevin: Absolutely, you can use
those other software tools.

54
00:04:03,926 --> 00:04:07,826
The problem often is, and I've seen
this happen so many times in companies

55
00:04:07,826 --> 00:04:11,766
that I've worked with, is that you do the
work and then a few months later you have

56
00:04:11,766 --> 00:04:14,936
to come back to it, and try to
answer questions from your boss.

57
00:04:15,386 --> 00:04:17,856
Or another colleague continues
on with the project.

58
00:04:17,856 --> 00:04:21,366
If you only give them an Excel file,

59
00:04:21,366 --> 00:04:25,546
or some document that doesn't
record the exact steps you took,

60
00:04:25,856 --> 00:04:28,616
it's very hard to reconstruct
what you were doing,

61
00:04:28,856 --> 00:04:30,896
and what was going through
your head at the time.

62
00:04:31,856 --> 00:04:35,726
Writing out the code like this
explicitly creates a very traceable

63
00:04:35,726 --> 00:04:37,796
and reproducible record of your work.

64
00:04:38,416 --> 00:04:40,786
This is so important in many companies

65
00:04:41,136 --> 00:04:43,986
where there are regulatory
requirements for traceability.

66
00:04:44,876 --> 00:04:47,466
There is one other piece of code
I would like to share with you

67
00:04:47,466 --> 00:04:50,316
to help you visualize each
of the effects in the model.

68
00:05:01,976 --> 00:05:06,996
What I mean by this, is that we have eight
parameters and sometimes we would like to know

69
00:05:06,996 --> 00:05:10,196
which ones are the most influential
on our outcome variable.

70
00:05:11,116 --> 00:05:16,086
We can get an idea of the important factors
in our system, by examining the equation model

71
00:05:16,086 --> 00:05:19,646
in R. We pick out the numbers
that are the largest.

72
00:05:20,046 --> 00:05:25,536
It's easy to visualize this though and
here's some code that will create a barplot.

73
00:05:25,756 --> 00:05:29,636
The bar plot shows the absolute
value of the model parameters.

74
00:05:31,116 --> 00:05:33,116
Why should we use the absolute values?

75
00:05:33,836 --> 00:05:37,596
We do this, because we want to
compare the magnitude of each factor.

76
00:05:38,226 --> 00:05:43,966
The sign is important for sure, but it is easier
to compare large negatives and large positives

77
00:05:44,016 --> 00:05:45,956
if they're on the same side of the plot.

78
00:05:46,406 --> 00:05:51,306
In order to retain the sign information, I've
used light grey for negative coefficients

79
00:05:51,306 --> 00:05:53,426
and black for positive coefficients.

80
00:05:54,146 --> 00:05:58,646
This is important: not everyone can
perceive colour, or sometimes you have

81
00:05:58,646 --> 00:06:00,306
to print a report in black and white.

82
00:06:01,256 --> 00:06:05,296
You can modify how you use the code to
get alternative colour schemes though.

83
00:06:05,646 --> 00:06:07,576
R is really flexible in this way.

84
00:06:08,046 --> 00:06:09,736
Now we need to interpret the plot.

85
00:06:10,566 --> 00:06:13,776
We can quickly see that the C
times T times S interaction.

86
00:06:13,776 --> 00:06:19,256
And the CT and TS interactions are
really small compared to the other terms.

87
00:06:20,566 --> 00:06:22,536
Devon: Why don't you show
the intercept in the plot?

88
00:06:22,536 --> 00:06:24,696
Kevin: There are several reasons.

89
00:06:25,536 --> 00:06:28,246
We always keep an intercept term in our models.

90
00:06:28,386 --> 00:06:31,766
The intent of this Pareto plot,
is for you to compare the effect

91
00:06:31,766 --> 00:06:33,726
of the various factors against each other.

92
00:06:34,296 --> 00:06:37,276
But the intercept isn't something
that you can really change.

93
00:06:38,526 --> 00:06:43,636
These plots are often used to locate variables
that are uninteresting and then remove them.

94
00:06:44,226 --> 00:06:48,526
But since the intercept will always be important
in our models; we will never remove it,

95
00:06:48,596 --> 00:06:50,366
so we don't really need to plot it.

96
00:06:50,366 --> 00:06:55,316
Furthermore, the intercept can
sometimes be a really large value,

97
00:06:55,316 --> 00:06:58,116
relative to the other bars,
which will distort the plot.

98
00:06:58,666 --> 00:07:02,886
The plot shows us the parameters, ranked
from largest absolute value at the top,

99
00:07:02,936 --> 00:07:05,196
to smallest absolute values at the bottom.

100
00:07:05,886 --> 00:07:09,956
This quickly allows us to find the
most influential factors in our system.

101
00:07:10,096 --> 00:07:12,256
And this plot is called a Pareto plot.

102
00:07:13,126 --> 00:07:18,666
The largest magnitude bars, corresponds to those
factors which most strongly affect the outcome.

103
00:07:19,356 --> 00:07:23,506
In this case, we have S.
This colour here indicates

104
00:07:23,506 --> 00:07:25,786
that it has a reducing effect on the outcome.

105
00:07:26,816 --> 00:07:30,186
Remember that our objective was
to reduce the amount of pollution.

106
00:07:30,616 --> 00:07:34,436
So we can quickly see here,
that increasing S will result

107
00:07:34,436 --> 00:07:36,536
in less pollution, which is desirable.

108
00:07:37,566 --> 00:07:41,576
We investigated the interpretation of
this interaction term in the prior class,

109
00:07:41,846 --> 00:07:43,716
so I'm not going to repeat it over here.

110
00:07:43,716 --> 00:07:47,486
And similarly, for the bar that
represents the chemical effect,

111
00:07:47,486 --> 00:07:53,516
C. Just before finishing this example
today, I'd like to quickly share

112
00:07:53,516 --> 00:07:56,396
with you what the matrix form
of this problem looks like.

113
00:07:57,386 --> 00:08:01,416
I know that there are those of you that
are more math-oriented, and will like -

114
00:08:01,416 --> 00:08:04,276
and actually have a better
understanding of - this representation.

115
00:08:04,816 --> 00:08:07,456
There's certainly something
for everyone in this course.

116
00:08:08,526 --> 00:08:11,936
So there it is, the "X" matrix
and the "y" vector.

117
00:08:12,576 --> 00:08:14,836
And what R is doing behind the scenes,

118
00:08:14,936 --> 00:08:18,936
is finding efficiently the solution
to the least squares problem.

119
00:08:19,666 --> 00:08:24,626
To end this video, we hope that you are
enjoying using R to solve your linear models

120
00:08:24,626 --> 00:08:27,006
and to fit your outcome values
to make predictions.

121
00:08:27,476 --> 00:08:28,636
Please keep using it.

122
00:08:29,016 --> 00:08:32,586
There are plenty of practice problems in
this module which you should be attempting.

123
00:08:33,146 --> 00:08:36,186
Those problems provide full
R code in the solutions.