1 00:00:02,186 --> 00:00:05,716 In this video, I am going to show you how we can analyze the results 2 00:00:05,716 --> 00:00:09,606 from the wastewater treatment example we were considering in the prior module. 3 00:00:10,806 --> 00:00:14,306 Remember how tedious it was to analyze the results by hand? 4 00:00:14,496 --> 00:00:18,736 I am going to show you a really fast way of using R in today's class. 5 00:00:20,686 --> 00:00:26,926 In the previous video, I showed you how we used the software, RStudio, to analyze the results 6 00:00:26,926 --> 00:00:30,426 for a very small two-factor system. 7 00:00:30,426 --> 00:00:35,746 Open RStudio now, and start by creating a new file for this wastewater treatment example. 8 00:00:36,776 --> 00:00:38,546 And allow me to work backwards again. 9 00:00:38,876 --> 00:00:44,226 Start by creating a least squares model called "water", which is a linear model, 10 00:00:44,226 --> 00:00:48,926 where we predict the outcome value "y" from several factors. 11 00:00:50,416 --> 00:00:55,776 Remember that we had three factors in a water treatment example, "C" the chemical factor, 12 00:00:56,066 --> 00:00:59,886 "T" the temperature, and "S" the stirring speed factor. 13 00:01:00,926 --> 00:01:05,326 We also have several two factor interactions, CT, CS, 14 00:01:05,326 --> 00:01:11,996 and ST. And there's also a three factor interaction: CTS. 15 00:01:12,376 --> 00:01:18,096 In the water treatment example, recall that we had eight experiments required for this system. 16 00:01:18,996 --> 00:01:23,236 We learned that we will always require at least as many experiments 17 00:01:23,236 --> 00:01:25,236 as there are parameters being estimated. 18 00:01:25,376 --> 00:01:32,976 For example, in the popcorn video, we had four parameters and four experiments. 19 00:01:32,976 --> 00:01:35,426 In this example, we have eight experiments. 20 00:01:35,576 --> 00:01:38,126 So we are able to estimate eight parameters. 21 00:01:39,436 --> 00:01:43,466 However, as before, we see only seven of them represented here. 22 00:01:43,746 --> 00:01:48,056 The main effects of C, T and S, the three two factor interactions, 23 00:01:48,396 --> 00:01:49,986 and the three factor interaction. 24 00:01:50,996 --> 00:01:54,366 The eighth parameter, the intercept, is built in. 25 00:01:54,956 --> 00:01:57,526 R will automatically calculate that for us. 26 00:01:57,526 --> 00:02:01,446 So there are actually eight parameters here, in our linear model. 27 00:02:03,506 --> 00:02:09,056 We need to define what C, T and S are, and also provide the outcome variable, "y". 28 00:02:10,306 --> 00:02:14,536 We can let R automatically create C, T and S, using the special form 29 00:02:14,536 --> 00:02:16,166 of the command, shown on the screen. 30 00:02:16,556 --> 00:02:21,016 The first line of the code, creates three variables in one line. 31 00:02:21,706 --> 00:02:27,046 If we inspect the variables, we can see they are simply -1, followed by +1. 32 00:02:27,196 --> 00:02:32,886 Next, expand this into the standard order table, using the code shown here on the screen. 33 00:02:33,996 --> 00:02:38,156 We need to also extract from this, the C, T, and S columns. 34 00:02:39,006 --> 00:02:41,316 These are the columns from our standard order table. 35 00:02:41,406 --> 00:02:46,166 Feel free to reuse this code at the start of every experimental analysis you do 36 00:02:46,166 --> 00:02:52,826 in R. The last step that we need, is the "y" vector containing the eight outcome values. 37 00:02:53,406 --> 00:02:56,166 We can take these directly from the standard order table. 38 00:02:57,516 --> 00:03:00,206 Now we are ready to let the software calculate the model. 39 00:03:00,896 --> 00:03:05,146 Run this code to create the linear model; and use the "summary" command 40 00:03:05,146 --> 00:03:06,906 to display the model on the screen. 41 00:03:07,906 --> 00:03:12,186 Notice that the parameters in this prediction model are identical to those we calculated 42 00:03:12,186 --> 00:03:18,276 in the prior module: 11.25, 6.25, 0.75, and so on. 43 00:03:19,176 --> 00:03:24,166 I want to show you a quick shortcut that you can use in R. Instead of writing the linear model 44 00:03:24,166 --> 00:03:29,466 by hand, where you could make mistakes writing out all those two and three factor interactions; 45 00:03:29,916 --> 00:03:32,336 rather use this notation shown on the screen. 46 00:03:32,856 --> 00:03:35,726 This gives you exactly the same model as you had before. 47 00:03:36,746 --> 00:03:38,316 Now it's time for some advice. 48 00:03:39,086 --> 00:03:43,216 Always perform your analysis using software code that you write out by hand. 49 00:03:43,696 --> 00:03:48,916 This is a permanent record of your work and is especially helpful if you add many comments. 50 00:03:52,196 --> 00:03:54,356 Devon: Um, that seems like a lot of work. 51 00:03:54,786 --> 00:03:57,926 Couldn't I use Excel, or other statistical tools to, 52 00:03:57,926 --> 00:04:00,346 sort of click a few buttons and get the same result? 53 00:04:00,676 --> 00:04:03,516 Kevin: Absolutely, you can use those other software tools. 54 00:04:03,926 --> 00:04:07,826 The problem often is, and I've seen this happen so many times in companies 55 00:04:07,826 --> 00:04:11,766 that I've worked with, is that you do the work and then a few months later you have 56 00:04:11,766 --> 00:04:14,936 to come back to it, and try to answer questions from your boss. 57 00:04:15,386 --> 00:04:17,856 Or another colleague continues on with the project. 58 00:04:17,856 --> 00:04:21,366 If you only give them an Excel file, 59 00:04:21,366 --> 00:04:25,546 or some document that doesn't record the exact steps you took, 60 00:04:25,856 --> 00:04:28,616 it's very hard to reconstruct what you were doing, 61 00:04:28,856 --> 00:04:30,896 and what was going through your head at the time. 62 00:04:31,856 --> 00:04:35,726 Writing out the code like this explicitly creates a very traceable 63 00:04:35,726 --> 00:04:37,796 and reproducible record of your work. 64 00:04:38,416 --> 00:04:40,786 This is so important in many companies 65 00:04:41,136 --> 00:04:43,986 where there are regulatory requirements for traceability. 66 00:04:44,876 --> 00:04:47,466 There is one other piece of code I would like to share with you 67 00:04:47,466 --> 00:04:50,316 to help you visualize each of the effects in the model. 68 00:05:01,976 --> 00:05:06,996 What I mean by this, is that we have eight parameters and sometimes we would like to know 69 00:05:06,996 --> 00:05:10,196 which ones are the most influential on our outcome variable. 70 00:05:11,116 --> 00:05:16,086 We can get an idea of the important factors in our system, by examining the equation model 71 00:05:16,086 --> 00:05:19,646 in R. We pick out the numbers that are the largest. 72 00:05:20,046 --> 00:05:25,536 It's easy to visualize this though and here's some code that will create a barplot. 73 00:05:25,756 --> 00:05:29,636 The bar plot shows the absolute value of the model parameters. 74 00:05:31,116 --> 00:05:33,116 Why should we use the absolute values? 75 00:05:33,836 --> 00:05:37,596 We do this, because we want to compare the magnitude of each factor. 76 00:05:38,226 --> 00:05:43,966 The sign is important for sure, but it is easier to compare large negatives and large positives 77 00:05:44,016 --> 00:05:45,956 if they're on the same side of the plot. 78 00:05:46,406 --> 00:05:51,306 In order to retain the sign information, I've used light grey for negative coefficients 79 00:05:51,306 --> 00:05:53,426 and black for positive coefficients. 80 00:05:54,146 --> 00:05:58,646 This is important: not everyone can perceive colour, or sometimes you have 81 00:05:58,646 --> 00:06:00,306 to print a report in black and white. 82 00:06:01,256 --> 00:06:05,296 You can modify how you use the code to get alternative colour schemes though. 83 00:06:05,646 --> 00:06:07,576 R is really flexible in this way. 84 00:06:08,046 --> 00:06:09,736 Now we need to interpret the plot. 85 00:06:10,566 --> 00:06:13,776 We can quickly see that the C times T times S interaction. 86 00:06:13,776 --> 00:06:19,256 And the CT and TS interactions are really small compared to the other terms. 87 00:06:20,566 --> 00:06:22,536 Devon: Why don't you show the intercept in the plot? 88 00:06:22,536 --> 00:06:24,696 Kevin: There are several reasons. 89 00:06:25,536 --> 00:06:28,246 We always keep an intercept term in our models. 90 00:06:28,386 --> 00:06:31,766 The intent of this Pareto plot, is for you to compare the effect 91 00:06:31,766 --> 00:06:33,726 of the various factors against each other. 92 00:06:34,296 --> 00:06:37,276 But the intercept isn't something that you can really change. 93 00:06:38,526 --> 00:06:43,636 These plots are often used to locate variables that are uninteresting and then remove them. 94 00:06:44,226 --> 00:06:48,526 But since the intercept will always be important in our models; we will never remove it, 95 00:06:48,596 --> 00:06:50,366 so we don't really need to plot it. 96 00:06:50,366 --> 00:06:55,316 Furthermore, the intercept can sometimes be a really large value, 97 00:06:55,316 --> 00:06:58,116 relative to the other bars, which will distort the plot. 98 00:06:58,666 --> 00:07:02,886 The plot shows us the parameters, ranked from largest absolute value at the top, 99 00:07:02,936 --> 00:07:05,196 to smallest absolute values at the bottom. 100 00:07:05,886 --> 00:07:09,956 This quickly allows us to find the most influential factors in our system. 101 00:07:10,096 --> 00:07:12,256 And this plot is called a Pareto plot. 102 00:07:13,126 --> 00:07:18,666 The largest magnitude bars, corresponds to those factors which most strongly affect the outcome. 103 00:07:19,356 --> 00:07:23,506 In this case, we have S. This colour here indicates 104 00:07:23,506 --> 00:07:25,786 that it has a reducing effect on the outcome. 105 00:07:26,816 --> 00:07:30,186 Remember that our objective was to reduce the amount of pollution. 106 00:07:30,616 --> 00:07:34,436 So we can quickly see here, that increasing S will result 107 00:07:34,436 --> 00:07:36,536 in less pollution, which is desirable. 108 00:07:37,566 --> 00:07:41,576 We investigated the interpretation of this interaction term in the prior class, 109 00:07:41,846 --> 00:07:43,716 so I'm not going to repeat it over here. 110 00:07:43,716 --> 00:07:47,486 And similarly, for the bar that represents the chemical effect, 111 00:07:47,486 --> 00:07:53,516 C. Just before finishing this example today, I'd like to quickly share 112 00:07:53,516 --> 00:07:56,396 with you what the matrix form of this problem looks like. 113 00:07:57,386 --> 00:08:01,416 I know that there are those of you that are more math-oriented, and will like - 114 00:08:01,416 --> 00:08:04,276 and actually have a better understanding of - this representation. 115 00:08:04,816 --> 00:08:07,456 There's certainly something for everyone in this course. 116 00:08:08,526 --> 00:08:11,936 So there it is, the "X" matrix and the "y" vector. 117 00:08:12,576 --> 00:08:14,836 And what R is doing behind the scenes, 118 00:08:14,936 --> 00:08:18,936 is finding efficiently the solution to the least squares problem. 119 00:08:19,666 --> 00:08:24,626 To end this video, we hope that you are enjoying using R to solve your linear models 120 00:08:24,626 --> 00:08:27,006 and to fit your outcome values to make predictions. 121 00:08:27,476 --> 00:08:28,636 Please keep using it. 122 00:08:29,016 --> 00:08:32,586 There are plenty of practice problems in this module which you should be attempting. 123 00:08:33,146 --> 00:08:36,186 Those problems provide full R code in the solutions.