1 00:00:00,036 --> 00:00:05,766 Let's look at a good case study now, involving four factors, and two outcome variables. 2 00:00:06,286 --> 00:00:08,546 We're stepping up the complexity here a little bit. 3 00:00:08,646 --> 00:00:13,066 This is a good question, from the textbook by Box, Hunter and Hunter. 4 00:00:13,206 --> 00:00:18,006 It's a case study where we we are using solar panels, with a storage tank. 5 00:00:18,706 --> 00:00:21,176 The outcome values were from a computer simulation. 6 00:00:21,176 --> 00:00:24,986 Now just a quick piece of advice when using simulations. 7 00:00:25,756 --> 00:00:28,046 Running simulations is often really easy. 8 00:00:28,646 --> 00:00:31,126 But there's a temptation to really do this inefficiently. 9 00:00:31,126 --> 00:00:35,806 I often see people just playing around with the software, trying out different values, 10 00:00:35,806 --> 00:00:37,496 until they get an answer they like. 11 00:00:38,426 --> 00:00:41,906 You shouldn't treat simulations any differently from real life. 12 00:00:41,906 --> 00:00:43,646 Always use a systematic method. 13 00:00:44,516 --> 00:00:48,356 In this case we're going to use a set of experiments as our systematic method. 14 00:00:49,096 --> 00:00:52,576 There are two key advantages though to using simulations. 15 00:00:52,576 --> 00:00:57,086 You can run the simulations in parallel at very little cost and secondly, 16 00:00:57,336 --> 00:01:00,916 you don't have to randomize the order of experiments. 17 00:01:00,916 --> 00:01:02,566 And the reason for that is quite simple. 18 00:01:03,536 --> 00:01:06,546 When you repeat the simulation, you get the same answer, 19 00:01:06,546 --> 00:01:10,286 so the need for randomization isn't there anymore, which was, 20 00:01:10,286 --> 00:01:13,266 minimize the impact of disturbances. 21 00:01:13,266 --> 00:01:18,916 Be careful though: certain computer experiments, when repeated, don't give identical results. 22 00:01:18,916 --> 00:01:21,116 So then you should randomize. 23 00:01:21,556 --> 00:01:24,056 In fact, I always recommend you randomize. 24 00:01:24,286 --> 00:01:29,096 The cost of doing so if very minimal, and it guards against all sorts of problems. 25 00:01:29,616 --> 00:01:31,356 More on that in the next module though. 26 00:01:31,356 --> 00:01:34,366 Let's go back to the solar panel system. 27 00:01:34,846 --> 00:01:36,006 There are four factors. 28 00:01:36,356 --> 00:01:42,396 A: the total amount of insulation or sunlight received; B: the capacity of the storage tank; 29 00:01:42,486 --> 00:01:47,316 C: the water flow rate through the absorber; and D: the intermittency of the sunlight. 30 00:01:48,046 --> 00:01:51,206 You can read more about these types of systems, by following this link. 31 00:01:52,296 --> 00:01:56,146 The two outcome variables were "y_1" the collection efficiency, 32 00:01:56,146 --> 00:01:58,766 and "y_2" the energy delivery efficiency. 33 00:01:59,456 --> 00:02:02,566 You should be able to quickly tell how many experiments will be done, 34 00:02:02,566 --> 00:02:06,116 if each factor is operated at the low level and the high level. 35 00:02:06,116 --> 00:02:11,786 You should have: two to the power of four (2^4) which is 16. 36 00:02:11,916 --> 00:02:18,086 So 16 experiments were run, and I've put the results and the R code here on the screen. 37 00:02:18,506 --> 00:02:21,706 They're available on the course website. 38 00:02:21,706 --> 00:02:25,306 Copy and paste that code and follow along with me for the rest of the video. 39 00:02:25,356 --> 00:02:30,836 So here we define the four factors: A, B, C and D, and I've manually typed 40 00:02:30,836 --> 00:02:33,536 in the two outcome variables, "y_1" and "y_2". 41 00:02:33,536 --> 00:02:38,566 This is what you would do in practice, but to make things a bit simpler, 42 00:02:38,776 --> 00:02:43,176 and to avoid typing errors, you can also use the PID package in R. 43 00:02:44,026 --> 00:02:46,986 In a prior video I showed how you can download and install 44 00:02:46,986 --> 00:02:49,276 that package, to extend R's capability. 45 00:02:50,016 --> 00:02:53,106 That package includes the numeric results for this case study. 46 00:02:53,816 --> 00:02:59,026 And you can get that dataset by typing the following command: data(solar). 47 00:02:59,086 --> 00:03:04,866 So since we ran 16 experiments, we are able to estimate 16 parameters: 48 00:03:05,366 --> 00:03:08,546 there are four main effects (one for A, B, C and D). 49 00:03:09,066 --> 00:03:14,056 There are 6 two-factor interactions, there are 4 three-factor interactions, 50 00:03:14,056 --> 00:03:16,526 and then the single four-factor interaction. 51 00:03:16,526 --> 00:03:21,766 That's a total of 15 parameters, and it adds to 16 if you count the intercept. 52 00:03:21,766 --> 00:03:24,876 The software can create all of this for you, 53 00:03:24,876 --> 00:03:27,816 very compactly with the "lm(...)" command, as shown here. 54 00:03:29,536 --> 00:03:36,386 The reason why this A*B*C*D concept works is because of the principle of model hierarchy. 55 00:03:36,386 --> 00:03:43,886 Let's take a simple example: if you wrote just A*B, then R will expand 56 00:03:43,886 --> 00:03:46,796 that to include factor A and factor B in the model. 57 00:03:46,796 --> 00:03:53,176 After all, you can't have the two factor interaction A*B if you don't also have factor A 58 00:03:53,176 --> 00:03:58,076 and factor B. Similarly, when R encounters A*B*C, 59 00:03:58,076 --> 00:04:05,016 it ensures that the AB interaction is present, as well as factor C. But, 60 00:04:05,126 --> 00:04:08,486 we've already mentioned that the AB will be expanded into factors A 61 00:04:08,486 --> 00:04:15,046 and B. So it will ensure the BC interaction is present, and in a similar line of thinking, 62 00:04:15,296 --> 00:04:17,556 the AC interaction will also be present. 63 00:04:17,556 --> 00:04:23,536 So now you can understand why when we write A*B*C*D here in the lm(...) 64 00:04:23,536 --> 00:04:28,416 command, R will recursively expand this into all the main effects, 65 00:04:28,416 --> 00:04:31,916 all the two factor interactions, all the 3 factor interactions 66 00:04:32,166 --> 00:04:34,036 as well as the 4 factor interaction. 67 00:04:34,886 --> 00:04:37,756 It is as if we had written it all out by hand as shown here. 68 00:04:38,426 --> 00:04:42,546 But obviously that is tedious, and error-prone, so let R do the work for you. 69 00:04:42,546 --> 00:04:46,756 Now let's build those two separate linear models: for the collection efficiency, "y1", 70 00:04:46,756 --> 00:04:49,816 and for the energy delivery efficiency, "y2". 71 00:04:49,816 --> 00:04:51,936 If you use the summary(...) 72 00:04:51,936 --> 00:04:55,316 command, as we've done before, it might be fairly difficult 73 00:04:55,316 --> 00:04:59,136 to quickly locate what the important factors are that influence y_1. 74 00:05:00,036 --> 00:05:03,366 Rather let's use the Pareto plot to show us what the important parameters are. 75 00:05:04,096 --> 00:05:07,946 Here it is: the grey bars represent the terms with a negative sign. 76 00:05:08,446 --> 00:05:12,306 And black bars represent the terms with a positive sign. 77 00:05:12,306 --> 00:05:16,796 The most important terms are the B, the A, the AB interaction, 78 00:05:17,046 --> 00:05:21,236 and factor C. The other terms have a diminishing effect on the outcome. 79 00:05:22,716 --> 00:05:26,916 The collection efficiency will decrease when factor B is increased. 80 00:05:26,916 --> 00:05:32,676 In other words, as the storage tank capacity is increased, the collection efficiency drops. 81 00:05:32,676 --> 00:05:35,276 This is the most influential variable in the system. 82 00:05:35,706 --> 00:05:42,246 Next is the A factor, the amount of insolation, has a positive on the collection efficiency. 83 00:05:42,246 --> 00:05:46,126 Now try answering this question here on the screen: pause the video, 84 00:05:46,336 --> 00:05:48,406 and think about the AB interaction. 85 00:05:50,266 --> 00:05:53,426 The correct answer is the one that use a high level for factor A, 86 00:05:53,426 --> 00:05:58,546 and a low level for factor B. We can see this in the equation, and from the Pareto plot. 87 00:05:59,346 --> 00:06:04,156 In this case, setting factor B to a negative sign, helps boost our objective, 88 00:06:04,536 --> 00:06:07,756 but it also makes the two factor interaction work in our favour. 89 00:06:09,316 --> 00:06:13,716 So A, B and AB interaction are the three most influential terms in the model. 90 00:06:14,296 --> 00:06:17,586 But you also notice that factor D has little impact on the outcome. 91 00:06:17,586 --> 00:06:22,216 That's a useful result as it indicates we are relatively insensitive 92 00:06:22,216 --> 00:06:24,436 to the variation in the solar intermittency. 93 00:06:24,436 --> 00:06:27,526 If we were to run more experiments in the future, 94 00:06:27,896 --> 00:06:30,106 we might leave factor D out of consideration. 95 00:06:30,136 --> 00:06:35,016 Similarly, when trying to optimize the process for collection efficiency, y1, 96 00:06:35,016 --> 00:06:39,976 we can be confident that solar intermittency won't play a major role; 97 00:06:40,426 --> 00:06:42,646 at least according to this simulation system. 98 00:06:44,196 --> 00:06:49,346 Now let's take a look at our second outcome variable, y_2, the energy delivery efficiency. 99 00:06:49,346 --> 00:06:54,846 If you rebuild the model and look at the Pareto plot we see extremely strong effects 100 00:06:54,846 --> 00:06:58,116 from factor A, and the two factor factor interaction of AB. 101 00:06:58,726 --> 00:07:01,426 The other factors, C and D, are small. 102 00:07:02,376 --> 00:07:05,596 What you also notice here, and this is a very common result, 103 00:07:06,136 --> 00:07:09,526 is that many of the higher level interactions, such as the 3- 104 00:07:09,526 --> 00:07:12,166 and 4-factor interaction are small, or zero. 105 00:07:12,166 --> 00:07:17,116 I would like to point out an important issue at this moment using this example. 106 00:07:17,846 --> 00:07:22,826 Take a look at factor B, it is small and based on what we've done you might be tempted 107 00:07:22,826 --> 00:07:25,046 to conclude that factor B is not important. 108 00:07:25,646 --> 00:07:26,996 That's not entirely correct. 109 00:07:27,396 --> 00:07:32,646 We cannot exclude factor B from consideration, because AB interaction is very important. 110 00:07:33,406 --> 00:07:35,586 Remember what an interaction was defined as. 111 00:07:36,136 --> 00:07:41,976 In this example the AB interaction means that the effect of factor A is dependant on the level 112 00:07:41,976 --> 00:07:48,346 of factor B. Alternatively, the effect of factor B is dependant on the factor A. 113 00:07:50,026 --> 00:07:54,966 So because the AB interaction is strong, we cannot ignore factor B. The level 114 00:07:54,966 --> 00:07:57,546 that factor B is set at is also important. 115 00:07:57,546 --> 00:08:00,786 And so we cannot remove factor B from the model either. 116 00:08:02,606 --> 00:08:05,796 So let's end off today's class with this question for you to think about. 117 00:08:06,426 --> 00:08:10,226 Can you maximize both y_1 and y_2 simultaneously? 118 00:08:10,226 --> 00:08:15,106 What would be the best combination of settings of the factors to get that maximum? 119 00:08:15,106 --> 00:08:18,426 This is a question that we will discuss in the course forums. 120 00:08:18,816 --> 00:08:25,776 Please go ahead and participate in the forums, and discuss that issue. 121 00:08:25,846 --> 00:08:28,046 So that's a wrap. 122 00:08:28,046 --> 00:08:32,306 In this module, and in the prior one, you've seen how we can use pen and paper, 123 00:08:32,306 --> 00:08:37,156 or use computer software to analyze experiments to make improvements. 124 00:08:37,156 --> 00:08:40,386 Now in the coming module we start to get a little bit lazy. 125 00:08:40,386 --> 00:08:42,336 We want to do fewer experiments, 126 00:08:42,336 --> 00:08:46,646 but still extract the most information we can from the system. 127 00:08:46,646 --> 00:08:50,746 Well, we are not actually being lazy, we really just want to save money 128 00:08:50,746 --> 00:08:53,506 and time, because experiments are costly. 129 00:08:53,626 --> 00:08:58,216 So run as few experiments but extract the most information we possibly can. 130 00:08:58,216 --> 00:09:00,426 I'm looking forward to one way we might do that. 131 00:09:00,426 --> 00:09:01,786 See you over there.