1 00:00:05,606 --> 00:00:09,426 Our focus with this module is to understand how to do less work 2 00:00:09,426 --> 00:00:13,936 and still get mostly the same amount of information, as if we had done all the work. 3 00:00:14,586 --> 00:00:19,296 A bit of educated guessing is required, and some assumptions are used along the way. 4 00:00:20,146 --> 00:00:24,436 Now, do you remember that rule that when we were dealing with a system with "k" factors, 5 00:00:24,436 --> 00:00:29,836 and there are two levels for each factor, that we will have 2 to the power of "k" experiments? 6 00:00:30,496 --> 00:00:32,696 That's a lot of experiments in many cases. 7 00:00:33,576 --> 00:00:36,716 We saw that in the prior module, that when we used the software, 8 00:00:36,716 --> 00:00:38,736 we could estimate all those coefficients. 9 00:00:39,716 --> 00:00:42,826 The key insight that you will take away from these videos is 10 00:00:43,016 --> 00:00:45,446 that we don't have to run all those experiments. 11 00:00:45,826 --> 00:00:50,296 We can do fewer, but there's going to be a price to pay; and we're going to figure 12 00:00:50,296 --> 00:00:52,206 out what that price is in this video. 13 00:00:52,476 --> 00:00:55,696 Here's an experiment with two factors at two levels 14 00:00:56,326 --> 00:00:58,836 and there are the four parameters that we can estimate. 15 00:00:59,186 --> 00:01:03,946 The intercept, the main effect of the first factor, the main effect of the second factor 16 00:01:03,946 --> 00:01:06,566 and the two factor interaction between the two. 17 00:01:07,826 --> 00:01:12,696 Here is a system with three factors, and as we can see, we can estimate eight parameters 18 00:01:12,766 --> 00:01:15,686 after we have completed the eight experiments. 19 00:01:15,966 --> 00:01:20,886 A system with four factors will have a total of 16 experiments in a full factorial. 20 00:01:20,886 --> 00:01:26,426 Such as system will have 16 parameters that we can estimate using computer software. 21 00:01:28,236 --> 00:01:31,796 You can probably appreciate that this procedure quickly becomes prohibitive 22 00:01:31,906 --> 00:01:33,586 for most practical systems. 23 00:01:34,306 --> 00:01:37,856 There are many systems where there are 6, 7, or more factors. 24 00:01:38,356 --> 00:01:42,386 We do not want to perform so many experiments required by the full factorial. 25 00:01:43,616 --> 00:01:46,786 It will be both time prohibitive and cost prohibitive. 26 00:01:48,116 --> 00:01:54,006 This is even true for systems that can be highly automated, e.g. systems with DNA sequencing 27 00:01:54,356 --> 00:01:57,486 or systems that are done using computer software and stimulation. 28 00:01:57,976 --> 00:02:04,126 There is also very little use in estimating all 2 to the power of "k" coefficients, that's many, 29 00:02:04,126 --> 00:02:07,086 many coefficients in some experiments. 30 00:02:07,086 --> 00:02:09,796 These higher order interactions are non-existent, 31 00:02:10,246 --> 00:02:14,666 and many of those coefficients will be so small, that they're practically zero. 32 00:02:15,586 --> 00:02:20,216 You'll seldom see a 3 factor interaction that is actually present in a real system. 33 00:02:21,006 --> 00:02:26,526 And a 4th order, and higher level interactions, almost certainly don't exist in practice. 34 00:02:26,576 --> 00:02:31,586 By using some educated guessing, and making reasonable assumptions about our system, 35 00:02:31,996 --> 00:02:35,336 we are going to figure out a way to do fewer experiments 36 00:02:35,336 --> 00:02:41,676 and still retain the essential information of the important effects in our system. 37 00:02:41,676 --> 00:02:44,616 At the core of this approach is an implicit assumption 38 00:02:44,816 --> 00:02:47,836 that we ignore these higher-order coefficients in the model. 39 00:02:48,686 --> 00:02:51,286 There are occasions when it is appropriate to do that, 40 00:02:51,956 --> 00:02:54,696 and there will be times when our assumptions are faulty. 41 00:02:56,076 --> 00:03:00,536 It is critical to understand that there are practical situations where it's quite okay 42 00:03:00,536 --> 00:03:04,046 to lose some of this prediction accuracy from the higher-order terms. 43 00:03:05,236 --> 00:03:09,776 Those higher-order terms definitely helped you fine tune the predictions but the cost 44 00:03:09,776 --> 00:03:11,796 of obtaining them can be prohibitive. 45 00:03:12,766 --> 00:03:16,386 You'll need to decide whether or not it is worth doing that work. 46 00:03:16,986 --> 00:03:18,896 And that's the subject of today's video. 47 00:03:20,526 --> 00:03:26,256 Perhaps let me ask you to consider the question this way: if we only had the time and a budget 48 00:03:26,256 --> 00:03:30,736 to do 4 experiments, which 4 of these original 8 would you do? 49 00:03:31,656 --> 00:03:35,556 You might start by considering to only run the 4 experiments here at the front, 50 00:03:36,306 --> 00:03:40,796 but that won't work so well because you will only have factor C at its low level. 51 00:03:41,216 --> 00:03:43,926 There will be no experiments at the high level for factor C, 52 00:03:44,426 --> 00:03:48,256 and so you won't really know what factor C does in the system. 53 00:03:48,256 --> 00:03:54,626 So then you might say: "what if I select these two at the front and those two at the back?" 54 00:03:55,456 --> 00:03:58,626 Those represent the middle four rows from the standard order table. 55 00:03:59,216 --> 00:04:01,726 That's not a bad choice, but it's not the best. 56 00:04:01,826 --> 00:04:05,186 Let me show you a better choice then I will explain it afterwards. 57 00:04:06,236 --> 00:04:08,566 Here is the set of 4 experiments that you should do. 58 00:04:09,166 --> 00:04:13,236 Either select the 4 with open circles or the 4 with closed circles. 59 00:04:13,236 --> 00:04:16,186 Notice the interesting pattern in the cube. 60 00:04:16,186 --> 00:04:19,966 It is intentionally selected that way and let me explain why. 61 00:04:20,986 --> 00:04:21,946 We'll work backwards here. 62 00:04:21,946 --> 00:04:27,286 Assuming we have completed these 4 experiments - the 4 with open circles. 63 00:04:27,846 --> 00:04:30,576 And now when we analyze the data we discover 64 00:04:30,576 --> 00:04:33,736 that factor A is not significant from the Pareto plot. 65 00:04:33,736 --> 00:04:39,796 If A is not significant then it essentially implies that we could have ignored factor A, 66 00:04:39,796 --> 00:04:44,286 and never really needed to include it in our experiments. 67 00:04:44,286 --> 00:04:47,416 Another way of saying that, is that factor A could have been at the - 68 00:04:47,416 --> 00:04:52,896 level or at the + level, and it really wouldn't have affected our outcome variable much. 69 00:04:52,896 --> 00:04:58,016 If A can exist at two levels and not really affect our outcome, 70 00:04:58,486 --> 00:05:02,176 that means that we can collapse the minus and the plus layers together. 71 00:05:02,786 --> 00:05:04,346 And notice then what happens. 72 00:05:05,196 --> 00:05:13,266 As we do that, we recover 4 experiments in factors B and C. Four experiments 73 00:05:13,266 --> 00:05:15,956 in two factors; that's a full factorial! 74 00:05:16,536 --> 00:05:18,426 We don't have to do any more work here. 75 00:05:18,836 --> 00:05:24,496 These four experiments that we've already run, now complete a full factorial in factors B 76 00:05:24,496 --> 00:05:31,336 and C. In fact you can prove this to yourself for the case when factor B is not significant. 77 00:05:31,916 --> 00:05:35,756 Then it collapses to a full factorial in factor A and factor C. 78 00:05:35,756 --> 00:05:42,706 If factor C is not significant then it collapses to a full factorial in factor A and factor B. 79 00:05:44,536 --> 00:05:49,016 So from that perspective, these are really a good set of 4 experiments to use. 80 00:05:49,016 --> 00:05:54,596 So now let's imagine that we've run only these 4 experiments. 81 00:05:54,596 --> 00:05:58,296 I'd like to show you how we could analyze the data and I'm going 82 00:05:58,296 --> 00:06:00,426 to use the water treatment example again. 83 00:06:01,426 --> 00:06:08,216 I hope you don't mind if I rename the factors to A, B, and C. I'm doing this because I want 84 00:06:08,216 --> 00:06:12,316 to use the water treatment example that you're comfortable with, but at the end, 85 00:06:12,646 --> 00:06:17,236 I want to extend what we learned here today to any system, and A, B, 86 00:06:17,236 --> 00:06:19,876 and C are the most generic way to do that. 87 00:06:21,626 --> 00:06:24,986 Now assume that each of these experiments were very expensive. 88 00:06:25,596 --> 00:06:28,046 Maybe they cost around $10,000 each. 89 00:06:28,046 --> 00:06:34,456 So instead of doing 8, let's assume we've only done these 4: half the work. 90 00:06:35,076 --> 00:06:38,686 Our boss is going to be pretty impressed that we've saved $40,000. 91 00:06:39,716 --> 00:06:42,386 Open the software and let's see what happens. 92 00:06:43,536 --> 00:06:49,086 Using the best choice design I talked about earlier, where you've only done experiments 2, 93 00:06:49,086 --> 00:06:56,606 3, 5 and 8 from the original set, I'm going to ask the software to create new variables for A, 94 00:06:56,926 --> 00:07:01,756 B and C, which only include those 4 experiments. 95 00:07:01,756 --> 00:07:04,316 And here are the 4 outcomes at those conditions. 96 00:07:05,076 --> 00:07:09,086 Now if you just go ahead and type in the code from the previous class, 97 00:07:09,086 --> 00:07:13,376 you can see that the software will create a model from A, B and C; 98 00:07:13,886 --> 00:07:16,786 and it includes 2 and 3 factor interactions. 99 00:07:17,976 --> 00:07:22,176 But what you will notice that's different from last time, is all these NA terms. 100 00:07:22,916 --> 00:07:27,666 That NA stands for "Not Applicable"; those terms cannot be estimated. 101 00:07:28,826 --> 00:07:34,256 But we got 4 estimates of 4 coefficients, we ran 4 experiments so we expected that. 102 00:07:34,376 --> 00:07:37,086 The full model prediction has 8 parameters 103 00:07:37,086 --> 00:07:40,416 and would have required 8 experiments to calculate all 8 of them. 104 00:07:40,416 --> 00:07:48,206 Now I hope you're still curious about how I selected those four experiments to run. 105 00:07:48,336 --> 00:07:53,726 Hold that question in your mind, I'll come back to it, I promise. 106 00:07:53,726 --> 00:07:59,796 But I want to show you first what we lost out by doing less work. 107 00:07:59,796 --> 00:08:03,336 That way you can judge whether it was worth it. 108 00:08:03,336 --> 00:08:08,496 Let me assume we've done all 8 experiments. 109 00:08:08,496 --> 00:08:15,946 And let me compare that to the case where we've only done 4 of the experiments. 110 00:08:15,946 --> 00:08:20,946 We're going to write out the two prediction models side-by-side 111 00:08:20,946 --> 00:08:23,696 so that you can see the differences between them. 112 00:08:23,696 --> 00:08:29,536 In this particular example, you can see that three of the terms are numerically similar; 113 00:08:29,536 --> 00:08:33,836 it's not going to lead to serious misinterpretation. 114 00:08:33,836 --> 00:08:35,656 However, there is one term that is very different. 115 00:08:35,656 --> 00:08:36,956 What has happened over there? 116 00:08:36,956 --> 00:08:40,886 I'm going to show you now how that reduced design was found. 117 00:08:40,886 --> 00:08:43,056 How did we come to that best choice? 118 00:08:43,056 --> 00:08:45,486 We call this a half fraction. 119 00:08:45,516 --> 00:08:51,466 The full set of experiments for 3 factors would've required 2 to the 3 experiments. 120 00:08:51,546 --> 00:08:56,656 If we want to do half the work, then we can divide by 2 here, which is equal to 4. 121 00:08:56,656 --> 00:09:01,516 Or for those of you that remember your exponent rules, we could write this 122 00:09:01,516 --> 00:09:05,046 as 2 to the power of (3 minus 1). 123 00:09:05,046 --> 00:09:09,236 This equals 2 to the power of 2, which equals 4. 124 00:09:09,356 --> 00:09:13,806 There is a systematic way to select those four runs. 125 00:09:13,806 --> 00:09:20,176 Since we know that we will have 4 experiments, we can quite happily go ahead and write 126 00:09:20,176 --> 00:09:26,206 out our standard order table for the first two factors, A and B. We do this 127 00:09:26,206 --> 00:09:30,356 because we know two factors require 4 experiments. 128 00:09:30,576 --> 00:09:35,626 Okay, but what about that third factor, factor C? 129 00:09:35,626 --> 00:09:39,966 At what settings should we write out that factor? 130 00:09:39,966 --> 00:09:47,016 We write it out as C equals A times B. In fact, we say "generate factor C as A times B". 131 00:09:47,016 --> 00:09:56,256 So there we have that factor C is equal to +, -, -, + for the 4 experiments; the multiplication 132 00:09:56,256 --> 00:10:01,526 of the values in column A and column B. Let's visualize 133 00:10:01,526 --> 00:10:06,126 where those 4 points are on the original cube. 134 00:10:06,236 --> 00:10:11,866 The first row is at low A and low B, and high C, so it appears here. 135 00:10:11,866 --> 00:10:17,286 The next point is that high A, low B, and then low C. So that's over here. 136 00:10:17,286 --> 00:10:22,186 The third experiment is there, and the last experiment is at high A, high B, 137 00:10:22,186 --> 00:10:27,376 and high C. Notice how that corresponds to the ideal selection 138 00:10:27,376 --> 00:10:30,846 of four experiments we made at the start of this video. 139 00:10:30,846 --> 00:10:36,766 In the next video I'm going to show you where I got that rule where C should equal A times B. 140 00:10:36,766 --> 00:10:40,666 So let's understand the trade off here. 141 00:10:40,666 --> 00:10:45,996 If we do half the amount of experiments we have to accept 142 00:10:46,356 --> 00:10:49,696 that we get less information from the system. 143 00:10:49,696 --> 00:10:54,436 I guess you can say there's no such thing as a free lunch. 144 00:10:54,436 --> 00:10:56,286 You can't get something for nothing. 145 00:10:56,286 --> 00:11:00,836 The question is: "what is the penalty for doing fewer experiments?" 146 00:11:00,836 --> 00:11:04,046 "What is this free lunch costing me?" 147 00:11:04,416 --> 00:11:07,966 I mean, if we had paid an extra $40,000, 148 00:11:07,966 --> 00:11:12,706 and did the extra four experiments we'd have that extra information. 149 00:11:12,706 --> 00:11:14,666 You can already see that over here. 150 00:11:14,736 --> 00:11:20,276 We had some good estimates of the three parameters. 151 00:11:20,276 --> 00:11:23,146 The intercept, the A main effect, the C main effect. 152 00:11:23,146 --> 00:11:26,516 But the B main effect was actually quite wrong. 153 00:11:26,516 --> 00:11:33,326 Also you notice that we didn't get any estimates of the two-factor interactions. 154 00:11:34,236 --> 00:11:40,346 Let me drop in two words that we will come back to in later classes. 155 00:11:40,346 --> 00:11:42,526 "Screening" and "optimization". 156 00:11:42,526 --> 00:11:48,416 When we are screening, we don't mind having reduced knowledge of the system. 157 00:11:48,416 --> 00:11:53,046 For example, we don't mind if the two-factor interactions are not all known, 158 00:11:53,586 --> 00:11:58,126 or if the estimates of the factors are not quite correct. 159 00:11:58,126 --> 00:12:03,296 Later on, when optimizing, though, we want more specific information about the system: 160 00:12:03,296 --> 00:12:05,076 a better level of prediction accuracy. 161 00:12:05,076 --> 00:12:12,526 At that point is when we will require better resolution of the main effects and interactions. 162 00:12:12,526 --> 00:12:18,646 So this is what the $40,000 is costing us: a reduction in the model's prediction quality. 163 00:12:18,646 --> 00:12:22,466 You could ask whether that's worth the money saved. 164 00:12:22,466 --> 00:12:24,496 Well, you'll never really know the correct answer, 165 00:12:24,496 --> 00:12:27,536 unless you do the full set of experiments. 166 00:12:27,536 --> 00:12:32,236 But I'm going to show you how we can make some educated guesses later in this module. 167 00:12:32,396 --> 00:12:36,636 What we've done here by not running those extra experiments, is, 168 00:12:36,636 --> 00:12:39,936 we've rather cleverly selected a subset of them to save $40,000. 169 00:12:39,936 --> 00:12:42,226 We can use this money later on. 170 00:12:42,226 --> 00:12:49,596 For when we required a more detailed model to find that optimum in the system. 171 00:12:49,596 --> 00:12:55,606 George Box, the famous statistician from whose text book, we're using this example, 172 00:12:55,606 --> 00:12:58,116 said is a rough rule, but only a portion about 25% of the experimental efforts 173 00:12:58,146 --> 00:12:59,856 and budget should be invested in the first experimental designs. 174 00:12:59,886 --> 00:13:00,546 I paraphrased that slightly. 175 00:13:00,576 --> 00:13:02,196 But basically he is saying that you should leave some money, and time, 176 00:13:02,226 --> 00:13:03,216 for later on to figure out the details. 177 00:13:03,246 --> 00:13:05,376 In the beginning you don't even know yet if A, B or C are actually significant. 178 00:13:05,406 --> 00:13:06,876 First figure that out before you go build a detailed model, 179 00:13:06,906 --> 00:13:08,016 with two-level and three-level interactions. 180 00:13:08,046 --> 00:13:09,156 That's where we're going to leave the class today. 181 00:13:09,186 --> 00:13:11,316 We've shown you the end point that when you do half the work, you lose a bit of accuracy 182 00:13:11,346 --> 00:13:12,876 in your model but there's a great built-in backup strategy 183 00:13:12,906 --> 00:13:14,316 in the clever selection of which half of the work to do. 184 00:13:14,346 --> 00:13:16,176 I guess you could say at least be smart about which half of the work to do. 185 00:13:16,206 --> 00:13:18,156 In the next class we're going to learn the technical terms and the mechanics 186 00:13:18,186 --> 00:13:19,056 around creating these half-fractions.