1 00:00:00,636 --> 00:00:03,966 In this lecture, we're going to look at the term "disturbances". 2 00:00:04,656 --> 00:00:08,576 We're going to gain an insight into why we randomize our experiments, 3 00:00:08,996 --> 00:00:12,026 the concept that we've seen several times in the course so far. 4 00:00:12,906 --> 00:00:15,856 In many ways, this is one of the more important videos. 5 00:00:17,036 --> 00:00:20,536 Now, I could have covered the topic of randomization a little earlier, 6 00:00:21,146 --> 00:00:25,186 but I needed the terminology of confounding from the previous class 7 00:00:25,186 --> 00:00:27,886 to clearly explain the purpose of randomization. 8 00:00:28,826 --> 00:00:32,756 Also, by waiting, you've started asking really good questions 9 00:00:32,756 --> 00:00:35,016 about randomization on the course forums. 10 00:00:35,716 --> 00:00:39,036 One of the best times to learn anything is just when you need 11 00:00:39,036 --> 00:00:41,456 that information, just in time learning. 12 00:00:42,266 --> 00:00:44,216 Now that you've got all these questions in your mind 13 00:00:44,216 --> 00:00:47,676 about randomization, we're ready to learn about it. 14 00:00:48,436 --> 00:00:54,556 To do so, we must understand the nature of variables or factors in our experiments. 15 00:00:54,556 --> 00:00:57,326 We can categorize our variables in several ways. 16 00:00:58,596 --> 00:01:02,356 The first way, is to talk about variables that we know about, 17 00:01:02,356 --> 00:01:05,106 and those that we don't know about, the unknowns. 18 00:01:05,106 --> 00:01:10,386 The second way, is we can talk about variables that we can control 19 00:01:10,386 --> 00:01:11,816 and those that we cannot control. 20 00:01:12,756 --> 00:01:17,146 And thirdly, we can consider variables that we can measure and those that we cannot measure. 21 00:01:17,146 --> 00:01:20,866 We mostly deal with variables we know about. 22 00:01:20,866 --> 00:01:23,846 So the first distinction is a little unnecessary, 23 00:01:24,086 --> 00:01:26,126 but we will see where we use it later on. 24 00:01:27,416 --> 00:01:33,156 There almost certainly are variables in your system that will affect your outcome, 25 00:01:33,626 --> 00:01:36,786 that you did not think about when you start experimenting. 26 00:01:37,326 --> 00:01:41,826 And you'll see why these unknown variables can play an important role later on. 27 00:01:42,446 --> 00:01:45,726 I 'm going to walk through an example to show what I mean 28 00:01:45,726 --> 00:01:48,356 by controllable and measurable factors. 29 00:01:49,046 --> 00:01:53,976 Once you understand the terminology, you will see why randomization is so crucial. 30 00:02:00,386 --> 00:02:06,036 A great example was posted on the forums about why randomization must be done. 31 00:02:06,036 --> 00:02:08,306 Lets take a look. 32 00:02:08,306 --> 00:02:12,776 Imagine we're baking ginger biscuits and I'm investigating 33 00:02:12,776 --> 00:02:17,406 that with eight experiments because I have three factors. 34 00:02:17,406 --> 00:02:24,776 I decide I'm going to do all these experiments in one day. 35 00:02:24,776 --> 00:02:30,176 So I start in the morning and by the end of the day, I'm going to be pretty tired. 36 00:02:30,176 --> 00:02:35,866 When I mix the ingredients by hand, I may not be doing it so well by the end of the day. 37 00:02:35,866 --> 00:02:45,966 Tiredness is something I cannot control in the experiments if I do them all in one go. 38 00:02:46,026 --> 00:02:49,656 And I am also not able to measure my tiredness. 39 00:02:49,656 --> 00:02:52,516 But being tired will affect some of the outcome variables. 40 00:02:52,516 --> 00:02:56,726 Such as the texture of the biscuit if I am not mixing things properly. 41 00:02:56,726 --> 00:02:59,136 A disturbance is defined as something that you do not have control over, 42 00:02:59,136 --> 00:03:01,966 and something that you are not able to measure. 43 00:03:01,966 --> 00:03:07,436 So tiredness is a variable in our system that we call a disturbance, 44 00:03:07,436 --> 00:03:10,386 it's uncontrolled and unmeasured. 45 00:03:10,386 --> 00:03:15,686 I'd like to note here that the term control refers to your ability to adjust the variable. 46 00:03:15,686 --> 00:03:21,756 When we say you cannot control something, it means we cannot actively change it. 47 00:03:21,756 --> 00:03:27,166 For example, you are not able to change the temperature outside or the humidity outside. 48 00:03:27,166 --> 00:03:31,276 But you can control the temperature inside your house if you have air conditioning 49 00:03:31,336 --> 00:03:36,866 and you might also be able to control the humidity in your home. 50 00:03:36,926 --> 00:03:39,336 When I say that you're not able to measure something, 51 00:03:39,336 --> 00:03:45,916 I don't mean that it's impossible to measure it or quantify it. 52 00:03:45,916 --> 00:03:53,336 I just mean that you might not have the right tool, or the instrument, or way to do it. 53 00:03:53,386 --> 00:03:55,836 Humidity could affect my system, but I might not have enough money 54 00:03:55,876 --> 00:03:59,146 to buy a decent humidity sensor. 55 00:03:59,146 --> 00:04:02,196 That's why tiredness is a disturbance. 56 00:04:02,196 --> 00:04:05,756 I cannot control it, and I don't have a way to reliably measure it. 57 00:04:05,756 --> 00:04:07,576 Back to the baking. 58 00:04:07,576 --> 00:04:15,706 You can visualize that there's this increasing tiredness factor taking place over time. 59 00:04:15,776 --> 00:04:19,256 Now here are the eight experiments written in standard order. 60 00:04:19,256 --> 00:04:24,636 If I chose to run the experiments in that same order, 61 00:04:24,946 --> 00:04:28,356 you will notice the third factor, the effect of baking time. 62 00:04:28,426 --> 00:04:33,436 We'll have four low level experiments first and then four high experiments last. 63 00:04:33,436 --> 00:04:40,616 Those first four experiments were when I had lots of energy and mixing my ingredients well. 64 00:04:40,616 --> 00:04:43,396 The last four experiments were run when I was tired. 65 00:04:43,396 --> 00:04:48,076 What has happened here is that we've confounded the effects of tiredness 66 00:04:48,076 --> 00:04:49,416 with this last factor of baking time. 67 00:04:49,416 --> 00:04:52,306 If I analyze my experimental results, 68 00:04:52,306 --> 00:04:57,346 I might find that baking time has an important effect on the outcome. 69 00:04:57,346 --> 00:05:00,596 Was it baking time that caused it or was it me getting more 70 00:05:00,596 --> 00:05:04,786 and more tired throughout the experiments? 71 00:05:04,786 --> 00:05:10,416 You can also imagine a situation where we find that baking time has no effect. 72 00:05:10,416 --> 00:05:18,276 But maybe baking time actually did have an effect, but it was counteracted or cancelled 73 00:05:18,276 --> 00:05:21,596 out by this unmeasured tiredness factor due to confounding. 74 00:05:21,996 --> 00:05:26,556 Both hypothetical examples there would lead to a wrong conclusion. 75 00:05:26,556 --> 00:05:29,176 Now recall in an earlier video, I had said 76 00:05:29,176 --> 00:05:36,556 that computer simulation experiments generally don't need to be randomized. 77 00:05:37,956 --> 00:05:39,076 Now you can see why. 78 00:05:39,076 --> 00:05:42,956 Those types of experiments generally don't have unmeasured 79 00:05:42,956 --> 00:05:46,136 and uncontrolled disturbances happening. 80 00:05:46,136 --> 00:05:52,336 If you repeat the computer experiment, today, tomorrow or next week, 81 00:05:52,336 --> 00:05:54,336 we should be obtaining the same results. 82 00:05:54,336 --> 00:05:59,996 So you must randomize your experiments if you are not running your experiments 83 00:05:59,996 --> 00:06:03,686 in a very controlled environment like a computer simulation. 84 00:06:03,686 --> 00:06:08,086 In the discussion forums, the number of you have proposed experimental designs 85 00:06:10,806 --> 00:06:19,296 that might be confounded by disturbances, if you don't randomize the trials. 86 00:06:19,496 --> 00:06:28,236 For example, if you experiment with a way to learn a new language or multiplication tables 87 00:06:28,236 --> 00:06:34,186 for your child or going to gym or memorizing 20 digits, every time you do one 88 00:06:34,186 --> 00:06:39,396 of those experiments you are naturally going to get better and better. 89 00:06:39,466 --> 00:06:48,326 Simply because you are practicing and not necessarily due to the experimental conditions. 90 00:06:48,326 --> 00:06:58,266 In other systems, simply experimenting makes the system worse and worse. 91 00:06:58,266 --> 00:07:03,596 For example, if you run tests for gas mileage, 92 00:07:03,596 --> 00:07:08,916 your car is naturally deteriorating over time at a very slow rate. 93 00:07:08,916 --> 00:07:14,396 In the baking experiments, I was getting tired. 94 00:07:14,396 --> 00:07:20,446 People in the chemical industry know that our equipment slowly deteriorates over time, 95 00:07:20,526 --> 00:07:26,316 and we shut it down periodically to clean it out and restart them. 96 00:07:26,316 --> 00:07:31,616 You can thank the second law of thermodynamics for that unhelpful feature. 97 00:07:31,616 --> 00:07:36,786 Now you can see why it is so important to randomize. 98 00:07:36,786 --> 00:07:42,906 It ensures that our system is minimally affected by variables which we cannot control. 99 00:07:42,906 --> 00:07:50,176 Minimally affected by variables which we cannot measure and also minimally affected 100 00:07:50,176 --> 00:07:54,026 by variables which we don't even know about. 101 00:07:54,026 --> 00:08:00,156 When we randomize, we are doing so, so that we're not confounded or confused 102 00:08:00,436 --> 00:08:05,146 by the uncontrolled and unmeasured disturbances taking place. 103 00:08:05,526 --> 00:08:09,066 However, there are disturbances that we can measure. 104 00:08:09,066 --> 00:08:13,626 The advice here is that you should record such variables and add them 105 00:08:13,626 --> 00:08:17,636 to your table of results as extra columns. 106 00:08:17,636 --> 00:08:22,396 In the baking example, I may not be able to control the temperature 107 00:08:22,396 --> 00:08:25,366 in my house, but I can still measure it. 108 00:08:25,366 --> 00:08:28,116 It might be possible that I could measure the humidity. 109 00:08:28,116 --> 00:08:36,446 Humidity plays an important role for certain baked products and if I can measure it, 110 00:08:36,446 --> 00:08:40,486 I should do so and add it to my table. 111 00:08:40,486 --> 00:08:47,096 We call such additional measurements covariates, which is a term for a variable 112 00:08:47,096 --> 00:08:52,546 that is not really our primary focus, but might still vary 113 00:08:52,546 --> 00:08:56,416 and potentially influence the outcome variable directly. 114 00:08:56,416 --> 00:09:01,656 Or, as is more likely the case, that covariate will affect one of the factors 115 00:09:01,656 --> 00:09:06,736 in our experiment, which then in turn influences the outcome. 116 00:09:06,736 --> 00:09:10,516 You can use this extra information from the covariates in two ways. 117 00:09:10,516 --> 00:09:14,536 Firstly, the most simple way is to simply use them 118 00:09:14,536 --> 00:09:19,306 to understand unusual results after the fact. 119 00:09:19,306 --> 00:09:27,106 If one of your experiments had, for example, an unusually low outcome value, 120 00:09:27,106 --> 00:09:29,406 the reason might be due to a covariate. 121 00:09:29,406 --> 00:09:34,946 The second way we could use it, for those of you that have an understanding of Least squares, 122 00:09:34,946 --> 00:09:41,416 is we can add these covariates as additional regression variables 123 00:09:41,416 --> 00:09:44,626 in our model to separate out their effect. 124 00:09:44,626 --> 00:09:49,186 There's more to say on that topic, but now that you know that terminology, you're in a position 125 00:09:49,186 --> 00:09:54,176 to do the extra research outside of this class and see how you could deal with covariate data. 126 00:09:54,176 --> 00:09:56,586 Let's test your knowledge. 127 00:09:56,586 --> 00:10:06,676 We're going to use this example in the next video, so try to remember it for next time. 128 00:10:06,676 --> 00:10:10,926 The example was inspired by an article that appeared in the Harvard Business Review. 129 00:10:10,926 --> 00:10:12,816 You are marketing a calendaring app for a cellphone, lets call it Cal app. 130 00:10:12,846 --> 00:10:13,956 The basic functionality in the app is free, 131 00:10:13,986 --> 00:10:15,966 but inside the app the users can pay small amounts to upgrade various features. 132 00:10:15,996 --> 00:10:16,806 These are known as in app purchases. 133 00:10:16,836 --> 00:10:18,846 For example, you could pay $1 and get the sync-to-other-devices feature 134 00:10:18,876 --> 00:10:20,076 or pay another dollar extra for SMS reminders. 135 00:10:20,106 --> 00:10:21,576 Or you might really need that extra feature for integration 136 00:10:21,606 --> 00:10:23,136 with your desktop calendaring application, and you guessed it, 137 00:10:23,166 --> 00:10:24,096 that's going to cost you another dollar. 138 00:10:24,126 --> 00:10:26,196 So your company has created the app, but it's your job to sell and promote it. 139 00:10:26,226 --> 00:10:28,536 Each experiment you do might involve about 2,000 people, and you'll measure the percentage 140 00:10:28,566 --> 00:10:30,126 of those people that are still using the app after 60 days. 141 00:10:30,156 --> 00:10:30,846 That's your outcome variable. 142 00:10:30,876 --> 00:10:32,196 The factors that you consider might be, for example, 143 00:10:32,226 --> 00:10:34,356 factor A is whether you provide a single free in app upgrade for one of the features 144 00:10:34,386 --> 00:10:36,216 or at the plus level you provide a 30-day trial for all the features. 145 00:10:36,246 --> 00:10:38,256 Factor B might be that the sales message is: "This app has your schedule available 146 00:10:38,286 --> 00:10:39,156 at your fingertips on any device". 147 00:10:39,186 --> 00:10:40,716 Or the alternative message: "Cal app features are configurable, 148 00:10:40,746 --> 00:10:41,586 only pay for the features you want". 149 00:10:41,616 --> 00:10:43,566 Factor C might be, the price for in app purchases is $0.89 per feature. 150 00:10:43,596 --> 00:10:45,486 Or at the plus level, the price might be $0.99 cents per feature. 151 00:10:45,516 --> 00:10:46,086 So those were the factors. 152 00:10:46,116 --> 00:10:46,956 Now consider these other variables. 153 00:10:46,986 --> 00:10:48,276 Determine whether they are disturbances or covariates. 154 00:10:48,306 --> 00:10:50,406 Note that a variety of answers could be correct, depending on the assumptions you make 155 00:10:50,436 --> 00:10:52,026 and on your knowledge regarding the capability of cellphone apps. 156 00:10:52,056 --> 00:10:53,886 So, in summary, the distinction between covariates and disturbances is 157 00:10:53,916 --> 00:10:55,356 that covariates are measurable, while disturbances are not. 158 00:10:55,386 --> 00:10:56,166 Neither of them are controllable. 159 00:10:56,196 --> 00:10:57,846 In the next video, we are going to consider a subtle difference 160 00:10:57,876 --> 00:10:58,926 in the variables that can be controlled.