1
00:00:00,636 --> 00:00:03,966
In this lecture, we're going to
look at the term "disturbances".

2
00:00:04,656 --> 00:00:08,576
We're going to gain an insight into
why we randomize our experiments,

3
00:00:08,996 --> 00:00:12,026
the concept that we've seen
several times in the course so far.

4
00:00:12,906 --> 00:00:15,856
In many ways, this is one of
the more important videos.

5
00:00:17,036 --> 00:00:20,536
Now, I could have covered the topic
of randomization a little earlier,

6
00:00:21,146 --> 00:00:25,186
but I needed the terminology of
confounding from the previous class

7
00:00:25,186 --> 00:00:27,886
to clearly explain the purpose of randomization.

8
00:00:28,826 --> 00:00:32,756
Also, by waiting, you've started
asking really good questions

9
00:00:32,756 --> 00:00:35,016
about randomization on the course forums.

10
00:00:35,716 --> 00:00:39,036
One of the best times to learn
anything is just when you need

11
00:00:39,036 --> 00:00:41,456
that information, just in time learning.

12
00:00:42,266 --> 00:00:44,216
Now that you've got all these
questions in your mind

13
00:00:44,216 --> 00:00:47,676
about randomization, we're
ready to learn about it.

14
00:00:48,436 --> 00:00:54,556
To do so, we must understand the nature of
variables or factors in our experiments.

15
00:00:54,556 --> 00:00:57,326
We can categorize our variables in several ways.

16
00:00:58,596 --> 00:01:02,356
The first way, is to talk about
variables that we know about,

17
00:01:02,356 --> 00:01:05,106
and those that we don't know
about, the unknowns.

18
00:01:05,106 --> 00:01:10,386
The second way, is we can talk
about variables that we can control

19
00:01:10,386 --> 00:01:11,816
and those that we cannot control.

20
00:01:12,756 --> 00:01:17,146
And thirdly, we can consider variables that we
can measure and those that we cannot measure.

21
00:01:17,146 --> 00:01:20,866
We mostly deal with variables we know about.

22
00:01:20,866 --> 00:01:23,846
So the first distinction
is a little unnecessary,

23
00:01:24,086 --> 00:01:26,126
but we will see where we use it later on.

24
00:01:27,416 --> 00:01:33,156
There almost certainly are variables in
your system that will affect your outcome,

25
00:01:33,626 --> 00:01:36,786
that you did not think about
when you start experimenting.

26
00:01:37,326 --> 00:01:41,826
And you'll see why these unknown variables
can play an important role later on.

27
00:01:42,446 --> 00:01:45,726
I 'm going to walk through an
example to show what I mean

28
00:01:45,726 --> 00:01:48,356
by controllable and measurable factors.

29
00:01:49,046 --> 00:01:53,976
Once you understand the terminology, you
will see why randomization is so crucial.

30
00:02:00,386 --> 00:02:06,036
A great example was posted on the forums
about why randomization must be done.

31
00:02:06,036 --> 00:02:08,306
Lets take a look.

32
00:02:08,306 --> 00:02:12,776
Imagine we're baking ginger
biscuits and I'm investigating

33
00:02:12,776 --> 00:02:17,406
that with eight experiments
because I have three factors.

34
00:02:17,406 --> 00:02:24,776
I decide I'm going to do all
these experiments in one day.

35
00:02:24,776 --> 00:02:30,176
So I start in the morning and by the end
of the day, I'm going to be pretty tired.

36
00:02:30,176 --> 00:02:35,866
When I mix the ingredients by hand, I may not
be doing it so well by the end of the day.

37
00:02:35,866 --> 00:02:45,966
Tiredness is something I cannot control in
the experiments if I do them all in one go.

38
00:02:46,026 --> 00:02:49,656
And I am also not able to measure my tiredness.

39
00:02:49,656 --> 00:02:52,516
But being tired will affect
some of the outcome variables.

40
00:02:52,516 --> 00:02:56,726
Such as the texture of the biscuit
if I am not mixing things properly.

41
00:02:56,726 --> 00:02:59,136
A disturbance is defined as something
that you do not have control over,

42
00:02:59,136 --> 00:03:01,966
and something that you are not able to measure.

43
00:03:01,966 --> 00:03:07,436
So tiredness is a variable in our
system that we call a disturbance,

44
00:03:07,436 --> 00:03:10,386
it's uncontrolled and unmeasured.

45
00:03:10,386 --> 00:03:15,686
I'd like to note here that the term control
refers to your ability to adjust the variable.

46
00:03:15,686 --> 00:03:21,756
When we say you cannot control something,
it means we cannot actively change it.

47
00:03:21,756 --> 00:03:27,166
For example, you are not able to change the
temperature outside or the humidity outside.

48
00:03:27,166 --> 00:03:31,276
But you can control the temperature inside
your house if you have air conditioning

49
00:03:31,336 --> 00:03:36,866
and you might also be able to
control the humidity in your home.

50
00:03:36,926 --> 00:03:39,336
When I say that you're not
able to measure something,

51
00:03:39,336 --> 00:03:45,916
I don't mean that it's impossible
to measure it or quantify it.

52
00:03:45,916 --> 00:03:53,336
I just mean that you might not have the right
tool, or the instrument, or way to do it.

53
00:03:53,386 --> 00:03:55,836
Humidity could affect my system,
but I might not have enough money

54
00:03:55,876 --> 00:03:59,146
to buy a decent humidity sensor.

55
00:03:59,146 --> 00:04:02,196
That's why tiredness is a disturbance.

56
00:04:02,196 --> 00:04:05,756
I cannot control it, and I don't
have a way to reliably measure it.

57
00:04:05,756 --> 00:04:07,576
Back to the baking.

58
00:04:07,576 --> 00:04:15,706
You can visualize that there's this increasing
tiredness factor taking place over time.

59
00:04:15,776 --> 00:04:19,256
Now here are the eight experiments
written in standard order.

60
00:04:19,256 --> 00:04:24,636
If I chose to run the experiments
in that same order,

61
00:04:24,946 --> 00:04:28,356
you will notice the third
factor, the effect of baking time.

62
00:04:28,426 --> 00:04:33,436
We'll have four low level experiments
first and then four high experiments last.

63
00:04:33,436 --> 00:04:40,616
Those first four experiments were when I had
lots of energy and mixing my ingredients well.

64
00:04:40,616 --> 00:04:43,396
The last four experiments
were run when I was tired.

65
00:04:43,396 --> 00:04:48,076
What has happened here is that we've
confounded the effects of tiredness

66
00:04:48,076 --> 00:04:49,416
with this last factor of baking time.

67
00:04:49,416 --> 00:04:52,306
If I analyze my experimental results,

68
00:04:52,306 --> 00:04:57,346
I might find that baking time has
an important effect on the outcome.

69
00:04:57,346 --> 00:05:00,596
Was it baking time that caused
it or was it me getting more

70
00:05:00,596 --> 00:05:04,786
and more tired throughout the experiments?

71
00:05:04,786 --> 00:05:10,416
You can also imagine a situation where
we find that baking time has no effect.

72
00:05:10,416 --> 00:05:18,276
But maybe baking time actually did have an
effect, but it was counteracted or cancelled

73
00:05:18,276 --> 00:05:21,596
out by this unmeasured tiredness
factor due to confounding.

74
00:05:21,996 --> 00:05:26,556
Both hypothetical examples there
would lead to a wrong conclusion.

75
00:05:26,556 --> 00:05:29,176
Now recall in an earlier video, I had said

76
00:05:29,176 --> 00:05:36,556
that computer simulation experiments
generally don't need to be randomized.

77
00:05:37,956 --> 00:05:39,076
Now you can see why.

78
00:05:39,076 --> 00:05:42,956
Those types of experiments
generally don't have unmeasured

79
00:05:42,956 --> 00:05:46,136
and uncontrolled disturbances happening.

80
00:05:46,136 --> 00:05:52,336
If you repeat the computer experiment,
today, tomorrow or next week,

81
00:05:52,336 --> 00:05:54,336
we should be obtaining the same results.

82
00:05:54,336 --> 00:05:59,996
So you must randomize your experiments
if you are not running your experiments

83
00:05:59,996 --> 00:06:03,686
in a very controlled environment
like a computer simulation.

84
00:06:03,686 --> 00:06:08,086
In the discussion forums, the number of
you have proposed experimental designs

85
00:06:10,806 --> 00:06:19,296
that might be confounded by disturbances,
if you don't randomize the trials.

86
00:06:19,496 --> 00:06:28,236
For example, if you experiment with a way to
learn a new language or multiplication tables

87
00:06:28,236 --> 00:06:34,186
for your child or going to gym or
memorizing 20 digits, every time you do one

88
00:06:34,186 --> 00:06:39,396
of those experiments you are naturally
going to get better and better.

89
00:06:39,466 --> 00:06:48,326
Simply because you are practicing and not
necessarily due to the experimental conditions.

90
00:06:48,326 --> 00:06:58,266
In other systems, simply experimenting
makes the system worse and worse.

91
00:06:58,266 --> 00:07:03,596
For example, if you run tests for gas mileage,

92
00:07:03,596 --> 00:07:08,916
your car is naturally deteriorating
over time at a very slow rate.

93
00:07:08,916 --> 00:07:14,396
In the baking experiments, I was getting tired.

94
00:07:14,396 --> 00:07:20,446
People in the chemical industry know that
our equipment slowly deteriorates over time,

95
00:07:20,526 --> 00:07:26,316
and we shut it down periodically
to clean it out and restart them.

96
00:07:26,316 --> 00:07:31,616
You can thank the second law of
thermodynamics for that unhelpful feature.

97
00:07:31,616 --> 00:07:36,786
Now you can see why it is
so important to randomize.

98
00:07:36,786 --> 00:07:42,906
It ensures that our system is minimally
affected by variables which we cannot control.

99
00:07:42,906 --> 00:07:50,176
Minimally affected by variables which we
cannot measure and also minimally affected

100
00:07:50,176 --> 00:07:54,026
by variables which we don't even know about.

101
00:07:54,026 --> 00:08:00,156
When we randomize, we are doing so, so
that we're not confounded or confused

102
00:08:00,436 --> 00:08:05,146
by the uncontrolled and unmeasured
disturbances taking place.

103
00:08:05,526 --> 00:08:09,066
However, there are disturbances
that we can measure.

104
00:08:09,066 --> 00:08:13,626
The advice here is that you should
record such variables and add them

105
00:08:13,626 --> 00:08:17,636
to your table of results as extra columns.

106
00:08:17,636 --> 00:08:22,396
In the baking example, I may not
be able to control the temperature

107
00:08:22,396 --> 00:08:25,366
in my house, but I can still measure it.

108
00:08:25,366 --> 00:08:28,116
It might be possible that I
could measure the humidity.

109
00:08:28,116 --> 00:08:36,446
Humidity plays an important role for certain
baked products and if I can measure it,

110
00:08:36,446 --> 00:08:40,486
I should do so and add it to my table.

111
00:08:40,486 --> 00:08:47,096
We call such additional measurements
covariates, which is a term for a variable

112
00:08:47,096 --> 00:08:52,546
that is not really our primary
focus, but might still vary

113
00:08:52,546 --> 00:08:56,416
and potentially influence the
outcome variable directly.

114
00:08:56,416 --> 00:09:01,656
Or, as is more likely the case, that
covariate will affect one of the factors

115
00:09:01,656 --> 00:09:06,736
in our experiment, which then
in turn influences the outcome.

116
00:09:06,736 --> 00:09:10,516
You can use this extra information
from the covariates in two ways.

117
00:09:10,516 --> 00:09:14,536
Firstly, the most simple
way is to simply use them

118
00:09:14,536 --> 00:09:19,306
to understand unusual results after the fact.

119
00:09:19,306 --> 00:09:27,106
If one of your experiments had, for
example, an unusually low outcome value,

120
00:09:27,106 --> 00:09:29,406
the reason might be due to a covariate.

121
00:09:29,406 --> 00:09:34,946
The second way we could use it, for those of
you that have an understanding of Least squares,

122
00:09:34,946 --> 00:09:41,416
is we can add these covariates as
additional regression variables

123
00:09:41,416 --> 00:09:44,626
in our model to separate out their effect.

124
00:09:44,626 --> 00:09:49,186
There's more to say on that topic, but now that
you know that terminology, you're in a position

125
00:09:49,186 --> 00:09:54,176
to do the extra research outside of this class
and see how you could deal with covariate data.

126
00:09:54,176 --> 00:09:56,586
Let's test your knowledge.

127
00:09:56,586 --> 00:10:06,676
We're going to use this example in the next
video, so try to remember it for next time.

128
00:10:06,676 --> 00:10:10,926
The example was inspired by an article that
appeared in the Harvard Business Review.

129
00:10:10,926 --> 00:10:12,816
You are marketing a calendaring app
for a cellphone, lets call it Cal app.

130
00:10:12,846 --> 00:10:13,956
The basic functionality in the app is free,

131
00:10:13,986 --> 00:10:15,966
but inside the app the users can pay
small amounts to upgrade various features.

132
00:10:15,996 --> 00:10:16,806
These are known as in app purchases.

133
00:10:16,836 --> 00:10:18,846
For example, you could pay $1 and
get the sync-to-other-devices feature

134
00:10:18,876 --> 00:10:20,076
or pay another dollar extra for SMS reminders.

135
00:10:20,106 --> 00:10:21,576
Or you might really need that
extra feature for integration

136
00:10:21,606 --> 00:10:23,136
with your desktop calendaring
application, and you guessed it,

137
00:10:23,166 --> 00:10:24,096
that's going to cost you another dollar.

138
00:10:24,126 --> 00:10:26,196
So your company has created the app, but
it's your job to sell and promote it.

139
00:10:26,226 --> 00:10:28,536
Each experiment you do might involve about
2,000 people, and you'll measure the percentage

140
00:10:28,566 --> 00:10:30,126
of those people that are still
using the app after 60 days.

141
00:10:30,156 --> 00:10:30,846
That's your outcome variable.

142
00:10:30,876 --> 00:10:32,196
The factors that you consider
might be, for example,

143
00:10:32,226 --> 00:10:34,356
factor A is whether you provide a single
free in app upgrade for one of the features

144
00:10:34,386 --> 00:10:36,216
or at the plus level you provide a
30-day trial for all the features.

145
00:10:36,246 --> 00:10:38,256
Factor B might be that the sales message
is: "This app has your schedule available

146
00:10:38,286 --> 00:10:39,156
at your fingertips on any device".

147
00:10:39,186 --> 00:10:40,716
Or the alternative message: "Cal
app features are configurable,

148
00:10:40,746 --> 00:10:41,586
only pay for the features you want".

149
00:10:41,616 --> 00:10:43,566
Factor C might be, the price for in
app purchases is $0.89 per feature.

150
00:10:43,596 --> 00:10:45,486
Or at the plus level, the price
might be $0.99 cents per feature.

151
00:10:45,516 --> 00:10:46,086
So those were the factors.

152
00:10:46,116 --> 00:10:46,956
Now consider these other variables.

153
00:10:46,986 --> 00:10:48,276
Determine whether they are
disturbances or covariates.

154
00:10:48,306 --> 00:10:50,406
Note that a variety of answers could be
correct, depending on the assumptions you make

155
00:10:50,436 --> 00:10:52,026
and on your knowledge regarding
the capability of cellphone apps.

156
00:10:52,056 --> 00:10:53,886
So, in summary, the distinction
between covariates and disturbances is

157
00:10:53,916 --> 00:10:55,356
that covariates are measurable,
while disturbances are not.

158
00:10:55,386 --> 00:10:56,166
Neither of them are controllable.

159
00:10:56,196 --> 00:10:57,846
In the next video, we are going
to consider a subtle difference

160
00:10:57,876 --> 00:10:58,926
in the variables that can be controlled.