1 00:00:05,276 --> 00:00:09,516 In the prior video I introduced the concept of phase 1 and phase 2. 2 00:00:10,286 --> 00:00:13,016 Phase 1 is where you build your chart and test it internally. 3 00:00:13,646 --> 00:00:19,786 Phase 2 is when you test the chart on new, unseen data, usually online and in real time. 4 00:00:20,476 --> 00:00:23,936 Let's look at phase 1 in some detail for the Shewhart chart in this video. 5 00:00:24,676 --> 00:00:28,056 We could, of course, plot our data in its raw form and monitor that. 6 00:00:28,996 --> 00:00:32,636 Show one value shown on the plot for every sample that we take from our process. 7 00:00:32,966 --> 00:00:37,216 However, we know that these raw data will be noisy and have high variability. 8 00:00:37,866 --> 00:00:42,386 What we can do is group our samples up, calculate the average of the group of values, 9 00:00:42,466 --> 00:00:44,566 and then show that group average instead. 10 00:00:44,946 --> 00:00:49,076 For example, if I was using group sizes of 5 samples each, 11 00:00:49,546 --> 00:00:53,536 I would wait till I've acquired 5 new values, calculate the average 12 00:00:53,536 --> 00:00:56,486 of those 5 numbers, and call it x-bar-1. 13 00:00:57,226 --> 00:01:01,226 And on my monitoring chart I show x-bar-1, not the raw data. 14 00:01:01,736 --> 00:01:07,076 Then I collect another 5 raw samples, and display that average: x-bar-2 next. 15 00:01:07,736 --> 00:01:12,516 Then another five, and call that x-bar-3, and then x-bar-4, and so on. 16 00:01:13,216 --> 00:01:16,826 You have the freedom to select that subgroup size, lowercase 'n'. 17 00:01:17,196 --> 00:01:20,216 The more samples in your subgroup, the more smooth your plots. 18 00:01:20,916 --> 00:01:25,566 But it will also take longer to acquire those 'n' samples within the subgroup, and so, 19 00:01:25,756 --> 00:01:28,636 the longer it will take to detect a problem when it has occurred. 20 00:01:29,026 --> 00:01:33,036 Using a small subgroup on the other hand, means that you could have potentially one 21 00:01:33,036 --> 00:01:35,586 or two noisy data points creating a false alarm. 22 00:01:36,006 --> 00:01:39,136 So it is tradeoff between accuracy and speed of detection. 23 00:01:39,496 --> 00:01:41,446 We discuss this more in an upcoming video. 24 00:01:41,996 --> 00:01:44,456 Let's assume that you are using n values in your subgroup 25 00:01:44,456 --> 00:01:47,256 and that you have calculated this average now, x-bar. 26 00:01:47,776 --> 00:01:52,246 Do you recall from an earlier part of this course what the distribution of x-bar will be? 27 00:01:52,806 --> 00:01:55,236 You should recall that it is from the normal distribution, 28 00:01:55,726 --> 00:01:59,416 and furthermore that normal distributions will have a mean of "mu" 29 00:01:59,726 --> 00:02:02,066 and a variance of sigma-squared over 'n'. 30 00:02:02,746 --> 00:02:07,706 We can define and call the standard deviation of the subgroup average: sigma x-bar, 31 00:02:08,306 --> 00:02:10,956 and that will be equal to sigma over root 'n'. 32 00:02:11,726 --> 00:02:15,266 We can safely assume that we know what 'mu' is: 'mu' is the average 33 00:02:15,266 --> 00:02:17,076 of the population we are sampling from. 34 00:02:17,466 --> 00:02:21,156 The average should be for a well-controlled process the target value, 35 00:02:21,596 --> 00:02:23,786 the desired value that we would like to be at. 36 00:02:23,996 --> 00:02:28,806 Sigma x-bar then represents the spread around the target and so it is not surprising 37 00:02:28,806 --> 00:02:30,836 that the upper and the lower control limits are going 38 00:02:30,836 --> 00:02:33,136 to be a function of that sigma x-bar value. 39 00:02:33,506 --> 00:02:35,146 Here's a plot to help visualize that. 40 00:02:35,576 --> 00:02:38,526 We have the raw data shown by the thin black line over here. 41 00:02:38,686 --> 00:02:42,986 This process has a mean of 6 and a variance of 4, or 2 standard deviations. 42 00:02:43,596 --> 00:02:47,816 When we take subgroups, which have five samples each, the subgroups now come 43 00:02:47,816 --> 00:02:50,556 from a new distribution, given by the thicker black line. 44 00:02:50,776 --> 00:02:54,206 It has the same mean still, but the variance is tighter, or pulled in. 45 00:02:54,536 --> 00:03:00,156 In this case, the standard deviation of the subgroups is 0.894, and you can verify that. 46 00:03:00,536 --> 00:03:03,926 A reasonable upper bound and lower bound would be those that span 47 00:03:03,926 --> 00:03:08,616 between -3 times sigma x-bar up to +3 times sigma x-bar. 48 00:03:08,946 --> 00:03:12,536 This captures a very large percentage of the regular process operation. 49 00:03:13,376 --> 00:03:14,436 Here's a quick check for you. 50 00:03:14,436 --> 00:03:18,576 Based on the prior sections in this course, we saw in the univariate data analysis, 51 00:03:18,736 --> 00:03:23,176 that the percentage of normal process operation lying between -3 times sigma 52 00:03:23,486 --> 00:03:29,616 and up to +3 times sigma can be calculated as our lower bound being 3.318, 53 00:03:29,706 --> 00:03:32,856 and our upper bound as 8.68 in this example. 54 00:03:33,176 --> 00:03:35,996 As long as the process is operating under normal conditions, 55 00:03:36,336 --> 00:03:39,576 we should lie within those two bounds for our x-bar value. 56 00:03:40,066 --> 00:03:46,156 We can also theoretically go calculate a z-value for this system and unpack it for this system, 57 00:03:46,186 --> 00:03:48,596 by selecting a critical value of c_n. 58 00:03:49,096 --> 00:03:53,366 This critical value can be reasonably selected as a value of c_n=3.0. 59 00:03:53,646 --> 00:03:56,356 And given that selection, the area between the lower 60 00:03:56,356 --> 00:04:00,146 and upper control limits then would span 99.73 percent. 61 00:04:00,326 --> 00:04:05,546 That corresponds to a 1 in 370 chance that a data point, x-bar, 62 00:04:05,716 --> 00:04:09,996 will lie outside the bounds even though it is from good and normal operation. 63 00:04:10,646 --> 00:04:13,916 Why the value of 3 is reasonable is because it is a great tradeoff 64 00:04:13,916 --> 00:04:17,496 between what should be normal operation and abnormal operation, 65 00:04:17,966 --> 00:04:22,886 without raising too many false alarms and having too many false negatives. 66 00:04:23,636 --> 00:04:27,546 Of course all of the above assumes that we know what 'mu' and 'sigma' are. 67 00:04:28,196 --> 00:04:31,626 Well for 'mu' we don't know the population value, however, 68 00:04:31,626 --> 00:04:35,996 a reasonable approximation is the mean, or the median, of a long sequence 69 00:04:35,996 --> 00:04:37,976 of historical data for the variable. 70 00:04:38,416 --> 00:04:44,046 Or we could use the target value, or we could calculate the average of the x-bar values. 71 00:04:45,006 --> 00:04:48,176 If we use that last option we call is x-double-bar, 72 00:04:48,356 --> 00:04:53,276 and capital 'K' in this formula represents the number of subgroups we are forming. 73 00:04:53,846 --> 00:04:58,736 Now a reasonable estimate for 'sigma' might be to first go calculate the standard deviation 74 00:04:58,826 --> 00:05:00,956 of the values within each subgroup. 75 00:05:01,226 --> 00:05:03,286 For the case of 5 samples in the subgroup, 76 00:05:03,496 --> 00:05:06,126 calculate the standard deviation of these five values. 77 00:05:06,476 --> 00:05:10,876 For the first subgroup, the second subgroup, all the way up to the Kth subgroup. 78 00:05:11,366 --> 00:05:15,306 Now when we calculate these standard deviations we can then find the average 79 00:05:15,306 --> 00:05:19,836 of these standard deviations numbers, and call that value capital S-bar: 80 00:05:20,176 --> 00:05:23,926 the mean of the standard deviations of the 'K' subgroups. 81 00:05:24,776 --> 00:05:29,776 Now that estimate is not quite a pure estimate of 'sigma'; we need to correct it 82 00:05:29,846 --> 00:05:35,606 to match what 'sigma' would truly be, and this correction factor here is a theoretical number 83 00:05:35,606 --> 00:05:40,696 that is derived for us for various subgroup sizes of 2, 3, 4 and so on. 84 00:05:41,526 --> 00:05:44,006 Notice that there is some intuitive expectation here. 85 00:05:44,276 --> 00:05:48,666 When our subgroup size becomes large that correction factor approaches 1, 86 00:05:48,876 --> 00:05:53,306 which indicates that when our subgroup becomes large we can estimate the standard deviation 87 00:05:53,306 --> 00:05:54,186 more accurately. 88 00:05:54,556 --> 00:05:59,336 It is clear that estimating standard deviation on a very small number of samples is going 89 00:05:59,336 --> 00:06:02,796 to be biased, and that is what the correction factor fixes up for us. 90 00:06:03,256 --> 00:06:05,916 So now we are in a position to estimate our lower 91 00:06:05,916 --> 00:06:09,976 and upper control limits using sampled data, not population data. 92 00:06:10,396 --> 00:06:12,236 We take our x-double-bar as the target, 93 00:06:12,326 --> 00:06:16,336 and add and subtract 3 times this standard deviation value. 94 00:06:16,726 --> 00:06:19,556 These will estimate upper and lower control limits for us. 95 00:06:20,386 --> 00:06:24,826 So it is time for an example, and let's use this case where we are measuring the colour 96 00:06:25,016 --> 00:06:27,066 of a product, using a digital camera. 97 00:06:27,736 --> 00:06:30,016 We are going to use 5 values in our subgroup. 98 00:06:30,016 --> 00:06:36,796 I have shown the raw data here on the screen and the average of the first 5 data points is 237, 99 00:06:36,796 --> 00:06:39,886 and the standard deviation is 9.38. 100 00:06:40,386 --> 00:06:42,436 Now let's take the next set of five numbers. 101 00:06:42,936 --> 00:06:47,606 They're shown there on the screen and the mean this time is 245.6 102 00:06:47,886 --> 00:06:50,346 and the standard deviation is 8.44. 103 00:06:50,826 --> 00:06:56,066 Now if we had a hundred such raw colour values, I could create 20 subgroups, 104 00:06:56,406 --> 00:07:00,806 and that is what I've shown now on the screen: the mean of the 20 subgroup values. 105 00:07:01,066 --> 00:07:08,906 I've calculated the overall mean of all the 20 subgroups, that is x-double-bar, as 238.8, 106 00:07:09,376 --> 00:07:13,986 and the mean of the standard deviations I've also recorded for you as 9.28. 107 00:07:14,516 --> 00:07:18,456 Now for the phase 1 step we need to calculate the lower and the upper control limit. 108 00:07:19,216 --> 00:07:23,646 Implicit in the definition for phase 1 is that when we are calculating the limits, 109 00:07:24,036 --> 00:07:29,986 we are doing so based on data that are from good quality, normal operation. 110 00:07:30,686 --> 00:07:33,116 You would not use contaminated data, 111 00:07:33,116 --> 00:07:36,536 when problems were occurring, to calculate the phase 1 limits. 112 00:07:36,916 --> 00:07:38,936 Otherwise those limits would be too wide. 113 00:07:39,976 --> 00:07:45,096 So we should verify in fact that the subgroup data are from normal operation. 114 00:07:45,416 --> 00:07:49,506 And we do that by testing these initial set of upper and lower limits 115 00:07:49,566 --> 00:07:51,706 on the data we used to calculate them. 116 00:07:52,376 --> 00:07:54,876 You might want to try this entire process on your own. 117 00:07:55,416 --> 00:08:00,246 Download the data from this link, copy and paste it into R, or into a spreadsheet, 118 00:08:00,246 --> 00:08:04,136 or whichever software you prefer and make sure you can reproduce the upper 119 00:08:04,136 --> 00:08:06,116 and lower limit control limit calculations. 120 00:08:06,266 --> 00:08:09,976 Do you observe any of the subgroup values that lie outside the limits? 121 00:08:10,336 --> 00:08:13,036 If you do, you should go exclude that subgroup, 122 00:08:13,036 --> 00:08:15,896 and recalculate the lower and upper control limits again. 123 00:08:17,146 --> 00:08:21,296 Notice that one of the subgroups, 253 exceeds the limits. 124 00:08:21,756 --> 00:08:25,956 We exclude that single subgroup, and now we have 19 subgroups left. 125 00:08:26,756 --> 00:08:30,146 Recalculate the x-double-bar, recalculate the S-bar, 126 00:08:30,466 --> 00:08:33,436 and then go recalculate our lower and upper control limits. 127 00:08:34,176 --> 00:08:38,246 This time we get 224 and 252 respectively. 128 00:08:38,526 --> 00:08:40,006 They don't change by very much. 129 00:08:40,796 --> 00:08:45,266 It's not uncommon that we have to go through one or two iterations of this process of pruning 130 00:08:45,266 --> 00:08:47,946 out bad data, to get a clean dataset 131 00:08:47,946 --> 00:08:50,766 that represents the upper and lower control limits fairly. 132 00:08:51,486 --> 00:08:55,986 So that process is phase 1 and now we are ready to go onto phase 2. 133 00:08:56,416 --> 00:09:00,336 Phase 2 is where you go put this all into a computerized system. 134 00:09:00,336 --> 00:09:03,356 You show the upper and lower control limits, the target value, 135 00:09:03,546 --> 00:09:07,466 and the computerized system would go calculate your subgroup averages for you. 136 00:09:07,836 --> 00:09:12,246 Superimpose them on the plot here on the right hand side and then bump off a value 137 00:09:12,246 --> 00:09:15,546 on the left-hand side, as was shown earlier in this animation 138 00:09:15,646 --> 00:09:17,786 which I'm repeating here now on the screen for you. 139 00:09:18,746 --> 00:09:23,706 Every one of these data points coming in on the right hand side is a subgroup, not the raw data. 140 00:09:25,056 --> 00:09:30,266 Now in the next video, we're going to show how we can judge whether this chart performs well 141 00:09:30,266 --> 00:09:30,686 or not. 142 00:09:31,056 --> 00:09:32,936 How many times do we get a false alarm? 143 00:09:33,586 --> 00:09:37,866 How many times does the opposite occur: when the process is actually behaving badly, 144 00:09:37,956 --> 00:09:40,416 but we're still showing data within the limits? 145 00:09:41,036 --> 00:09:44,946 Neither of these situations is desirable, but how do we quantify that? 146 00:09:45,306 --> 00:09:46,246 We'll see that next.