1 00:00:00,933 --> 00:00:01,666 2 00:00:01,666 --> 00:00:08,799 Hello, and welcome to a video on probability in python. We'll be working with the scipy.stats.norm.cdf 3 00:00:08,800 --> 00:00:09,266 4 00:00:09,266 --> 00:00:15,999 function. We won't need to load in any data today. So let's start by loading in our library. So from scipy.stats, 5 00:00:16,000 --> 00:00:16,166 6 00:00:16,166 --> 00:00:22,432 we'll import norm, okay? Now, we're going to be working on a very special 7 00:00:22,433 --> 00:00:22,466 8 00:00:22,466 --> 00:00:28,432 kind of problem that we'll have today called finding the area under the curve. And what that means is 9 00:00:28,433 --> 00:00:28,599 10 00:00:28,600 --> 00:00:34,733 we'll have a distribution. And we're looking for the area under the curve of that distribution. Here's what that 11 00:00:34,733 --> 00:00:34,766 12 00:00:34,766 --> 00:00:40,899 type of problem looks like. In a normally distributed distribution with a mean of eight and a standard deviation 13 00:00:40,900 --> 00:00:45,766 of 12. What is the probability that a randomly selected score will fall below 92? 14 00:00:45,766 --> 00:00:47,532 15 00:00:47,533 --> 00:00:53,633 Let's take a look at a visualization of that. You can see here 16 00:00:53,633 --> 00:00:53,666 17 00:00:53,666 --> 00:00:58,499 we have a normal distribution. And this is where our 92 is. 18 00:00:58,500 --> 00:01:00,666 19 00:01:00,666 --> 00:01:06,732 We have our mean of 80 and 20 00:01:06,733 --> 00:01:09,466 21 00:01:09,466 --> 00:01:14,299 our standard deviation of 12. So right here would be the first standard deviation. 22 00:01:14,300 --> 00:01:16,200 23 00:01:16,200 --> 00:01:22,433 where it is on the other side. So we know that 92 is one standard deviation 24 00:01:22,433 --> 00:01:22,699 25 00:01:22,700 --> 00:01:28,733 And what we're looking for is all of this blue shaded area here. We 26 00:01:28,733 --> 00:01:28,766 27 00:01:28,766 --> 00:01:31,999 want to know how much area this contains. 28 00:01:32,000 --> 00:01:41,033 29 00:01:41,033 --> 00:01:47,133 So to do that, we're going to use norm.cdf. And it will ask us for a couple 30 00:01:47,133 --> 00:01:51,699 of arguments. We're going to tell it that we have a score 31 00:01:51,700 --> 00:01:54,100 32 00:01:54,100 --> 00:02:00,200 of 92 that we're looking for. A mean of 80. And a standard deviation of 12. 33 00:02:00,200 --> 00:02:00,733 34 00:02:00,733 --> 00:02:06,966 And these are called loc and scale. You don't actually 35 00:02:06,966 --> 00:02:13,132 need to include these argument labels. This line works just as well. If you just have 92 comma 80 comma 36 00:02:13,133 --> 00:02:13,399 37 00:02:13,400 --> 00:02:18,366 12. All right. Let's go ahead and run this and see what our answer is. 38 00:02:18,366 --> 00:02:21,766 39 00:02:21,766 --> 00:02:27,966 Okay. So this says that the probability of selecting a value below 92 is 0.8413 40 00:02:27,966 --> 00:02:28,566 41 00:02:28,566 --> 00:02:34,566 This looks about right. If we go back to our image, we 42 00:02:34,566 --> 00:02:34,599 43 00:02:34,600 --> 00:02:40,700 can see that this could be about 84% of our distribution. We know that we have most 44 00:02:40,700 --> 00:02:41,666 45 00:02:41,666 --> 00:02:45,099 of our distribution, most of the percent of our distribution in the center here. 46 00:02:45,100 --> 00:02:51,833 47 00:02:51,833 --> 00:02:57,899 So we can say that that visually checks out. So what if we want to find a value between two 48 00:02:57,900 --> 00:02:57,933 49 00:02:57,933 --> 00:03:03,966 values? Here's another type of problem you might run nto. What is the probability in a normally distributed 50 00:03:03,966 --> 00:03:03,999 51 00:03:04,000 --> 00:03:10,033 distribution of selecting an X value between 54 and 78 with a mean of 70 and 52 00:03:10,033 --> 00:03:15,733 a standard deviation of 8? Let's go back to our visualization this 53 00:03:15,733 --> 00:03:16,699 54 00:03:16,700 --> 00:03:20,700 time. We're looking for this shaded area here. 55 00:03:20,700 --> 00:03:22,900 56 00:03:22,900 --> 00:03:28,400 How much of our distribution is between these two X values. To do 57 00:03:28,400 --> 00:03:33,033 58 00:03:33,033 --> 00:03:38,866 that we will subtract the smaller X value from the larger X value. 59 00:03:38,866 --> 00:03:39,466 60 00:03:39,466 --> 00:03:45,666 You always want to do it in that order, because we can't have negative values in probability. You can only have 61 00:03:45,666 --> 00:03:45,932 62 00:03:45,933 --> 00:03:51,133 them with a domain of zero and one or between zero and one. 63 00:03:51,133 --> 00:03:53,933 64 00:03:53,933 --> 00:04:00,033 So we're going to find probability between norm.cdf. And we're taking the 65 00:04:00,033 --> 00:04:00,066 66 00:04:00,066 --> 00:04:06,332 larger of our two X values of 78 with a mean of 70 and standard deviation of 8 minus 67 00:04:06,333 --> 00:04:08,166 68 00:04:08,166 --> 00:04:12,332 an X value of 54 with a mean of 70 and a standard deviation of 8. 69 00:04:12,333 --> 00:04:14,566 70 00:04:14,566 --> 00:04:20,632 to guess what it's going for just from looking at the picture. And if 71 00:04:20,633 --> 00:04:20,666 72 00:04:20,666 --> 00:04:24,332 you get something close to 82% then you are pretty correct. 73 00:04:24,333 --> 00:04:27,799 74 00:04:27,800 --> 00:04:33,800 We can take another look at this again and kind of mentally verify, because we know that not very much of our 75 00:04:33,800 --> 00:04:33,833 76 00:04:33,833 --> 00:04:39,599 distribution is in these two tail ends here. Most of it falls between the center. So again, we've got about 82% 77 00:04:39,600 --> 00:04:39,966 78 00:04:39,966 --> 00:04:42,632 of our distribution between two values. 79 00:04:42,633 --> 00:04:46,533 80 00:04:46,533 --> 00:04:52,799 So we've got another type of problem you might run into. What if we want to find the probability of randomly selecting a value 81 00:04:52,800 --> 00:04:53,233 82 00:04:53,233 --> 00:04:59,699 above X? Let's use that first problem as an example. What if we wanted to find the probability of selecting 83 00:04:59,700 --> 00:05:03,866 a value above 92 with a mean of 80 and standard deviation of 12? 84 00:05:03,866 --> 00:05:07,599 85 00:05:07,600 --> 00:05:13,700 Let's go back to our visualization. This time we have that smaller part of the distribution that we're 86 00:05:13,700 --> 00:05:18,566 looking for. So we know that norm.cdf will find anything below 87 00:05:18,566 --> 00:05:19,832 88 00:05:19,833 --> 00:05:25,766 So how would we tell it we want the opposite of that. And the answer is 89 00:05:25,766 --> 00:05:26,299 90 00:05:26,300 --> 00:05:31,700 that we subtract from one. So, prob_above equals one minus, 91 00:05:31,700 --> 00:05:32,533 92 00:05:32,533 --> 00:05:35,866 norm.cdf, 92, 80, 12. 93 00:05:35,866 --> 00:05:39,466 94 00:05:39,466 --> 00:05:45,499 And this will essentially cut away the part of 95 00:05:45,500 --> 00:05:51,566 the distribution that we were looking at before. So we're saying one because probabilities only exist between zero 96 00:05:51,566 --> 00:05:57,599 and one. One minus, this value here leaves us with 97 00:05:57,600 --> 00:06:02,933 this red shaded area. Okay, 98 00:06:02,933 --> 00:06:05,199 99 00:06:05,200 --> 00:06:11,166 and that comes out to .1587, roughly. Which makes sense because 100 00:06:11,166 --> 00:06:12,332 101 00:06:12,333 --> 00:06:18,499 we got 0.8413 up here. And those two numbers together will equal 102 00:06:18,500 --> 00:06:18,533 103 00:06:18,533 --> 00:06:23,633 about one. So what if we'd like to calculate a percentile? 104 00:06:23,633 --> 00:06:24,966 105 00:06:24,966 --> 00:06:30,566 Let's say a clinic wants to identify patients who score low on a test. So the patients can be offered a new therapy. 106 00:06:30,566 --> 00:06:31,532 107 00:06:31,533 --> 00:06:37,866 The scores are normally distributed with a mean of 80 at a standard deviation 12. The clinic decides 108 00:06:37,866 --> 00:06:38,032 109 00:06:38,033 --> 00:06:44,099 on the lowest 40 percent of scores. What is the score that marks the 40 percentile? Now to do 110 00:06:44,100 --> 00:06:49,933 this, you just use norm.ppf, give it the percentile, 111 00:06:49,933 --> 00:06:50,899 112 00:06:50,900 --> 00:06:57,300 give it the mean, give it your standard deviation. And that's it. It will find you the percentile, that information. 113 00:06:57,300 --> 00:06:59,600 114 00:06:59,600 --> 00:07:04,733 And here we have that our 40th percentile, the raw score is 76.95 now. 115 00:07:04,733 --> 00:07:07,933 116 00:07:07,933 --> 00:07:13,633 Okay, that's it for this lesson on finding the area under the curve, have fun coding.