1
00:00:00,933 --> 00:00:01,666


2
00:00:01,666 --> 00:00:08,799
Hello, and welcome to a video on probability in python. We'll be working with the scipy.stats.norm.cdf

3
00:00:08,800 --> 00:00:09,266


4
00:00:09,266 --> 00:00:15,999
function. We won't need to load in any data today. So let's start by loading in our library. So from scipy.stats,

5
00:00:16,000 --> 00:00:16,166


6
00:00:16,166 --> 00:00:22,432
we'll import norm, okay? Now, we're going to be working on a very special

7
00:00:22,433 --> 00:00:22,466


8
00:00:22,466 --> 00:00:28,432
kind of problem that we'll have today called finding the area under the curve. And what that means is

9
00:00:28,433 --> 00:00:28,599


10
00:00:28,600 --> 00:00:34,733
we'll have a distribution. And we're looking for the area under the curve of that distribution. Here's what that

11
00:00:34,733 --> 00:00:34,766


12
00:00:34,766 --> 00:00:40,899
type of problem looks like. In a normally distributed distribution with a mean of eight and a standard deviation

13
00:00:40,900 --> 00:00:45,766
of 12. What is the probability that a randomly selected score will fall below 92?

14
00:00:45,766 --> 00:00:47,532


15
00:00:47,533 --> 00:00:53,633
Let's take a look at a visualization of that. You can see here

16
00:00:53,633 --> 00:00:53,666


17
00:00:53,666 --> 00:00:58,499
we have a normal distribution. And this is where our 92 is.

18
00:00:58,500 --> 00:01:00,666


19
00:01:00,666 --> 00:01:06,732
We have our mean of 80 and

20
00:01:06,733 --> 00:01:09,466


21
00:01:09,466 --> 00:01:14,299
our standard deviation of 12. So right here would be the first standard deviation.

22
00:01:14,300 --> 00:01:16,200


23
00:01:16,200 --> 00:01:22,433
where it is on the other side. So we know that 92 is one standard deviation

24
00:01:22,433 --> 00:01:22,699


25
00:01:22,700 --> 00:01:28,733
And what we're looking for is all of this blue shaded area here. We

26
00:01:28,733 --> 00:01:28,766


27
00:01:28,766 --> 00:01:31,999
want to know how much area this contains.

28
00:01:32,000 --> 00:01:41,033


29
00:01:41,033 --> 00:01:47,133
So to do that, we're going to use norm.cdf. And it will ask us for a couple

30
00:01:47,133 --> 00:01:51,699
of arguments. We're going to tell it that we have a score

31
00:01:51,700 --> 00:01:54,100


32
00:01:54,100 --> 00:02:00,200
of 92 that we're looking for. A mean of 80. And a standard deviation of 12.

33
00:02:00,200 --> 00:02:00,733


34
00:02:00,733 --> 00:02:06,966
And these are called loc and scale. You don't actually

35
00:02:06,966 --> 00:02:13,132
need to include these argument labels. This line works just as well. If you just have 92 comma 80 comma

36
00:02:13,133 --> 00:02:13,399


37
00:02:13,400 --> 00:02:18,366
12. All right. Let's go ahead and run this and see what our answer is.

38
00:02:18,366 --> 00:02:21,766


39
00:02:21,766 --> 00:02:27,966
Okay. So this says that the probability of selecting a value below 92 is 0.8413

40
00:02:27,966 --> 00:02:28,566


41
00:02:28,566 --> 00:02:34,566
This looks about right. If we go back to our image, we

42
00:02:34,566 --> 00:02:34,599


43
00:02:34,600 --> 00:02:40,700
can see that this could be about 84% of our distribution. We know that we have most

44
00:02:40,700 --> 00:02:41,666


45
00:02:41,666 --> 00:02:45,099
of our distribution, most of the percent of our distribution in the center here.

46
00:02:45,100 --> 00:02:51,833


47
00:02:51,833 --> 00:02:57,899
So we can say that that visually checks out. So what if we want to find a value between two

48
00:02:57,900 --> 00:02:57,933


49
00:02:57,933 --> 00:03:03,966
values? Here's another type of problem you might run nto. What is the probability in a normally distributed

50
00:03:03,966 --> 00:03:03,999


51
00:03:04,000 --> 00:03:10,033
distribution of selecting an X value between 54 and 78 with a mean of 70 and

52
00:03:10,033 --> 00:03:15,733
a standard deviation of 8? Let's go back to our visualization this

53
00:03:15,733 --> 00:03:16,699


54
00:03:16,700 --> 00:03:20,700
time. We're looking for this shaded area here.

55
00:03:20,700 --> 00:03:22,900


56
00:03:22,900 --> 00:03:28,400
How much of our distribution is between these two X values. To do

57
00:03:28,400 --> 00:03:33,033


58
00:03:33,033 --> 00:03:38,866
that we will subtract the smaller X value from the larger X value.

59
00:03:38,866 --> 00:03:39,466


60
00:03:39,466 --> 00:03:45,666
You always want to do it in that order, because we can't have negative values in probability. You can only have

61
00:03:45,666 --> 00:03:45,932


62
00:03:45,933 --> 00:03:51,133
them with a domain of zero and one or between zero and one.

63
00:03:51,133 --> 00:03:53,933


64
00:03:53,933 --> 00:04:00,033
So we're going to find probability between norm.cdf. And we're taking the

65
00:04:00,033 --> 00:04:00,066


66
00:04:00,066 --> 00:04:06,332
larger of our two X values of 78 with a mean of 70 and standard deviation of 8 minus

67
00:04:06,333 --> 00:04:08,166


68
00:04:08,166 --> 00:04:12,332
an X value of 54 with a mean of 70 and a standard deviation of 8.

69
00:04:12,333 --> 00:04:14,566


70
00:04:14,566 --> 00:04:20,632
to guess what it's going for just from looking at the picture. And if

71
00:04:20,633 --> 00:04:20,666


72
00:04:20,666 --> 00:04:24,332
you get something close to 82% then you are pretty correct.

73
00:04:24,333 --> 00:04:27,799


74
00:04:27,800 --> 00:04:33,800
We can take another look at this again and kind of mentally verify, because we know that not very much of our

75
00:04:33,800 --> 00:04:33,833


76
00:04:33,833 --> 00:04:39,599
distribution is in these two tail ends here. Most of it falls between the center. So again, we've got about 82%

77
00:04:39,600 --> 00:04:39,966


78
00:04:39,966 --> 00:04:42,632
of our distribution between two values.

79
00:04:42,633 --> 00:04:46,533


80
00:04:46,533 --> 00:04:52,799
So we've got another type of problem you might run into. What if we want to find the probability of randomly selecting a value

81
00:04:52,800 --> 00:04:53,233


82
00:04:53,233 --> 00:04:59,699
above X? Let's use that first problem as an example. What if we wanted to find the probability of selecting

83
00:04:59,700 --> 00:05:03,866
a value above 92 with a mean of 80 and standard deviation of 12?

84
00:05:03,866 --> 00:05:07,599


85
00:05:07,600 --> 00:05:13,700
Let's go back to our visualization. This time we have that smaller part of the distribution that we're

86
00:05:13,700 --> 00:05:18,566
looking for. So we know that norm.cdf will find anything below

87
00:05:18,566 --> 00:05:19,832


88
00:05:19,833 --> 00:05:25,766
So how would we tell it we want the opposite of that. And the answer is

89
00:05:25,766 --> 00:05:26,299


90
00:05:26,300 --> 00:05:31,700
that we subtract from one. So, prob_above equals one minus,

91
00:05:31,700 --> 00:05:32,533


92
00:05:32,533 --> 00:05:35,866
norm.cdf, 92, 80, 12.

93
00:05:35,866 --> 00:05:39,466


94
00:05:39,466 --> 00:05:45,499
And this will essentially cut away the part of

95
00:05:45,500 --> 00:05:51,566
the distribution that we were looking at before. So we're saying one because probabilities only exist between zero

96
00:05:51,566 --> 00:05:57,599
and one. One minus, this value here leaves us with

97
00:05:57,600 --> 00:06:02,933
this red shaded area. Okay,

98
00:06:02,933 --> 00:06:05,199


99
00:06:05,200 --> 00:06:11,166
and that comes out to .1587, roughly. Which makes sense because

100
00:06:11,166 --> 00:06:12,332


101
00:06:12,333 --> 00:06:18,499
we got 0.8413 up here. And those two numbers together will equal

102
00:06:18,500 --> 00:06:18,533


103
00:06:18,533 --> 00:06:23,633
about one. So what if we'd like to calculate a percentile?

104
00:06:23,633 --> 00:06:24,966


105
00:06:24,966 --> 00:06:30,566
Let's say a clinic wants to identify patients who score low on a test. So the patients can be offered a new therapy.

106
00:06:30,566 --> 00:06:31,532


107
00:06:31,533 --> 00:06:37,866
The scores are normally distributed with a mean of 80 at a standard deviation 12. The clinic decides

108
00:06:37,866 --> 00:06:38,032


109
00:06:38,033 --> 00:06:44,099
on the lowest 40 percent of scores. What is the score that marks the 40 percentile? Now to do

110
00:06:44,100 --> 00:06:49,933
this, you just use norm.ppf, give it the percentile,

111
00:06:49,933 --> 00:06:50,899


112
00:06:50,900 --> 00:06:57,300
give it the mean, give it your standard deviation. And that's it. It will find you the percentile, that information.

113
00:06:57,300 --> 00:06:59,600


114
00:06:59,600 --> 00:07:04,733
And here we have that our 40th percentile, the raw score is 76.95 now.

115
00:07:04,733 --> 00:07:07,933


116
00:07:07,933 --> 00:07:13,633
Okay, that's it for this lesson on finding the area under the curve, have fun coding.