1
00:00:00,400 --> 00:00:00,966


2
00:00:00,966 --> 00:00:07,132
Hello and welcome to a module on calculating mean, standard deviation, and variance in Python. We'll also

3
00:00:07,133 --> 00:00:13,199
be covering median and mode. These are commonly used to help us understand the shape of our data. First, let's

4
00:00:13,200 --> 00:00:13,233


5
00:00:13,233 --> 00:00:19,399
import some data to use. As usual, we'll be importing our libraries, our packages, up at the

6
00:00:19,400 --> 00:00:25,566
top of our file here. We will be using statistics and numpy today to perform our calculations.

7
00:00:25,566 --> 00:00:27,899


8
00:00:27,900 --> 00:00:33,966
Next, we'll set our file path, read our file path in, save it to a data frame that

9
00:00:33,966 --> 00:00:38,099
we'll call DF and print the head of DF so we can see what data we're working with.

10
00:00:38,100 --> 00:00:42,066


11
00:00:42,066 --> 00:00:47,299
As you can see, we've got the same class data that we've been using the whole time and today we'll be working with IV1.

12
00:00:47,300 --> 00:00:50,000


13
00:00:50,000 --> 00:00:56,233
So in this case, we're working with something called a sample. This means what we have is representative of a population,

14
00:00:56,233 --> 00:00:56,599


15
00:00:56,600 --> 00:01:02,733
but is not the entire population. There are two different packages we'll introduce today. The first is the

16
00:01:02,733 --> 00:01:09,266
statistics package. So these are measures of central tendency,

17
00:01:09,266 --> 00:01:15,866
mean, median and mode. Statistics.mean will find you the mean. Statistics.median

18
00:01:15,866 --> 00:01:15,899


19
00:01:15,900 --> 00:01:20,233
will find you the median. Statistics.mode will get you the mode.

20
00:01:20,233 --> 00:01:21,999


21
00:01:22,000 --> 00:01:27,466
In each of these cases, you just change out whatever variable it is you're trying to find for what

22
00:01:27,466 --> 00:01:28,599


23
00:01:28,600 --> 00:01:34,600
is inside these parentheses. And the next we have our

24
00:01:34,600 --> 00:01:40,900
measures of dispersion, standard deviation and variance. First, the statistics package

25
00:01:40,900 --> 00:01:45,866
that is statistics.stdev and then statistics.variance.

26
00:01:45,866 --> 00:01:47,899


27
00:01:47,900 --> 00:01:54,133
Now, if you had a variable that was a number and wasn't inside a data frame. So if you

28
00:01:54,133 --> 00:01:54,166


29
00:01:54,166 --> 00:02:00,499
had an array or a list of something like that in Python, or if you had

30
00:02:00,500 --> 00:02:00,833


31
00:02:00,833 --> 00:02:06,899
saved df, IV1 to another variable such as,

32
00:02:06,900 --> 00:02:11,000
you know, var1 equals df,

33
00:02:11,000 --> 00:02:13,033


34
00:02:13,033 --> 00:02:17,833
IV1. Then you could actually just put

35
00:02:17,833 --> 00:02:20,133


36
00:02:20,133 --> 00:02:26,533
var1 in here and it would work just fine. I'm just directly referring

37
00:02:26,533 --> 00:02:26,566


38
00:02:26,566 --> 00:02:28,932
to the data frame because it is faster.

39
00:02:28,933 --> 00:02:33,666


40
00:02:33,666 --> 00:02:39,666
So these next lines will print out our answers. If we'd like to give something a title in an answer,

41
00:02:39,666 --> 00:02:40,066


42
00:02:40,066 --> 00:02:45,599
then we can do that with using f quotations, writing our title,

43
00:02:45,600 --> 00:02:46,300


44
00:02:46,300 --> 00:02:52,300
and then inside this quotations, we use curly brackets and we denote which variable we'd like to print out.

45
00:02:52,300 --> 00:02:56,466


46
00:02:56,466 --> 00:03:02,499
So in this case, let's go ahead and print out our mean, median, mode, standard deviation, and variance, and then

47
00:03:02,500 --> 00:03:02,533


48
00:03:02,533 --> 00:03:08,399
just for fun, I'll show you how to round these answers as well. You just add a dot 4f to get to

49
00:03:08,400 --> 00:03:08,600


50
00:03:08,600 --> 00:03:14,800
4 decimal places. If you wanted 3 decimal places, it would be dot 3f. If you wanted 2, it would be dot 2f,

51
00:03:14,800 --> 00:03:14,833


52
00:03:14,833 --> 00:03:20,933
and so on and so forth. So let's go ahead and run this. You

53
00:03:20,933 --> 00:03:27,299
can see we've got our mean, median mode, which are the same in this data set. They won't always be the same.

54
00:03:27,300 --> 00:03:27,800


55
00:03:27,800 --> 00:03:33,433
Standard deviation of 18.53 variance of 343.41, and then this

56
00:03:33,433 --> 00:03:37,333


57
00:03:37,333 --> 00:03:44,033
is just our rounded to four decimal places of the answers. Now, never round your answers unless specifically

58
00:03:44,033 --> 00:03:44,066


59
00:03:44,066 --> 00:03:48,966
instructed to do so in the assignment. Generally, you just leave your answers exactly the way they come out.

60
00:03:48,966 --> 00:03:50,999


61
00:03:51,000 --> 00:03:57,100
So this statistics package just assumes we are working with a sample. So there's no extra work to do when using this package to find the standard

62
00:03:57,100 --> 00:04:03,333
deviation and variance for sample data. The only two calculations that would be affected by this are standard deviation and variance.

63
00:04:03,333 --> 00:04:03,899


64
00:04:03,900 --> 00:04:09,933
Since for the sample calculation, we would divide by n-1 and in the population calculation, we simply

65
00:04:09,933 --> 00:04:09,966


66
00:04:09,966 --> 00:04:16,032
divide by n. But what if we are working with a population? There are times where this may come up.

67
00:04:16,033 --> 00:04:16,199


68
00:04:16,200 --> 00:04:22,366
If you end up working with big data, where data that is so large, it could reasonably be considered the entire population.

69
00:04:22,366 --> 00:04:23,032


70
00:04:23,033 --> 00:04:28,499
Or you may have data that has been collected, such as sales numbers that do represent the entire data set.

71
00:04:28,500 --> 00:04:29,133


72
00:04:29,133 --> 00:04:35,166
In that case, you could use NumPy. Now our mean and median aren't going to be affected by

73
00:04:35,166 --> 00:04:41,799
this, but we'll go ahead and use NumPy to calculate them anyway. So we've got np.mean and np.median.

74
00:04:41,800 --> 00:04:42,733


75
00:04:42,733 --> 00:04:47,999
Find the mean and the median. NumPy lacks a mode function, unfortunately.

76
00:04:48,000 --> 00:04:49,300


77
00:04:49,300 --> 00:04:54,333
And now we'll calculate the standard deviation and variance for the population.

78
00:04:54,333 --> 00:04:55,399


79
00:04:55,400 --> 00:05:01,800
Remember NumPy assumes that you have a population. So it automatically calculate by dividing

80
00:05:01,800 --> 00:05:08,300
by n instead of n minus 1. So for standard deviation, it's just STD for variance.

81
00:05:08,300 --> 00:05:08,833


82
00:05:08,833 --> 00:05:15,099
It's just VAR. And we're going to save those into two floats, called standard

83
00:05:15,100 --> 00:05:15,133


84
00:05:15,133 --> 00:05:21,266
deviation, underscore population, and var underscore population. We can

85
00:05:21,266 --> 00:05:27,599
still calculate sample standard deviation and variance. But in order to do that, we have to add this argument

86
00:05:27,600 --> 00:05:33,666
ddof equals 1 to let it know that we want to divide by 1 degree of freedom or n

87
00:05:33,666 --> 00:05:33,699


88
00:05:33,700 --> 00:05:39,133
minus 1 instead of n. Let's go ahead and print our results.

89
00:05:39,133 --> 00:05:42,899


90
00:05:42,900 --> 00:05:48,533
And as you can see, the standard deviation for the population is now slightly different at 18.47 whereas

91
00:05:48,533 --> 00:05:49,666


92
00:05:49,666 --> 00:05:55,966
before it was 18.53. Our variance is slightly different as well. For

93
00:05:55,966 --> 00:06:01,932
population is 341.39 and for our sample is 343.41.

94
00:06:01,933 --> 00:06:02,066


95
00:06:02,066 --> 00:06:07,632
But you may recognize this the same sample deviation variance that we got before.

96
00:06:07,633 --> 00:06:14,766


97
00:06:14,766 --> 00:06:20,866
So as data sets get larger and larger, the differences between a sample calculation and a population calculation are smaller and

98
00:06:20,866 --> 00:06:20,899


99
00:06:20,900 --> 00:06:26,600
smaller until they are almost negligible. Our data set is small so the difference is more noticeable.

100
00:06:26,600 --> 00:06:28,000


101
00:06:28,000 --> 00:06:34,100
Remember in this class we are always working with samples. So no matter which method you choose to use to find

102
00:06:34,100 --> 00:06:34,133


103
00:06:34,133 --> 00:06:40,599
a measure of central tendency, make sure you are keeping that in mind. However, it is useful to know how to find population

104
00:06:40,600 --> 00:06:46,766
standard deviation and variance. Here's a quick recap. The statistics package assumes you are working with

105
00:06:46,766 --> 00:06:53,032
samples by default. NumPy assumes you are working with a population by default. NumPy

106
00:06:53,033 --> 00:06:53,066


107
00:06:53,066 --> 00:06:59,166
is still useful to learn as it is widely used in big data and other Python applications. You may choose either

108
00:06:59,166 --> 00:07:07,099
method for your assignment but make sure you're doing sample calculations. Degrees of freedom equal to 1.