1 00:00:00,400 --> 00:00:00,966 2 00:00:00,966 --> 00:00:07,132 Hello and welcome to a module on calculating mean, standard deviation, and variance in Python. We'll also 3 00:00:07,133 --> 00:00:13,199 be covering median and mode. These are commonly used to help us understand the shape of our data. First, let's 4 00:00:13,200 --> 00:00:13,233 5 00:00:13,233 --> 00:00:19,399 import some data to use. As usual, we'll be importing our libraries, our packages, up at the 6 00:00:19,400 --> 00:00:25,566 top of our file here. We will be using statistics and numpy today to perform our calculations. 7 00:00:25,566 --> 00:00:27,899 8 00:00:27,900 --> 00:00:33,966 Next, we'll set our file path, read our file path in, save it to a data frame that 9 00:00:33,966 --> 00:00:38,099 we'll call DF and print the head of DF so we can see what data we're working with. 10 00:00:38,100 --> 00:00:42,066 11 00:00:42,066 --> 00:00:47,299 As you can see, we've got the same class data that we've been using the whole time and today we'll be working with IV1. 12 00:00:47,300 --> 00:00:50,000 13 00:00:50,000 --> 00:00:56,233 So in this case, we're working with something called a sample. This means what we have is representative of a population, 14 00:00:56,233 --> 00:00:56,599 15 00:00:56,600 --> 00:01:02,733 but is not the entire population. There are two different packages we'll introduce today. The first is the 16 00:01:02,733 --> 00:01:09,266 statistics package. So these are measures of central tendency, 17 00:01:09,266 --> 00:01:15,866 mean, median and mode. Statistics.mean will find you the mean. Statistics.median 18 00:01:15,866 --> 00:01:15,899 19 00:01:15,900 --> 00:01:20,233 will find you the median. Statistics.mode will get you the mode. 20 00:01:20,233 --> 00:01:21,999 21 00:01:22,000 --> 00:01:27,466 In each of these cases, you just change out whatever variable it is you're trying to find for what 22 00:01:27,466 --> 00:01:28,599 23 00:01:28,600 --> 00:01:34,600 is inside these parentheses. And the next we have our 24 00:01:34,600 --> 00:01:40,900 measures of dispersion, standard deviation and variance. First, the statistics package 25 00:01:40,900 --> 00:01:45,866 that is statistics.stdev and then statistics.variance. 26 00:01:45,866 --> 00:01:47,899 27 00:01:47,900 --> 00:01:54,133 Now, if you had a variable that was a number and wasn't inside a data frame. So if you 28 00:01:54,133 --> 00:01:54,166 29 00:01:54,166 --> 00:02:00,499 had an array or a list of something like that in Python, or if you had 30 00:02:00,500 --> 00:02:00,833 31 00:02:00,833 --> 00:02:06,899 saved df, IV1 to another variable such as, 32 00:02:06,900 --> 00:02:11,000 you know, var1 equals df, 33 00:02:11,000 --> 00:02:13,033 34 00:02:13,033 --> 00:02:17,833 IV1. Then you could actually just put 35 00:02:17,833 --> 00:02:20,133 36 00:02:20,133 --> 00:02:26,533 var1 in here and it would work just fine. I'm just directly referring 37 00:02:26,533 --> 00:02:26,566 38 00:02:26,566 --> 00:02:28,932 to the data frame because it is faster. 39 00:02:28,933 --> 00:02:33,666 40 00:02:33,666 --> 00:02:39,666 So these next lines will print out our answers. If we'd like to give something a title in an answer, 41 00:02:39,666 --> 00:02:40,066 42 00:02:40,066 --> 00:02:45,599 then we can do that with using f quotations, writing our title, 43 00:02:45,600 --> 00:02:46,300 44 00:02:46,300 --> 00:02:52,300 and then inside this quotations, we use curly brackets and we denote which variable we'd like to print out. 45 00:02:52,300 --> 00:02:56,466 46 00:02:56,466 --> 00:03:02,499 So in this case, let's go ahead and print out our mean, median, mode, standard deviation, and variance, and then 47 00:03:02,500 --> 00:03:02,533 48 00:03:02,533 --> 00:03:08,399 just for fun, I'll show you how to round these answers as well. You just add a dot 4f to get to 49 00:03:08,400 --> 00:03:08,600 50 00:03:08,600 --> 00:03:14,800 4 decimal places. If you wanted 3 decimal places, it would be dot 3f. If you wanted 2, it would be dot 2f, 51 00:03:14,800 --> 00:03:14,833 52 00:03:14,833 --> 00:03:20,933 and so on and so forth. So let's go ahead and run this. You 53 00:03:20,933 --> 00:03:27,299 can see we've got our mean, median mode, which are the same in this data set. They won't always be the same. 54 00:03:27,300 --> 00:03:27,800 55 00:03:27,800 --> 00:03:33,433 Standard deviation of 18.53 variance of 343.41, and then this 56 00:03:33,433 --> 00:03:37,333 57 00:03:37,333 --> 00:03:44,033 is just our rounded to four decimal places of the answers. Now, never round your answers unless specifically 58 00:03:44,033 --> 00:03:44,066 59 00:03:44,066 --> 00:03:48,966 instructed to do so in the assignment. Generally, you just leave your answers exactly the way they come out. 60 00:03:48,966 --> 00:03:50,999 61 00:03:51,000 --> 00:03:57,100 So this statistics package just assumes we are working with a sample. So there's no extra work to do when using this package to find the standard 62 00:03:57,100 --> 00:04:03,333 deviation and variance for sample data. The only two calculations that would be affected by this are standard deviation and variance. 63 00:04:03,333 --> 00:04:03,899 64 00:04:03,900 --> 00:04:09,933 Since for the sample calculation, we would divide by n-1 and in the population calculation, we simply 65 00:04:09,933 --> 00:04:09,966 66 00:04:09,966 --> 00:04:16,032 divide by n. But what if we are working with a population? There are times where this may come up. 67 00:04:16,033 --> 00:04:16,199 68 00:04:16,200 --> 00:04:22,366 If you end up working with big data, where data that is so large, it could reasonably be considered the entire population. 69 00:04:22,366 --> 00:04:23,032 70 00:04:23,033 --> 00:04:28,499 Or you may have data that has been collected, such as sales numbers that do represent the entire data set. 71 00:04:28,500 --> 00:04:29,133 72 00:04:29,133 --> 00:04:35,166 In that case, you could use NumPy. Now our mean and median aren't going to be affected by 73 00:04:35,166 --> 00:04:41,799 this, but we'll go ahead and use NumPy to calculate them anyway. So we've got np.mean and np.median. 74 00:04:41,800 --> 00:04:42,733 75 00:04:42,733 --> 00:04:47,999 Find the mean and the median. NumPy lacks a mode function, unfortunately. 76 00:04:48,000 --> 00:04:49,300 77 00:04:49,300 --> 00:04:54,333 And now we'll calculate the standard deviation and variance for the population. 78 00:04:54,333 --> 00:04:55,399 79 00:04:55,400 --> 00:05:01,800 Remember NumPy assumes that you have a population. So it automatically calculate by dividing 80 00:05:01,800 --> 00:05:08,300 by n instead of n minus 1. So for standard deviation, it's just STD for variance. 81 00:05:08,300 --> 00:05:08,833 82 00:05:08,833 --> 00:05:15,099 It's just VAR. And we're going to save those into two floats, called standard 83 00:05:15,100 --> 00:05:15,133 84 00:05:15,133 --> 00:05:21,266 deviation, underscore population, and var underscore population. We can 85 00:05:21,266 --> 00:05:27,599 still calculate sample standard deviation and variance. But in order to do that, we have to add this argument 86 00:05:27,600 --> 00:05:33,666 ddof equals 1 to let it know that we want to divide by 1 degree of freedom or n 87 00:05:33,666 --> 00:05:33,699 88 00:05:33,700 --> 00:05:39,133 minus 1 instead of n. Let's go ahead and print our results. 89 00:05:39,133 --> 00:05:42,899 90 00:05:42,900 --> 00:05:48,533 And as you can see, the standard deviation for the population is now slightly different at 18.47 whereas 91 00:05:48,533 --> 00:05:49,666 92 00:05:49,666 --> 00:05:55,966 before it was 18.53. Our variance is slightly different as well. For 93 00:05:55,966 --> 00:06:01,932 population is 341.39 and for our sample is 343.41. 94 00:06:01,933 --> 00:06:02,066 95 00:06:02,066 --> 00:06:07,632 But you may recognize this the same sample deviation variance that we got before. 96 00:06:07,633 --> 00:06:14,766 97 00:06:14,766 --> 00:06:20,866 So as data sets get larger and larger, the differences between a sample calculation and a population calculation are smaller and 98 00:06:20,866 --> 00:06:20,899 99 00:06:20,900 --> 00:06:26,600 smaller until they are almost negligible. Our data set is small so the difference is more noticeable. 100 00:06:26,600 --> 00:06:28,000 101 00:06:28,000 --> 00:06:34,100 Remember in this class we are always working with samples. So no matter which method you choose to use to find 102 00:06:34,100 --> 00:06:34,133 103 00:06:34,133 --> 00:06:40,599 a measure of central tendency, make sure you are keeping that in mind. However, it is useful to know how to find population 104 00:06:40,600 --> 00:06:46,766 standard deviation and variance. Here's a quick recap. The statistics package assumes you are working with 105 00:06:46,766 --> 00:06:53,032 samples by default. NumPy assumes you are working with a population by default. NumPy 106 00:06:53,033 --> 00:06:53,066 107 00:06:53,066 --> 00:06:59,166 is still useful to learn as it is widely used in big data and other Python applications. You may choose either 108 00:06:59,166 --> 00:07:07,099 method for your assignment but make sure you're doing sample calculations. Degrees of freedom equal to 1.