1 00:00:00,333 --> 00:00:00,933 2 00:00:00,933 --> 00:00:07,199 Hello and welcome to a module on scaling variables in Python. You may remember Z score scaling 3 00:00:07,200 --> 00:00:07,233 4 00:00:07,233 --> 00:00:13,366 from your textbook. This is what we will be doing today. First, let's load in some data and import our libraries. 5 00:00:13,366 --> 00:00:15,332 6 00:00:15,333 --> 00:00:21,433 We've imported these two before a couple of times. We're going to be using those to import our 7 00:00:21,433 --> 00:00:27,466 data. We'll also be importing the statistics module so that we can use it down below to do some 8 00:00:27,466 --> 00:00:27,599 9 00:00:27,600 --> 00:00:33,666 basic statistics. We're importing mapplotlib.pyplot to do some visualizations 10 00:00:33,666 --> 00:00:33,699 11 00:00:33,700 --> 00:00:39,866 of our variables. And we're importing z-score from sci-py.stats to actually perform 12 00:00:39,866 --> 00:00:46,032 our z-score distribution. This is some code we're pretty familiar with by now. So let's go ahead 13 00:00:46,033 --> 00:00:46,066 14 00:00:46,066 --> 00:00:52,132 and run it and we have all the variables that we'll be using today. Let's work 15 00:00:52,133 --> 00:00:58,333 with dv. Let's make a histogram for dv, so we can visualize the shape of its distribution. Then we'll scale 16 00:00:58,333 --> 00:01:04,833 the variable and then we'll take a look at the distribution afterward. The best case scenario to apply a z-score distribution 17 00:01:04,833 --> 00:01:10,733 is when you have a distribution that is approximately normally distributed. This is the code for our histogram. 18 00:01:10,733 --> 00:01:15,766 19 00:01:15,766 --> 00:01:21,999 And we can see that dv is not quite normally distributed, but it's close enough for our purposes. Now let's 20 00:01:22,000 --> 00:01:22,033 21 00:01:22,033 --> 00:01:27,999 scale dv. So first to score it or to do 22 00:01:28,000 --> 00:01:28,433 23 00:01:28,433 --> 00:01:33,566 a z-score distribution of it. We are going to use the z-score function, 24 00:01:33,566 --> 00:01:34,566 25 00:01:34,566 --> 00:01:40,766 we'll apply it to dv in our data frame, and we will put this in a new variable called z-scores in our data 26 00:01:40,766 --> 00:01:46,832 frame. Let's print the head so we can take a look. You'll notice after we printed them, we got 27 00:01:46,833 --> 00:01:46,866 28 00:01:46,866 --> 00:01:53,732 a new distribution of z-scores. As a result of transforming dv, when we transform a variable into a z-distribution, 29 00:01:53,733 --> 00:01:53,899 30 00:01:53,900 --> 00:02:00,266 the shape of the distribution remains the same. Let's take a look at a histogram of z-scores to see the shape of its distribution. 31 00:02:00,266 --> 00:02:01,499 32 00:02:01,500 --> 00:02:07,533 So this is the same histogram code as above, but we are using the z-scores variable instead of dv. 33 00:02:07,533 --> 00:02:08,533 34 00:02:08,533 --> 00:02:14,666 And you can see that the shape of this distribution is exactly the same as the one for 35 00:02:14,666 --> 00:02:21,066 dv. It did not change. This will always be the case when performing a z-score 36 00:02:21,066 --> 00:02:21,099 37 00:02:21,100 --> 00:02:27,166 transformation of a distribution. However, z-score distributions are special in other ways. For example, let's 38 00:02:27,166 --> 00:02:33,299 find the mean and standard deviation of our z-score distributions. We'll be using 39 00:02:33,300 --> 00:02:33,333 40 00:02:33,333 --> 00:02:37,866 statistics.mean and statistics.stdev standard deviation to find these numbers. 41 00:02:37,866 --> 00:02:39,766 42 00:02:39,766 --> 00:02:45,899 You'll see that our mean is zero and our standard deviation is one. A z-score distribution will always 43 00:02:45,900 --> 00:02:45,933 44 00:02:45,933 --> 00:02:51,966 have the mean of zero and a standard deviation of one. This is a good way to check your work to make sure that you haven't used the wrong 45 00:02:51,966 --> 00:02:57,232 variable or something like that. All right, see you in the next video.