1 00:00:00,000 --> 00:00:00,900 2 00:00:00,900 --> 00:00:07,266 Hello and welcome to a module on Pearson correlations. First, let's load in our data and the packages we'll be using today. 3 00:00:07,266 --> 00:00:07,732 4 00:00:07,733 --> 00:00:13,299 So we'll be importing Pandas as PD, Drive from Google.colab, numpy as np, 5 00:00:13,300 --> 00:00:14,066 6 00:00:14,066 --> 00:00:20,266 mapplotlib.pyplot as PLT and we'll be importing pearsonr from scipy.stats. 7 00:00:20,266 --> 00:00:21,032 8 00:00:21,033 --> 00:00:26,933 We'll be doing some scatter plots today, so that's why we've got matplotlib imported. 9 00:00:26,933 --> 00:00:28,233 10 00:00:28,233 --> 00:00:33,899 So let's go ahead and create our variables for the class data set. This should look very familiar. 11 00:00:33,900 --> 00:00:35,433 12 00:00:35,433 --> 00:00:41,566 Okay, and these are variables that we're used to seeing by now, and we know how these variables were created. So we're aware that 13 00:00:41,566 --> 00:00:47,599 there are relationships between IV1, IV2, DV, and the rest of these variables here. So let's 14 00:00:47,600 --> 00:00:54,033 investigate those relationships and see how closely related these variables actually are. Let's calculate 15 00:00:54,033 --> 00:00:54,066 16 00:00:54,066 --> 00:01:00,332 the Pearson correlation between IV2 and RDV. In another video, we created a scatter plot between 17 00:01:00,333 --> 00:01:06,399 those two variables. Let's do that again and take a look. So we're going to create a scatter plot between 18 00:01:06,400 --> 00:01:06,633 19 00:01:06,633 --> 00:01:12,766 IV2 and RDB, create our line of best fit, and then actually produce our 20 00:01:12,766 --> 00:01:19,032 scatter plot, label it, title it, and show it. If you need a review on how to do that, 21 00:01:19,033 --> 00:01:25,333 go visit the scatter plot video to see more. So looking at this 22 00:01:25,333 --> 00:01:25,366 23 00:01:25,366 --> 00:01:31,766 scatter plot, we can see that our line is not completely flat, and it's not negative, 24 00:01:31,766 --> 00:01:31,799 25 00:01:31,800 --> 00:01:38,133 so we have a positive correlation here, but it's maybe not a very strong positive correlation. We might 26 00:01:38,133 --> 00:01:44,233 guess that it's going to be moderate in size, which I already know what it's going to be, so 27 00:01:44,233 --> 00:01:44,266 28 00:01:44,266 --> 00:01:50,466 my guess is a very good guess. But judging from our trend line, we might expect to see a moderate positive 29 00:01:50,466 --> 00:01:50,499 30 00:01:50,500 --> 00:01:56,933 correlation between these two variables. Let's check with a Pearson calculation to see if that is the case. 31 00:01:56,933 --> 00:01:57,299 32 00:01:57,300 --> 00:02:03,800 So here, we are giving Python two values to put our functions 33 00:02:03,800 --> 00:02:04,266 34 00:02:04,266 --> 00:02:10,132 returns in because we know that we're going to get two returns 35 00:02:10,133 --> 00:02:10,333 36 00:02:10,333 --> 00:02:16,766 from this function. We're going to give it two arguments, RIV-2 and RDV. 37 00:02:16,766 --> 00:02:22,466 It doesn't really matter what order you do them in, so long as they are both there. 38 00:02:22,466 --> 00:02:23,432 39 00:02:23,433 --> 00:02:29,833 Next, we'll be calculating the T value using this correlation coefficient up here that we've already 40 00:02:29,833 --> 00:02:30,199 41 00:02:30,200 --> 00:02:35,833 run. So N is equal to the length of DF. This is just returning how many data points are 42 00:02:35,833 --> 00:02:36,233 43 00:02:36,233 --> 00:02:42,299 in our data frame. Len is short for length. Our T value is calculated 44 00:02:42,300 --> 00:02:47,200 using this formula, which uses the correlation coefficient, and next we're 45 00:02:47,200 --> 00:02:48,366 46 00:02:48,366 --> 00:02:54,466 going to print the results. Okay, 47 00:02:54,466 --> 00:02:54,499 48 00:02:54,500 --> 00:03:00,800 so we can see that we do have a moderate positive correlation between these two 49 00:03:00,800 --> 00:03:07,200 because we have a point .3246. So let's do this again for some of our other variables. 50 00:03:07,200 --> 00:03:07,733 51 00:03:07,733 --> 00:03:13,733 What about the relationship between RIV1 and two IVs? Remember, two IVs was created 52 00:03:13,733 --> 00:03:13,766 53 00:03:13,766 --> 00:03:19,999 by multiplying RIV1 and RIV2. So we should expect to see a pretty strong correlation 54 00:03:20,000 --> 00:03:26,333 here because the two are so highly related. Let's check out a visualization of this relationship by making a scatter 55 00:03:26,333 --> 00:03:26,366 56 00:03:26,366 --> 00:03:32,399 plot first. So just as before, we'll make our scatter plot. I'll go ahead and run this. And 57 00:03:32,400 --> 00:03:38,733 yeah, we can see now our line looks very different. It's almost completely diagonal 58 00:03:38,733 --> 00:03:40,266 59 00:03:40,266 --> 00:03:46,366 and our scatter - our points - are spread very closely around 60 00:03:46,366 --> 00:03:46,699 61 00:03:46,700 --> 00:03:52,900 the line. So up here, we can see that things were spread out quite a bit more. Our line isn't 62 00:03:52,900 --> 00:03:59,066 quite as diagonal and here we have a very different looking graph. 63 00:03:59,066 --> 00:03:59,599 64 00:03:59,600 --> 00:04:05,766 So it looks like a very strong correlation. Let's see if that's the case by running a Pearson correlation. 65 00:04:05,766 --> 00:04:06,532 66 00:04:06,533 --> 00:04:12,666 So just as above, we did all the same things, we just used two different variables this time. And we have a Pearson correlation 67 00:04:12,666 --> 00:04:12,699 68 00:04:12,700 --> 00:04:18,733 coefficient of 0.9303. That's pretty strong. And we should expect it again because 69 00:04:18,733 --> 00:04:25,266 those two variables are so highly related since one is a transformation of the other. All right. 70 00:04:25,266 --> 00:04:25,766 71 00:04:25,766 --> 00:04:29,666 That's it for Pearson correlation coefficients. Happy coding! 72 00:04:29,666 --> 00:04:36,599