1
00:00:00,000 --> 00:00:00,900


2
00:00:00,900 --> 00:00:07,266
Hello and welcome to a module on Pearson correlations. First, let's load in our data and the packages we'll be using today.

3
00:00:07,266 --> 00:00:07,732


4
00:00:07,733 --> 00:00:13,299
So we'll be importing Pandas as PD, Drive from Google.colab, numpy as np,

5
00:00:13,300 --> 00:00:14,066


6
00:00:14,066 --> 00:00:20,266
mapplotlib.pyplot as PLT and we'll be importing pearsonr from scipy.stats.

7
00:00:20,266 --> 00:00:21,032


8
00:00:21,033 --> 00:00:26,933
We'll be doing some scatter plots today, so that's why we've got matplotlib imported.

9
00:00:26,933 --> 00:00:28,233


10
00:00:28,233 --> 00:00:33,899
So let's go ahead and create our variables for the class data set. This should look very familiar.

11
00:00:33,900 --> 00:00:35,433


12
00:00:35,433 --> 00:00:41,566
Okay, and these are variables that we're used to seeing by now, and we know how these variables were created. So we're aware that

13
00:00:41,566 --> 00:00:47,599
there are relationships between IV1, IV2, DV, and the rest of these variables here. So let's

14
00:00:47,600 --> 00:00:54,033
investigate those relationships and see how closely related these variables actually are. Let's calculate

15
00:00:54,033 --> 00:00:54,066


16
00:00:54,066 --> 00:01:00,332
the Pearson correlation between IV2 and RDV. In another video, we created a scatter plot between

17
00:01:00,333 --> 00:01:06,399
those two variables. Let's do that again and take a look. So we're going to create a scatter plot between

18
00:01:06,400 --> 00:01:06,633


19
00:01:06,633 --> 00:01:12,766
IV2 and RDB, create our line of best fit, and then actually produce our

20
00:01:12,766 --> 00:01:19,032
scatter plot, label it, title it, and show it. If you need a review on how to do that,

21
00:01:19,033 --> 00:01:25,333
go visit the scatter plot video to see more. So looking at this

22
00:01:25,333 --> 00:01:25,366


23
00:01:25,366 --> 00:01:31,766
scatter plot, we can see that our line is not completely flat, and it's not negative,

24
00:01:31,766 --> 00:01:31,799


25
00:01:31,800 --> 00:01:38,133
so we have a positive correlation here, but it's maybe not a very strong positive correlation. We might

26
00:01:38,133 --> 00:01:44,233
guess that it's going to be moderate in size, which I already know what it's going to be, so

27
00:01:44,233 --> 00:01:44,266


28
00:01:44,266 --> 00:01:50,466
my guess is a very good guess. But judging from our trend line, we might expect to see a moderate positive

29
00:01:50,466 --> 00:01:50,499


30
00:01:50,500 --> 00:01:56,933
correlation between these two variables. Let's check with a Pearson calculation to see if that is the case.

31
00:01:56,933 --> 00:01:57,299


32
00:01:57,300 --> 00:02:03,800
So here, we are giving Python two values to put our functions

33
00:02:03,800 --> 00:02:04,266


34
00:02:04,266 --> 00:02:10,132
returns in because we know that we're going to get two returns

35
00:02:10,133 --> 00:02:10,333


36
00:02:10,333 --> 00:02:16,766
from this function. We're going to give it two arguments, RIV-2 and RDV.

37
00:02:16,766 --> 00:02:22,466
It doesn't really matter what order you do them in, so long as they are both there.

38
00:02:22,466 --> 00:02:23,432


39
00:02:23,433 --> 00:02:29,833
Next, we'll be calculating the T value using this correlation coefficient up here that we've already

40
00:02:29,833 --> 00:02:30,199


41
00:02:30,200 --> 00:02:35,833
run. So N is equal to the length of DF. This is just returning how many data points are

42
00:02:35,833 --> 00:02:36,233


43
00:02:36,233 --> 00:02:42,299
in our data frame. Len is short for length. Our T value is calculated

44
00:02:42,300 --> 00:02:47,200
using this formula, which uses the correlation coefficient, and next we're

45
00:02:47,200 --> 00:02:48,366


46
00:02:48,366 --> 00:02:54,466
going to print the results. Okay,

47
00:02:54,466 --> 00:02:54,499


48
00:02:54,500 --> 00:03:00,800
so we can see that we do have a moderate positive correlation between these two

49
00:03:00,800 --> 00:03:07,200
because we have a point .3246. So let's do this again for some of our other variables.

50
00:03:07,200 --> 00:03:07,733


51
00:03:07,733 --> 00:03:13,733
What about the relationship between RIV1 and two IVs? Remember, two IVs was created

52
00:03:13,733 --> 00:03:13,766


53
00:03:13,766 --> 00:03:19,999
by multiplying RIV1 and RIV2. So we should expect to see a pretty strong correlation

54
00:03:20,000 --> 00:03:26,333
here because the two are so highly related. Let's check out a visualization of this relationship by making a scatter

55
00:03:26,333 --> 00:03:26,366


56
00:03:26,366 --> 00:03:32,399
plot first. So just as before, we'll make our scatter plot. I'll go ahead and run this. And

57
00:03:32,400 --> 00:03:38,733
yeah, we can see now our line looks very different. It's almost completely diagonal

58
00:03:38,733 --> 00:03:40,266


59
00:03:40,266 --> 00:03:46,366
and our scatter - our points - are spread very closely around

60
00:03:46,366 --> 00:03:46,699


61
00:03:46,700 --> 00:03:52,900
the line. So up here, we can see that things were spread out quite a bit more. Our line isn't

62
00:03:52,900 --> 00:03:59,066
quite as diagonal and here we have a very different looking graph.

63
00:03:59,066 --> 00:03:59,599


64
00:03:59,600 --> 00:04:05,766
So it looks like a very strong correlation. Let's see if that's the case by running a Pearson correlation.

65
00:04:05,766 --> 00:04:06,532


66
00:04:06,533 --> 00:04:12,666
So just as above, we did all the same things, we just used two different variables this time. And we have a Pearson correlation

67
00:04:12,666 --> 00:04:12,699


68
00:04:12,700 --> 00:04:18,733
coefficient of 0.9303. That's pretty strong. And we should expect it again because

69
00:04:18,733 --> 00:04:25,266
those two variables are so highly related since one is a transformation of the other. All right.

70
00:04:25,266 --> 00:04:25,766


71
00:04:25,766 --> 00:04:29,666
That's it for Pearson correlation coefficients. Happy coding!

72
00:04:29,666 --> 00:04:36,599