1
00:00:00,133 --> 00:00:00,599


2
00:00:00,600 --> 00:00:07,200
Hello and welcome to a video on Chi Square analysis. Chi Square is used to test whether there is a significant

3
00:00:07,200 --> 00:00:13,233
association between two categorical variables. In simpler terms, a Chi Square test helps you to

4
00:00:13,233 --> 00:00:13,266


5
00:00:13,266 --> 00:00:19,299
figure out if the distribution of one variable differs from what you'd expect based on the distribution of

6
00:00:19,300 --> 00:00:19,333


7
00:00:19,333 --> 00:00:25,833
another variable. So let's start by loading our packages, loading in some data, and producing some categorical variables.

8
00:00:25,833 --> 00:00:28,433


9
00:00:28,433 --> 00:00:34,533
All right, as usual, we have our packages that we're loading in here. Today we'll be in loading in statistics and also the

10
00:00:34,533 --> 00:00:34,566


11
00:00:34,566 --> 00:00:40,132
chi Square contingency function from sci-py.stats.

12
00:00:40,133 --> 00:00:41,266


13
00:00:41,266 --> 00:00:47,332
Next we're mounting our drive, setting our file path, reading in our data. This should look

14
00:00:47,333 --> 00:00:47,366


15
00:00:47,366 --> 00:00:53,499
very familiar by now. So we've got our variables here to play with. Let's go ahead and collapse a few of them.

16
00:00:53,500 --> 00:00:53,666


17
00:00:53,666 --> 00:00:59,666
Now there is an entire video on collapsing variables. If you need a refresher on that, I would suggest revisiting it. I'm

18
00:00:59,666 --> 00:00:59,699


19
00:00:59,700 --> 00:01:05,733
not going to sit here and explain all of the ins and outs on how to do this. But just know

20
00:01:05,733 --> 00:01:06,066


21
00:01:06,066 --> 00:01:12,066
that we have set our mean for two different variables here for I.V.1 and R.D.V. And

22
00:01:12,066 --> 00:01:12,099


23
00:01:12,100 --> 00:01:18,433
then we're going to use NP.where to determine if we are above or below

24
00:01:18,433 --> 00:01:24,499
the mean and use zeros and ones to denote that. So if we're below the mean, there will be a zero there

25
00:01:24,500 --> 00:01:24,533


26
00:01:24,533 --> 00:01:30,633
and if we're above the mean, there will be a one there. Okay, let's go ahead and run that code.

27
00:01:30,633 --> 00:01:31,399


28
00:01:31,400 --> 00:01:37,566
You can see now we've got two new variables. We've got D_R_IV1, and this is actually a column of ones

29
00:01:37,566 --> 00:01:43,766
and zeros. It's just that for the first five, there are ones. And then D_R_DV, which

30
00:01:43,766 --> 00:01:43,799


31
00:01:43,800 --> 00:01:49,933
is now a column of zeros and ones. So now that we

32
00:01:49,933 --> 00:01:56,266
have two categorical variables and we know that they're categorical because we have two categories,

33
00:01:56,266 --> 00:01:56,399


34
00:01:56,400 --> 00:02:02,366
zero or one that our variable could fall into. We could have more categories than this.

35
00:02:02,366 --> 00:02:02,732


36
00:02:02,733 --> 00:02:09,233
There are ways to split your data differently where you may decide, oh, I need three or four categories for my data.

37
00:02:09,233 --> 00:02:09,533


38
00:02:09,533 --> 00:02:15,033
Perhaps you have data for employment and you have unemployed, employed,

39
00:02:15,033 --> 00:02:15,633


40
00:02:15,633 --> 00:02:21,933
part-time employed, retired, that kind of thing. So you've got four or five, maybe even six categories.

41
00:02:21,933 --> 00:02:22,233


42
00:02:22,233 --> 00:02:28,233
And in that case, you would use the different method to break that down into different

43
00:02:28,233 --> 00:02:32,733
categories. But for today, we're just going to keep it simple and use two.

44
00:02:32,733 --> 00:02:34,533


45
00:02:34,533 --> 00:02:41,066
So now that we have two categorical variables, we can make a contingency table, contingency tables are special.

46
00:02:41,066 --> 00:02:41,766


47
00:02:41,766 --> 00:02:47,832
This one is a two-by-two contingency table and what it does is set up your data that you can see the

48
00:02:47,833 --> 00:02:53,533
relationship better between two variables based on whether they are yes or

49
00:02:53,533 --> 00:02:53,899


50
00:02:53,900 --> 00:03:00,366
no, based on what category they're in, essentially. So we're going to use PD.cross

51
00:03:00,366 --> 00:03:05,599
tab, which makes a cross-table of whatever variables you put into it.

52
00:03:05,600 --> 00:03:06,933


53
00:03:06,933 --> 00:03:12,566
So here we've used our new D_R_IV1 and D_R_DV.

54
00:03:12,566 --> 00:03:13,232


55
00:03:13,233 --> 00:03:19,699
And we're going to save that in a variable called contingency table.

56
00:03:19,700 --> 00:03:20,633


57
00:03:20,633 --> 00:03:27,033
Once we've made contingency table we'll use these methods called columns and index to set the

58
00:03:27,033 --> 00:03:32,566
labels for the contingency table. Okay, let's go ahead and run that.

59
00:03:32,566 --> 00:03:37,132


60
00:03:37,133 --> 00:03:43,233
Okay, so now we have RdV below

61
00:03:43,233 --> 00:03:43,266


62
00:03:43,266 --> 00:03:49,432
the mean RdV above the mean RIV1 below and RIV1 above and they're set

63
00:03:49,433 --> 00:03:55,799
up like that. Now that we have our contingency table, it's finally time to do some statistics.

64
00:03:55,800 --> 00:03:56,500


65
00:03:56,500 --> 00:04:01,800
So we're going to be using the chi-square contingency function from sci-pi.stats and in

66
00:04:01,800 --> 00:04:02,533


67
00:04:02,533 --> 00:04:08,299
it we're going to put the contingency table that we just made. Now you may notice we're returning

68
00:04:08,300 --> 00:04:08,600


69
00:04:08,600 --> 00:04:14,166
four different variables here. And we're going to save them in the chi-square, p,

70
00:04:14,166 --> 00:04:14,966


71
00:04:14,966 --> 00:04:20,132
dof for degrees of freedom, and expected for expected values. There we go.

72
00:04:20,133 --> 00:04:24,266


73
00:04:24,266 --> 00:04:29,899
Now we have our expected frequency stable. Our chi-square statistic,

74
00:04:29,900 --> 00:04:30,333


75
00:04:30,333 --> 00:04:36,433
the p value, which is actually a very small number, we can see this e minus 0, 6. This means our

76
00:04:36,433 --> 00:04:42,566
decimal place is actually 6 places to the left. So we have a very small value of 0 and

77
00:04:42,566 --> 00:04:42,599


78
00:04:42,600 --> 00:04:48,633
one degree of freedom. Okay,

79
00:04:48,633 --> 00:04:54,933
so how are we able to declare four variables in one line? The answer is that chi-square contingency

80
00:04:54,933 --> 00:04:54,966


81
00:04:54,966 --> 00:05:01,032
function returns four values in a set order. It always does. We can also unpack lists of variables from

82
00:05:01,033 --> 00:05:06,433
arrays or tuples. How do we know that the chi-square contingency returns four variables?

83
00:05:06,433 --> 00:05:07,166


84
00:05:07,166 --> 00:05:13,166
That for starters, we can check the official documentation for a function or package. Usually, easily found

85
00:05:13,166 --> 00:05:19,166
through Google. We can also use the built-in help function in Python, like so. So I've

86
00:05:19,166 --> 00:05:19,199


87
00:05:19,200 --> 00:05:25,166
got help. And then in the parentheses, you put the function that you're looking for help

88
00:05:25,166 --> 00:05:26,432


89
00:05:26,433 --> 00:05:32,433
with. Then you click run. And if we scroll all the way to the

90
00:05:32,433 --> 00:05:38,533
top, you can see it gives us some information about what a chi-square test is. Then

91
00:05:38,533 --> 00:05:38,566


92
00:05:38,566 --> 00:05:44,699
it tells us the parameters. So this is how you know what you need to put in the function. So if you're ever confused about what

93
00:05:44,700 --> 00:05:50,700
goes in the parentheses of your function and the tooltips aren't helping you, you can go to help and you can see that oh,

94
00:05:50,700 --> 00:05:55,500
This one has two optional arguments, but the

95
00:05:55,500 --> 00:05:56,766


96
00:05:56,766 --> 00:06:01,699
only requirement is that it has a contingency table. Next you can

97
00:06:01,700 --> 00:06:03,733


98
00:06:03,733 --> 00:06:08,533
see that there are returns here. This means what the function returns.

99
00:06:08,533 --> 00:06:09,866


100
00:06:09,866 --> 00:06:15,999
So this one returns statistic, which is the chi-square statistic, p-value, which is our

101
00:06:16,000 --> 00:06:16,033


102
00:06:16,033 --> 00:06:22,333
p-value of the test, d-o-f, which is the degrees of freedom and the expected frequency,

103
00:06:22,333 --> 00:06:22,599


104
00:06:22,600 --> 00:06:28,266
which is the expected frequencies based on the marginal totals. All right,

105
00:06:28,266 --> 00:06:29,032


106
00:06:29,033 --> 00:06:35,033
hopefully this is enough to help you do your homework. If you're having trouble with this, I recommend

107
00:06:35,033 --> 00:06:35,066


108
00:06:35,066 --> 00:06:40,299
going back again to the collapsing variables video and reviewing that a bit,

109
00:06:40,300 --> 00:06:41,600


110
00:06:41,600 --> 00:06:46,666
and also just taking some time to make sure that your contingency table is set up correctly.

111
00:06:46,666 --> 00:06:48,232


112
00:06:48,233 --> 00:06:52,466
All right, have a wonderful day and have fun coding!

113
00:06:52,466 --> 00:06:57,199


114
00:06:57,200 --> 00:06:57,266
have fun coding.