1 00:00:00,133 --> 00:00:00,599 2 00:00:00,600 --> 00:00:07,200 Hello and welcome to a video on Chi Square analysis. Chi Square is used to test whether there is a significant 3 00:00:07,200 --> 00:00:13,233 association between two categorical variables. In simpler terms, a Chi Square test helps you to 4 00:00:13,233 --> 00:00:13,266 5 00:00:13,266 --> 00:00:19,299 figure out if the distribution of one variable differs from what you'd expect based on the distribution of 6 00:00:19,300 --> 00:00:19,333 7 00:00:19,333 --> 00:00:25,833 another variable. So let's start by loading our packages, loading in some data, and producing some categorical variables. 8 00:00:25,833 --> 00:00:28,433 9 00:00:28,433 --> 00:00:34,533 All right, as usual, we have our packages that we're loading in here. Today we'll be in loading in statistics and also the 10 00:00:34,533 --> 00:00:34,566 11 00:00:34,566 --> 00:00:40,132 chi Square contingency function from sci-py.stats. 12 00:00:40,133 --> 00:00:41,266 13 00:00:41,266 --> 00:00:47,332 Next we're mounting our drive, setting our file path, reading in our data. This should look 14 00:00:47,333 --> 00:00:47,366 15 00:00:47,366 --> 00:00:53,499 very familiar by now. So we've got our variables here to play with. Let's go ahead and collapse a few of them. 16 00:00:53,500 --> 00:00:53,666 17 00:00:53,666 --> 00:00:59,666 Now there is an entire video on collapsing variables. If you need a refresher on that, I would suggest revisiting it. I'm 18 00:00:59,666 --> 00:00:59,699 19 00:00:59,700 --> 00:01:05,733 not going to sit here and explain all of the ins and outs on how to do this. But just know 20 00:01:05,733 --> 00:01:06,066 21 00:01:06,066 --> 00:01:12,066 that we have set our mean for two different variables here for I.V.1 and R.D.V. And 22 00:01:12,066 --> 00:01:12,099 23 00:01:12,100 --> 00:01:18,433 then we're going to use NP.where to determine if we are above or below 24 00:01:18,433 --> 00:01:24,499 the mean and use zeros and ones to denote that. So if we're below the mean, there will be a zero there 25 00:01:24,500 --> 00:01:24,533 26 00:01:24,533 --> 00:01:30,633 and if we're above the mean, there will be a one there. Okay, let's go ahead and run that code. 27 00:01:30,633 --> 00:01:31,399 28 00:01:31,400 --> 00:01:37,566 You can see now we've got two new variables. We've got D_R_IV1, and this is actually a column of ones 29 00:01:37,566 --> 00:01:43,766 and zeros. It's just that for the first five, there are ones. And then D_R_DV, which 30 00:01:43,766 --> 00:01:43,799 31 00:01:43,800 --> 00:01:49,933 is now a column of zeros and ones. So now that we 32 00:01:49,933 --> 00:01:56,266 have two categorical variables and we know that they're categorical because we have two categories, 33 00:01:56,266 --> 00:01:56,399 34 00:01:56,400 --> 00:02:02,366 zero or one that our variable could fall into. We could have more categories than this. 35 00:02:02,366 --> 00:02:02,732 36 00:02:02,733 --> 00:02:09,233 There are ways to split your data differently where you may decide, oh, I need three or four categories for my data. 37 00:02:09,233 --> 00:02:09,533 38 00:02:09,533 --> 00:02:15,033 Perhaps you have data for employment and you have unemployed, employed, 39 00:02:15,033 --> 00:02:15,633 40 00:02:15,633 --> 00:02:21,933 part-time employed, retired, that kind of thing. So you've got four or five, maybe even six categories. 41 00:02:21,933 --> 00:02:22,233 42 00:02:22,233 --> 00:02:28,233 And in that case, you would use the different method to break that down into different 43 00:02:28,233 --> 00:02:32,733 categories. But for today, we're just going to keep it simple and use two. 44 00:02:32,733 --> 00:02:34,533 45 00:02:34,533 --> 00:02:41,066 So now that we have two categorical variables, we can make a contingency table, contingency tables are special. 46 00:02:41,066 --> 00:02:41,766 47 00:02:41,766 --> 00:02:47,832 This one is a two-by-two contingency table and what it does is set up your data that you can see the 48 00:02:47,833 --> 00:02:53,533 relationship better between two variables based on whether they are yes or 49 00:02:53,533 --> 00:02:53,899 50 00:02:53,900 --> 00:03:00,366 no, based on what category they're in, essentially. So we're going to use PD.cross 51 00:03:00,366 --> 00:03:05,599 tab, which makes a cross-table of whatever variables you put into it. 52 00:03:05,600 --> 00:03:06,933 53 00:03:06,933 --> 00:03:12,566 So here we've used our new D_R_IV1 and D_R_DV. 54 00:03:12,566 --> 00:03:13,232 55 00:03:13,233 --> 00:03:19,699 And we're going to save that in a variable called contingency table. 56 00:03:19,700 --> 00:03:20,633 57 00:03:20,633 --> 00:03:27,033 Once we've made contingency table we'll use these methods called columns and index to set the 58 00:03:27,033 --> 00:03:32,566 labels for the contingency table. Okay, let's go ahead and run that. 59 00:03:32,566 --> 00:03:37,132 60 00:03:37,133 --> 00:03:43,233 Okay, so now we have RdV below 61 00:03:43,233 --> 00:03:43,266 62 00:03:43,266 --> 00:03:49,432 the mean RdV above the mean RIV1 below and RIV1 above and they're set 63 00:03:49,433 --> 00:03:55,799 up like that. Now that we have our contingency table, it's finally time to do some statistics. 64 00:03:55,800 --> 00:03:56,500 65 00:03:56,500 --> 00:04:01,800 So we're going to be using the chi-square contingency function from sci-pi.stats and in 66 00:04:01,800 --> 00:04:02,533 67 00:04:02,533 --> 00:04:08,299 it we're going to put the contingency table that we just made. Now you may notice we're returning 68 00:04:08,300 --> 00:04:08,600 69 00:04:08,600 --> 00:04:14,166 four different variables here. And we're going to save them in the chi-square, p, 70 00:04:14,166 --> 00:04:14,966 71 00:04:14,966 --> 00:04:20,132 dof for degrees of freedom, and expected for expected values. There we go. 72 00:04:20,133 --> 00:04:24,266 73 00:04:24,266 --> 00:04:29,899 Now we have our expected frequency stable. Our chi-square statistic, 74 00:04:29,900 --> 00:04:30,333 75 00:04:30,333 --> 00:04:36,433 the p value, which is actually a very small number, we can see this e minus 0, 6. This means our 76 00:04:36,433 --> 00:04:42,566 decimal place is actually 6 places to the left. So we have a very small value of 0 and 77 00:04:42,566 --> 00:04:42,599 78 00:04:42,600 --> 00:04:48,633 one degree of freedom. Okay, 79 00:04:48,633 --> 00:04:54,933 so how are we able to declare four variables in one line? The answer is that chi-square contingency 80 00:04:54,933 --> 00:04:54,966 81 00:04:54,966 --> 00:05:01,032 function returns four values in a set order. It always does. We can also unpack lists of variables from 82 00:05:01,033 --> 00:05:06,433 arrays or tuples. How do we know that the chi-square contingency returns four variables? 83 00:05:06,433 --> 00:05:07,166 84 00:05:07,166 --> 00:05:13,166 That for starters, we can check the official documentation for a function or package. Usually, easily found 85 00:05:13,166 --> 00:05:19,166 through Google. We can also use the built-in help function in Python, like so. So I've 86 00:05:19,166 --> 00:05:19,199 87 00:05:19,200 --> 00:05:25,166 got help. And then in the parentheses, you put the function that you're looking for help 88 00:05:25,166 --> 00:05:26,432 89 00:05:26,433 --> 00:05:32,433 with. Then you click run. And if we scroll all the way to the 90 00:05:32,433 --> 00:05:38,533 top, you can see it gives us some information about what a chi-square test is. Then 91 00:05:38,533 --> 00:05:38,566 92 00:05:38,566 --> 00:05:44,699 it tells us the parameters. So this is how you know what you need to put in the function. So if you're ever confused about what 93 00:05:44,700 --> 00:05:50,700 goes in the parentheses of your function and the tooltips aren't helping you, you can go to help and you can see that oh, 94 00:05:50,700 --> 00:05:55,500 This one has two optional arguments, but the 95 00:05:55,500 --> 00:05:56,766 96 00:05:56,766 --> 00:06:01,699 only requirement is that it has a contingency table. Next you can 97 00:06:01,700 --> 00:06:03,733 98 00:06:03,733 --> 00:06:08,533 see that there are returns here. This means what the function returns. 99 00:06:08,533 --> 00:06:09,866 100 00:06:09,866 --> 00:06:15,999 So this one returns statistic, which is the chi-square statistic, p-value, which is our 101 00:06:16,000 --> 00:06:16,033 102 00:06:16,033 --> 00:06:22,333 p-value of the test, d-o-f, which is the degrees of freedom and the expected frequency, 103 00:06:22,333 --> 00:06:22,599 104 00:06:22,600 --> 00:06:28,266 which is the expected frequencies based on the marginal totals. All right, 105 00:06:28,266 --> 00:06:29,032 106 00:06:29,033 --> 00:06:35,033 hopefully this is enough to help you do your homework. If you're having trouble with this, I recommend 107 00:06:35,033 --> 00:06:35,066 108 00:06:35,066 --> 00:06:40,299 going back again to the collapsing variables video and reviewing that a bit, 109 00:06:40,300 --> 00:06:41,600 110 00:06:41,600 --> 00:06:46,666 and also just taking some time to make sure that your contingency table is set up correctly. 111 00:06:46,666 --> 00:06:48,232 112 00:06:48,233 --> 00:06:52,466 All right, have a wonderful day and have fun coding! 113 00:06:52,466 --> 00:06:57,199 114 00:06:57,200 --> 00:06:57,266 have fun coding.