1 00:00:00,000 --> 00:00:00,866 2 00:00:00,866 --> 00:00:06,932 Hello and welcome to a video on collapsing variables. Let's start as usual by loading in our data and any packages we 3 00:00:06,933 --> 00:00:12,966 need to access. First we'll be importing pandas as PD, NumPy as NP, and the statistics package for some 4 00:00:12,966 --> 00:00:18,966 stats that we're going to be calculating at the bottom. Just the next we'll mount our 5 00:00:18,966 --> 00:00:18,999 6 00:00:19,000 --> 00:00:25,333 drive, set our file path, read in our data, and today we're going to create the same data that you've created 7 00:00:25,333 --> 00:00:25,366 8 00:00:25,366 --> 00:00:32,232 from assignment one. This may look a little different from semester to semester, but you'll recognized it as similar. 9 00:00:32,233 --> 00:00:33,966 10 00:00:33,966 --> 00:00:40,399 Let's go ahead and run that. Okay, so now we have all of our data created 11 00:00:40,400 --> 00:00:40,433 12 00:00:40,433 --> 00:00:46,633 for us, all of our variables set and ready to go. We have several reasons to collapse a variable 13 00:00:46,633 --> 00:00:46,666 14 00:00:46,666 --> 00:00:52,866 when working with data. We might collapse data that has too many categories that make it cumbersome to visualize or interpret. 15 00:00:52,866 --> 00:00:53,366 16 00:00:53,366 --> 00:00:59,399 We might collapse data so that we can use certain types of statistical analyses on it. For example, we may want to perform 17 00:00:59,400 --> 00:01:06,000 a chi-square test on our data, but it is continuously measured. In that case, we would want to make it categorical. 18 00:01:06,000 --> 00:01:06,400 19 00:01:06,400 --> 00:01:12,633 A categorical variable is one where the data can fall into one of several categories. One example may be job status, 20 00:01:12,633 --> 00:01:12,666 21 00:01:12,666 --> 00:01:18,766 employed or on employed. There are several ways we can collapse data. Let's use NumPy to solve this problem since 22 00:01:18,766 --> 00:01:24,799 we're already familiar with it. However, you should know that there are several other methods you can use and would produce 23 00:01:24,800 --> 00:01:24,833 24 00:01:24,833 --> 00:01:31,199 the same results. We want to find all the values of RIV1 that are above and below the mean. 25 00:01:31,200 --> 00:01:31,633 26 00:01:31,633 --> 00:01:37,666 Let's pretend RIV1 represents a score in seconds on a speed test, and we'd like to know who performed above the 27 00:01:37,666 --> 00:01:37,699 28 00:01:37,700 --> 00:01:44,166 mean. For those above the mean, we'll give them a score of one, and those below, we'll get a score of zero. First, 29 00:01:44,166 --> 00:01:50,299 we need to know the mean of RIV1. We've already calculated the mean in a previous video, so if you need to refresher on this, 30 00:01:50,300 --> 00:01:56,200 go back to that. Okay, and we've got our mean set ready to go. 31 00:01:56,200 --> 00:01:56,900 32 00:01:56,900 --> 00:02:03,666 Now, we want to split RIV1 into two categories, using a new variable. We'll 33 00:02:03,666 --> 00:02:03,699 34 00:02:03,700 --> 00:02:09,000 call it D_R_IV1 for dichotomous RIV1. 35 00:02:09,000 --> 00:02:10,966 36 00:02:10,966 --> 00:02:17,132 So D_R_IV1 is equal to np.where. This is the function we'll be using. 37 00:02:17,133 --> 00:02:23,333 And it's where from the numpy library. Where RIV1 is less 38 00:02:23,333 --> 00:02:23,366 39 00:02:23,366 --> 00:02:29,432 than or equal to the mean, put a zero, otherwise put a one. 40 00:02:29,433 --> 00:02:30,199 41 00:02:30,200 --> 00:02:36,300 So what they're saying, if it's less than or equal to this mean, it should have a zero there. If it's above that 42 00:02:36,300 --> 00:02:36,333 43 00:02:36,333 --> 00:02:40,599 mean, it should have a one. 44 00:02:40,600 --> 00:02:43,366 45 00:02:43,366 --> 00:02:48,932 Okay, let's go ahead and run that. And you can see now we've got a column of ones and zeros here. 46 00:02:48,933 --> 00:02:50,699 47 00:02:50,700 --> 00:02:56,700 This collapsed our variable into two categories. Those above, and those below the mean, usually just collapsing the 48 00:02:56,700 --> 00:02:56,733 49 00:02:56,733 --> 00:03:02,366 variable is the first step. Let's collapse another variable and then create a contingency table using the two. 50 00:03:02,366 --> 00:03:04,399 51 00:03:04,400 --> 00:03:08,700 So you'll will do the same thing again where we find the mean of R_DV. 52 00:03:08,700 --> 00:03:11,066 53 00:03:11,066 --> 00:03:17,799 And then we'll do the same thing where we create dichotomous rdv by finding 54 00:03:17,800 --> 00:03:18,200 55 00:03:18,200 --> 00:03:23,966 rdv less than or equal to the mean of rdv. Put a zero, otherwise put a one. 56 00:03:23,966 --> 00:03:24,299 57 00:03:24,300 --> 00:03:30,033 And then we'll print that. Great. 58 00:03:30,033 --> 00:03:30,299 59 00:03:30,300 --> 00:03:36,566 Now that we have two variables that are collapsed, let's make a two by two contingency tables so we can perform some basic calculations. 60 00:03:36,566 --> 00:03:43,832 61 00:03:43,833 --> 00:03:50,199 Okay. So what this code has done, I've created a contingency table just naming it contingency 62 00:03:50,200 --> 00:03:50,233 63 00:03:50,233 --> 00:03:56,299 table. No need to get too creative here. We're using the pd.crosstab 64 00:03:56,300 --> 00:03:56,333 65 00:03:56,333 --> 00:04:03,999 function from pandas. So we're taking df d_r_iv1 66 00:04:04,000 --> 00:04:04,333 67 00:04:04,333 --> 00:04:10,666 and d_r_dv. And we're creating a little table 68 00:04:10,666 --> 00:04:11,432 69 00:04:11,433 --> 00:04:17,466 from them. These are the areas where both had zero, where one had zero, one had one, one had 70 00:04:17,466 --> 00:04:17,966 71 00:04:17,966 --> 00:04:24,032 zero and one had one. And this is where they both had ones in the category. And to double-check our 72 00:04:24,033 --> 00:04:30,233 work, we can add up the sum of all those values to make sure we're not missing anything. We know we have 170 rows of data, 73 00:04:30,233 --> 00:04:36,333 remember Python starts numbering rows at zero. So those four numbers should add up to 170. They 74 00:04:36,333 --> 00:04:36,366 75 00:04:36,366 --> 00:04:42,266 do. Let's add some labels to our rows and columns for clarity. To do this, 76 00:04:42,266 --> 00:04:42,799 77 00:04:42,800 --> 00:04:49,000 we'll call contingencytable.columns. So we're using a method here this time to name 78 00:04:49,000 --> 00:04:55,200 the columns and then contingencytable.index to name the rows. 79 00:04:55,200 --> 00:04:56,700 80 00:04:56,700 --> 00:05:03,000 Okay, and now you can see above here, we had zero and zero. So that's 81 00:05:03,000 --> 00:05:09,233 below the mean and below the mean. There's 45 there. We have 14 that are above 82 00:05:09,233 --> 00:05:09,266 83 00:05:09,266 --> 00:05:12,366 the mean and below the mean and the other. 84 00:05:12,366 --> 00:05:15,966 85 00:05:15,966 --> 00:05:22,199 43 where RDV was below the mean and RIV. One was above the mean and 68 where they were both above 86 00:05:22,200 --> 00:05:28,200 the mean. All right. That's it for collapsing variables. Have fun 87 00:05:28,200 --> 00:05:28,233 88 00:05:28,233 --> 00:05:30,966 coding.