1
00:00:00,000 --> 00:00:00,866


2
00:00:00,866 --> 00:00:06,932
Hello and welcome to a video on collapsing variables. Let's start as usual by loading in our data and any packages we

3
00:00:06,933 --> 00:00:12,966
need to access. First we'll be importing pandas as PD, NumPy as NP, and the statistics package for some

4
00:00:12,966 --> 00:00:18,966
stats that we're going to be calculating at the bottom. Just the next we'll mount our

5
00:00:18,966 --> 00:00:18,999


6
00:00:19,000 --> 00:00:25,333
drive, set our file path, read in our data, and today we're going to create the same data that you've created

7
00:00:25,333 --> 00:00:25,366


8
00:00:25,366 --> 00:00:32,232
from assignment one. This may look a little different from semester to semester, but you'll recognized it as similar.

9
00:00:32,233 --> 00:00:33,966


10
00:00:33,966 --> 00:00:40,399
Let's go ahead and run that. Okay, so now we have all of our data created

11
00:00:40,400 --> 00:00:40,433


12
00:00:40,433 --> 00:00:46,633
for us, all of our variables set and ready to go. We have several reasons to collapse a variable

13
00:00:46,633 --> 00:00:46,666


14
00:00:46,666 --> 00:00:52,866
when working with data. We might collapse data that has too many categories that make it cumbersome to visualize or interpret.

15
00:00:52,866 --> 00:00:53,366


16
00:00:53,366 --> 00:00:59,399
We might collapse data so that we can use certain types of statistical analyses on it. For example, we may want to perform

17
00:00:59,400 --> 00:01:06,000
a chi-square test on our data, but it is continuously measured. In that case, we would want to make it categorical.

18
00:01:06,000 --> 00:01:06,400


19
00:01:06,400 --> 00:01:12,633
A categorical variable is one where the data can fall into one of several categories. One example may be job status,

20
00:01:12,633 --> 00:01:12,666


21
00:01:12,666 --> 00:01:18,766
employed or on employed. There are several ways we can collapse data. Let's use NumPy to solve this problem since

22
00:01:18,766 --> 00:01:24,799
we're already familiar with it. However, you should know that there are several other methods you can use and would produce

23
00:01:24,800 --> 00:01:24,833


24
00:01:24,833 --> 00:01:31,199
the same results. We want to find all the values of RIV1 that are above and below the mean.

25
00:01:31,200 --> 00:01:31,633


26
00:01:31,633 --> 00:01:37,666
Let's pretend RIV1 represents a score in seconds on a speed test, and we'd like to know who performed above the

27
00:01:37,666 --> 00:01:37,699


28
00:01:37,700 --> 00:01:44,166
mean. For those above the mean, we'll give them a score of one, and those below, we'll get a score of zero. First,

29
00:01:44,166 --> 00:01:50,299
we need to know the mean of RIV1. We've already calculated the mean in a previous video, so if you need to refresher on this,

30
00:01:50,300 --> 00:01:56,200
go back to that. Okay, and we've got our mean set ready to go.

31
00:01:56,200 --> 00:01:56,900


32
00:01:56,900 --> 00:02:03,666
Now, we want to split RIV1 into two categories, using a new variable. We'll

33
00:02:03,666 --> 00:02:03,699


34
00:02:03,700 --> 00:02:09,000
call it D_R_IV1 for dichotomous RIV1.

35
00:02:09,000 --> 00:02:10,966


36
00:02:10,966 --> 00:02:17,132
So D_R_IV1 is equal to np.where. This is the function we'll be using.

37
00:02:17,133 --> 00:02:23,333
And it's where from the numpy library. Where RIV1 is less

38
00:02:23,333 --> 00:02:23,366


39
00:02:23,366 --> 00:02:29,432
than or equal to the mean, put a zero, otherwise put a one.

40
00:02:29,433 --> 00:02:30,199


41
00:02:30,200 --> 00:02:36,300
So what they're saying, if it's less than or equal to this mean, it should have a zero there. If it's above that

42
00:02:36,300 --> 00:02:36,333


43
00:02:36,333 --> 00:02:40,599
mean, it should have a one.

44
00:02:40,600 --> 00:02:43,366


45
00:02:43,366 --> 00:02:48,932
Okay, let's go ahead and run that. And you can see now we've got a column of ones and zeros here.

46
00:02:48,933 --> 00:02:50,699


47
00:02:50,700 --> 00:02:56,700
This collapsed our variable into two categories. Those above, and those below the mean, usually just collapsing the

48
00:02:56,700 --> 00:02:56,733


49
00:02:56,733 --> 00:03:02,366
variable is the first step. Let's collapse another variable and then create a contingency table using the two.

50
00:03:02,366 --> 00:03:04,399


51
00:03:04,400 --> 00:03:08,700
So you'll will do the same thing again where we find the mean of R_DV.

52
00:03:08,700 --> 00:03:11,066


53
00:03:11,066 --> 00:03:17,799
And then we'll do the same thing where we create dichotomous rdv by finding

54
00:03:17,800 --> 00:03:18,200


55
00:03:18,200 --> 00:03:23,966
rdv less than or equal to the mean of rdv. Put a zero, otherwise put a one.

56
00:03:23,966 --> 00:03:24,299


57
00:03:24,300 --> 00:03:30,033
And then we'll print that. Great.

58
00:03:30,033 --> 00:03:30,299


59
00:03:30,300 --> 00:03:36,566
Now that we have two variables that are collapsed, let's make a two by two contingency tables so we can perform some basic calculations.

60
00:03:36,566 --> 00:03:43,832


61
00:03:43,833 --> 00:03:50,199
Okay. So what this code has done, I've created a contingency table just naming it contingency

62
00:03:50,200 --> 00:03:50,233


63
00:03:50,233 --> 00:03:56,299
table. No need to get too creative here. We're using the pd.crosstab

64
00:03:56,300 --> 00:03:56,333


65
00:03:56,333 --> 00:04:03,999
function from pandas. So we're taking df d_r_iv1

66
00:04:04,000 --> 00:04:04,333


67
00:04:04,333 --> 00:04:10,666
and d_r_dv. And we're creating a little table

68
00:04:10,666 --> 00:04:11,432


69
00:04:11,433 --> 00:04:17,466
from them. These are the areas where both had zero, where one had zero, one had one, one had

70
00:04:17,466 --> 00:04:17,966


71
00:04:17,966 --> 00:04:24,032
zero and one had one. And this is where they both had ones in the category. And to double-check our

72
00:04:24,033 --> 00:04:30,233
work, we can add up the sum of all those values to make sure we're not missing anything. We know we have 170 rows of data,

73
00:04:30,233 --> 00:04:36,333
remember Python starts numbering rows at zero. So those four numbers should add up to 170. They

74
00:04:36,333 --> 00:04:36,366


75
00:04:36,366 --> 00:04:42,266
do. Let's add some labels to our rows and columns for clarity. To do this,

76
00:04:42,266 --> 00:04:42,799


77
00:04:42,800 --> 00:04:49,000
we'll call contingencytable.columns. So we're using a method here this time to name

78
00:04:49,000 --> 00:04:55,200
the columns and then contingencytable.index to name the rows.

79
00:04:55,200 --> 00:04:56,700


80
00:04:56,700 --> 00:05:03,000
Okay, and now you can see above here, we had zero and zero. So that's

81
00:05:03,000 --> 00:05:09,233
below the mean and below the mean. There's 45 there. We have 14 that are above

82
00:05:09,233 --> 00:05:09,266


83
00:05:09,266 --> 00:05:12,366
the mean and below the mean and the other.

84
00:05:12,366 --> 00:05:15,966


85
00:05:15,966 --> 00:05:22,199
43 where RDV was below the mean and RIV. One was above the mean and 68 where they were both above

86
00:05:22,200 --> 00:05:28,200
the mean. All right. That's it for collapsing variables. Have fun

87
00:05:28,200 --> 00:05:28,233


88
00:05:28,233 --> 00:05:30,966
coding.