1 00:00:00,000 --> 00:00:03,033 2 00:00:03,033 --> 00:00:09,366 Hello and welcome to a video on collapsing variables in Excel. Today we're going to be talking 3 00:00:09,366 --> 00:00:15,799 about how to take a variable and one that is continuously measured and collapse 4 00:00:15,800 --> 00:00:15,833 5 00:00:15,833 --> 00:00:21,933 it into one that is categorical. This is what we mean when we say collapsing of variable. So 6 00:00:21,933 --> 00:00:21,966 7 00:00:21,966 --> 00:00:28,032 first let's say we want to find the average of our column 8 00:00:28,033 --> 00:00:28,166 9 00:00:28,166 --> 00:00:34,299 dv and find only values that are 10 00:00:34,300 --> 00:00:40,533 above the average and for those values we want to put a one and for any other values we want 11 00:00:40,533 --> 00:00:40,566 12 00:00:40,566 --> 00:00:46,799 to put a zero. So the first thing that we'll do is find the average of dv and remember in Excel 13 00:00:46,800 --> 00:00:46,833 14 00:00:46,833 --> 00:00:53,266 that is equals average and then you select your entire column of data. 15 00:00:53,266 --> 00:00:57,366 16 00:00:57,366 --> 00:01:03,399 So here we've got 73.25662 and you can see we're just repeating the same step 17 00:01:03,400 --> 00:01:04,066 18 00:01:04,066 --> 00:01:10,266 that we've done right here so that you can see what the end result is before we get there. So we've 19 00:01:10,266 --> 00:01:16,299 got our mean of dv. Now we're going to create a variable and we're going to call it underscore. I'm 20 00:01:16,300 --> 00:01:19,566 going to call it dv2 since I've already made d_dv. 21 00:01:19,566 --> 00:01:23,499 22 00:01:23,500 --> 00:01:28,633 And now what we'd like to know is if 23 00:01:28,633 --> 00:01:32,199 24 00:01:32,200 --> 00:01:37,333 this cell is greater 25 00:01:37,333 --> 00:01:39,399 26 00:01:39,400 --> 00:01:45,666 than this cell. And we actually want to put 27 00:01:45,666 --> 00:01:51,732 dollar signs in front of f and five, because we want to make 28 00:01:51,733 --> 00:01:51,766 29 00:01:51,766 --> 00:01:58,032 this a constant. The value if that is true. Now remember we're 30 00:01:58,033 --> 00:02:04,133 saying if c2 is greater than f5, we want a 1 to be there because that means 31 00:02:04,133 --> 00:02:10,133 it's greater than the mean. Otherwise, and here I put a comma, and now it's our value if 32 00:02:10,133 --> 00:02:10,166 33 00:02:10,166 --> 00:02:15,699 false. We want a zero. Close parentheses and hit enter. 34 00:02:15,700 --> 00:02:17,000 35 00:02:17,000 --> 00:02:22,866 And you can see I've got a zero there in a double click, so that that populates the rest of the way down. 36 00:02:22,866 --> 00:02:23,166 37 00:02:23,166 --> 00:02:29,399 And you can see we've got our ones in zeros now. And here's that formula again, 38 00:02:29,400 --> 00:02:31,166 39 00:02:31,166 --> 00:02:37,332 just so you can look it over. But that is one way that we can collapse a variable 40 00:02:37,333 --> 00:02:37,366 41 00:02:37,366 --> 00:02:43,566 in Excel. But what if we have multiple categories? 42 00:02:43,566 --> 00:02:44,199 43 00:02:44,200 --> 00:02:50,500 We need to collapse from our variables. We can use a different formula for this. 44 00:02:50,500 --> 00:02:51,233 45 00:02:51,233 --> 00:02:57,433 And let's go over here. To this 2x2 contingency table that I've made, where 46 00:02:57,433 --> 00:03:03,599 we have 0, 1, low and high. Now the 0s and 1s are going to be using the 47 00:03:03,600 --> 00:03:03,633 48 00:03:03,633 --> 00:03:10,066 d_d_v2 that we just made. And the lows and highs are from IV2. 49 00:03:10,066 --> 00:03:10,466 50 00:03:10,466 --> 00:03:16,632 So what we would like to know is what if IV2 is below 51 00:03:16,633 --> 00:03:16,833 52 00:03:16,833 --> 00:03:22,866 the mean and d_d_dv = 0? Where 53 00:03:22,866 --> 00:03:29,032 does it equal? Where is 1, low and the other 1? Where is 1, high and the other 54 00:03:29,033 --> 00:03:29,066 55 00:03:29,066 --> 00:03:33,766 low? So basically I want to know. Is IV2 IV2, 56 00:03:33,766 --> 00:03:40,366 57 00:03:40,366 --> 00:03:47,299 dv? Okay, 58 00:03:47,300 --> 00:03:48,000 59 00:03:48,000 --> 00:03:50,233 let's see if I can say that again, so it makes more sense. 60 00:03:50,233 --> 00:03:54,099 61 00:03:54,100 --> 00:03:59,833 Where is IV2 below the mean? Where d_dv is also 0? 62 00:03:59,833 --> 00:04:00,799 63 00:04:00,800 --> 00:04:05,900 Where is IV2 below the mean? Where d_dv is 1? 64 00:04:05,900 --> 00:04:07,300 65 00:04:07,300 --> 00:04:13,500 Where is IV2 above the mean? Where d_dv is 0? And where is IV2 66 00:04:13,500 --> 00:04:19,633 above the mean? Where d_dv is equal to 1? Those are the questions that we're asking of this contingency 67 00:04:19,633 --> 00:04:19,666 68 00:04:19,666 --> 00:04:25,566 table. So we're going to have low. Hi. 69 00:04:25,566 --> 00:04:25,732 70 00:04:25,733 --> 00:04:31,799 Let me move this down a little. 0 71 00:04:31,800 --> 00:04:34,966 72 00:04:34,966 --> 00:04:40,966 and 1. Now in your data, this may be males and females. It may be yes, no. It may 73 00:04:40,966 --> 00:04:40,999 74 00:04:41,000 --> 00:04:47,000 be arrested or not arrested. It may be all kinds of other categories 75 00:04:47,000 --> 00:04:47,033 76 00:04:47,033 --> 00:04:53,166 that we could have here. But for this example, let's pretend that the zeros are males. The ones 77 00:04:53,166 --> 00:04:53,932 78 00:04:53,933 --> 00:04:59,933 are females. And we'll say that these IV2 is scores 79 00:04:59,933 --> 00:04:59,966 80 00:04:59,966 --> 00:05:06,099 on a test. Some sort of maybe like a-- what do these go up 81 00:05:06,100 --> 00:05:10,433 to? We'll say there's scores on a math 82 00:05:10,433 --> 00:05:16,466 83 00:05:16,466 --> 00:05:22,599 test. And we want to know how many 84 00:05:22,600 --> 00:05:22,633 85 00:05:22,633 --> 00:05:28,733 males scored low on the math test, how many females scored low on the math test, how many males scored high on 86 00:05:28,733 --> 00:05:34,799 the math test, and how many females scored high on the math test. That's what we would like to know. And we want that to widen up into 87 00:05:34,800 --> 00:05:40,733 four categories. So we can use equals count 88 00:05:40,733 --> 00:05:40,966 89 00:05:40,966 --> 00:05:47,299 ifs. We've got multiple conditions that we're trying to meet remember. We've got two different variables 90 00:05:47,300 --> 00:05:47,333 91 00:05:47,333 --> 00:05:53,333 that we're looking at here. So we want to know, let's open up this parentheses. It's going to give us 92 00:05:53,333 --> 00:05:53,366 93 00:05:53,366 --> 00:05:59,432 our first criteria range here. So criteria range means what is the range 94 00:05:59,433 --> 00:05:59,466 95 00:05:59,466 --> 00:06:05,732 of values you're looking at where we want to set a criteria for. So in this case, 96 00:06:05,733 --> 00:06:06,299 97 00:06:06,300 --> 00:06:08,700 we're going to select de_db. I'm going 98 00:06:08,700 --> 00:06:12,333 99 00:06:12,333 --> 00:06:16,066 to scroll back up to the top so you can see. Comet criteria 100 00:06:16,066 --> 00:06:19,166 101 00:06:19,166 --> 00:06:26,766 one is that de_db is equal to, and we've decided males are zero. So zero. 102 00:06:26,766 --> 00:06:28,566 103 00:06:28,566 --> 00:06:32,832 Comet criteria range two, IV2, or our test scores. And I've already 104 00:06:32,833 --> 00:06:34,699 105 00:06:34,700 --> 00:06:38,900 found the mean of IV2 right 106 00:06:38,900 --> 00:06:41,066 107 00:06:41,066 --> 00:06:45,699 here in F3. When we 108 00:06:45,700 --> 00:06:47,200 109 00:06:47,200 --> 00:06:53,433 want to use a cell when using count ifs, 110 00:06:53,433 --> 00:06:53,566 111 00:06:53,566 --> 00:06:59,666 we have to do it in a very particular way. You put quotation marks around your 112 00:06:59,666 --> 00:06:59,699 113 00:06:59,700 --> 00:07:05,933 less than or equal to sign. In this case, we're going to say, because we're looking for low test scores. 114 00:07:05,933 --> 00:07:12,166 We're going to say less than or equal to on the low one. Then we use the hamper sand symbol, and then 115 00:07:12,166 --> 00:07:18,466 you can select your box where your 116 00:07:18,466 --> 00:07:18,499 117 00:07:18,500 --> 00:07:24,500 mean is in this case. 118 00:07:24,500 --> 00:07:24,533 119 00:07:24,533 --> 00:07:30,999 And we don't need to use the dollar signs to make this a constant this time because 120 00:07:31,000 --> 00:07:37,166 we're using count ifs, and it looks at the entire column simultaneously. So we'll close our parenthesis here 121 00:07:37,166 --> 00:07:37,199 122 00:07:37,200 --> 00:07:43,300 and we'll hit enter and we've got 49. And we can see that's the same 123 00:07:43,300 --> 00:07:43,333 124 00:07:43,333 --> 00:07:49,099 number we had here. So we're doing good. Okay, so let's look at how many males 125 00:07:49,100 --> 00:07:49,600 126 00:07:49,600 --> 00:07:55,833 scored high on the test. And in this case, we're going to say equals count 127 00:07:55,833 --> 00:08:01,933 ifs. See if I can type today. We're 128 00:08:01,933 --> 00:08:01,966 129 00:08:01,966 --> 00:08:04,666 going to select the same criteria ranges. 130 00:08:04,666 --> 00:08:09,199 131 00:08:09,200 --> 00:08:15,000 Scroll back up to the top so you can see comma 132 00:08:15,000 --> 00:08:16,033 133 00:08:16,033 --> 00:08:21,633 zero because we're still on males. comma criteria range two is IV two because 134 00:08:21,633 --> 00:08:22,199 135 00:08:22,200 --> 00:08:24,866 those are the math test scores that we're looking at. 136 00:08:24,866 --> 00:08:30,099 137 00:08:30,100 --> 00:08:36,200 comma and 138 00:08:36,200 --> 00:08:36,233 139 00:08:36,233 --> 00:08:42,566 we've already used equals for our less than so we're not going to use equals again. We'll 140 00:08:42,566 --> 00:08:42,599 141 00:08:42,600 --> 00:08:49,600 just say greater than ampersand F3 close parentheses. 142 00:08:49,600 --> 00:08:50,833 143 00:08:50,833 --> 00:08:57,466 So these are the males who scored higher than the average on the math test. 144 00:08:57,466 --> 00:08:57,666 145 00:08:57,666 --> 00:09:03,232 Now let's do the same thing, but for females so we use equals count hips. Our criteria 146 00:09:03,233 --> 00:09:04,733 147 00:09:04,733 --> 00:09:08,799 range is d_d_b for the first one. We're 148 00:09:08,800 --> 00:09:11,433 149 00:09:11,433 --> 00:09:17,599 still except this time, we're 150 00:09:17,600 --> 00:09:19,066 151 00:09:19,066 --> 00:09:25,332 going to say we want this equal to 1. Because 0 is for males and 1 is for females. Okay, 152 00:09:25,333 --> 00:09:26,166 153 00:09:26,166 --> 00:09:29,232 our second criteria range is IV2. And we want 154 00:09:29,233 --> 00:09:39,233 155 00:09:39,233 --> 00:09:41,733 to know 156 00:09:41,733 --> 00:09:45,399 157 00:09:45,400 --> 00:09:51,700 who scored low because that's the, that's the box we're in right now. So open quotation 158 00:09:51,700 --> 00:09:57,766 marks less than or equal to quotation marks comma, oops, no 159 00:09:57,766 --> 00:10:04,132 ampersand F3. Close parentheses. 160 00:10:04,133 --> 00:10:05,333 161 00:10:05,333 --> 00:10:09,333 Okay, and we've got 14 females who scored lower on this math test. 162 00:10:09,333 --> 00:10:17,433 163 00:10:17,433 --> 00:10:23,266 Okay, and now we're going to have another one. We want to know how many females scored high on the math test. 164 00:10:23,266 --> 00:10:25,566 165 00:10:25,566 --> 00:10:29,566 So again, we're going to select our criteria range one. 166 00:10:29,566 --> 00:10:38,132 167 00:10:38,133 --> 00:10:43,166 One for females, comma, criteria range 168 00:10:43,166 --> 00:10:47,066 169 00:10:47,066 --> 00:10:53,399 two. And for this one, we're looking for greater than. So open quotation, greater than symbol, 170 00:10:53,400 --> 00:10:58,166 closed quotations, ampersand F3. Okay, 171 00:10:58,166 --> 00:11:01,099 172 00:11:01,100 --> 00:11:07,200 and we've got 68. So I know these are correct because these are the same numbers that I got up here. But 173 00:11:07,200 --> 00:11:07,333 174 00:11:07,333 --> 00:11:13,466 we can check these to make sure that we have correct answers and something that makes sense by 175 00:11:13,466 --> 00:11:13,499 176 00:11:13,500 --> 00:11:19,300 using some. So we're going to sum up our marginal totals. So you 177 00:11:19,300 --> 00:11:20,633 178 00:11:20,633 --> 00:11:26,633 can see we've got 88 and 82, but some these 179 00:11:26,633 --> 00:11:26,666 180 00:11:26,666 --> 00:11:28,832 marginal totals now. And I'm just 181 00:11:28,833 --> 00:11:33,466 182 00:11:33,466 --> 00:11:39,366 dragging this formula down. There we go. So if you click up in here, 183 00:11:39,366 --> 00:11:39,599 184 00:11:39,600 --> 00:11:45,633 you can see that this is the sum of 88 plus 82 equals 170. And 185 00:11:45,633 --> 00:11:45,666 186 00:11:45,666 --> 00:11:51,766 remember, we have labels so we don't count the first row. But if we scroll all the way to the bottom, we can 187 00:11:51,766 --> 00:11:57,932 see that we have 170 points of data. We have 171 rows. But that first row is labels. 188 00:11:57,933 --> 00:11:58,299 189 00:11:58,300 --> 00:12:04,500 To be of 170 points of data in our data frame observations. And we have 190 00:12:04,500 --> 00:12:05,166 191 00:12:05,166 --> 00:12:11,199 170 observations here. So we know that we've collapsed our variables correctly from our two categories for 192 00:12:11,200 --> 00:12:11,233 193 00:12:11,233 --> 00:12:17,399 our four categories rather. All right. That should 194 00:12:17,400 --> 00:12:23,833 be all you need to know to collapse variables in Excel. There are multiple different ways to do this. 195 00:12:23,833 --> 00:12:23,866 196 00:12:23,866 --> 00:12:29,999 So if you find another way that works better for you and gets you correct results, feel free 197 00:12:30,000 --> 00:12:30,033 198 00:12:30,033 --> 00:12:36,166 to use it. But that should be all you need. So have a great day and have 199 00:12:36,166 --> 00:12:37,566 fun learning Excel.