1 00:00:00,733 --> 00:00:02,233 2 00:00:02,233 --> 00:00:07,499 Hello, and welcome to a video on sub-setting in Python. What is sub-setting? 3 00:00:07,500 --> 00:00:08,300 4 00:00:08,300 --> 00:00:14,700 Sub-setting is the process of selecting a smaller part of a larger data set based on specific criteria. 5 00:00:14,700 --> 00:00:14,933 6 00:00:14,933 --> 00:00:21,099 Think of it like picking out only the red apples from the big basket of mixed fruit. In data analysis, 7 00:00:21,100 --> 00:00:21,133 8 00:00:21,133 --> 00:00:27,233 sub-setting allows us to focus on specific rows or columns that meet certain conditions, making it easier to 9 00:00:27,233 --> 00:00:33,533 analyze relevant information. For example, if you have a data set of students test scores, you might subset 10 00:00:33,533 --> 00:00:33,566 11 00:00:33,566 --> 00:00:39,632 the data to look at only students who scored above 80. Let's do what we usually do and load in some data. 12 00:00:39,633 --> 00:00:40,599 13 00:00:40,600 --> 00:00:45,833 So we're going to import pandas as pd and from google.colab. We're going to import drive. 14 00:00:45,833 --> 00:00:48,266 15 00:00:48,266 --> 00:00:54,499 So we're going to mount our drive, set our file path, read in our file path into a data frame, and make some 16 00:00:54,500 --> 00:00:54,666 17 00:00:54,666 --> 00:00:57,166 variables. So let's go ahead and do that. 18 00:00:57,166 --> 00:01:01,899 19 00:01:01,900 --> 00:01:08,200 So now we have some variables to work with. Let's use that example of student test scores and 20 00:01:08,200 --> 00:01:08,233 21 00:01:08,233 --> 00:01:14,599 pretend IV2 is a column of test scores. What if we only wanted the data where IV2 was over 80 22 00:01:14,600 --> 00:01:15,766 23 00:01:15,766 --> 00:01:21,832 to do that? We're going to set a variable called scores over 80 and we'll put a data frame into 24 00:01:21,833 --> 00:01:21,866 25 00:01:21,866 --> 00:01:28,032 it. So this is letting Python know I want to make a new data frame. Based on this data 26 00:01:28,033 --> 00:01:34,066 frame where DF2 IV2 is greater than 80. And we're going to 27 00:01:34,066 --> 00:01:34,099 28 00:01:34,100 --> 00:01:36,300 call that new data frame scores over 80. 29 00:01:36,300 --> 00:01:40,433 30 00:01:40,433 --> 00:01:43,799 Okay, so now we have a 124 rows. 31 00:01:43,800 --> 00:01:50,066 32 00:01:50,066 --> 00:01:54,132 And we can see that our original 170 rows now we only have 33 00:01:54,133 --> 00:01:58,333 34 00:01:58,333 --> 00:02:04,433 125. Because remember Python begins counting at zero. What if 35 00:02:04,433 --> 00:02:10,433 we want to select only certain columns? Let's say we're only interested in IV2, RIV1 and 36 00:02:10,433 --> 00:02:16,566 RIV2, where the score of IV2 is over 80. To do 37 00:02:16,566 --> 00:02:22,332 that, we do the same thing as before. Except now we're going to place a comma after the 80. 38 00:02:22,333 --> 00:02:22,766 39 00:02:22,766 --> 00:02:28,766 I'm going to tell it, okay, but we only want these three columns. And 40 00:02:28,766 --> 00:02:34,866 we'll select them using brackets. We'll call this data set 41 00:02:34,866 --> 00:02:35,066 42 00:02:35,066 --> 00:02:41,599 subset. Okay, that's much more manageable. 43 00:02:41,600 --> 00:02:43,333 44 00:02:43,333 --> 00:02:49,099 In pandas, dot-loc is a way to access or filter data based on labels, row and column names. 45 00:02:49,100 --> 00:02:49,566 46 00:02:49,566 --> 00:02:55,299 It is particularly useful when selecting rows and columns based on labels rather than numerical positions. 47 00:02:55,300 --> 00:02:56,733 48 00:02:56,733 --> 00:03:03,466 So you can see we've used dot-loc up here to select labels 49 00:03:03,466 --> 00:03:04,732 50 00:03:04,733 --> 00:03:05,633 and a position. 51 00:03:05,633 --> 00:03:12,966 52 00:03:12,966 --> 00:03:19,099 Okay, what if we have two conditions we'd like to meet? Let's say we want IV2 over 80 53 00:03:19,100 --> 00:03:19,433 54 00:03:19,433 --> 00:03:26,133 and RIV1 less than 9. But we still only want the columns IV2, RIV1 and RIV2. 55 00:03:26,133 --> 00:03:26,633 56 00:03:26,633 --> 00:03:32,866 We can combine the last two techniques into one like so. So we're going to use DF dot-loc 57 00:03:32,866 --> 00:03:32,899 58 00:03:32,900 --> 00:03:38,933 again to make a data frame where IV2 is greater than 85. And 59 00:03:38,933 --> 00:03:45,066 that's what that ampersand symbol is here. And RIV1, less than 60 00:03:45,066 --> 00:03:51,432 9. And in addition to that, we only want the variables IV2, RIV1, 61 00:03:51,433 --> 00:03:51,466 62 00:03:51,466 --> 00:03:57,566 and RIV2. Now we have a much smaller version 63 00:03:57,566 --> 00:04:03,866 of our data set that will actually print the entirety of in our window 64 00:04:03,866 --> 00:04:09,966 here. But not all data is 65 00:04:09,966 --> 00:04:09,999 66 00:04:10,000 --> 00:04:15,800 continuously measured and may not fit with this pattern. What if we have categorical data that we'd like to subset? 67 00:04:15,800 --> 00:04:16,366 68 00:04:16,366 --> 00:04:22,399 Let's take a look at some sample data. So here we'll make some quick sample data. We'll have 69 00:04:22,400 --> 00:04:28,233 name, major, and score, and we'll have some names, a couple different majors for these people, 70 00:04:28,233 --> 00:04:28,599 71 00:04:28,600 --> 00:04:33,200 and whatever scores they've gotten on, maybe it's a test that we've recently had. 72 00:04:33,200 --> 00:04:36,400 73 00:04:36,400 --> 00:04:42,566 So we're going to subset only the psychology students. And to do this, we're going to say 74 00:04:42,566 --> 00:04:42,599 75 00:04:42,600 --> 00:04:47,400 major if major is equal to psychology, put it in data frame. 76 00:04:47,400 --> 00:04:48,933 77 00:04:48,933 --> 00:04:53,533 And we're going to call that data frame psychology students, and then we'll print it out. 78 00:04:53,533 --> 00:04:58,166 79 00:04:58,166 --> 00:05:04,366 So now we can see that only Alice and Charlie are part of the psychology program in this group. What if we wanted both 80 00:05:04,366 --> 00:05:09,699 the psychology and biology students to do that? Well, use .isin to 81 00:05:09,700 --> 00:05:10,766 82 00:05:10,766 --> 00:05:16,232 give Python a list and say anything in this list. We want to know if 83 00:05:16,233 --> 00:05:17,433 84 00:05:17,433 --> 00:05:22,766 they're in that major. So we're selecting major from it. 85 00:05:22,766 --> 00:05:24,332 86 00:05:24,333 --> 00:05:30,699 But in addition, saying .isin this list of things that we're passing through. 87 00:05:30,700 --> 00:05:33,600 88 00:05:33,600 --> 00:05:39,700 And we'll call those science students. So now we have Alice, Bob, 89 00:05:39,700 --> 00:05:45,766 Charlie, and Eve in our subsets. Alright, that's pretty much all the information 90 00:05:45,766 --> 00:05:45,799 91 00:05:45,800 --> 00:05:49,033 you need about subsets to do the homework. So have fun coding.