1
00:00:00,733 --> 00:00:02,233


2
00:00:02,233 --> 00:00:07,499
Hello, and welcome to a video on sub-setting in Python. What is sub-setting?

3
00:00:07,500 --> 00:00:08,300


4
00:00:08,300 --> 00:00:14,700
Sub-setting is the process of selecting a smaller part of a larger data set based on specific criteria.

5
00:00:14,700 --> 00:00:14,933


6
00:00:14,933 --> 00:00:21,099
Think of it like picking out only the red apples from the big basket of mixed fruit. In data analysis,

7
00:00:21,100 --> 00:00:21,133


8
00:00:21,133 --> 00:00:27,233
sub-setting allows us to focus on specific rows or columns that meet certain conditions, making it easier to

9
00:00:27,233 --> 00:00:33,533
analyze relevant information. For example, if you have a data set of students test scores, you might subset

10
00:00:33,533 --> 00:00:33,566


11
00:00:33,566 --> 00:00:39,632
the data to look at only students who scored above 80. Let's do what we usually do and load in some data.

12
00:00:39,633 --> 00:00:40,599


13
00:00:40,600 --> 00:00:45,833
So we're going to import pandas as pd and from google.colab. We're going to import drive.

14
00:00:45,833 --> 00:00:48,266


15
00:00:48,266 --> 00:00:54,499
So we're going to mount our drive, set our file path, read in our file path into a data frame, and make some

16
00:00:54,500 --> 00:00:54,666


17
00:00:54,666 --> 00:00:57,166
variables. So let's go ahead and do that.

18
00:00:57,166 --> 00:01:01,899


19
00:01:01,900 --> 00:01:08,200
So now we have some variables to work with. Let's use that example of student test scores and

20
00:01:08,200 --> 00:01:08,233


21
00:01:08,233 --> 00:01:14,599
pretend IV2 is a column of test scores. What if we only wanted the data where IV2 was over 80

22
00:01:14,600 --> 00:01:15,766


23
00:01:15,766 --> 00:01:21,832
to do that? We're going to set a variable called scores over 80 and we'll put a data frame into

24
00:01:21,833 --> 00:01:21,866


25
00:01:21,866 --> 00:01:28,032
it. So this is letting Python know I want to make a new data frame. Based on this data

26
00:01:28,033 --> 00:01:34,066
frame where DF2 IV2 is greater than 80. And we're going to

27
00:01:34,066 --> 00:01:34,099


28
00:01:34,100 --> 00:01:36,300
call that new data frame scores over 80.

29
00:01:36,300 --> 00:01:40,433


30
00:01:40,433 --> 00:01:43,799
Okay, so now we have a 124 rows.

31
00:01:43,800 --> 00:01:50,066


32
00:01:50,066 --> 00:01:54,132
And we can see that our original 170 rows now we only have

33
00:01:54,133 --> 00:01:58,333


34
00:01:58,333 --> 00:02:04,433
125. Because remember Python begins counting at zero. What if

35
00:02:04,433 --> 00:02:10,433
we want to select only certain columns? Let's say we're only interested in IV2, RIV1 and

36
00:02:10,433 --> 00:02:16,566
RIV2, where the score of IV2 is over 80. To do

37
00:02:16,566 --> 00:02:22,332
that, we do the same thing as before. Except now we're going to place a comma after the 80.

38
00:02:22,333 --> 00:02:22,766


39
00:02:22,766 --> 00:02:28,766
I'm going to tell it, okay, but we only want these three columns. And

40
00:02:28,766 --> 00:02:34,866
we'll select them using brackets. We'll call this data set

41
00:02:34,866 --> 00:02:35,066


42
00:02:35,066 --> 00:02:41,599
subset. Okay, that's much more manageable.

43
00:02:41,600 --> 00:02:43,333


44
00:02:43,333 --> 00:02:49,099
In pandas, dot-loc is a way to access or filter data based on labels, row and column names.

45
00:02:49,100 --> 00:02:49,566


46
00:02:49,566 --> 00:02:55,299
It is particularly useful when selecting rows and columns based on labels rather than numerical positions.

47
00:02:55,300 --> 00:02:56,733


48
00:02:56,733 --> 00:03:03,466
So you can see we've used dot-loc up here to select labels

49
00:03:03,466 --> 00:03:04,732


50
00:03:04,733 --> 00:03:05,633
and a position.

51
00:03:05,633 --> 00:03:12,966


52
00:03:12,966 --> 00:03:19,099
Okay, what if we have two conditions we'd like to meet? Let's say we want IV2 over 80

53
00:03:19,100 --> 00:03:19,433


54
00:03:19,433 --> 00:03:26,133
and RIV1 less than 9. But we still only want the columns IV2, RIV1 and RIV2.

55
00:03:26,133 --> 00:03:26,633


56
00:03:26,633 --> 00:03:32,866
We can combine the last two techniques into one like so. So we're going to use DF dot-loc

57
00:03:32,866 --> 00:03:32,899


58
00:03:32,900 --> 00:03:38,933
again to make a data frame where IV2 is greater than 85. And

59
00:03:38,933 --> 00:03:45,066
that's what that ampersand symbol is here. And RIV1, less than

60
00:03:45,066 --> 00:03:51,432
9. And in addition to that, we only want the variables IV2, RIV1,

61
00:03:51,433 --> 00:03:51,466


62
00:03:51,466 --> 00:03:57,566
and RIV2. Now we have a much smaller version

63
00:03:57,566 --> 00:04:03,866
of our data set that will actually print the entirety of in our window

64
00:04:03,866 --> 00:04:09,966
here. But not all data is

65
00:04:09,966 --> 00:04:09,999


66
00:04:10,000 --> 00:04:15,800
continuously measured and may not fit with this pattern. What if we have categorical data that we'd like to subset?

67
00:04:15,800 --> 00:04:16,366


68
00:04:16,366 --> 00:04:22,399
Let's take a look at some sample data. So here we'll make some quick sample data. We'll have

69
00:04:22,400 --> 00:04:28,233
name, major, and score, and we'll have some names, a couple different majors for these people,

70
00:04:28,233 --> 00:04:28,599


71
00:04:28,600 --> 00:04:33,200
and whatever scores they've gotten on, maybe it's a test that we've recently had.

72
00:04:33,200 --> 00:04:36,400


73
00:04:36,400 --> 00:04:42,566
So we're going to subset only the psychology students. And to do this, we're going to say

74
00:04:42,566 --> 00:04:42,599


75
00:04:42,600 --> 00:04:47,400
major if major is equal to psychology, put it in data frame.

76
00:04:47,400 --> 00:04:48,933


77
00:04:48,933 --> 00:04:53,533
And we're going to call that data frame psychology students, and then we'll print it out.

78
00:04:53,533 --> 00:04:58,166


79
00:04:58,166 --> 00:05:04,366
So now we can see that only Alice and Charlie are part of the psychology program in this group. What if we wanted both

80
00:05:04,366 --> 00:05:09,699
the psychology and biology students to do that? Well, use .isin to

81
00:05:09,700 --> 00:05:10,766


82
00:05:10,766 --> 00:05:16,232
give Python a list and say anything in this list. We want to know if

83
00:05:16,233 --> 00:05:17,433


84
00:05:17,433 --> 00:05:22,766
they're in that major. So we're selecting major from it.

85
00:05:22,766 --> 00:05:24,332


86
00:05:24,333 --> 00:05:30,699
But in addition, saying .isin this list of things that we're passing through.

87
00:05:30,700 --> 00:05:33,600


88
00:05:33,600 --> 00:05:39,700
And we'll call those science students. So now we have Alice, Bob,

89
00:05:39,700 --> 00:05:45,766
Charlie, and Eve in our subsets. Alright, that's pretty much all the information

90
00:05:45,766 --> 00:05:45,799


91
00:05:45,800 --> 00:05:49,033
you need about subsets to do the homework. So have fun coding.