1
00:00:00,600 --> 00:00:02,900


2
00:00:02,900 --> 00:00:09,000
Hello, and welcome to a module on bivariate regression in Python. First, let's start by loading in some

3
00:00:09,000 --> 00:00:14,200
data. Let's try out the ADD.xlsx data this time, just to shake things up.

4
00:00:14,200 --> 00:00:16,566


5
00:00:16,566 --> 00:00:21,366
First, we'll be importing pandas as pd, drive from google.colab,

6
00:00:21,366 --> 00:00:23,066


7
00:00:23,066 --> 00:00:28,399
sm from statsmodels.api. We're importing this today to do our regression,

8
00:00:28,400 --> 00:00:29,266


9
00:00:29,266 --> 00:00:34,366
and we'll also be importing matplotlib.pyplot so we can do some visualizations of our regression as well.

10
00:00:34,366 --> 00:00:35,899


11
00:00:35,900 --> 00:00:41,933
Next, we'll mount our drive, set our file path, and remember we're using a different data set. So make sure your

12
00:00:41,933 --> 00:00:41,966


13
00:00:41,966 --> 00:00:48,166
file path is set to something different than what it usually is. And make sure to save the file in your Google Drive

14
00:00:48,166 --> 00:00:54,499
so that it is available for you to use through your file path that I will read in your file path.

15
00:00:54,500 --> 00:00:55,133


16
00:00:55,133 --> 00:00:57,033
And let's print the head of our data.

17
00:00:57,033 --> 00:01:01,299


18
00:01:01,300 --> 00:01:07,466
So here we have some variables in our data set. The ones we're the most concerned with (we'll go over a couple of them). And we've

19
00:01:07,466 --> 00:01:13,132
got gender here, which is pretty self-explanatory, repeat, which is if they have repeated a

20
00:01:13,133 --> 00:01:13,866


21
00:01:13,866 --> 00:01:20,299
grade, IQ scores, GPA, what I think stands for social problems

22
00:01:20,300 --> 00:01:26,866
and drop out, which means if they have dropped out. And finally ADDSC,

23
00:01:26,866 --> 00:01:32,399
which is ADD score. So this is a score that a student received on

24
00:01:32,400 --> 00:01:32,933


25
00:01:32,933 --> 00:01:38,966
an ADD or ADHD test. So what

26
00:01:38,966 --> 00:01:45,232
is bivariate regression? Bivariate means two variables, where one is the independent

27
00:01:45,233 --> 00:01:45,366


28
00:01:45,366 --> 00:01:51,499
variable, the predictor. And one is the dependent variable, the outcome. We regress the two

29
00:01:51,500 --> 00:01:57,633
variables when we want to do things like make prediction. For example, if we had a data frame with number

30
00:01:57,633 --> 00:01:57,666


31
00:01:57,666 --> 00:02:03,766
of hours studied and exam scores, we could make predictions about what a score would be given the number of hours study.

32
00:02:03,766 --> 00:02:05,199


33
00:02:05,200 --> 00:02:11,433
We can also use it to understand relationships. Bivariate regression in the above example, would help answer:

34
00:02:11,433 --> 00:02:11,666


35
00:02:11,666 --> 00:02:18,099
Does studying more actually lead to higher exam scores? The third way we can use it is to quantify impact.

36
00:02:18,100 --> 00:02:19,033


37
00:02:19,033 --> 00:02:24,766
The regression equation here tells us how much the dependent variable exam score changes

38
00:02:24,766 --> 00:02:25,199


39
00:02:25,200 --> 00:02:31,266
for each unit increase in the independent variable study hours. So let's take a look at

40
00:02:31,266 --> 00:02:37,499
the variables in our data. What if we would like to predict GPA based on ADDSC? In other

41
00:02:37,500 --> 00:02:37,533


42
00:02:37,533 --> 00:02:43,699
words, our X variable independent will be ADDSC and our Y variable dependent

43
00:02:43,700 --> 00:02:44,100


44
00:02:44,100 --> 00:02:50,400
will be GPA. In this data set, the ADDSC scores

45
00:02:50,400 --> 00:02:50,433


46
00:02:50,433 --> 00:02:56,166
are ADHD test scores and GPA is for grades. So keep that in mind as go through

47
00:02:56,166 --> 00:02:58,199


48
00:02:58,200 --> 00:03:04,200
our test. So the first thing

49
00:03:04,200 --> 00:03:04,233


50
00:03:04,233 --> 00:03:10,299
we'll want to do in our code is set X and Y equal to a variable here. And this

51
00:03:10,300 --> 00:03:10,333


52
00:03:10,333 --> 00:03:16,466
just makes it easier for us to read the code later, rather than having all of our data frame

53
00:03:16,466 --> 00:03:16,732


54
00:03:16,733 --> 00:03:22,833
tags all over the place. So X is the independent variable. Y is

55
00:03:22,833 --> 00:03:22,866


56
00:03:22,866 --> 00:03:28,866
the dependent variable. Next we're going to add a constant for the intercept.

57
00:03:28,866 --> 00:03:30,899


58
00:03:30,900 --> 00:03:37,066
You have to do this manually in Python. The constant is a number that exists on the line of best fit. It is where

59
00:03:37,066 --> 00:03:43,299
X is equal to zero. This line will add that content to our x values

60
00:03:43,300 --> 00:03:43,600


61
00:03:43,600 --> 00:03:49,733
and save it in a new variable called x_const, which stands for X constant.

62
00:03:49,733 --> 00:03:53,333


63
00:03:53,333 --> 00:03:59,366
Next we'll actually run our model. In this case we're using sm which is stats models that we imported

64
00:03:59,366 --> 00:04:05,432
above. .OLS and in the parentheses our arguments are going to

65
00:04:05,433 --> 00:04:05,466


66
00:04:05,466 --> 00:04:09,599
be Y and the x constant that we created just a few moments

67
00:04:09,600 --> 00:04:11,933


68
00:04:11,933 --> 00:04:17,633
ago. Next we'll set our mod will use model.fit to actually fit the model to our data and save

69
00:04:17,633 --> 00:04:18,933


70
00:04:18,933 --> 00:04:24,966
it in a variable called results and then we'll print those results in the summary. Let's go ahead

71
00:04:24,966 --> 00:04:30,999
and do it. Okay so this is a lot of numbers and we're not going to go

72
00:04:31,000 --> 00:04:31,033


73
00:04:31,033 --> 00:04:36,866
over every single number that's here but I will highlight a few of them for you so that you have an idea of what these numbers are trying to tell

74
00:04:36,866 --> 00:04:38,499


75
00:04:38,500 --> 00:04:44,700
you. So in our example summary output we can see that P, the P values less than 0.001. This is P right

76
00:04:44,700 --> 00:04:50,333
here. It's 0.000 which gives us strong evidence that we can reject the null hypothesis.

77
00:04:50,333 --> 00:04:51,199


78
00:04:51,200 --> 00:04:57,200
This suggests that there is a relationship between ADDSC and GPA. If we want to know the strength of that

79
00:04:57,200 --> 00:05:03,200
relationship we should look at the effect size or R-squared value. R-squared value is right

80
00:05:03,200 --> 00:05:03,233


81
00:05:03,233 --> 00:05:09,266
here. This top right number in this case R-squared

82
00:05:09,266 --> 00:05:15,399
is 0.378 which means 37.8% of the variance of GPA is explained by ADDSC. So let's

83
00:05:15,400 --> 00:05:15,433


84
00:05:15,433 --> 00:05:19,733
talk about a few more of these numbers.

85
00:05:19,733 --> 00:05:21,499


86
00:05:21,500 --> 00:05:25,500
Next up we have the intercept. The intercept is this number right here.

87
00:05:25,500 --> 00:05:28,900


88
00:05:28,900 --> 00:05:34,966
It's the constant coefficient. That's where x is equal to 0. The intercept in

89
00:05:34,966 --> 00:05:41,132
linear regression is the predicted value of y when x is 0. It's where the line crosses the x-axis. Kind of like

90
00:05:41,133 --> 00:05:41,566


91
00:05:41,566 --> 00:05:47,599
the starting point, the prediction. For example, if you're predicting someone's salary based on experience, the

92
00:05:47,600 --> 00:05:51,266
intercept would be the predicted salary for someone with zero years of experience.

93
00:05:51,266 --> 00:05:56,266


94
00:05:56,266 --> 00:06:02,766
The standard error of the estimate tells us how much the actual data points differ from the predicted values on the regression

95
00:06:02,766 --> 00:06:02,799


96
00:06:02,800 --> 00:06:08,500
line. It's like saying, on average, our predictions are off by this much. Standard error is right here.

97
00:06:08,500 --> 00:06:15,266


98
00:06:15,266 --> 00:06:21,566
Unstandardized regression coefficient. And unstandardized regression coefficient tells you how much why changes

99
00:06:21,566 --> 00:06:27,699
when x increases by one unit. It uses the original units of the variables, like dollars, hours,

100
00:06:27,700 --> 00:06:30,600
or inches, so you can directly see the real world act.

101
00:06:30,600 --> 00:06:34,233


102
00:06:34,233 --> 00:06:40,466
So this is our unstandardized regression coefficient. And we can see that this indicates that there is a negative

103
00:06:40,466 --> 00:06:46,599
relationship between ADD scores and GPA. We'll check out that a little bit more later.

104
00:06:46,600 --> 00:06:49,333


105
00:06:49,333 --> 00:06:55,533
P value. This is when you probably already know. It helps you measure the strength of the evidence against the null hypothesis. And we've

106
00:06:55,533 --> 00:07:01,633
already pointed out where the P value is in our data here. R squared tells you how

107
00:07:01,633 --> 00:07:01,666


108
00:07:01,666 --> 00:07:07,832
much of variation y, the outcome is explained by x predictor. It's a number between 0 and 1. Ploser

109
00:07:07,833 --> 00:07:08,266


110
00:07:08,266 --> 00:07:14,266
to 1 means the model fits the data better. If r squared is 0.8, it means that 80% of the changes in

111
00:07:14,266 --> 00:07:19,699
y can be explained by the model. And again, our r squared value is right here.

112
00:07:19,700 --> 00:07:24,166


113
00:07:24,166 --> 00:07:30,232
Beta shows how strongly each variable affects the outcome, but it uses standardized units. So you can

114
00:07:30,233 --> 00:07:36,333
compare the importance of different predictors, even if they're measured differently. If one variable has a beta 0.7 and

115
00:07:36,333 --> 00:07:36,566


116
00:07:36,566 --> 00:07:42,932
another has a 0.2, the first one has a bigger impact on the outcome. Beta will not be produced automatically

117
00:07:42,933 --> 00:07:42,966


118
00:07:42,966 --> 00:07:45,966
by your code. So it will not appear in your results summary.

119
00:07:45,966 --> 00:07:49,699


120
00:07:49,700 --> 00:07:55,800
Okay, we've talked a little about what these variables mean, but let's take a look at what

121
00:07:55,800 --> 00:08:01,800
these variables are, what sorry, not what the variables mean. We've talked about what these results mean, but let's take a

122
00:08:01,800 --> 00:08:01,833


123
00:08:01,833 --> 00:08:07,366
look at what these variables look like when they're actually produced on a scatterplot.

124
00:08:07,366 --> 00:08:08,132


125
00:08:08,133 --> 00:08:13,133
So we've done a scatterplot before previously. I won't be going over this code too much as a result, but you

126
00:08:13,133 --> 00:08:15,233


127
00:08:15,233 --> 00:08:20,766
can see that it's pretty much the same as what we've done before. Except now, we have a Y prediction using

128
00:08:20,766 --> 00:08:22,032


129
00:08:22,033 --> 00:08:27,733
the X constant, and then we're plotting X the Y predictor.

130
00:08:27,733 --> 00:08:29,099


131
00:08:29,100 --> 00:08:32,166
The blue and blue we're going to make it red, have a line at best fit.

132
00:08:32,166 --> 00:08:36,199


133
00:08:36,200 --> 00:08:42,233
Okay, and we can see that there is a negative relationship between these two variables,

134
00:08:42,233 --> 00:08:44,066


135
00:08:44,066 --> 00:08:47,099
meaning as ADDSC scores go up, GPA goes down.

136
00:08:47,100 --> 00:08:50,800


137
00:08:50,800 --> 00:08:54,800
This is probably not surprising. If you're familiar

138
00:08:54,800 --> 00:08:56,833


139
00:08:56,833 --> 00:09:00,566
with the literature on ADHD,

140
00:09:00,566 --> 00:09:05,666


141
00:09:05,666 --> 00:09:11,799
we can also check this by doing a correlation of ADDSC and GA. And we can indeed see

142
00:09:11,800 --> 00:09:11,833


143
00:09:11,833 --> 00:09:17,699
that GPA goes down as ADDSC goes up.

144
00:09:17,700 --> 00:09:17,866


145
00:09:17,866 --> 00:09:24,032
Okay, this should be all you need to get started on your

146
00:09:24,033 --> 00:09:24,066


147
00:09:24,066 --> 00:09:30,166
homework. Please let us know if there's anything that we do to help you understand better. If we have

148
00:09:30,166 --> 00:09:35,866
tutors this semester, we've talked to tutors and if we don't have tutors, then we've talked to you on structure for help.

149
00:09:35,866 --> 00:09:37,832


150
00:09:37,833 --> 00:09:40,166
All right, have a great day and have fun learning python!