1 00:00:00,600 --> 00:00:02,900 2 00:00:02,900 --> 00:00:09,000 Hello, and welcome to a module on bivariate regression in Python. First, let's start by loading in some 3 00:00:09,000 --> 00:00:14,200 data. Let's try out the ADD.xlsx data this time, just to shake things up. 4 00:00:14,200 --> 00:00:16,566 5 00:00:16,566 --> 00:00:21,366 First, we'll be importing pandas as pd, drive from google.colab, 6 00:00:21,366 --> 00:00:23,066 7 00:00:23,066 --> 00:00:28,399 sm from statsmodels.api. We're importing this today to do our regression, 8 00:00:28,400 --> 00:00:29,266 9 00:00:29,266 --> 00:00:34,366 and we'll also be importing matplotlib.pyplot so we can do some visualizations of our regression as well. 10 00:00:34,366 --> 00:00:35,899 11 00:00:35,900 --> 00:00:41,933 Next, we'll mount our drive, set our file path, and remember we're using a different data set. So make sure your 12 00:00:41,933 --> 00:00:41,966 13 00:00:41,966 --> 00:00:48,166 file path is set to something different than what it usually is. And make sure to save the file in your Google Drive 14 00:00:48,166 --> 00:00:54,499 so that it is available for you to use through your file path that I will read in your file path. 15 00:00:54,500 --> 00:00:55,133 16 00:00:55,133 --> 00:00:57,033 And let's print the head of our data. 17 00:00:57,033 --> 00:01:01,299 18 00:01:01,300 --> 00:01:07,466 So here we have some variables in our data set. The ones we're the most concerned with (we'll go over a couple of them). And we've 19 00:01:07,466 --> 00:01:13,132 got gender here, which is pretty self-explanatory, repeat, which is if they have repeated a 20 00:01:13,133 --> 00:01:13,866 21 00:01:13,866 --> 00:01:20,299 grade, IQ scores, GPA, what I think stands for social problems 22 00:01:20,300 --> 00:01:26,866 and drop out, which means if they have dropped out. And finally ADDSC, 23 00:01:26,866 --> 00:01:32,399 which is ADD score. So this is a score that a student received on 24 00:01:32,400 --> 00:01:32,933 25 00:01:32,933 --> 00:01:38,966 an ADD or ADHD test. So what 26 00:01:38,966 --> 00:01:45,232 is bivariate regression? Bivariate means two variables, where one is the independent 27 00:01:45,233 --> 00:01:45,366 28 00:01:45,366 --> 00:01:51,499 variable, the predictor. And one is the dependent variable, the outcome. We regress the two 29 00:01:51,500 --> 00:01:57,633 variables when we want to do things like make prediction. For example, if we had a data frame with number 30 00:01:57,633 --> 00:01:57,666 31 00:01:57,666 --> 00:02:03,766 of hours studied and exam scores, we could make predictions about what a score would be given the number of hours study. 32 00:02:03,766 --> 00:02:05,199 33 00:02:05,200 --> 00:02:11,433 We can also use it to understand relationships. Bivariate regression in the above example, would help answer: 34 00:02:11,433 --> 00:02:11,666 35 00:02:11,666 --> 00:02:18,099 Does studying more actually lead to higher exam scores? The third way we can use it is to quantify impact. 36 00:02:18,100 --> 00:02:19,033 37 00:02:19,033 --> 00:02:24,766 The regression equation here tells us how much the dependent variable exam score changes 38 00:02:24,766 --> 00:02:25,199 39 00:02:25,200 --> 00:02:31,266 for each unit increase in the independent variable study hours. So let's take a look at 40 00:02:31,266 --> 00:02:37,499 the variables in our data. What if we would like to predict GPA based on ADDSC? In other 41 00:02:37,500 --> 00:02:37,533 42 00:02:37,533 --> 00:02:43,699 words, our X variable independent will be ADDSC and our Y variable dependent 43 00:02:43,700 --> 00:02:44,100 44 00:02:44,100 --> 00:02:50,400 will be GPA. In this data set, the ADDSC scores 45 00:02:50,400 --> 00:02:50,433 46 00:02:50,433 --> 00:02:56,166 are ADHD test scores and GPA is for grades. So keep that in mind as go through 47 00:02:56,166 --> 00:02:58,199 48 00:02:58,200 --> 00:03:04,200 our test. So the first thing 49 00:03:04,200 --> 00:03:04,233 50 00:03:04,233 --> 00:03:10,299 we'll want to do in our code is set X and Y equal to a variable here. And this 51 00:03:10,300 --> 00:03:10,333 52 00:03:10,333 --> 00:03:16,466 just makes it easier for us to read the code later, rather than having all of our data frame 53 00:03:16,466 --> 00:03:16,732 54 00:03:16,733 --> 00:03:22,833 tags all over the place. So X is the independent variable. Y is 55 00:03:22,833 --> 00:03:22,866 56 00:03:22,866 --> 00:03:28,866 the dependent variable. Next we're going to add a constant for the intercept. 57 00:03:28,866 --> 00:03:30,899 58 00:03:30,900 --> 00:03:37,066 You have to do this manually in Python. The constant is a number that exists on the line of best fit. It is where 59 00:03:37,066 --> 00:03:43,299 X is equal to zero. This line will add that content to our x values 60 00:03:43,300 --> 00:03:43,600 61 00:03:43,600 --> 00:03:49,733 and save it in a new variable called x_const, which stands for X constant. 62 00:03:49,733 --> 00:03:53,333 63 00:03:53,333 --> 00:03:59,366 Next we'll actually run our model. In this case we're using sm which is stats models that we imported 64 00:03:59,366 --> 00:04:05,432 above. .OLS and in the parentheses our arguments are going to 65 00:04:05,433 --> 00:04:05,466 66 00:04:05,466 --> 00:04:09,599 be Y and the x constant that we created just a few moments 67 00:04:09,600 --> 00:04:11,933 68 00:04:11,933 --> 00:04:17,633 ago. Next we'll set our mod will use model.fit to actually fit the model to our data and save 69 00:04:17,633 --> 00:04:18,933 70 00:04:18,933 --> 00:04:24,966 it in a variable called results and then we'll print those results in the summary. Let's go ahead 71 00:04:24,966 --> 00:04:30,999 and do it. Okay so this is a lot of numbers and we're not going to go 72 00:04:31,000 --> 00:04:31,033 73 00:04:31,033 --> 00:04:36,866 over every single number that's here but I will highlight a few of them for you so that you have an idea of what these numbers are trying to tell 74 00:04:36,866 --> 00:04:38,499 75 00:04:38,500 --> 00:04:44,700 you. So in our example summary output we can see that P, the P values less than 0.001. This is P right 76 00:04:44,700 --> 00:04:50,333 here. It's 0.000 which gives us strong evidence that we can reject the null hypothesis. 77 00:04:50,333 --> 00:04:51,199 78 00:04:51,200 --> 00:04:57,200 This suggests that there is a relationship between ADDSC and GPA. If we want to know the strength of that 79 00:04:57,200 --> 00:05:03,200 relationship we should look at the effect size or R-squared value. R-squared value is right 80 00:05:03,200 --> 00:05:03,233 81 00:05:03,233 --> 00:05:09,266 here. This top right number in this case R-squared 82 00:05:09,266 --> 00:05:15,399 is 0.378 which means 37.8% of the variance of GPA is explained by ADDSC. So let's 83 00:05:15,400 --> 00:05:15,433 84 00:05:15,433 --> 00:05:19,733 talk about a few more of these numbers. 85 00:05:19,733 --> 00:05:21,499 86 00:05:21,500 --> 00:05:25,500 Next up we have the intercept. The intercept is this number right here. 87 00:05:25,500 --> 00:05:28,900 88 00:05:28,900 --> 00:05:34,966 It's the constant coefficient. That's where x is equal to 0. The intercept in 89 00:05:34,966 --> 00:05:41,132 linear regression is the predicted value of y when x is 0. It's where the line crosses the x-axis. Kind of like 90 00:05:41,133 --> 00:05:41,566 91 00:05:41,566 --> 00:05:47,599 the starting point, the prediction. For example, if you're predicting someone's salary based on experience, the 92 00:05:47,600 --> 00:05:51,266 intercept would be the predicted salary for someone with zero years of experience. 93 00:05:51,266 --> 00:05:56,266 94 00:05:56,266 --> 00:06:02,766 The standard error of the estimate tells us how much the actual data points differ from the predicted values on the regression 95 00:06:02,766 --> 00:06:02,799 96 00:06:02,800 --> 00:06:08,500 line. It's like saying, on average, our predictions are off by this much. Standard error is right here. 97 00:06:08,500 --> 00:06:15,266 98 00:06:15,266 --> 00:06:21,566 Unstandardized regression coefficient. And unstandardized regression coefficient tells you how much why changes 99 00:06:21,566 --> 00:06:27,699 when x increases by one unit. It uses the original units of the variables, like dollars, hours, 100 00:06:27,700 --> 00:06:30,600 or inches, so you can directly see the real world act. 101 00:06:30,600 --> 00:06:34,233 102 00:06:34,233 --> 00:06:40,466 So this is our unstandardized regression coefficient. And we can see that this indicates that there is a negative 103 00:06:40,466 --> 00:06:46,599 relationship between ADD scores and GPA. We'll check out that a little bit more later. 104 00:06:46,600 --> 00:06:49,333 105 00:06:49,333 --> 00:06:55,533 P value. This is when you probably already know. It helps you measure the strength of the evidence against the null hypothesis. And we've 106 00:06:55,533 --> 00:07:01,633 already pointed out where the P value is in our data here. R squared tells you how 107 00:07:01,633 --> 00:07:01,666 108 00:07:01,666 --> 00:07:07,832 much of variation y, the outcome is explained by x predictor. It's a number between 0 and 1. Ploser 109 00:07:07,833 --> 00:07:08,266 110 00:07:08,266 --> 00:07:14,266 to 1 means the model fits the data better. If r squared is 0.8, it means that 80% of the changes in 111 00:07:14,266 --> 00:07:19,699 y can be explained by the model. And again, our r squared value is right here. 112 00:07:19,700 --> 00:07:24,166 113 00:07:24,166 --> 00:07:30,232 Beta shows how strongly each variable affects the outcome, but it uses standardized units. So you can 114 00:07:30,233 --> 00:07:36,333 compare the importance of different predictors, even if they're measured differently. If one variable has a beta 0.7 and 115 00:07:36,333 --> 00:07:36,566 116 00:07:36,566 --> 00:07:42,932 another has a 0.2, the first one has a bigger impact on the outcome. Beta will not be produced automatically 117 00:07:42,933 --> 00:07:42,966 118 00:07:42,966 --> 00:07:45,966 by your code. So it will not appear in your results summary. 119 00:07:45,966 --> 00:07:49,699 120 00:07:49,700 --> 00:07:55,800 Okay, we've talked a little about what these variables mean, but let's take a look at what 121 00:07:55,800 --> 00:08:01,800 these variables are, what sorry, not what the variables mean. We've talked about what these results mean, but let's take a 122 00:08:01,800 --> 00:08:01,833 123 00:08:01,833 --> 00:08:07,366 look at what these variables look like when they're actually produced on a scatterplot. 124 00:08:07,366 --> 00:08:08,132 125 00:08:08,133 --> 00:08:13,133 So we've done a scatterplot before previously. I won't be going over this code too much as a result, but you 126 00:08:13,133 --> 00:08:15,233 127 00:08:15,233 --> 00:08:20,766 can see that it's pretty much the same as what we've done before. Except now, we have a Y prediction using 128 00:08:20,766 --> 00:08:22,032 129 00:08:22,033 --> 00:08:27,733 the X constant, and then we're plotting X the Y predictor. 130 00:08:27,733 --> 00:08:29,099 131 00:08:29,100 --> 00:08:32,166 The blue and blue we're going to make it red, have a line at best fit. 132 00:08:32,166 --> 00:08:36,199 133 00:08:36,200 --> 00:08:42,233 Okay, and we can see that there is a negative relationship between these two variables, 134 00:08:42,233 --> 00:08:44,066 135 00:08:44,066 --> 00:08:47,099 meaning as ADDSC scores go up, GPA goes down. 136 00:08:47,100 --> 00:08:50,800 137 00:08:50,800 --> 00:08:54,800 This is probably not surprising. If you're familiar 138 00:08:54,800 --> 00:08:56,833 139 00:08:56,833 --> 00:09:00,566 with the literature on ADHD, 140 00:09:00,566 --> 00:09:05,666 141 00:09:05,666 --> 00:09:11,799 we can also check this by doing a correlation of ADDSC and GA. And we can indeed see 142 00:09:11,800 --> 00:09:11,833 143 00:09:11,833 --> 00:09:17,699 that GPA goes down as ADDSC goes up. 144 00:09:17,700 --> 00:09:17,866 145 00:09:17,866 --> 00:09:24,032 Okay, this should be all you need to get started on your 146 00:09:24,033 --> 00:09:24,066 147 00:09:24,066 --> 00:09:30,166 homework. Please let us know if there's anything that we do to help you understand better. If we have 148 00:09:30,166 --> 00:09:35,866 tutors this semester, we've talked to tutors and if we don't have tutors, then we've talked to you on structure for help. 149 00:09:35,866 --> 00:09:37,832 150 00:09:37,833 --> 00:09:40,166 All right, have a great day and have fun learning python!