1 00:00:00,000 --> 00:00:01,133 2 00:00:01,133 --> 00:00:07,333 Welcome to a module on Performing Calculations in Python. Today we'll be discussing how to find a sum of squares in Python, 3 00:00:07,333 --> 00:00:13,399 several different ways. First, let's load in some data. As usual, I'm importing our libraries at 4 00:00:13,400 --> 00:00:20,200 the top that we'll be using and included is numpy as NP because we will be using numpy today to form some calculations. 5 00:00:20,200 --> 00:00:21,733 6 00:00:21,733 --> 00:00:28,199 This next bit we've seen before a couple of times now, but I will go over it. Filepath is set to the filepath 7 00:00:28,200 --> 00:00:28,233 8 00:00:28,233 --> 00:00:34,299 where classdataS20 is kept. pd.read_excel will read that filepath into a data frame 9 00:00:34,300 --> 00:00:39,333 called DF, and then I'd like to print the head of DF so I can see what variables I'm working with. 10 00:00:39,333 --> 00:00:40,766 11 00:00:40,766 --> 00:00:46,832 We're pretty familiar with this data set by now. So let's talk about the formula we need to 12 00:00:46,833 --> 00:00:46,866 13 00:00:46,866 --> 00:00:52,899 find the sum of squares. It is the sum of all values for x minus xbar 14 00:00:52,900 --> 00:00:52,933 15 00:00:52,933 --> 00:00:59,133 raised to the power of 2. To do this, we first need to know our 16 00:00:59,133 --> 00:01:05,266 x value. Then we need to find our xbar. Then we'll subtract xbar from 17 00:01:05,266 --> 00:01:05,299 18 00:01:05,300 --> 00:01:11,600 x. Then we'll raise it to the power of 2. And then we'll find the sum of all those values. 19 00:01:11,600 --> 00:01:12,100 20 00:01:12,100 --> 00:01:18,433 So we've got a couple of different steps here that we're doing. Let's find our x value first or decide 21 00:01:18,433 --> 00:01:24,533 what our x value is. I'm going to say let's find the sum of squares for 22 00:01:24,533 --> 00:01:24,566 23 00:01:24,566 --> 00:01:29,399 DV in our class data set. Now that we know x, we need to find xbar. Let's use numpy 24 00:01:29,400 --> 00:01:30,800 25 00:01:30,800 --> 00:01:35,666 to find the mean because we're using other numpy functions. First find 26 00:01:35,666 --> 00:01:37,132 27 00:01:37,133 --> 00:01:43,166 the mean. So the mean is equal to np.mean. So you're calling 28 00:01:43,166 --> 00:01:43,199 29 00:01:43,200 --> 00:01:49,366 a function here. The mean function from np. And we want to perform that 30 00:01:49,366 --> 00:01:49,399 31 00:01:49,400 --> 00:01:55,533 mean function on our data frame with dv. So df open brackets 32 00:01:55,533 --> 00:01:56,499 33 00:01:56,500 --> 00:02:02,233 quotation dv end quotation end brackets and parentheses. 34 00:02:02,233 --> 00:02:03,066 35 00:02:03,066 --> 00:02:09,099 Okay. Now that we've got our mean, let's 36 00:02:09,100 --> 00:02:09,133 37 00:02:09,133 --> 00:02:15,166 find a sum of squares. So remember we said that our next step would be to subtract the mean from 38 00:02:15,166 --> 00:02:21,232 all values of dv. So that's exactly what we're doing here. And we're going to save in an variable called variance. So all you have to 39 00:02:21,233 --> 00:02:21,866 40 00:02:21,866 --> 00:02:28,132 do to do this is say I'd like to take dv from this data frame and subtract 41 00:02:28,133 --> 00:02:28,599 42 00:02:28,600 --> 00:02:34,666 mean from it. Done. At this point, if you'd like to check your work, you can 43 00:02:34,666 --> 00:02:34,699 44 00:02:34,700 --> 00:02:41,000 do it by summing up all the values of variance. So we'll do that here and put it in a variable called 45 00:02:41,000 --> 00:02:47,100 variance sum, which is equal to np. sum of variance. And here 46 00:02:47,100 --> 00:02:47,133 47 00:02:47,133 --> 00:02:53,499 variance we've named up here. So that's why we're using just variance and we don't have to refer to it as df['variance'] 48 00:02:53,500 --> 00:02:59,900 any other thing because it's not in the data frame. So we'll print variance sum. 49 00:02:59,900 --> 00:02:59,933 50 00:02:59,933 --> 00:03:05,399 Next, we'll square that variance. So variance squared 51 00:03:05,400 --> 00:03:07,100 52 00:03:07,100 --> 00:03:13,733 is equal to variance raised to the power of two. 53 00:03:13,733 --> 00:03:14,066 54 00:03:14,066 --> 00:03:19,599 And remember in our first video, we talked about how to raise things to the power of another number in Python. 55 00:03:19,600 --> 00:03:20,100 56 00:03:20,100 --> 00:03:25,400 This is a little reminder that this is how you do that. Use the double asterisk. And then last, 57 00:03:25,400 --> 00:03:27,133 58 00:03:27,133 --> 00:03:33,166 we sum up all those values of variance squared. So sum of squares -- and this one I'm calling one -- will be equal to 59 00:03:33,166 --> 00:03:33,199 60 00:03:33,200 --> 00:03:39,566 the sum of variance squared. Now we'll print it down 61 00:03:39,566 --> 00:03:44,799 here. The little f tells it that I'd like to put a title in front of it and 62 00:03:44,800 --> 00:03:46,133 63 00:03:46,133 --> 00:03:52,299 then call sum of squares one as our printed 64 00:03:52,300 --> 00:03:52,333 65 00:03:52,333 --> 00:03:58,399 value. But you don't have to do that. You can also just print sum of 66 00:03:58,400 --> 00:03:58,433 67 00:03:58,433 --> 00:04:03,533 squares one or sum of squares. If that's what you named it. Okay, 68 00:04:03,533 --> 00:04:04,499 69 00:04:04,500 --> 00:04:10,600 so here we can see we've got what looks like an integer that's very close to -- 70 00:04:10,600 --> 00:04:16,800 not an integer but a float -- that's very close to one. But actually because of this E minus 71 00:04:16,800 --> 00:04:16,833 72 00:04:16,833 --> 00:04:22,133 12, this is actually scientific notation and this is negligibly close to zero. 73 00:04:22,133 --> 00:04:24,233 74 00:04:24,233 --> 00:04:30,399 So that is what you're expecting to see when you sum up x minus x bar. You want the sum of your 75 00:04:30,400 --> 00:04:34,300 variance for x minus x bar to be very close to zero. 76 00:04:34,300 --> 00:04:38,433 77 00:04:38,433 --> 00:04:45,466 Next, we have our sum of squares, which is just over 66,347. 78 00:04:45,466 --> 00:04:46,999 79 00:04:47,000 --> 00:04:53,066 Always leave this number unrounded. So here's another way to do that in fewer 80 00:04:53,066 --> 00:04:53,099 81 00:04:53,100 --> 00:04:57,600 lines. First, we still need to find our mean, 82 00:04:57,600 --> 00:04:59,100 83 00:04:59,100 --> 00:05:02,100 so this line here would still be necessary. 84 00:05:02,100 --> 00:05:08,033 85 00:05:08,033 --> 00:05:14,233 We're going to use mean down here. We're saying our dv minus the mean in parentheses 86 00:05:14,233 --> 00:05:14,933 87 00:05:14,933 --> 00:05:21,099 raised to the power of 2, wrapped in more parentheses and np.sum all the way 88 00:05:21,100 --> 00:05:21,133 89 00:05:21,133 --> 00:05:27,233 on the outside of those parentheses. If you feel like that makes more sense 90 00:05:27,233 --> 00:05:27,266 91 00:05:27,266 --> 00:05:32,266 to you, then by all means use this method. It will return the same result 92 00:05:32,266 --> 00:05:34,099 93 00:05:34,100 --> 00:05:35,166 as the one we've got up 94 00:05:35,166 --> 00:05:43,932 95 00:05:43,933 --> 00:05:49,966 here. Okay. So using numpy, we could actually accomplish this all in one line. But you 96 00:05:49,966 --> 00:05:49,999 97 00:05:50,000 --> 00:05:56,066 would have to be pretty familiar with Python and pretty familiar with numpy, and pretty comfortable 98 00:05:56,066 --> 00:05:56,599 99 00:05:56,600 --> 00:06:02,666 with nesting parentheses to get this to work for you. By all means, if you would like to 100 00:06:02,666 --> 00:06:02,699 101 00:06:02,700 --> 00:06:08,700 do it all in one line, give it a go. And definitely try out all three methods and see which one works best for 102 00:06:08,700 --> 00:06:14,466 you. And you can see here we get the same result as above.