1 00:00:00,066 --> 00:00:00,999 2 00:00:01,000 --> 00:00:07,700 Hello and welcome to a video on histograms. We want to use a histogram when we'd like to see the shape of our distribution. 3 00:00:07,700 --> 00:00:08,100 4 00:00:08,100 --> 00:00:14,100 Histogram show frequency data, which means they show how many of a certain type of data we have in set ranges called bins. 5 00:00:14,100 --> 00:00:14,566 6 00:00:14,566 --> 00:00:20,966 Let's start by loading in some data. So first we'll load our libraries. We have pandas as PD and from Google.colab 7 00:00:20,966 --> 00:00:20,999 8 00:00:21,000 --> 00:00:27,166 we're importing drive, and these are the two that we'll use to actually set our file path in import 9 00:00:27,166 --> 00:00:33,332 our Excel file as a data frame. I'm using NumPy as NP to generate some 10 00:00:33,333 --> 00:00:33,366 11 00:00:33,366 --> 00:00:39,199 random numbers below in an example histogram, but you won't use that in your homework more than likely for your 12 00:00:39,200 --> 00:00:39,433 13 00:00:39,433 --> 00:00:45,633 histogram at least. We're also importing Matplotlib.pyplot as PLT and this is really 14 00:00:45,633 --> 00:00:45,999 15 00:00:46,000 --> 00:00:52,133 important because it's what we're going to use to generate our histogram. Let's go ahead 16 00:00:52,133 --> 00:00:52,166 17 00:00:52,166 --> 00:00:58,232 and run this block of code and get our variables all set in place. Now we have IV1, IV2 and DV available 18 00:00:58,233 --> 00:01:04,233 to us. Now what if we would like to make a histogram? Let's see what one looks like 19 00:01:04,233 --> 00:01:10,433 with some simple example data. Then we'll make a histogram for IV1. Remember histogram show 20 00:01:10,433 --> 00:01:16,599 frequency data. So this section here is just to generate some random numbers. Let's 21 00:01:16,600 --> 00:01:22,866 actually set this equal to 1,000. Next 22 00:01:22,866 --> 00:01:22,899 23 00:01:22,900 --> 00:01:28,800 we have PLT.Hist and that is the function that you'll use to create histograms. 24 00:01:28,800 --> 00:01:30,433 25 00:01:30,433 --> 00:01:36,433 We'll put data where data is supposed to go. Set our number of bins equal to 30. Set the color to sky 26 00:01:36,433 --> 00:01:36,466 27 00:01:36,466 --> 00:01:41,199 skyblue and the edge color to black. Next we'll add titles and labels. We're deciding 28 00:01:41,200 --> 00:01:43,300 29 00:01:43,300 --> 00:01:49,500 to show the grid with this plot but we could set this to false if we wanted to. And we'll display the plot with PLT.show. 30 00:01:49,500 --> 00:01:49,533 31 00:01:49,533 --> 00:01:55,466 Let's go ahead and run this code block. And you can see now we have 32 00:01:55,466 --> 00:01:55,732 33 00:01:55,733 --> 00:02:01,933 a histogram of random data. What's interesting about generating random numbers 34 00:02:01,933 --> 00:02:07,966 like this is that as you generate more and more numbers, your histogram will become more and more normally 35 00:02:07,966 --> 00:02:10,866 distributed. So if I were to set this to 250, 36 00:02:10,866 --> 00:02:16,499 37 00:02:16,500 --> 00:02:19,100 you can see that it's a little less normally distributed. 38 00:02:19,100 --> 00:02:22,533 39 00:02:22,533 --> 00:02:25,366 If I were to set this to 100, we can see 40 00:02:25,366 --> 00:02:28,899 41 00:02:28,900 --> 00:02:34,933 it's even less normally distributed. If we were to set this to 2000, 42 00:02:34,933 --> 00:02:38,999 43 00:02:39,000 --> 00:02:44,366 now we can see it's very normally distributed. And that is just the case when you generate random numbers. 44 00:02:44,366 --> 00:02:45,366 45 00:02:45,366 --> 00:02:50,799 This is part of something called the central limit theorem, which is something that you'll study very briefly in class. 46 00:02:50,800 --> 00:02:51,833 47 00:02:51,833 --> 00:02:58,133 But it's really cool to see an actual visual example of it working. So if you're curious 48 00:02:58,133 --> 00:02:58,166 49 00:02:58,166 --> 00:03:04,532 about whether you can use a color other in the sky blue, you can. You can check out the colors available on this website 50 00:03:04,533 --> 00:03:04,633 51 00:03:04,633 --> 00:03:10,866 or you can just Google what colors can I use in Python and it will come up with several different websites that will 52 00:03:10,866 --> 00:03:10,899 53 00:03:10,900 --> 00:03:17,100 show you examples of colors you can use. I just like this one because it lists both 54 00:03:17,100 --> 00:03:17,133 55 00:03:17,133 --> 00:03:22,999 the simple colors you can use and also the more involved colors you can do things like coral 56 00:03:23,000 --> 00:03:23,133 57 00:03:23,133 --> 00:03:28,499 or chocolate or green yellow or whatever speaks to you. 58 00:03:28,500 --> 00:03:32,233 59 00:03:32,233 --> 00:03:38,266 Okay now let's take a look at some real data using IV1. So we're going to set data equal to df IV1 this 60 00:03:38,266 --> 00:03:44,099 time. You don't have to do it this way. We've got data right here. Which is being 61 00:03:44,100 --> 00:03:44,533 62 00:03:44,533 --> 00:03:50,533 called by data above. We could actually just set this equal to df['IV1'] right here. But for 63 00:03:50,533 --> 00:03:51,733 64 00:03:51,733 --> 00:03:57,899 the sake of consistency with the code above we're going to set that equal to data. 65 00:03:57,900 --> 00:04:00,933 66 00:04:00,933 --> 00:04:07,033 the bins equal to 30. The color is sky blue and the edge color is black. Next we'll set 67 00:04:07,033 --> 00:04:07,066 68 00:04:07,066 --> 00:04:12,566 our title, x-label, y-label, grid equal to true, and display the plot. 69 00:04:12,566 --> 00:04:13,266 70 00:04:13,266 --> 00:04:19,299 Let's go. Alright, that's pretty different. We can see in the first example 71 00:04:19,300 --> 00:04:25,466 for a histogram that the data is very normally distributed, whereas in this second example it's more positively skewed. We've 72 00:04:25,466 --> 00:04:31,532 got some skewness to our data this time. Interesting. This tells us something about our variable that we did not 73 00:04:31,533 --> 00:04:31,566 74 00:04:31,566 --> 00:04:37,599 know previously and may affect how we perform statistical analysis on it. It's always 75 00:04:37,600 --> 00:04:43,700 important to visualize your data before you actually do any kind of analysis. Now, what if we want our histogram to look 76 00:04:43,700 --> 00:04:47,933 a little different? We could set our bins equal to 20 and 77 00:04:47,933 --> 00:04:49,866 78 00:04:49,866 --> 00:04:56,099 see how that affects the overall shape. I think I preferred 79 00:04:56,100 --> 00:04:56,133 80 00:04:56,133 --> 00:04:59,499 30 because we 81 00:04:59,500 --> 00:05:07,733 82 00:05:07,733 --> 00:05:13,733 can see that we have a drop off here now. But that's just a matter of personal preference. 83 00:05:13,733 --> 00:05:14,233 84 00:05:14,233 --> 00:05:20,466 So long as you have more than 10 or so bins, and you have fewer bins than you actually 85 00:05:20,466 --> 00:05:26,499 have in your distribution. So for example, we have a 170 data points in our data 86 00:05:26,500 --> 00:05:32,400 frame that we have loaded in. We don't want 170 bins. I'll show you what that looks like 87 00:05:32,400 --> 00:05:32,933 88 00:05:32,933 --> 00:05:39,099 because then we're not actually binning our data. It's just showing each and every single value 89 00:05:39,100 --> 00:05:45,300 that we have. And this doesn't really look like a shape that we can use. 90 00:05:45,300 --> 00:05:45,733 91 00:05:45,733 --> 00:05:50,399 Whereas if we set it to something more like between 30 and 50, suddenly we 92 00:05:50,400 --> 00:05:54,633 93 00:05:54,633 --> 00:06:00,566 can see the shape of our data much more clearly. So just be careful when you're setting your bins that you're 94 00:06:00,566 --> 00:06:00,966 95 00:06:00,966 --> 00:06:07,332 doing enough, but not too many. And if you're confused about whether or not you've got too many or enough reach out to your instructor 96 00:06:07,333 --> 00:06:13,633 or you know talk to another student, talk to a teacher and see if you can get some feedback. All right, 97 00:06:13,633 --> 00:06:15,033 98 00:06:15,033 --> 00:06:18,033 that's it. Everybody have a great day.