1 00:00:00,000 --> 00:00:00,866 2 00:00:00,866 --> 00:00:06,899 Hello, and welcome to a video on scatterplots and Python. This will be similar to how we've done histograms, line 3 00:00:06,900 --> 00:00:12,833 graphs, and other types of visualization. First, let's load in our packages and data. As per usual, 4 00:00:12,833 --> 00:00:13,266 5 00:00:13,266 --> 00:00:19,066 we've got all of our libraries that we're importing here, pandas as PD, 6 00:00:19,066 --> 00:00:19,666 7 00:00:19,666 --> 00:00:25,766 drive from google.co lab. Of course, we're importing matplotlib.pyplot because that's what we used to actually create 8 00:00:25,766 --> 00:00:25,799 9 00:00:25,800 --> 00:00:32,033 our scatterplot and numpy as np. So first, 10 00:00:32,033 --> 00:00:32,066 11 00:00:32,066 --> 00:00:38,466 we'll mount our drive, set our file path, read in our data, and actually create our variables 12 00:00:38,466 --> 00:00:43,766 here. There we go, we've got some variables created to use now. 13 00:00:43,766 --> 00:00:44,932 14 00:00:44,933 --> 00:00:50,999 So a scatterplot is a useful tool for visualizing the relationship between two continuous variables. 15 00:00:51,000 --> 00:00:51,033 16 00:00:51,033 --> 00:00:57,133 It displays individual data points on a graph, where each point represents one observation with one variable 17 00:00:57,133 --> 00:01:03,333 plotted on the x-axis, and the other on the y-axis. In psychology, scatterplots are commonly 18 00:01:03,333 --> 00:01:03,366 19 00:01:03,366 --> 00:01:10,299 used to identify relationships, check for trends, detect outliers, and visualize data distributions. 20 00:01:10,300 --> 00:01:13,166 21 00:01:13,166 --> 00:01:19,432 Let's produce a scatterplot for R_IV2, and R_DV. We'll put R_IV_2 on the x-axis, 22 00:01:19,433 --> 00:01:21,566 and R_DV on the y-axis. 23 00:01:21,566 --> 00:01:25,732 24 00:01:25,733 --> 00:01:31,799 So we'll call PLT.scatter. Our x-axis is first, our y-axis is 25 00:01:31,800 --> 00:01:31,833 26 00:01:31,833 --> 00:01:36,533 second, and we're going to set our color to blue, and our label to data points. 27 00:01:36,533 --> 00:01:38,399 28 00:01:38,400 --> 00:01:42,700 Next, we'll set x and y-axis labels in a title for our scatterplot. 29 00:01:42,700 --> 00:01:44,700 30 00:01:44,700 --> 00:01:47,900 Don't forget to use PLT.show or else you won't see anything at all. 31 00:01:47,900 --> 00:01:51,200 32 00:01:51,200 --> 00:01:57,300 Okay, that's a pretty nice looking scatterplot. What if we wanted to add a regression line, or line of best 33 00:01:57,300 --> 00:01:57,333 34 00:01:57,333 --> 00:02:03,433 fit to our scatterplot? First, we'll need to define a slope and 35 00:02:03,433 --> 00:02:03,466 36 00:02:03,466 --> 00:02:09,799 intercept. By calling np.polyfit and setting our two variables 37 00:02:09,800 --> 00:02:15,733 as the variables for the linear model. You'll notice it's in the same order 38 00:02:15,733 --> 00:02:16,699 39 00:02:16,700 --> 00:02:22,766 as what we did above here. IV2 and then dv. Next, 40 00:02:22,766 --> 00:02:22,966 41 00:02:22,966 --> 00:02:28,599 we'll create a line using the poli.fit variables that we've just 42 00:02:28,600 --> 00:02:29,000 43 00:02:29,000 --> 00:02:34,100 created, slope and intercept. And call them here. 44 00:02:34,100 --> 00:02:37,366 45 00:02:37,366 --> 00:02:43,032 Finally, we'll create our scatterplot again. Except this time, 46 00:02:43,033 --> 00:02:46,599 47 00:02:46,600 --> 00:02:52,733 we'll plot the line dfriv2 and color it red. 48 00:02:52,733 --> 00:03:00,233 49 00:03:00,233 --> 00:03:03,833 Okay, so now we have our line of best fit through our scatterplot. 50 00:03:03,833 --> 00:03:09,366 51 00:03:09,366 --> 00:03:15,532 Feel free to play with this code and see if you can make it different colors. Come up with different variables 52 00:03:15,533 --> 00:03:20,999 to use, make different lines. And see how, see how you get on. 53 00:03:21,000 --> 00:03:22,133 54 00:03:22,133 --> 00:03:28,599 What does our line of best fit tell us? So we can tell us trend our relationships. We can have a positive, negative or no correlation 55 00:03:28,600 --> 00:03:28,633 56 00:03:28,633 --> 00:03:34,733 present in our data. What kind of relationship do RDB and RIV2 have? It looks like a slightly 57 00:03:34,733 --> 00:03:34,766 58 00:03:34,766 --> 00:03:40,932 positive relationship here, right? We can use it to predict values for a value of 59 00:03:40,933 --> 00:03:41,066 60 00:03:41,066 --> 00:03:47,232 why given a value of X or vice versa. We can visually identify outliers, points that are 61 00:03:47,233 --> 00:03:47,266 62 00:03:47,266 --> 00:03:52,866 far from the line in our data. We can also give an idea of variability from the amount of spread across the line. 63 00:03:52,866 --> 00:03:53,299 64 00:03:53,300 --> 00:03:59,366 Does our data have any outliers that you've noticed? Maybe these two, we might 65 00:03:59,366 --> 00:03:59,399 66 00:03:59,400 --> 00:04:05,433 say these are outliers. And then we also have a strength of our relationship. The 67 00:04:05,433 --> 00:04:05,466 68 00:04:05,466 --> 00:04:11,466 steepness of the line tells us the strength of the relationship. A steep line is a strong relationship where 69 00:04:11,466 --> 00:04:15,099 a more flat line suggests one or no relationship at all. 70 00:04:15,100 --> 00:04:21,466 71 00:04:21,466 --> 00:04:25,932 Okay. That's all for scatterplots. Have fun coding!