1
00:00:00,000 --> 00:00:00,866


2
00:00:00,866 --> 00:00:06,899
Hello, and welcome to a video on scatterplots and Python. This will be similar to how we've done histograms, line

3
00:00:06,900 --> 00:00:12,833
graphs, and other types of visualization. First, let's load in our packages and data. As per usual,

4
00:00:12,833 --> 00:00:13,266


5
00:00:13,266 --> 00:00:19,066
we've got all of our libraries that we're importing here, pandas as PD,

6
00:00:19,066 --> 00:00:19,666


7
00:00:19,666 --> 00:00:25,766
drive from google.co lab. Of course, we're importing matplotlib.pyplot because that's what we used to actually create

8
00:00:25,766 --> 00:00:25,799


9
00:00:25,800 --> 00:00:32,033
our scatterplot and numpy as np. So first,

10
00:00:32,033 --> 00:00:32,066


11
00:00:32,066 --> 00:00:38,466
we'll mount our drive, set our file path, read in our data, and actually create our variables

12
00:00:38,466 --> 00:00:43,766
here. There we go, we've got some variables created to use now.

13
00:00:43,766 --> 00:00:44,932


14
00:00:44,933 --> 00:00:50,999
So a scatterplot is a useful tool for visualizing the relationship between two continuous variables.

15
00:00:51,000 --> 00:00:51,033


16
00:00:51,033 --> 00:00:57,133
It displays individual data points on a graph, where each point represents one observation with one variable

17
00:00:57,133 --> 00:01:03,333
plotted on the x-axis, and the other on the y-axis. In psychology, scatterplots are commonly

18
00:01:03,333 --> 00:01:03,366


19
00:01:03,366 --> 00:01:10,299
used to identify relationships, check for trends, detect outliers, and visualize data distributions.

20
00:01:10,300 --> 00:01:13,166


21
00:01:13,166 --> 00:01:19,432
Let's produce a scatterplot for R_IV2, and R_DV. We'll put R_IV_2 on the x-axis,

22
00:01:19,433 --> 00:01:21,566
and R_DV on the y-axis.

23
00:01:21,566 --> 00:01:25,732


24
00:01:25,733 --> 00:01:31,799
So we'll call PLT.scatter. Our x-axis is first, our y-axis is

25
00:01:31,800 --> 00:01:31,833


26
00:01:31,833 --> 00:01:36,533
second, and we're going to set our color to blue, and our label to data points.

27
00:01:36,533 --> 00:01:38,399


28
00:01:38,400 --> 00:01:42,700
Next, we'll set x and y-axis labels in a title for our scatterplot.

29
00:01:42,700 --> 00:01:44,700


30
00:01:44,700 --> 00:01:47,900
Don't forget to use PLT.show or else you won't see anything at all.

31
00:01:47,900 --> 00:01:51,200


32
00:01:51,200 --> 00:01:57,300
Okay, that's a pretty nice looking scatterplot. What if we wanted to add a regression line, or line of best

33
00:01:57,300 --> 00:01:57,333


34
00:01:57,333 --> 00:02:03,433
fit to our scatterplot? First, we'll need to define a slope and

35
00:02:03,433 --> 00:02:03,466


36
00:02:03,466 --> 00:02:09,799
intercept. By calling np.polyfit and setting our two variables

37
00:02:09,800 --> 00:02:15,733
as the variables for the linear model. You'll notice it's in the same order

38
00:02:15,733 --> 00:02:16,699


39
00:02:16,700 --> 00:02:22,766
as what we did above here. IV2 and then dv. Next,

40
00:02:22,766 --> 00:02:22,966


41
00:02:22,966 --> 00:02:28,599
we'll create a line using the poli.fit variables that we've just

42
00:02:28,600 --> 00:02:29,000


43
00:02:29,000 --> 00:02:34,100
created, slope and intercept. And call them here.

44
00:02:34,100 --> 00:02:37,366


45
00:02:37,366 --> 00:02:43,032
Finally, we'll create our scatterplot again. Except this time,

46
00:02:43,033 --> 00:02:46,599


47
00:02:46,600 --> 00:02:52,733
we'll plot the line dfriv2 and color it red.

48
00:02:52,733 --> 00:03:00,233


49
00:03:00,233 --> 00:03:03,833
Okay, so now we have our line of best fit through our scatterplot.

50
00:03:03,833 --> 00:03:09,366


51
00:03:09,366 --> 00:03:15,532
Feel free to play with this code and see if you can make it different colors. Come up with different variables

52
00:03:15,533 --> 00:03:20,999
to use, make different lines. And see how, see how you get on.

53
00:03:21,000 --> 00:03:22,133


54
00:03:22,133 --> 00:03:28,599
What does our line of best fit tell us? So we can tell us trend our relationships. We can have a positive, negative or no correlation

55
00:03:28,600 --> 00:03:28,633


56
00:03:28,633 --> 00:03:34,733
present in our data. What kind of relationship do RDB and RIV2 have? It looks like a slightly

57
00:03:34,733 --> 00:03:34,766


58
00:03:34,766 --> 00:03:40,932
positive relationship here, right? We can use it to predict values for a value of

59
00:03:40,933 --> 00:03:41,066


60
00:03:41,066 --> 00:03:47,232
why given a value of X or vice versa. We can visually identify outliers, points that are

61
00:03:47,233 --> 00:03:47,266


62
00:03:47,266 --> 00:03:52,866
far from the line in our data. We can also give an idea of variability from the amount of spread across the line.

63
00:03:52,866 --> 00:03:53,299


64
00:03:53,300 --> 00:03:59,366
Does our data have any outliers that you've noticed? Maybe these two, we might

65
00:03:59,366 --> 00:03:59,399


66
00:03:59,400 --> 00:04:05,433
say these are outliers. And then we also have a strength of our relationship. The

67
00:04:05,433 --> 00:04:05,466


68
00:04:05,466 --> 00:04:11,466
steepness of the line tells us the strength of the relationship. A steep line is a strong relationship where

69
00:04:11,466 --> 00:04:15,099
a more flat line suggests one or no relationship at all.

70
00:04:15,100 --> 00:04:21,466


71
00:04:21,466 --> 00:04:25,932
Okay. That's all for scatterplots. Have fun coding!