Correlation and regression are two terms in statistics that are related, but not quite the same. In this tutorial, we’ll provide a brief explanation of both terms and explain how they’re similar and different.
For example, suppose we have the following dataset that contains two variables: (1) Hours studied and (2) Exam Score received for 20 different students:
If we created a scatterplot of hours studied vs. exam score, here’s what it would look like:
Just from looking at the plot, we can tell that students who study more tend to earn higher exam scores. In other words, we can visually see that there is a positive correlation between the two variables.
Using a calculator, we can find that the correlation between these two variables is r = 0.915. Since this value is close to 1, it confirms that there is a strong positive correlation between the two variables.
Regression is a method we can use to understand how changing the values of the x variable affect the values of the y variable.
A regression model uses one variable, x, as the predictor variable, and the other variable, y, as the response variable. It then finds an equation with the following form that best describes the relationship between the two variables:
For example, consider our dataset from earlier:
Using a linear regression calculator, we find that the following equation best describes the relationship between these two variables:
Predicted exam score = 65.47 + 2.58*(hours studied)
The way to interpret this equation is as follows:
We can also use this equation to predict the score that a student will receive based on the number of hours studied.
For example, a student who studies 6 hours is expected to receive a score of 80.95:
Predicted exam score = 65.47 + 2.58*(6) = 80.95.
We can also plot this equation as a line on a scatterplot:
We can see that the regression line “fits” the data quite well.
Recall earlier that the correlation between these two variables was r = 0.915. It turns out that we can square this value and get a number called “r-squared” that describes the total proportion of variance in the response variable that can be explained by the predictor variable.
In this example, r 2 = 0.915 2 = 0.837. This means that 83.7% of the variation in exam scores can be explained by the number of hours studied.
Here is a summary of the similarities and differences between correlation and regression:
Similarities:
Differences:
The following tutorials offer more in-depth explanations of topics covered in this post.
Hey there. My name is Zach Bobbitt. I have a Masters of Science degree in Applied Statistics and I’ve worked on machine learning algorithms for professional businesses in both healthcare and retail. I’m passionate about statistics, machine learning, and data visualization and I created Statology to be a resource for both students and teachers alike. My goal with this site is to help you learn statistics through using simple terms, plenty of real-world examples, and helpful illustrations.