3.3 Least-Squares Linear Regression

Once we see a line fits the trend in a scatterplot, we have to decide how to find the equation of the line. We could eyeball the best fitting line and make our best guess, but the we’d each would get a different line. There is a more systematic way to fit a linear function to a set of data, because, as you likely suspect, it is highly improbable that any set of data falls exactly on a line. Any statistical program or package will fit a line to a set of data. There are many ways to do this, of course, but one standard and popular way to fit a line to a data set is called Least-Squares Linear Regression.

Remember, it is always important to plot a scatter diagram first. If the scatter plot indicates that there is a linear relationship between the variables, then it is reasonable to use a best fit line to make predictions for y given x within the domain of x-values in the sample data, but not necessarily for x-values outside that domain.

Example 1 – Final Exam and Midterm Exam

A random sample of 11 statistics students produced the following data, where x is the third exam score out of 80, and y is the final exam score out of 200. Create a scatterplot, find the correlation coefficient, and comment on whether a line appears to be a good fit for this data.

Third Exam Score, x Final Exam Score, y
65 175
67 133
71 185
71 163
66 126
75 198
67 153
70 163
71 159
69 151
69 159

Solution:

A scatterplot shows a positive relationship and the correlation coefficient value is 0.66. A line appears to mimic the trend in the data, with higher third exam scores occurring with higher final exam scores and lower third and final exam scores occurring together.

This is a scatter plot of the data provided. The third exam score is plotted on the x-axis, and the final exam score is plotted on the y-axis. The points form a strong, positive, linear pattern.

 

Consider Example 1. We would like to predict the final exam score by knowing the third exam score. This would be valuable information for students to use while projecting course grades before the end of an academic quarter. We want to use the third exam score to make a prediction, so we hope the third exam score can explain or predict the final exam score. This is why it is called the explanatory variable or the predictor. The resulting final exam score is the response variable.

A line that fits the data best would have the form:

\text{Final Exam Score} = \text{Slope} \times \text{Third Exam Score} + \text{Vertical Intercept}

If we adopt conventional notation for the slope, \beta_{1}, the vertical intercept, \beta_{0}, the dependent variable, y, and the independent variable, x, then the line of best fit would have the form:

y = \beta_{1}x + \beta_{0}

When we create the linear regression, we get estimates of \beta_{1} and \beta_{0}. We denote estimates with “hats”, so the estimated values would be labeled \hat{\beta_{1}} and \hat{\beta_{0}}. We call these least-squares coefficients. Once we have a linear regression line, we know the prediction of a final exam score is just that, a prediction, so we use a hat to denote the predicted value.

Then for each point, the least-squares regression line would have the form:

\hat{y}_{i} = \hat{\beta}_{1}x_{i} + \hat{\beta_{1}}

Returning to Example 1, and using a statistical program to create the least-squares regression line we have: \hat{y} = 4.83x - 173.51. This linear regression line has been drawn on the original scatter plot in Figure 1.

The scatter plot of exam scores with a line of best fit. One data point is highlighted along with the corresponding point on the line of best fit. Both points have the same x-coordinate. The distance between these two points illustrates how to compute the sum of squared errors.
Figure 1: Regression Line and Scatter Plot

Each data point is of the the form (x_{i}, y_{i}) and each point of the line of best fit using least-squares linear regression has the form (x_{i}, \hat{y}_{i}). The \hat{{y}} is read “y hat” and is the estimated value of y. It is the value of y obtained using the regression line. It is not generally equal to y from data, but it can be.

Notice in Figure 1 that the data point (x_{0}, y_{0}) and the point on the regression line (x_{0}, \hat{y}_{0}) have a vertical line drawn between them. The length of this line (absolute value) is the distance between the actual final exam score and the predicted final exam score for a particular third exam score. Because the linear regression line does not pass exactly through the original data point, there is error in the prediction. This error is not a mistake in the conventional way we think of error. It is error because we expect a line to generally pass through the data values but we know it will not pass through each of the data values. We expect error when we deal with data and regression lines. By keeping track of the error, we can study it closer. For least-squares regression, it is this error which plays a critical role in the formation of the regression line.

Residual

The term  {y}_{0}-\hat{y}_{0}=\epsilon_{0} is called the error but it is better known by the term residual. The residual is denoted by the Greek letter epsilon, \epsilon. The absolute value of a residual measures the vertical distance between the actual value of y and the estimated value of y. In other words, it measures the vertical distance between the actual data point and the predicted point on the line.

If the observed data point lies above the line, the residual is positive, and the line underestimates the actual data value for y. If the observed data point lies below the line, the residual is negative, and the line overestimates that actual data value for y.

For each data point, the residuals can be calculated, {y}_{i}-\hat{y}_{i}=\epsilon_{i} for i = 1, 2, 3, …, n.

For Example 1, for the 11 statistics students, there are 11 data points. Therefore, there are 11 values of \epsilon, the residual. If you square each \epsilon and add them, you get \epsilon_{1}^{2}+\epsilon_{2}^{2}+\ldots+\epsilon_{11}^{2}=\sum_{i=1}^{11}\epsilon_{i}^2. This is called the Sum of Squared Errors (SSE).

Least-Squares Criteria for Best Fit

What would happen if we try to make that sum of squared errors as small as possible? Minimizing a sum sounds a lot like optimization problems from calculus. In fact, if we use calculus, we can determine the values of the slope,  \hat{\beta_{1}}, and vertical intercept, \hat{\beta_{0}}, of the line that makes the SSE a minimum. When SSE a minimum, then we have created the least-squares linear regression line. The process is called Linear Regression. Any other line you might choose would have a higher SSE than the best fit line. This best fit line is called the least-squares regression line.

Without showing the derivation, the values of \hat{\beta_{1}} and \hat{\beta_{0}} that make SSE a minimum are calculated as

\hat{\beta}_{1}=\frac{\sum_{i=1}^n(x_{i}-\overline{x})(y_{i}-\overline{y})}{\sum_{i=1}^n(x_{i}-\overline{x})^2}

\hat{\beta}_{0}=\overline{y}-\hat{\beta}_{1}\overline{x}

Keep in mind that we will not calculate these regression coefficients by hand, but it is instructive to see how the values are created.

Understanding Slope

The slope of the line \hat{\beta_{1}} describes how changes in the variables are related. It is important to interpret the slope of the line in the context of the situation represented by the data. You should be able to write a sentence interpreting the slope in plain English. The slope of the best-fit line tells us how the dependent variable (y) changes for every one unit increase in the independent (x) variable, on average.

Example 2 – Interpreting the Regression Model

For the 11 statistics students from Example 1 with final exam and third exam data, the least-squares regression line was \hat{y} = 4.83x - 173.51. Interpret the slope.

Solution:

The slope is estimated to be 4.83 but we must remember that the slope of a line is a comparison. It gives the approximate change in the final exam score for a one-point in crease in the third exam score. If we look at the slope as a fraction we see \frac{4.83\space \text{final exam points}}{1\space \text{third exam point}}. We are predicting a 4.83 point increase on the final exam for every one-point increase in the third exam score.

Interpolation, Prediction, and Extrapolation

Once we have a regression line, we want to use it to make predictions. For Example 2, we could use the regression line to predict the final exam score for a student who earned a grade of 73 on the third exam. Note that a third exam score of 73 is within the original set of third exam scores. If we use a regression line to make a prediction based on a value that is within the domain of the original data, then this is called interpolation. Using the regression line we have the following:

\hat{y} = 4.83(73) - 173.51 = 179.08, so if a student earns a score of 73 on the third exam, they have a predicted final exam score of 179.

For many variables, linear relationships will hold only within a certain range of values. If we were to try to predict a final exam score for a student who earned a 26 or a 79 on the third exam, we might not obtain a reliable estimate. Our linear model was created with a specific domain of third exam scores. It is possible that scores outside of that range no longer follow a linear pattern. Always be cautious if you try to predict using a value that is not in the original domain. Extrapolation means that we try to predict a value outside of the given set of data. There is no guarantee the original model holds for extrapolated values.

 

License

Icon for the Creative Commons Attribution 4.0 International License

Introduction to Statistics for Engineers Copyright © by Vikki Maurer & Jeff Crabill & Linn-Benton Community College is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.

Feedback/Errata

Comments are closed.