Illuminations: The Regression Line and Correlation

The Regression Line and Correlation


The Centroid and the Regression Line

Interactive computer-based tools provide students with the opportunity to easily investigate the relationship between a set of data points and a curve used to fit the data points. As students work with bivariate data in grades 9-12, they will be able to investigate relationships between the variables using linear, exponential, power, logarithmic, and other functions for curve fitting (See Related 9-12 Data Analysis & Probability Standard). Using interactive tools like the one below, students can investigate the properties of regression lines and correlation.

Learning Objectives

 

Students will

  • investigate the centroid of a data set and its significance for the line fitted to the data

Materials

 
  • Computer and Internet connection

Instructional Plan

In part 4 you will investigate the centroid of a data set and its significance for the line fitted to the data.

Recall that the idea of a centroid of a set of points comes from the idea of center of mass in physics, where each point has the same mass. In physics, the center of mass is a point whose motion represents the motion of the entire set of points. In data analysis, we don't have masses and motion, but we can think of the centroid as a representative "average" point that represents the entire set of points. It may not be one of the data points, but it is somewhere in the" middle" of them. It is a "center" about which the data points are scattered.

How can we find such a point? Remember that the data points (x, y) have the x-variable as the first coordinate and the corresponding values of the y-variable as the second coordinate. We can find such a point for a given data set by separating out the x-values from the y-values and thinking about a "central" value for the x-values and, separately, a "central" value for the y-values and then putting these two central values together. The mean is a "central" value for a set of univariate data. So we use the mean to do it.

For example, at the beginning of this i-Math we were interested in the relationship between a person's height and weight. The data points were pairs (x, y) where x was height and y was weight. But we can think of the height measurements as a univariate data set in their own right, and the weight measurements as a univariate data set in their own right.

In the graph below, the x-values of the data points are points on the x-axis and the y-values of the data points are points on the y-axis. You can compute the mean of each of these univariate sets of data, that is, the mean of the x-values (x-bar) (for example, the mean height) and the mean of the y-values (y-bar) (for example, the mean weight) and locate these values on the x-axis and the y-axis, respectively. Next plot the point (x-bar, y-bar). This is the point we call the centroid of the bivariate data set. Go to Questions.

Finding the Relationship Between the
Regression Line and the Centroid

  1. CLEAR the graph and plot two points
    1. On your own paper, determine the centroid of these two points.
    2. On your own paper, compute the midpoint of the two points. Compare this midpoint to the centroid you computed. Explain the connection.
    3. Click on SHOW CENTROID. Compare the coordinates of the centroid shown to the coordinates you computed in parts (a) and (b). Explain and resolve any differences.
  2. CLEAR the graph and plot three points.
    1. On your own paper, determine the centroid of these three points.
    2. Click on SHOW CENTROID. Confirm that you get the same point as in part (a).
    3. Click on SHOW LINE. What happens?
  3. Experiment with several scatterplots to find the relationship between the centroid and the least-squares regression line. Describe this relationship.

Notice: use the link below to go the the applet, rather than scrolling down.
Go to the Regression Line Applet




Reflection Questions

Look back over the activities in this i-Math Investigation and consider the following questions:

  1.  When modeling real data using the regression line, what is a meaning for the centroid?
  2. Think about how the equation of the regression line can be used to predict the value of one variable given the value of the other variable in a real life situation, such as the height-weight example. In that example, we took a sample of 40 people from the general population, measured both height and weight for each of the 40 people, and fitted a regression line to the data. Is it valid to then measure the height of another person who wasn't one of the 40 sampled people and say that their weight is given by the formula for the regression line?
  3. Suppose you sample another 40 people's heights and weights and fit a regression line to that data? Would the new line be the same as before? Which of these lines should you use to predict weight from height? Both these questions are variations on the theme: Once we have determined a particular curve and equation for our 40 data points, how reliable is our equation for predicting any person's weight given we know their height?

    For this reason it is very important to always use only the variable on the horizontal axis (the independent variable) to predict the value of the variable on the vertical axis, and not the reverse. That is, the original regression line does not predict height from a given weight, it only predicts weight from a given height. To predict height from a given weight we would need to put weight on the horizontal axis, height on the vertical axis, and fit a new regression line.

    These questions of reliability are very important in applying the ideas in this i-Math to data from any real life experiment. The statistical theory of regression deals with questions like this and answers can be found by looking under "linear regression" in any standard textbook on statistics.

  4. What would happen if we switched the roles of weight and height? That is, suppose we made weight the independent variable (input) and height the dependent variable (output)? Now weight would appear on the horizontal axis and height on the vertical axis. If we fit a regression line to this new scatterplot would the new regression line be the inverse of the original regression line? Why or why not?

Answers

Teacher Reflection

 

Look back over the activities in this i-Math Investigation and consider the following questions:

  • What mathematical ideas can be developed through these activities?
  • What is the value of using computer-based tools in activities like this?
  • What should be the role of the teacher during activities like this?
  • What assessment tasks can be used with these activities?

NCTM Standards and Expectations

 
Data Analysis & Probability 9-12
  1. Recognize how linear transformations of univariate data affect shape, center, and spread.
  2. For bivariate measurement data, be able to display a scatterplot, describe its shape, and determine regression coefficients, regression equations, and correlation coefficients using technological tools.
  3. Display and discuss bivariate data where at least one variable is categorical.

References

 
  • NSF Copyright Notice: Applet generously provided by: L. O. Cannon, James Dorward, E. Robert Heal, Richard Wellman (Utah State University, www.matti.usu.edu). The USU MATTI project is supported by the National Science Foundation (Award #9819107). Copyright 1999.

     

  
1 period   

NCTM Resources

Principles and Standards for School Mathematics

 Activities


National Council of Teachers of Mathematics Thinkfinity Verizon Foundation
© 2000 National Council of Teachers of Mathematics
Use of this Web site constitutes acceptance of the Terms of Use