## The Centroid and the Regression Line

Interactive computer-based tools provide students with the opportunity to easily investigate the relationship between a set of data points and a curve used to fit the data points. As students work with bivariate data in grades 9-12, they will be able to investigate relationships between the variables using linear, exponential, power, logarithmic, and other functions for curve fitting. Using interactive tools like the one below, students can investigate the properties of regression lines and correlation.

Recall that the idea of a centroid of a set of points comes from the idea of center of mass in physics, where each point has the same mass. In physics, the center of mass is a point whose motion represents the motion of the entire set of points. In data analysis, we don't have masses and motion, but we can think of the centroid as a representative "average" point that represents the entire set of points. It may not be one of the data points, but it is somewhere in the" middle" of them. It is a "center" about which the data points are scattered.

How can we find such a point? Remember that the data points (*x, y*) have the *x*-variable as the first coordinate and the corresponding values of the *y*-variable as the second coordinate. We can find such a point for a given data set by separating out the *x*-values from the *y*-values and thinking about a "central" value for the *x*-values and, separately, a "central" value for the *y*-values
and then putting these two central values together. The mean is a
"central" value for a set of univariate data. So we use the mean to do
it.

For example, at the beginning of this i-Math we were interested in
the relationship between a person's height and weight. The data points
were pairs (*x, y*) where *x* was height and *y* was
weight. But we can think of the height measurements as a univariate data
set in their own right, and the weight measurements as a univariate
data set in their own right.

In the graph below, the *x*-values of the data points are points on the *x*-axis and the *y*-values of the data points are points on the *y*-axis. You can compute the mean of each of these univariate sets of data, that is, the mean of the *x*-values (*x*-bar) (for example, the mean height) and the mean of the *y*-values (*y*-bar) (for example,
the mean weight) and locate these values on the *x*-axis and the *y*-axis, respectively. Next plot the point (*x*-bar, *y*-bar). This is the point we call the centroid of the bivariate data set.

Linear Regression II

**Instructions:**

- To add a data point, click in the white area.
- Hold down shift, and click on a point to drag that point to a new location.
- In order to remove a point, hold down control and click on the point.
- Be sure that the circle around the point is showing before you click or drag a point that is already on the graph.
- The origin is at the center of the grid, but will move if you change the scale.

### Finding the Relationship Between the Regression Line and the Centroid

- CLEAR the graph and plot two points.
- On your own paper, determine the centroid of these two points.
- On your own paper, compute the midpoint of the two points. Compare this midpoint to the centroid you computed. Explain the connection.
- Click on SHOW CENTROID. Compare the coordinates of the centroid shown to the coordinates you computed in parts (a) and (b). Explain and resolve any differences.

- CLEAR the graph and plot three points.
- On your own paper, determine the centroid of these three points.
- Click on SHOW CENTROID. Confirm that you get the same point as in part (a).
- Click on SHOW LINE. What happens?

- Experiment with several scatterplots to find the relationship between the centroid and the least-squares regression line. Describe this relationship.

**Reflection
Questions**

Look back over the activities in this i-Math Investigation and consider the following questions:

- When modeling real data using the regression line, what is a meaning for the centroid?
- Think about how the equation of the regression line can be used to predict the value of one variable given the value of the other variable in a real life situation, such as the height-weight example. In that example, we took a sample of 40 people from the general population, measured both height and weight for each of the 40 people, and fitted a regression line to the data. Is it valid to then measure the height of another person who wasn't one of the 40 sampled people and say that their weight is given by the formula for the regression line?
- Suppose you sample another 40 people's heights and weights and
fit a regression line to that data? Would the new line be the same as
before? Which of these lines should you use to predict weight from
height? Both these questions are variations on the theme: Once we have
determined a particular curve and equation for our 40 data points, how
reliable is our equation for predicting any
person's weight given we know their height?

For this reason it is very important to always use only the variable on the horizontal axis (the independent variable) to predict the value of the variable on the vertical axis, and not the reverse. That is, the original regression line does not predict height from a given weight, it only predicts weight from a given height. To predict height from a given weight we would need to put weight on the horizontal axis, height on the vertical axis, and fit a new regression line.

These questions of reliability are very important in applying the ideas in this i-Math to data from any real life experiment. The statistical theory of regression deals with questions like this and answers can be found by looking under "linear regression" in any standard textbook on statistics. - What would happen if we switched the roles of weight and height? That is, suppose we made weight the independent variable (input) and height the dependent variable (output)? Now weight would appear on the horizontal axis and height on the vertical axis. If we fit a regression line to this new scatterplot would the new regression line be the inverse of the original regression line? Why or why not?

### Reference

Copyright Notice: Applet generously provided by: L. O. Cannon, James Dorward, E. Robert Heal, Richard Wellman (Utah State University, www.matti.usu.edu). The USU MATTI project is supported by the National Science Foundation (Award #9819107). Copyright 1999.

- Computers with internet connection
- Answers

Teacher Reflection

Look back over the activities in this i-Math Investigation and consider the following questions:

- What mathematical ideas can be developed through these activities?
- What is the value of using computer-based tools in activities like this?
- What should be the role of the teacher during activities like this?
- What assessment tasks can be used with these activities?

### The Regression Line and Correlation

### Correlation and the Regression Line

### The Effects of Outliers

### The Regression Line

### Learning Objectives

Students will:

- Investigate the centroid of a data set and its significance for the line fitted to the data.

### NCTM Standards and Expectations

- For bivariate measurement data, be able to display a scatterplot, describe its shape, and determine regression coefficients, regression equations, and correlation coefficients using technological tools.

- Display and discuss bivariate data where at least one variable is categorical.

- Recognize how linear transformations of univariate data affect shape, center, and spread.