Least-square Linear Regression

For data mining, often we need to learn a model from the training data set so that we can use the model to predict the unknown value of the class attribute of any given new instance based on the known values of the other attributes. When the class attribute is a nominal attribute (i.e. there are only a finite number c of possible values for the class attribute), this is a discrete classification problem since our job is to classify the new instance into one of the c categories corresponding to the c possible values for the class attribute we know.

However, when the class attribute is a numeric attribute (for example when the value can be any real number in an interval), there are an infinite number of possible values for the class attribute and in the training data set we can only see a finite number of the possible values showing up. Obviously we need to learn a model that can possibly classify new instances into any of the possible values instead of the values observed in the training data set only. Sometime people refer to the task of predicting the value of such numeric class attribute as a regression problem. There are many regression algorithms for the job. This exercise is about the very simple linear regression algorithm for prediction to give you an idea about the role of regression algorithms.

Linear Regression Model and Least square solution:

· Given a data mining problem with (n+1) attributes with numerical values and a training data set of m instances, the linear regression model try to find the best way to model the data as a linear system Ax = b, where the j th equation corresponds to the j th instance and represents the value of the (n+1) th attribute in that stance as a linear combination of numerical values of the first n attributes in the same instance. In other words, each column A_j encodes the m values of the j th attribute appearing in the m instances, and the column vector b encodes the m values of the (n+1) th attribute in the m instances.

· Ideally, a solution x to the linear system Ax = b would provide us with n perfect coefficients for combining the values of the first n attributes to generate the exact value for the (n+1) th attribute. Unfortunately Ax=b in general has no solution if m is greater than n. In other words, it is impossible to use a linear equation to exactly predict the value for the (n+1) th attribute based on the values of the first n attributes. Instead, we can look for the least square solution x to give us a way that will minimize the sum of the squares of prediction errors over the m instances. See the next section about how to find the least square error solution for over-determined linear equations.

The least square solution x tells us the n coefficients for combining the values of the first n attributes to generate a predicted value for the (n+1) th attribute. In other words, we have learned the least square solution x from the m training instances to establish a linear regression model, and we can now predict the value of the (n+1) th attribute according to the values of the first n attributes in a reasonable way.

Finding the least-square-error solution for over-determined linear equations:

· Given m linear equations with n variables, we can describe them as a linear system Ax = b where A is an m by n matrix, x is an n by 1 vector, and b is an m by 1 vector.

· Typically Ax=b has no solution if m is greater than n. In other words, Ax-b is not a zero vector for any given vector x. Instead of an exact solution, we can look for a least square solution x such that the length of the error vector Ax -b is minimized. In other words, we would like to minimize the sum of the squares of individual components of the error vector Ax-b.

· You can find the least square solution x by solving the following linear system (A^TA) x = (A^T b) where A^Tis the transpose matrix of A.

· Note that (A^TA) is an n by n matrix and (A^T b) is an n by 1 vector and now you are solving a new linear system (A^TA)x = (A^T b) of n equations with n variables to find the least square solution for the original system Ax=b.

Questions:

	1	2	3	4	5
H	147	155	163	173	180
W	51	57	58	68	72

Part 1:

First, let’s consider five instances of height (in cm) versus weight (in kg) from a small dataset of 5 persons above. Let H and W denote the height and the weight of people. We would like to apply linear regression to learn from the dataset a formula of the form c1.H =W, where c1 is a constant (a real number). We can then predict the weight of a person based on the height of that person.
Based on the five sample instances, write down the five corresponding linear equations of the form c1.H =W. These equations together form a linear system of the form Ax=b. What is your matrix A? What is x? And what is your column vector b?
Show how you can apply the algorithm described above to determine c1, which will minimize the sum of squares of prediction errors when the formula c1.H=W is applied to the sample instances. What is the sum of the squares of prediction errors when this formula is applied to the five sample instances?

Given a person that is 170cm high (i.e. about 5ft 7 inches),, what is the predicted weight according to your linear regression model?

Part 2:

Note that often a dummy attribute with a constant value 1 is artificially included into all the instances of the data set to allow extra flexibility for finding a good linear regression leaner. Let’s apply it to our case here to see whether it provides a better linear regression model.

Use the same five instances of people in part 1 again. Let H and W denote the height and the weight of people. We would like to apply linear regression to learn from the collected sample instances a formula of the form c1.H +c2=W, where c1 and c2 are two constants (real numbers). We can then predict the weight of a person based on the height of that person.
Based on the five sample instances, write down the five corresponding linear equations of the form c1.H +c2=W. These equations together form a linear system of the form Ax=b. What is your matrix A? What is x? And what is your column vector b?
Show how you can apply the algorithm described above to determine c1 and c2, which will minimize the sum of squares of prediction errors when the formula c1.H +c2=W is applied to the sample instances. What is the sum of the squares of prediction errors when this formula is applied to the five sample instances?

Given a person that is 170cm high (i.e. about 5ft 7 inches), what is the predicted weight according to your linear regression model?