Solution
David answered on
Dec 24 2021
OLS in Matrix Form
1 The True Model
• Let X be an n × k matrix where we have observations on k independent variables for n
observations. Since our model will usually contain a constant term, one of the columns in
the X matrix will contain only ones. This column should be treated exactly the same as any
other column in the X matrix.
• Let y be an n× 1 vector of observations on the dependent variable.
• Let ² be an n× 1 vector of distu
ances or e
ors.
• Let β be an k × 1 vector of unknown population parameters that we want to estimate.
Our statistical model will essentially look something like the following:
Y1
Y2
...
...
Yn
n×1
=
1 X11 X21 . . . Xk1
1 X12 X22 . . . Xk2
...
...
... . . .
...
...
...
... . . .
...
1 X1n X2n . . . Xkn
n×k
β1
β2
...
...
βn
k×1
+
²1
²2
...
...
²n
n×1
This can be rewritten more simply as:
y = Xβ + ² (1)
This is assumed to be an accurate reflection of the real world. The model has a systematic com-
ponent (Xβ) and a stochastic component (²). Our goal is to obtain estimates of the population
parameters in the β vector.
2 Criteria for Estimates
Our estimates of the population parameters are refe
ed to as β̂. Recall that the criteria we use
for obtaining our estimates is to find the estimator β̂ that minimizes the sum of squared residuals
(
∑
e2i in scalar notation).
1 Why this criteria? Where does this criteria come from?
The vector of residuals e is given by:
e = y −Xβ̂ (2)
1Make sure that you are always careful about distinguishing between distu
ances (²) that refer to things that
cannot be observed and residuals (e) that can be observed. It is important to remember that ² 6= e.
1
The sum of squared residuals (RSS) is e′e.2
[
e1 e2 . . . . . . en
]
1×n
e1
e2
...
...
en
n×1
=
[
e1 × e1 + e2 × e2 + . . . + en × en
]
1×1 (3)
It should be obvious that we can write the sum of squared residuals as:
e′e = (y −Xβ̂)′(y −Xβ̂)
= y′y − β̂′X ′y − y′Xβ̂ + β̂′X ′Xβ̂
= y′y − 2β̂′X ′y + β̂′X ′Xβ̂ (4)
where this development uses the fact that the transpose of a scalar is the scalar i.e. y′Xβ̂ =
(y′Xβ̂)′ = β̂′X ′y.
To find the β̂ that minimizes the sum of squared residuals, we need to take the derivative of Eq. 4
with respect to β̂. This gives us the following equation:
∂e′e
∂β̂
= −2X ′y + 2X ′Xβ̂ = 0 (5)
To check this is a minimum, we would take the derivative of this with respect to β̂ again – this
gives us 2X ′X. It is easy to see that, so long as X has full rank, this is a positive definite matrix
(analogous to a positive real number) and hence a minimum.3
2It is important to note that this is very different from ee′ – the variance-covariance matrix of residuals.
3Here is a
ief overview of matrix differentiaton.
∂a′
∂
=
∂b′a
∂
= a (6)
when a and b are K×1 vectors.
∂b′A
∂
= 2Ab = 2b′A (7)
when A is any symmetric matrix. Note that you can write the derivative as either 2Ab or 2b′A.
∂2β′X ′y
∂
=
∂2β′(X ′y)
∂
= 2X ′y (8)
and
∂β′X ′Xβ
∂
=
∂β′Aβ
∂
= 2Aβ = 2X ′Xβ (9)
when X ′X is a K×K matrix. For more information, see Greene (2003, 837-841) and Gujarati (2003, 925).
2
From Eq. 5 we get what are called the ‘normal equations’.
(X ′X)β̂ = X ′y (10)
Two things to note about the (X ′X) matrix. First, it is always square since it is k × k. Second, it
is always symmetric.
Recall that (X ′X) and X ′y are known from our data but β̂ is unknown. If the inverse of (X ′X) exists
(i.e. (X ′X)−1), then pre-multiplying both sides by this inverse gives us the following equation:4
(X ′X)−1(X ′X)β̂ = (X ′X)−1X ′y (11)
We know that by definition, (X ′X)−1(X ′X) = I, where I in this case is a k × k identity matrix.
This gives us:
Iβ̂ = (X ′X)−1X ′y
β̂ = (X ′X)−1X ′y (12)
Note that we have not had to make any assumptions to get this far! Since the OLS estimators in
the β̂ vector are a linear combination of existing random variables (X and y), they themselves are
andom variables with certain straightforward properties.
3 Properties of the OLS Estimators
The primary property of OLS estimators is that they satisfy the criteria of minimizing the sum of
squared residuals. However, there are other properties. These properties do not depend on any
assumptions - they will always be true so long as we compute them in the manner just shown.
Recall the normal form equations from earlier in Eq. 10.
(X ′X)β̂ = X ′y (13)
Now substitute in y = Xβ̂ + e to get
(X ′X)β̂ = X ′(Xβ̂ + e)
(X ′X)β̂ = (X ′X)β̂ + X ′e
X ′e = 0 (14)
4The inverse of (X ′X) may not exist. If this is the case, then this matrix is called non-invertible or singular and is
said to be of less than full rank. There are two possible reasons why this matrix might be non-invertible. One, based
on a trivial theorem about rank, is that n < k i.e. we have more independent variables than observations. This is
unlikely to be a problem for us in practice. The other is that one or more of the independent variables are a linea
combination of the other variables i.e. perfect multicollinearity.
3
What does X ′e look like?
X11 X12 . . . X1n
X21 X22 . . . X2n
...
...
...
...
...
...
...
...
Xk1 Xk2 . . . Xkn
e1
e2
...
...
en
=
X11 × e1 + X12 × e2 + . . . + X1n × en
X21 × e1 + X22 × e2 + . . . + X2n × en
...
...
Xk1 × e1 + Xk2 × e2 + . . . + Xkn × en
=
0
0
...
...
0
(15)
From X ′e = 0, we can derive a number of properties.
1. The observed values of X are unco
elated with the residuals.
X ′e = 0 implies that for every column xk of X, x′ke = 0. In other words, each regresso
has zero sample co
elation with the residuals. Note that this does not mean that X is un-
co
elated with the distu
ances; we’ll have to assume this.
If our regression includes a constant, then the following properties also hold.
2. The sum of the residuals is zero.
If there is a constant, then the first column in X (i.e. X1) will be a column of ones. This
means that for the first element in the X ′e vector (i.e. X11 × e1 + X12 × e2 + . . . + X1n × en)
to be zero, it must be the case that
∑
ei = 0.
3. The sample mean of the residuals is zero.
This follows straightforwardly from the previous property i.e. e =
∑
ei
n = 0.
4. The regression hyperplane passes through the means of the observed values (X
and y).
This follows from the fact that e = 0. Recall that e = y − Xβ̂. Dividing by the numbe
of observations, we get e = y − xβ̂ = 0. This implies that y = xβ̂. This shows that the
egression hyperplane goes through the point of means of the data.
5. The predicted values of y are unco
elated with the residuals.
The predicted values of y are equal to Xβ̂ i.e. ŷ = Xβ̂. From this we have
ŷ′e = (Xβ̂)′e = b′X ′e = 0 (16)
This last development takes account of the fact that X ′e = 0.
6. The mean of the predicted Y’s for the sample will equal the mean of the observed
Y’s i.e. ŷ = y.
4
These properties always hold true. You should be careful not to infer...