Setup. The goal of classical linear regression is to obtain the affine function of best fit given training data (x1,y1), ... , (xm,ym) where xi∈Rn and yi∈R. An affine function H in variable x=[x1,...,xn] is specified by n+1 parameters βT=[β0,β1,...,βn] by
H(x):=β0+i=1∑nxnβn=x~Tβ,
where x~T=[1,xT]. The parameter β0 controls the affine shift, while the parameters β1,...,βn define a normal to any level set of H. "Best fit" is understood in the sense of quadratic error minimization. Thus, our problem is to find
β^=H affineargmin{i=1∑m∣yi−H(xi)∣2}=β∈Rn+1argmin{i=1∑m∣yi−x~iTβ∣2}=β∈Rn+1argmin{∥y−Xβ∥2}.
For the last line, we define the design matrix X as the m×(n+1) matrix where the ith row is the vector xiT. Also, yT:=[y1,...,ym].
Minimum. We can interpret the final form of (2) as the following question: what is the projection of a vector y∈Rm to the column space of X? Note that by convexity, we get a local-to-global property. Therefore, it suffices to compute the first order optimality conditions for ∥y−Xβ∥2 to find its minimum. We expand this out as
β^=βargmin∥y−Xβ∥2=βargmin(y−Xβ)T(y−Xβ)=βargmin(yT−βTXT)(y−Xβ)=βargminyTy−yTXβ−βTXTy+βTXTXβ=βargmin∥y∥2−2yTXβ+βT(XTX)β,
where the last line follows from the fact that transpose is the identity on scalars. Since the first term doesn't involve β, it won't appear in the first order optimality condition. For the second term, we compute the partial derivative in coordinates as
∂βk(yTXβ)=∂βki,j∑yiXjiβj=i,j∑yiXji∂βkβj=i,j∑yiXjiδkj=i∑yiXki=(yTX)k=(XTy)k.
The last equality follows from the fact that the kth entry of a vector and its dual agree, as well as the convention that tangent vectors are column vectors. It follows that
∇β(yTXβ)=XTy.
For the third term, we set A:=(XTX) and compute
∂βk(βT(XTX)β)=∂βk(βTAβ)=∂βki,j∑βiAjiβj=i,j∑Aji∂βk(βiβj)=i,j∑Aji(δkiβj+βiδkj)=j∑Ajkβj+i∑Akiβi=(Aβ)k+(βTA)k=(Aβ)k+(ATβ)k=((A+AT)β)k=(2(XTX)β)k.
It follows that
∇β(βT(XTX)β)=2(XTX)β.
We use these computations to obtain first order optimality conditions. The critical point β^ satisfies the constraint
0=∇β∥y−Xβ∥2=∇β∥y∥2−2∇β(yTXβ)+∇β(βT(XTX)β)=−2XTy+2(XTX)β^=−XTy+(XTX)β^.
Assuming (XTX) is invertible, it follows that the minimizer is given by the equation
β^=(XTX)−1XTy.
This gives us the affine function of best fit.