Bias-Variance, Randomness, and Gauss-Markov

We use the bias-variance decomposition in machine learning as a backdrop to explore some probability-theoretic ways of thinking. We also make a comparison between the bias in Gauss-Markov and in bias-variance.

22 February 2025

All the computer scientists I've talked to have a very intuitive grasp of randomness, which comes through in the language they use. They'll say phrases like "sources of randomness" which reflects a level of comfort that I'd like to achieve. Let's explore this line of thinking.

 \text{ }

Setup. Let's suppose the "true" relation between features xx and labels yy is given by the function ff. The goal is to learn ff, through restricted access via yy. This will be further complicated by the noise in the system (due to measurement error, random fluctuations, white noise, etc). This leads us to model our relation between features and labels as a deterministic-random decomposition

y=f(x)+ϵ,\begin{equation} y=f(x)+\epsilon,\end{equation}

where ϵ\epsilon is assumed to to have mean 00 and variance σ2\sigma^2. The mean 00 assumption is natural, and as far as I can tell, finite variance is a smallness assumption on the noise. Furthermore, we assume that x,ϵx,\epsilon are independent, because why wouldn't they be?

\text{}

Let y^\hat{y} be whatever predictor we built in our effort to learn ff. This requires training data, denoted by D\mathcal{D}, and different training data will lead to different predictors y^\hat{y}. This accounts for the first source of randomness, namely in the variability in the training data. Furthermore, since the creation of y^\hat{y} is via a deterministic process, all of the randomness in y^\hat{y} is concentrated in the variability of D\mathcal{D}.

 \text{ }

Statistics. In order to measure how badly our model y^\hat{y} predicts labels y\mathbb{y}, we introduce a loss function L(x)=L(y(x),y^(x))L(x)=L(y(x),\hat{y}(x)), where xx is a fixed but arbitrary test point. Of course testing over individual points is too much information, so we average over points in xx to define the test error statistic

ErrD:=Ex[LD].\begin{equation} Err_{\mathcal{D}}:=\mathbb{E}^x[L|\mathcal{D}]. \end{equation}

This accounts for the second source of randomness, namely averaging over test data. ErrDErr_\mathcal{D} forms a measurement, wrt a chosen L, of the performance of the estimator y^\hat{y}. We emphasize again that y^\hat{y} is fixed since D\mathcal{D} is, and test points xx are averaged over. This makes it a dead-end approach. To see why, assume that xx is drawn from some distribution with a density pp. Then, the test error is given by the formula

ErrD=xRnL(x)p(x)dx.\begin{equation} Err_{\mathcal{D}}=\int_{x\in \mathbb{R}^n} L(x)p(x)dx. \end{equation}

In order to compute ErrDErr_{\mathcal{D}}, we would need complete knowledge of the density function pp. Equivalently, this means we need to know the distribution that xx was sampled from. However, since we can only draw finitely many samples (via a training set D\mathcal{D}), we can only estimate pp.

\text{}

To circumvent this problem, we instead compute the expected generalization error by averaging the test error over all training data D\mathcal{D},

Err:=ED[ErrD]=Ex,D[L],\begin{equation} Err:=\mathbb{E}^\mathcal{D}[Err_{\mathcal{D}}]=\mathbb{E}^{x,\mathcal{D}}[L], \end{equation}

where we take a joint expectation over both the test points and the training data D\mathcal{D} used to build our model y^\hat{y}. The insight is to use the tower law / Fubini's theorem to flip the order of iterating expectations. This allows us to write the expected generalization error as

Err=Ex[ED[Lx]],\begin{equation} Err=\mathbb{E}^{\mathcal{x}}[\mathbb{E}^{\mathcal{D}}[L|x]],\\ \end{equation}

where the inner expectation averages over training data D\mathcal{D} and the outer expectation averages over test points xx.

\text{}

Bias-variance. The goal of this section is to calculate the inner expectation Err(x):=ED[Lx]Err(x):=\mathbb{E}^{\mathcal{D}}[L|x] when LL is the L2L^2 loss (how else would variance enter?). For notation, we denote the conditional expectation with subscripts

Ex[f]:=E[fx].\begin{equation} \mathbb{E}_x[f]:=\mathbb{E}[f|x]. \end{equation}

On a test data point (x,y)(x,y),

Err(x)=E[(yy^)2x]=Ex[(yy^)2]=Ex[(f+ϵy^)2]=Ex[(fy^)2+ϵ22(fy^)ϵ]=Ex[(fy^)2]+Ex[ϵ2]2Ex[(fy^)ϵ]=Ex[(fy^)2]+Ex[ϵ2]2Ex[(fy^)]Ex[ϵ].\begin{equation} \begin{split} Err(x) &= \mathbb{E}[(y-\hat{y})^2|x]\\ &=\mathbb{E}_x[(y-\hat{y})^2]\\ &=\mathbb{E}_x[(f+\epsilon-\hat{y})^2]\\ &=\mathbb{E}_x[(f-\hat{y})^2+\epsilon^2-2(f-\hat{y})\epsilon]\\ &=\mathbb{E}_x[(f-\hat{y})^2]+\mathbb{E}_x[\epsilon^2]-2\mathbb{E}_x[(f-\hat{y})\epsilon]\\ &=\mathbb{E}_x[(f-\hat{y})^2]+\mathbb{E}_x[\epsilon^2]-2\mathbb{E}_x[(f-\hat{y})]\mathbb{E}_x[\epsilon].\\ \end{split} \end{equation}

The last line requires some justification. Recall that x,ϵx,\epsilon are assumed to be independent. First, since ff is deterministic in xx, it follows that f,ϵf,\epsilon are independent. The second point is a bit more subtle. Recall that the sources of randomness in y^\hat{y} are from variability in the training data D\mathcal{D} and from the test data xx. When conditioned on xx, the only source of randomness is from D\mathcal{D}. In particular, none of the randomness in y^\hat{y} comes from any randomness in the noise in the test data ϵ\epsilon. Therefore, ϵ,y^\epsilon,\hat{y} are also independent. The last line follows. We continue

Err(x)=Ex[(fy^)2]+Ex[ϵ2]2Ex[(fy^)]Ex[ϵ]=Ex[(fy^)2]+E[ϵ2]2Ex[(fy^)]E[ϵ]=0=Ex[(fy^)2]+E[ϵ2]E[ϵ]2=02=Ex[f22fy^+y^2]+σ2=Ex[f2]2Ex[fy^]+Ex[y^2]+σ2=Ex[f2]2Ex[fy^]+Ex[y^2]±Ex[y^]2+σ2=Ex[f2]2Ex[fy^]+Ex[y^]2+Vx[y^]+σ2=f(x)2Ex[1]=12f(x)Ex[y^]+Ex[y^]2+Vx[y^]+σ2.\begin{equation} \begin{split} Err(x) &= \mathbb{E}_x[(f-\hat{y})^2]+\mathbb{E}_x[\epsilon^2]-2\mathbb{E}_x[(f-\hat{y})]\mathbb{E}_x[\epsilon]\\ &=\mathbb{E}_x[(f-\hat{y})^2]+\mathbb{E}[\epsilon^2]-2\mathbb{E}_x[(f-\hat{y})]\underbrace{\mathbb{E}[\epsilon]}_{=0}\\ &=\mathbb{E}_x[(f-\hat{y})^2]+\mathbb{E}[\epsilon^2]-\underbrace{\mathbb{E}[\epsilon]^2}_{=0^2}\\ &=\mathbb{E}_x[f^2-2f\hat{y}+\hat{y}^2]+\sigma^2\\ &=\mathbb{E}_x[f^2]-2\mathbb{E}_x[f\hat{y}]+\mathbb{E}_x[\hat{y}^2]+\sigma^2\\ &=\mathbb{E}_x[f^2]-2\mathbb{E}_x[f\hat{y}]+\mathbb{E}_x[\hat{y}^2]\pm\mathbb{E}_x[\hat{y}]^2+\sigma^2\\ &=\mathbb{E}_x[f^2]-2\mathbb{E}_x[f\hat{y}]+\mathbb{E}_x[\hat{y}]^2+\mathbb{V}_x[\hat{y}]+\sigma^2\\ &=\mathbb{f}(x)^2\underbrace{\mathbb{E}_x[1]}_{=1}-2f(x)\mathbb{E}_x[\hat{y}]+\mathbb{E}_x[\hat{y}]^2+\mathbb{V}_x[\hat{y}]+\sigma^2.\\ \end{split} \end{equation}

In the second line, we use the fact that x,ϵx,\epsilon are independent to get rid of the conditioning on xx. In the last line, we use the fact that ff is deterministic, so conditioning is just evaluation. Therefore,

Err(x)=f(x)22f(x)Ex[y^]+Ex[y^]2+Vx[y^]+σ2=(f(x)Ex[y^])2Bias(x)2+Vx[y^]+σ2.\begin{equation} \begin{split} Err(x)&=\mathbb{f}(x)^2-2f(x)\mathbb{E}_x[\hat{y}]+\mathbb{E}_x[\hat{y}]^2+\mathbb{V}_x[\hat{y}]+\sigma^2\\ &=\underbrace{(f(x)-\mathbb{E}_x[\hat{y}])^2}_{\text{Bias(x)}^2}+\mathbb{V}_x[\hat{y}]+\sigma^2. \end{split} \end{equation}

Gauss-Markov. Gauss-Markov tells us that OLS is unbiased and variance minimizing among all linear unbiased estimators. At a glance, this seems to contradict most of the pictures drawn when discussing bias-variance. Usually, the picture suggests that the data follows a (e.g.) quadratic ff, and linear models are said to underfit with high-bias-low-variance, while high degree polynomials are said to overfit with low-bias-high-variance. So why can Gauss-Markov claim that OLS, a linear function, is unbiased? Are we overloading the word bias?

\text{}

First, realize that built into Gauss-Markov are stronger assumptions on the errors, but more importantly, the underlying assumption that the true function ff is linear in the features. This restricts your search space to affine functions from the get-go. In other words, we assume that

y=Xβ+ϵ,\begin{equation} y= X\beta +\epsilon, \end{equation}

for some column vector β\beta and design matrix XX. This resolves the apparent contradiction that arose with the pictures. Now for a discussion of two senses of "bias". In classical OLS, we solve for the estimator β^\hat{\beta} as

β^=(XTX)1XTy,\begin{equation} \hat{\beta} = (X^TX)^{-1}X^Ty, \end{equation}

which is calculated in this blog post. The claim that "OLS is unbiased" is classically understood in terms of parameter estimation, namely E[β^X]=β\mathbb{E}[\hat{\beta}|X]=\beta. Note that the design matrix (training data) is seen as a deterministic given. The proof of this comes from expanding out β^\hat{\beta} as

β^=(XTX)1XTy=(XTX)1XT(Xβ+ϵ)=(XTX)1XTXβ+(XTX)1XTϵ=β+(XTX)1XTϵ,\begin{equation} \begin{split} \hat{\beta}&=(X^TX)^{-1}X^Ty\\ &=(X^TX)^{-1}X^T (X\beta+\epsilon)\\ &=(X^TX)^{-1}X^TX\beta+(X^TX)^{-1}X^T\epsilon\\ & =\beta + (X^TX)^{-1}X^T\epsilon, \end{split} \end{equation}

and so

E[ββ^X]=E[(XTX)1XTϵX]=(XTX)1XTE[ϵX]=0.\begin{equation} \begin{split} \mathbb{E}[\beta-\hat{\beta}|X]&=\mathbb{E}[(X^TX)^{-1}X^T\epsilon|X]\\ &=(X^TX)^{-1}X^T\mathbb{E}[\epsilon|X]\\ &=0. \end{split} \end{equation}

In machine learning, however, we're more interested in predictions. So let's instead view OLS as a prediction model. In this case, the bias of OLS is now understood as

Bias(x)=f(x)EX[y^x]=xTβEX[xTβ^x]\begin{equation} \begin{split} \text{Bias}(x)&=f(x)-\mathbb{E}^X[\hat{y}|x]\\ &= x^T\beta-\mathbb{E}^X[x^T\hat{\beta}|x] \end{split} \end{equation}

where we now take expectation over design matrices XX, and (x,xTβ^)(x, x^T\hat{\beta}) is a fixed test point. Note that the design matrix is now random. To show OLS is unbiased in this sense, it follows from the fact that EX[β^x]=β\mathbb{E}^X[\hat{\beta}|x]=\beta. Assuming this, we compute

Bias(x)=xTβEX[xTβ^x]=xT(βEX[β^x])=0.\begin{equation} \begin{split} \text{Bias}(x)&=x^T\beta-\mathbb{E}^X[x^T\hat{\beta}|x]\\ &=x^T(\beta-\mathbb{E}^X[\hat{\beta}|x])\\ &=0. \end{split} \end{equation}

To show this fact, we compute

EX[β^x]=EX[(XTX)1XTYx]=EX[(XTX)1XT(Xβ+ϵ)x]=EX[βx]+E[(XTX)1XTϵx]=β+E[(XTX)1XTx]E[ϵx]=0,\begin{equation} \begin{split} \mathbb{E}^X[\hat{\beta}|x]&=\mathbb{E}^X[(X^TX)^{-1}X^TY|x]\\ &=\mathbb{E}^X[(X^TX)^{-1}X^T(X\beta+\epsilon)|x]\\ &=\mathbb{E}^X[\beta|x]+\mathbb{E}[(X^TX)^{-1}X^T\epsilon|x]\\ &=\beta+\mathbb{E}[(X^TX)^{-1}X^T|x]\underbrace{\mathbb{E}[\epsilon|x]}_{=0}, \end{split} \end{equation}

where the last equality comes from the fact that ϵ\epsilon is randomness from the test data, which is independent from the randomness that comes from the training data XX. Compare equations (13)(13) and (16)(16). It's interesting to note that the two senses of unbiased basically follow the same proof, up until the last line.