Principal Component Analysis

We discuss some very basic ideas from empirical processes and use them to motivate PCA.

29 March 2025

Measure Theory. What's the most natural probability measure associated to a sample x1,...,xmRnx_1,...,x_m\in \R^n? The easiest, most unbiased way is to treat these points as your entire measure space with the uniform measure. For simplicity, take n=1n=1 so that we're dealing with random variables. We take the Dirac measure 𝟙xj𝟙_{x_j} for each data point xjx_j and define the empirical measure on R\mathbb{R} as

PX:=1mj=1m𝟙xj,\begin{equation} \mathbb{P}_{\mathbb{X}}:=\frac{1}{m}\sum_{j=1}^m 𝟙_{x_j}, \end{equation}

where we set X:={x1,...,xm}\mathbb{X}:=\{x_1,...,x_m\}. Let's compute the expectation and variance of this distribution. Let xPXx\sim\mathbb{P}_\mathbb{X}. We compute the sample expecation as

E[x]=i=1mxiPX(x=xi)=i=1mxiPX(xi)=i=1mxi(1mj=1m𝟙xj(xi))=i=1mxi(1mj=1mδij)=1mi=1mxi.\begin{equation} \begin{split} \mathbb{E}[x]&=\sum_{i=1}^m x_i \mathbb{P}_\mathbb{X}(x=x_i) \\ &=\sum_{i=1}^m x_i \mathbb{P}_\mathbb{X}(x_i) \\ &=\sum_{i=1}^m x_i \left(\frac{1}{m}\sum_{j=1}^m𝟙_{x_j}(x_i)\right)\\ &=\sum_{i=1}^m x_i\left(\frac{1}{m}\sum_{j=1}^m\delta_{ij}\right)\\ &=\frac{1}{m}\sum_{i=1}^m x_i. \end{split} \end{equation}

Under this measure, the expected value yields the center-of-mass. Therefore, we set xˉ=E[x]\bar{x}=\mathbb{E}[x]. Next, we compute the sample variance as

V[x]=E[(xxˉ)2]=i=1m(xixˉ)2PX(x=xi)=1m=1mi=1m(xixˉ)2.\begin{equation} \begin{split} \mathbb{V}[x]&=\mathbb{E}\left[(x-\bar{x})^2\right]\\ &=\sum_{i=1}^m(x_i-\bar{x})^2\underbrace{\mathbb{P}_\mathbb{X}(x=x_i)}_{=\frac{1}{m}}\\ &=\frac{1}{m}\sum_{i=1}^m(x_i-\bar{x})^2. \end{split} \end{equation}

We mention that the second equality in the calculation of the variance relies on the application of LOTUS. Repeating these calculations in higher dimenesions yields the center-of-mass vector for expectation, and sample covariance matrix for variance. Observe that the natural normalization for the sample variance makes it an biased estimator for the population variance.

{}

p1. The goal is to find a direction pSn1Rnp\in \mathbb{S}^{n-1}\subset \R^n in which the data has maximal spread. Since data spread is a purely empirical notion, we quantify it à la sample variance. Let yi:=pxiy_i:=p\cdot x_i be the projection of xix_i in the direction of pp. Our goal, therefore, is to find a direction that maximizes the sample variance Y={y1,...,ym}\mathbb{Y}=\{y_1,...,y_m\}. This is formalized by the optimization problem

maxVp[y]s.t.p=1,\begin{equation} \begin{split} &\max \mathbb{V}_p[y]\\ &\text{s.t.} \quad \|p\|=1,\\ \end{split} \end{equation}

for yPYy\sim \mathbb{P}_\mathbb{Y} the emperical distribution over the projected points. The subscript on the variance emphasizes that this is a post-projection computation. Our first goal is to expand out Vp\mathbb{V}_p as a quadratic form.

Vp[y]=1mi=1m(yiyˉ)2=1mi=1m(pTxi(1mj=1mpTxj))2=1mi=1m(pT(xixˉ))2=1mi=1m(pT(xixˉ))(pT(xixˉ))T=pTi=1m1m(xixˉ)(xixˉ)T=Cp,\begin{equation} \begin{split} \mathbb{V}_p[y]&=\frac{1}{m}\sum_{i=1}^m (y_i-\bar{y})^2\\ &=\frac{1}{m}\sum_{i=1}^m \left(p^T x_i-\left(\frac{1}{m}\sum_{j=1}^m {p^T x_j}\right)\right)^2\\ &=\frac{1}{m}\sum_{i=1}^m \left(p^T\left(x_i-\bar{x}\right)\right)^2\\ &=\frac{1}{m}\sum_{i=1}^m (p^T(x_i-\bar{x}))(p^T(x_i-\bar{x}))^T\\ &=p^T\underbrace{\sum_{i=1}^m\frac{1}{m}(x_i-\bar{x})(x_i-\bar{x})^T}_{=C}p,\\ \end{split} \end{equation}

where CC is the sample covariance matrix of the data X\mathbb{X}. Therefore, our maximization problem becomes

maxpTCps.t.p=1,\begin{equation} \begin{split} &\max p^TCp\\ &\text{s.t.} \quad \|p\|=1,\\ \end{split} \end{equation}

which can be solved with Lagrange mulitpliers. Our objective function and constraint together define the Lagrangian

L(p,λ)=pTCpλ(pTp1)=pT(CλI)p+λ\begin{equation} \begin{split} \mathcal{L}(p,\lambda)&=p^TCp-\lambda(p^Tp-1)\\ &=p^T(C-\lambda I )p+\lambda \end{split} \end{equation}

which has pp-critical points at

0=Lp(p)=(CλI)p.\begin{equation} 0= \frac{\partial \mathcal{L}}{\partial p}(p)=(C-\lambda I)p. \end{equation}

This tells us that a maximum to (6) is an eigendirection of CC. At this point there's an ambiguity in the sign of λ\lambda, since the contraints are equality constraints. We will fix this ambiguity later. We define a first principal direction p1p_1 as an eigendirection with the largest eigenvalue λ1\lambda_1. We claim that p1p_1 solves the optimization problem (5). This is because

Vp[y]=pTCp=pT(λp)=λ,\begin{equation} \mathbb{V}_p[y]=p^TCp=p^T(\lambda p)=\lambda, \end{equation}

so maximizing for the eigenvalue λ\lambda corresponds to maximizing for the sample variance. Indicidentally, the same computation constrains the sign of λ\lambda and shows that the covariance matrix CC is a positive semi-definite matrix, since variance is a non-negative quantity.

{}

Why orthogonality? The further principal components are computed iteratively under the same variance maximization scheme, but under the new constraint that pi+1p_{i+1} is orthogonal to the previous principal direction {p1,...,pi}\{p_1,...,p_i\}. This orthogonality condition enables the use of Rayleigh's principle, allowing us to compute the eigenvalues of the covariance matrix iteratively via Lagrangians.

{}

It's not immediately clear why orthogonality is more than just a convenient mathematical trick. What is the statistical motivation behind enforcing this condition? The answer gets to what covariance is and why we employ it as the natural inner product. For our purposes, it's best to define covariance as the error term blocking variance from being an additive homorphism. In other words, define for random variables X,YX,Y their covariance

Cov(X,Y):=12(V[X+Y]V[X]V[Y]).\begin{equation} \text{Cov}(X,Y):=\frac{1}{2}\bigg(\mathbb{V}[X+Y]-\mathbb{V}[X]-\mathbb{V}[Y]\bigg). \end{equation}

The prefactor of one half is for the convenient normalization Cov(X,X)=V[X]\text{Cov}(X,X)=\mathbb{V}[X]. We say random variables are uncorrelated if and only if their covariance is 00. The key here is that the concept of being uncorrelated, or equivalently variance additivity, captures the idea of variance independence.

{}

Back to PCA. If we were in the business of maximizing variance through unscrupulous means, ideally we'd try to reach a total variance of nVp1[y1].n\mathbb{V}_{p_1}[y_1]. Being the crafty and morally onerous mathematician that I am, I fix a cone CϵC_\epsilon of small opening angle ϵ\epsilon around p1p_1 and arbitrarily pick directions p2,...,pnCϵp_2,...,p_n\in C_\epsilon to form the basis {p1,...,pn}\{p_1,...,p_n\}. The total variance of the basis is nearly nVp1[y1]n\mathbb{V}_{p_1}[y_1], since I've esssentially captured the variance in the p1p_1 direction nn times. The issue is that the sources of variation are highly dependent; the fix is to impose variance independence. As discussed, this is equivalent to asking the principal directions be uncorrelated. This motivates us to define covariance as our inner product, and the orthogonality constraint on the principal directions follows.