Measure Theory. What's the most natural probability measure associated to a sample ? The easiest, most unbiased way is to treat these points as your entire measure space with the uniform measure. For simplicity, take so that we're dealing with random variables. We take the Dirac measure for each data point and define the empirical measure on as
where we set . Let's compute the expectation and variance of this distribution. Let . We compute the sample expecation as
Under this measure, the expected value yields the center-of-mass. Therefore, we set . Next, we compute the sample variance as
We mention that the second equality in the calculation of the variance relies on the application of LOTUS. Repeating these calculations in higher dimenesions yields the center-of-mass vector for expectation, and sample covariance matrix for variance. Observe that the natural normalization for the sample variance makes it an biased estimator for the population variance.
p1. The goal is to find a direction in which the data has maximal spread. Since data spread is a purely empirical notion, we quantify it à la sample variance. Let be the projection of in the direction of . Our goal, therefore, is to find a direction that maximizes the sample variance . This is formalized by the optimization problem
for the emperical distribution over the projected points. The subscript on the variance emphasizes that this is a post-projection computation. Our first goal is to expand out as a quadratic form.
where is the sample covariance matrix of the data . Therefore, our maximization problem becomes
which can be solved with Lagrange mulitpliers. Our objective function and constraint together define the Lagrangian
which has -critical points at
This tells us that a maximum to (6) is an eigendirection of . At this point there's an ambiguity in the sign of , since the contraints are equality constraints. We will fix this ambiguity later. We define a first principal direction as an eigendirection with the largest eigenvalue . We claim that solves the optimization problem (5). This is because
so maximizing for the eigenvalue corresponds to maximizing for the sample variance. Indicidentally, the same computation constrains the sign of and shows that the covariance matrix is a positive semi-definite matrix, since variance is a non-negative quantity.
Why orthogonality? The further principal components are computed iteratively under the same variance maximization scheme, but under the new constraint that is orthogonal to the previous principal direction . This orthogonality condition enables the use of Rayleigh's principle, allowing us to compute the eigenvalues of the covariance matrix iteratively via Lagrangians.
It's not immediately clear why orthogonality is more than just a convenient mathematical trick. What is the statistical motivation behind enforcing this condition? The answer gets to what covariance is and why we employ it as the natural inner product. For our purposes, it's best to define covariance as the error term blocking variance from being an additive homorphism. In other words, define for random variables their covariance
The prefactor of one half is for the convenient normalization . We say random variables are uncorrelated if and only if their covariance is . The key here is that the concept of being uncorrelated, or equivalently variance additivity, captures the idea of variance independence.
Back to PCA. If we were in the business of maximizing variance through unscrupulous means, ideally we'd try to reach a total variance of Being the crafty and morally onerous mathematician that I am, I fix a cone of small opening angle around and arbitrarily pick directions to form the basis . The total variance of the basis is nearly , since I've esssentially captured the variance in the direction times. The issue is that the sources of variation are highly dependent; the fix is to impose variance independence. As discussed, this is equivalent to asking the principal directions be uncorrelated. This motivates us to define covariance as our inner product, and the orthogonality constraint on the principal directions follows.