BorisBurkov.net
cover

Principal components analysis

July 12, 2021 6 min read

Principal components analysis is a ubiquitous method of dimensionality reduction, used in various fields from finance to genomics. In this post I'm going to consider PCA from different standpoints, resulting in various perspectives on it.

Principal components analysis was discovered multiple times throughout the history. The most relevant of them are Pearson’s use of it in 1901 and, later, development by Hotelling in the 1930s, who coined the name and popularized the method as a generic tool.

Optimization problem perspective

Identify axes, which preserve the most information on original data

Suppose that we’ve obtained nn pp-dimensional data points. For instance, let n=100000n=100 000 be the number of sequenced Russian citizen, and n=300000n=300 000 polymorphic single-nucleotide polymorphisms that characterize each Russian citizen. We want to clusterize the Russian population into sub-populations and visually inspect the results, by projecting each person from the initial 300,000-dimensional space onto a lower-dimensional space (2D or 3D).

For our 2D/3D representation, we want to select such axes, that preserve the most information about a person. To say it more formally, if we had one point, we want the distance from the real point to its projection onto our 2D/3D space to be minimal. Now that we have n points, we want the sum of squared distances from each of those points to its projection to be minimal.

E.g. if we chose to project the data onto a 2D-space, we are looking for such axes vv and ww, that sum of square distances from points to their projections were minimal:

v,w=argminv,wi=1nd2(Xi,projv,wXi)v, w = \arg \min_{v, w} \sum \limits_{i=1}^{n} d^2(X_i, proj_{v,w} X_i)

Minimal sum of square distance to the axes     \iff maximum explained variance

projv,w2Xi+d2(Xi,projvXi)=Xi2proj^2_{v,w}X_i + d^2(X_i, proj_{v} X_i) = X_i^2

i=1nprojv,w2Xi+i=1nd2(Xi,projvXi)=i=1nXi2=const\sum \limits_{i=1}^{n} proj^2_{v,w}X_i + \sum \limits_{i=1}^{n} d^2(X_i, proj_{v} X_i) = \sum \limits_{i=1}^{n} X_i^2 = const

v,w=argminv,wi=1nd2(Xi,projv,wXi)=argmaxv,wi=1nprojv,wXiv, w = \arg \min_{v, w} \sum \limits_{i=1}^{n} d^2(X_i, proj_{v,w} X_i) = \arg \max_{v, w} \sum \limits_{i=1}^{n} proj_{v,w} X_i

Thus, by choosing v,wv, w such, that they minimize the distances from the points to their projections, we simultaneously maximize the lengths of the projections themselves.

Iterative optimization process

Observation: turns out, you can use a greedy algorithm to perform the required optimization.

Indeed, if the dimensionality of space, that we are going to project the data to, is mm, we can always choose an orthogonal basis in that m-dimensional subspace that we are looking for.

Thus, we can first find one axis vv, such that:

v=argminvi=1nd2(Xi,projvXi)v = \arg \min_{v} \sum \limits_{i=1}^{n} d^2(X_i, proj_{v} X_i)

Then we should find ww, which is supposed to be orthogonal to vv so that:

w=argminwi=1nd2(Xi,projwXi)w = \arg \min_{w} \sum \limits_{i=1}^{n} d^2(X_i, proj_{w} X_i), and ww is orthogonal to vv.

Next basis vector should be orthogonal to both vv and ww, and so on. Luckily, as we’ll see in the next chapter, PCA automatically makes each next axis orthogonal to the previous ones.

Reformulation as an eigenvectors/eigenvalues problem

Alright, how do we find a single vector, such that projection onto it is maximal? Pretty simple, if the axis vector is vv, the length of projection of XiX_i data point onto it is (vTXi)2vTv\frac{(v^T X_i)^2}{v^T v}.

We are looking to maximize the sum of squares of lengths of the projections over all the data points:

v=argmaxvvTXTXvvTv=argmaxvvTΣvvTvv = \arg \max_v \frac{v^T \bm{X}^T \bm{X} v}{v^T v} = \arg \max_v \frac{v^T \Sigma v}{v^T v}, where Σ=1n1i=1n(XiX^)(XiX^)T\Sigma = \frac{1}{n-1} \sum \limits_{i=1}^{n} (X_i - \hat{X})(X_i - \hat{X})^T is a sample covariance matrix.

By the way, the quantity R(Σ,v)=vTΣvvTvR(\Sigma, v) = \frac{v^T \Sigma v}{v^T v} is called Rayleigh quotient, as the covariance matrix is symmetric.

Σ=EΛET\Sigma = E \Lambda E^T, thus, R(Σ,v)=(vTE)Λ(ETv)vTv=YTΛYvTvR(\Sigma, v) = \frac{(v^T E) \Lambda (E^T v)}{v^T v} = \frac{Y^T \Lambda Y}{v^T v}, where Y=ETvY = E^T v.

Elements of vectorYY, yiy_i, are coefficients for i-th eigenvector; eigenvectors are orthogonal, because the covariance matrix is symmetric, thus vTv=vTEETv=YTYv^T v = v^T E E^T v = Y^T Y.

Thus, R(Σ,v)=R(Σ,Y)=YTΛYYTY=i=1pλiyi2i=1pyi2R(\Sigma, v) = R(\Sigma, Y) = \frac{Y^T \Lambda Y}{Y^T Y} = \frac{\sum \limits_{i=1}^{p} \lambda_i y_i^2}{ \sum \limits_{i=1}^{p} y_i^2 }.

If one of the eigenvalues, say, λq\lambda_q, is larger than the others, as it often happens, it is called main eigenvalue.

The Rayleigh quotient then will take its maximal value, if Y=(0,0,...,1q-th position,0,...,0)TY = (0, 0, ..., \underbrace{1}_\text{q-th position}, 0, ..., 0)^T.

Indeed, let zi=yi2z_i = y_i^2, then we maximize the sum i=1pλizi\sum \limits_{i=1}^{p} \lambda_i z_i for non-negative ziz_i, given constraint i=1pzi=const\sum \limits_{i=1}^{p} z_i = const. Obviously, we reach maximum, if we maximize zqz_q, where λq\lambda_q is the largest eigenvalue.

Stochastic processes and signal processing perspective

From the standpoint of Karhunen-Loeve theorem, PCA is similar to a Fourier series in digital signal processing - essentially, a way to perform a lossy compression.

TODO

Classical statistics perspective

From multivariate normal distribution perspective.

TODO

SVD perspective

TODO

Bayesian optimization perspective

TODO: Vetrov’s lecture on VAE

References


Boris Burkov

Written by Boris Burkov who lives in Moscow, Russia and Cambridge, UK, loves to take part in development of cutting-edge technologies, reflects on how the world works and admires the giants of the past. You can follow me on Telegram