BorisBurkov.net
cover

Principal components analysis

July 12, 2021 7 min read

Principal components analysis is a ubiquitous method of dimensionality reduction, used in various fields from finance to genomics. In this post I'm going to consider PCA from different standpoints, resulting in various perspectives on it.

Principal components analysis was discovered multiple times throughout the history. The most relevant of them are Pearson’s use of it in 1901 and, later, development by Hotelling in the 1930s, who coined the name and popularized the method as a generic tool.

Optimization problem perspective

Identify axes, which preserve the most information on original data

Suppose that we’ve obtained nn pp-dimensional data points. For instance, let n=100000n=100 000 be the number of sequenced Russian citizen, and n=300000n=300 000 polymorphic single-nucleotide polymorphisms that characterize each Russian citizen. We want to clusterize the Russian population into sub-populations and visually inspect the results, by projecting each person from the initial 300,000-dimensional space onto a lower-dimensional space (2D or 3D).

For our 2D/3D representation, we want to select such axes, that preserve the most information about a person. To say it more formally, if we had one point, we want the distance from the real point to its projection onto our 2D/3D space to be minimal. Now that we have n points, we want the sum of squared distances from each of those points to its projection to be minimal.

E.g. if we chose to project the data onto a 2D-space, we are looking for such axes vv and ww, that sum of square distances from points to their projections were minimal:

v,w=argminv,wi=1nd2(Xi,projv,wXi)v, w = \arg \min_{v, w} \sum \limits_{i=1}^{n} d^2(X_i, proj_{v,w} X_i)

Minimal sum of square distance to the axes     \iff maximum explained variance

projv,w2Xi+d2(Xi,projvXi)=Xi2proj^2_{v,w}X_i + d^2(X_i, proj_{v} X_i) = X_i^2

i=1nprojv,w2Xi+i=1nd2(Xi,projvXi)=i=1nXi2=const\sum \limits_{i=1}^{n} proj^2_{v,w}X_i + \sum \limits_{i=1}^{n} d^2(X_i, proj_{v} X_i) = \sum \limits_{i=1}^{n} X_i^2 = const

v,w=argminv,wi=1nd2(Xi,projv,wXi)=argmaxv,wi=1nprojv,wXiv, w = \arg \min_{v, w} \sum \limits_{i=1}^{n} d^2(X_i, proj_{v,w} X_i) = \arg \max_{v, w} \sum \limits_{i=1}^{n} proj_{v,w} X_i

Thus, by choosing v,wv, w such, that they minimize the distances from the points to their projections, we simultaneously maximize the lengths of the projections themselves.

Iterative optimization process

Observation: turns out, you can use a greedy algorithm to perform the required optimization.

Indeed, if the dimensionality of space, that we are going to project the data to, is mm, we can always choose an orthogonal basis in that m-dimensional subspace that we are looking for.

Thus, we can first find one axis vv, such that:

v=argminvi=1nd2(Xi,projvXi)v = \arg \min_{v} \sum \limits_{i=1}^{n} d^2(X_i, proj_{v} X_i)

Then we should find ww, which is supposed to be orthogonal to vv so that:

w=argminwi=1nd2(Xi,projwXi)w = \arg \min_{w} \sum \limits_{i=1}^{n} d^2(X_i, proj_{w} X_i), and ww is orthogonal to vv.

Next basis vector should be orthogonal to both vv and ww, and so on. Luckily, as we’ll see in the next chapter, PCA automatically makes each next axis orthogonal to the previous ones.

Reformulation as an eigenvectors/eigenvalues problem

Alright, how do we find a single vector, such that projection onto it is maximal? Pretty simple, if the axis vector is vv, the length of projection of XiX_i data point onto it is (vTXi)2vTv\frac{(v^T X_i)^2}{v^T v}.

We are looking to maximize the sum of squares of lengths of the projections over all the data points:

v=argmaxvvTXTXvvTv=argmaxvvTΣvvTvv = \arg \max_v \frac{v^T \bm{X}^T \bm{X} v}{v^T v} = \arg \max_v \frac{v^T \Sigma v}{v^T v}, where Σ=1n1i=1n(XiX^)(XiX^)T\Sigma = \frac{1}{n-1} \sum \limits_{i=1}^{n} (X_i - \hat{X})(X_i - \hat{X})^T is a sample covariance matrix.

By the way, the quantity R(Σ,v)=vTΣvvTvR(\Sigma, v) = \frac{v^T \Sigma v}{v^T v} is called Rayleigh quotient, as the covariance matrix is symmetric.

Σ=EΛET\Sigma = E \Lambda E^T, thus, R(Σ,v)=(vTE)Λ(ETv)vTv=YTΛYvTvR(\Sigma, v) = \frac{(v^T E) \Lambda (E^T v)}{v^T v} = \frac{Y^T \Lambda Y}{v^T v}, where Y=ETvY = E^T v.

Elements of vectorYY, yiy_i, are coefficients for i-th eigenvector; eigenvectors are orthogonal, because the covariance matrix is symmetric, thus vTv=vTEETv=YTYv^T v = v^T E E^T v = Y^T Y.

Thus, R(Σ,v)=R(Σ,Y)=YTΛYYTY=i=1pλiyi2i=1pyi2R(\Sigma, v) = R(\Sigma, Y) = \frac{Y^T \Lambda Y}{Y^T Y} = \frac{\sum \limits_{i=1}^{p} \lambda_i y_i^2}{ \sum \limits_{i=1}^{p} y_i^2 }.

If one of the eigenvalues, say, λq\lambda_q, is larger than the others, as it often happens, it is called main eigenvalue.

The Rayleigh quotient then will take its maximal value, if Y=(0,0,...,1q-th position,0,...,0)TY = (0, 0, ..., \underbrace{1}_\text{q-th position}, 0, ..., 0)^T.

Indeed, let zi=yi2z_i = y_i^2, then we maximize the sum i=1pλizi\sum \limits_{i=1}^{p} \lambda_i z_i for non-negative ziz_i, given constraint i=1pzi=const\sum \limits_{i=1}^{p} z_i = const. Obviously, we reach maximum, if we maximize zqz_q, where λq\lambda_q is the largest eigenvalue.

Classical statistics perspective

From multivariate normal distribution perspective.

TODO

SVD perspective

As was mentioned previously, PCA works with covariance matrix Σ=XTX\Sigma = X^T X, where XX is our data vectors.

From the technical perspective, this matrix is as a special case of well-known in mathematics Gram matrix.

Gram matrices are positive semi-definite and symmetric. Moreover, usually they are non-full rank, if XX is nn x pp matrix, where n>pn > p, which means that although Σ\Sigma has dimensionality n×nn \times n, its rank is only pp and it has only pp non-zero eigenvalues.

Hence, a convenient technical way to find the eigenvalues of covariance matrix is Singular Value Decomposition:

X=UDVTX = U D V^T, where UU and VV are orthogonal p×pp \times p and n×nn \times n matrices, and DD is a diagonal n×pn \times p matrix of singular values (squares of singular values are eigenvalues of Σ\Sigma), where only pp elements are non-zero.

For more details on SVD, read this post. For more details on properties of Gram matrix - read this one.

Stochastic processes and signal processing perspective

From the standpoint of Karhunen-Loeve theorem, PCA is similar to a Fourier series in digital signal processing - essentially, a way to perform a lossy compression.

We can interpret PCA as a decomposition of the input data matrix XX into an ordered series of matrices of rank 1, created by outer products of principal components. Let us re-write the SVD transform as a series of outer products:

X=i=1pλiuiviTX = \sum \limits_{i=1}^{p} \lambda_i u_i v_i^T, where λi\lambda_i are eigenvalues, uiu_i are column-vectors of matrix UU and viTv_i^T are row-vectors of matrix VT,sothatV^T, so that u_i v_i^T$ is an outer-product, resulting in a matrix of rank 1.

Each term of this sum contributes more to the resulting data matrix, the larger is the eigenvalue λi\lambda_i in it. As in the case of Fourier transform, if we aim to compress the data, we can truncate the series early, when we believe that enough information is already contained in the first xx terms of the series. This approach is often used with SVD for data compression - e.g. eigenfaces approach, where whole datasets of human photos are compressed by early truncation of SVD series.

Bayesian optimization perspective

TODO: Vetrov’s lecture on VAE

References


Boris Burkov

Written by Boris Burkov who lives in Moscow, Russia and Cambridge, UK, loves to take part in development of cutting-edge technologies, reflects on how the world works and admires the giants of the past. You can follow me in Telegram