Principal components analysis
July 12, 2021 6 min read
Principal components analysis is a ubiquitous method of dimensionality reduction, used in various fields from finance to genomics. In this post I'm going to consider PCA from different standpoints, resulting in various perspectives on it.
Principal components analysis was discovered multiple times throughout the history. The most relevant of them are Pearson’s use of it in 1901 and, later, development by Hotelling in the 1930s, who coined the name and popularized the method as a generic tool.
Optimization problem perspective
Identify axes, which preserve the most information on original data
Suppose that we’ve obtained -dimensional data points. For instance, let be the number of sequenced Russian citizen, and polymorphic single-nucleotide polymorphisms that characterize each Russian citizen. We want to clusterize the Russian population into sub-populations and visually inspect the results, by projecting each person from the initial 300,000-dimensional space onto a lower-dimensional space (2D or 3D).
For our 2D/3D representation, we want to select such axes, that preserve the most information about a person. To say it more formally, if we had one point, we want the distance from the real point to its projection onto our 2D/3D space to be minimal. Now that we have n points, we want the sum of squared distances from each of those points to its projection to be minimal.
E.g. if we chose to project the data onto a 2D-space, we are looking for such axes and , that sum of square distances from points to their projections were minimal:
Minimal sum of square distance to the axes maximum explained variance
Thus, by choosing such, that they minimize the distances from the points to their projections, we simultaneously maximize the lengths of the projections themselves.
Iterative optimization process
Observation: turns out, you can use a greedy algorithm to perform the required optimization.
Indeed, if the dimensionality of space, that we are going to project the data to, is , we can always choose an orthogonal basis in that m-dimensional subspace that we are looking for.
Thus, we can first find one axis , such that:
Then we should find , which is supposed to be orthogonal to so that:
, and is orthogonal to .
Next basis vector should be orthogonal to both and , and so on. Luckily, as we’ll see in the next chapter, PCA automatically makes each next axis orthogonal to the previous ones.
Reformulation as an eigenvectors/eigenvalues problem
Alright, how do we find a single vector, such that projection onto it is maximal? Pretty simple, if the axis vector is , the length of projection of data point onto it is .
We are looking to maximize the sum of squares of lengths of the projections over all the data points:
, where is a sample covariance matrix.
By the way, the quantity is called Rayleigh quotient, as the covariance matrix is symmetric.
, thus, , where .
Elements of vector, , are coefficients for i-th eigenvector; eigenvectors are orthogonal, because the covariance matrix is symmetric, thus .
If one of the eigenvalues, say, , is larger than the others, as it often happens, it is called main eigenvalue.
The Rayleigh quotient then will take its maximal value, if .
Indeed, let , then we maximize the sum for non-negative , given constraint . Obviously, we reach maximum, if we maximize , where is the largest eigenvalue.
Stochastic processes and signal processing perspective
From the standpoint of Karhunen-Loeve theorem, PCA is similar to a Fourier series in digital signal processing - essentially, a way to perform a lossy compression.
Classical statistics perspective
From multivariate normal distribution perspective.
Bayesian optimization perspective
TODO: Vetrov’s lecture on VAE
Written by Boris Burkov who lives in Moscow, Russia and Cambridge, UK, loves to take part in development of cutting-edge technologies, reflects on how the world works and admires the giants of the past. You can follow me on Telegram