Singular Value Decomposition
August 26, 2021 7 min read
Singular value decomposition is a way of understanding a rectangular (i.e. not necessarily square) matrix from the operator norm standpoint. It is complementary perspective to eigenvalue decomposition that finds numerous application in statistics, machine learning, bioinformatics, quantum computers etc. This post explains its nature and connections to operator norm, least squares fitting, PCA, condition numbers, regularization problems etc.
Intuition for SVD from operator norm perspective
Suppose that you have a rectangular matrix , where its columns are people and each element of a column is how much they love horror movies and drama movies (I’m obviously drawing some inspiration from Netflix challenge).
Suppose that you are going to estimate the average fractions of your audience that prefer horror or drama movies. So you might be interested in multiplying this matrix by a vector of weights. For instance, if you are Netflix and 0.5 of your income is created by person 1, 0.2 - by person 2 and 0.3 - by person 3, your vector and is a 2-vector that reflects, what fractions of your total income horror movies and drama movies produce.
In the spirit of matrix/operator norms you might ask: if I applied this matrix to a vector of length 1, what is the largest length of vector I might receive on output?
The answer is simple: you need to calculate the following dot product (and then take square root of it to get the length):
As you can see, in the middle of this dot product we have a square 3-by-3 matrix , which is symmetric. That means that its eigenvectors are orthogonal/unitary. So we can represent it as an eigen decomposition: , where is an orthogonal matrix of eigenvectors.
Then the squared length of vector takes the form and the answer to our question becomes straightforward: the largest length of vector is achieved when x is the eigenvector corresponding to the largest eigenvalue . The square roots of elements of matrix D of the eigenvectors of matrix , are known as singular values and denoted . We’ve already run into them previously, while exploring the condition numbers.
On the other hand, we can consider a matrix instead of . It is a 2-by-2 matrix, which, again, is symmetric. Again, it has orthogonal/unitary eigenvectors matrix, which we will denote .
Note that matrices and are known as Gram matrices. As we’ve seen above that each Gram matrix is a square of a vector length by design, Gram matrices are positive semi-definite and, thus, their eigenvalues are non-negative.
Here comes the key point of SVD: we can express vectors through . I will show how, but first take note that is a 2-vector, and is a 3-vector. Matrix U has only 2 eigenvalues, while the matrix v has 3. Thus, in this example every can be expressed through , but not the other way round.
Now let us show that , where are singular values of matrix (). Indeed:
If we re-write in matrix form, we get:
or, equivalently, , or .
This proves that singular value decomposition exists if matrices and have eigenvalue decomposition.
Matrix norms from singular values perspective
Nuclear norm and Schatten norm
In the condition numbers post we considered two kinds of matrix norms: operator norm and Frobenius norm.
TODO: connection to L1-norm, compressed sensing, affine rank minimization problem, sparse identification of nonlinear dynamical systems.
A new perspective on Frobenius norm
Singular values also provide another perspective on Frobenius norm: it is a square root of sum os squares of singular values:
TODO: prove this fact!
Matrix spectral norm is simply its largest singular value.
Application: low-rank matrix approximation
As we’ve seen singular values provide a convenient representation of a matrix as a sum of outer products of column and row-vectors (each outer product, thus, results in a matrix of rank 1):
Here’s the catch: as singular values are ordered in decreasing order, we can use SVD as a means of compression of our data.
If we take only a subset of first elements of our SVD sum, instead of the full elements, it is very likely that we would preserve most of the information, contained in our data, but represent it with only a limited number of eigenvectors. This feels very similar to Fourier series. This is also a reason, why PCA works (it is basically a special case of SVD).
- https://personal.utdallas.edu/~m.vidyasagar/Fall-2015/6v80/CS-Notes.pdf - intro to compressed sensing
- https://arxiv.org/pdf/1611.07777.pdf - Affine matrix rank minimization problem
- https://www.diva-portal.org/smash/get/diva2:900592/FULLTEXT01.pdf - a great thesis on dynamic systems sparse model discovery
- https://www.youtube.com/watch?v=DvbbXX8Bd90 - video on dynamic systems sparse model discovery
- https://www.quora.com/What-is-the-significance-of-the-nuclear-norm - references to the meaning of nuclear norm
Written by Boris Burkov who lives in Moscow, Russia and Cambridge, UK, loves to take part in development of cutting-edge technologies, reflects on how the world works and admires the giants of the past. You can follow me in Telegram