September 21, 2023 30 min read
Conjugate gradients is one of the most popular computationally effective methods, used to solve systems of linear equations. Along with other Krylov space methods it can also be used for finding eigenvalues and eigenvectors of a matrix.
System of linear equations as a quadratic optimization problem
In case is a Gram matrix (i.e. symmetric and positive definite), so that its eigenvalues are real and positive, it is useful to think about solution of a system of linear equations in terms of an equivalent quadratic problem of finding a minimum of a quadratic function, given by a quadratic form:
Quadratic form represents ellipsoid, eigenvectors of matrix are axes, eigenvalues are lengths semi-axes, difference between the optimal solution and current point is , residue vector will be discussed later.
How to solve a system of linear equations computationally?
We shall briefly consider two naive approches: coordinate descent and steepest descent, before discussing conjugate gradients and their advantages.
First, let us look at the simplest approach: coordinate descent. I.e. we iteratively choose coordinates of the vector and minimize the value of function in that coordinate by solving .
Coordinate descent clearly does not guarantee convergence in N steps:
A smarter approach is steepest descent.
At each step of coordinate descent we calculate the gradient vector of function and use it as the direction of descent.
Here and below I will denote the direction vector for descent. If we are speaking of multiple steps, I will denote it as . In case of the steepest descent at each step , where is the approximation of solution , obtained at -th step, starting from an initial approximation .
If we are currently at the point , we shall move to some point , where is the step. We know the direction, but how do we find the optimal step of descent?
Length of the step is optimal, when the function is minimal in some point . It means that directional derivative and the isolevels of the function become tangent to the direction of descent.
Speaking in terms of multivariate calculus .
Simplifying this expression, we obtain the step length :
Now our steepest descent algorithm is fully defined. By the way, this formula for step length is going to be useful below.
Convergence rate of steepest descent varies. For instance, it might converge in one iteration, if is diagonal and all of its eigenvalues are equal:
However, in general we cannot guarantee fast convergence for steepest descent. E.g. if the matrix is ill-conditioned, i.e. the difference in magnitude between its largest and smallest eigenvalues is big, steepest descent will start jumping from one side of the ellipsoid to the other.
Hence, convergence rate is related to condition numbers of the matrix. One can actually derive a formula for convergence rate:
Exact derivation of this fact is somewhat tedious (although, not so hard for understanding) and can be found in J.R. Schewchuk tutorial, part 6.2.
Conjugate directions and conjugate gradients
As we’ve seen convergence of neither coordinate descent, nor steepest descent in steps is guaranteed.
Here we discuss a better approach, conjugate directions, which will converge in steps or less. It has a special case, conjugate gradients, which is even more computationally efficient.
Let us start with an example of orthogonal axes and then transition to a general case.
Consider the case when all the eigenvectors of our surface are parallel to the coordinate axes. In such a case matrix is diagonal and performing line search along the coordinate axes guarantees convergence in N iterations.
However, in general case we cannot rely on orthogonality of axes. If eigenvectors of matrix are not orthogonal and parallel to coordinate axes, this approach won’t work and line search along orthogonal directions would not work.
Instead of orthogonality, we introduce the notion of A-orthogonality:
We will apply two transformations of coordinates.
First, apply eigen decomposition of :
, where is the matrix of eigenvectors (each eigenvector is a column-vector) and is orthogonal, because eigenvectors of symmetric matrix are orthogonal; is the diagonal matrix of eignevalues, where each eigenvalue is known to be real, positive and related to a corresponding singular value as .
Use the fact that is orthogonal for symmetric matrix , hence, . Hence, , .
What did we do? We changed the coordinates, so that coordinate axes are now the eigenvectors .
In these coordinates vector has coordinates and vector has coordinates .
Second, define .
What did we do here? We stretched our coordinate axes by their respective eigenvalues , transforming our vectors into and into .
Eigenvectors, forming the semi-axes of ellipsoids, upon this change of coorindates take identical lengths, so that our ellipsoids become spheres.
So, how does A-orthogonality work: first we go from original coordinates to coordinates, where axes are eigenvectors of . We apply this transform to both and vectors, resulting in their eigendecomposition vectors and .
Second, we come to a coordinate system, where isolevels are spheres. In these coordinates resulting vectors and must be orthogonal.
Let us prove the guarantees of convergence for conjugate directions algorithm in steps.
Consider a descent algorithm, which uses vectors as descent directions and step sizes , where is the current residue.
We want to prove that if are conjugate directions, then .
Indeed, we know that vectors span the whole space. Hence, there exist coefficients , such that we can come to the desired solution point walking the directions:
Pre-multiply this expression by :
Recall that (because is the perfect solution of our system of linear equations in the end).
Now, consider and pre-multiply it by :
, hence, .
Then substitute this into .
This produces , which coincides with .
Conjugate directions as an energy norm minimization
An insightful way of thinking about conjugate gradients is through norm minimization/maximization.
One can show that conjugate directions is similar to PCA in terms of minimization/maximization of a certain norm.
PCA at each iteration finds a projection that minimizes Frobenius norm of approximation of our matrix by low-rank matrices:
TODO: check PCA analogy
TODO: duality between energy norm and Frobenius norm (maximum of energy norm corresponds to minimum of Frobenius norm)
Similarly, we can introduce a notion of energy norm of vector : .
One may show that at each step conjugate directions algorithm chooses such a vector that minimizes the energy norm of error:
TODO: check notation and finish this part
Conjugate directions and eigenvectors
Let be the eigenvector decomposition of .
Consider the definition of conjugate directions: , where is any diagonal matrix, is the matrix of conjugate directions row-vector (orthogonal).
How do we obtain conjugate directions in practice?
Simple and straightforward approach: let us use neg-gradients as directions: . Problem is, our “conjugate” directions are not conjugate. , directions are not A-orthogonal (although, by the way, we know that is orthogonal, not A-orthogonal to the previous search direction due to tangency of search direction to isolevels at the point .
So, we correct the gradient direction by subtracting with some coefficient to A-orthogonalize and :
We know that is orthogonal to each of the previous directions .
Hence, is already orthogonal to .
But more importantly for conjugate gradients, it is not only orthogonal, but also A-orthogonal to !
Pre-multiply this with : .
The following cases arise:
Hence, is already A-orthogonal to for . To A-orthogonalize derived from , all we need to do is to orthogonalize it with respect to . This greatly simplifies the calculation process and saves us a lot of memory:
Calculation of is evident from A-orthogonality of and : pre-multiply this expression with :
(all other )
This is just a special case of Schmidt orthogonalization.
One more important aspect: assume and consider . Hence, .
Similarly, if we consider , we get a polynomial of , and . And so on.
Hence, in case of we are working with a space of vectors
This space is called Krylov space and is useful in related problems, such as numeric eigenvector/eigenvalue estimation. Such problems are often solved with a related family of algorithms, such as Arnoldi iteration and Lanczos algorithm.
TODO: fix bugs
Convergence rate: inexact arithmetic
If we allow for a certain error, conjugate gradients might not even require iterations. Consider the following clustering of eigenvalues of our matrix :
Recall the convergence rate formula: .
Suppose that we’ve already made iterations. Then is very small and .
Hence, , where is small, and we’ve basically converged to the solution.
Preconditioning and clusters of eigenvalues
If the matrix is ill-conditioned, iterative multiplication of a vector by a matrix would accumulate the floating point rounding errors, resuling in a failure to converge to the correct solution. Hence, we need to “normalize” the matrix similarly to how we apply batch norm or layer norm to normalize the layers of neural networks.
This process is called pre-conditioning. We are multiplying both sides of our equation by a pre-conditioner matrix : .
An ideal pre-conditioner would convert our matrix to a perfect sphere. Then pre-conditioning axes should be the eigenvectors and coefficients be eigenvalues. So, the perfect pre-conditioner is the matrix inverse. However, this is computationally impractical, as it is more-or-less equivalent to the solution of the initial system and would take operations.
The simplest pre-conditioner matrix is a diagonal matrix of the values at the diagonal of the matrix . This is called Jacobi pre-conditioner. It is equivalent to scaling the matrix along the coordinate axes by values at the diagonal of matrix . Obviously, this is an imperfect solution.
A more practical pre-conditioner would be an incomplete Cholesky decomposition. I will first discuss a regular Cholesky decomposition in the context of conjugate directions, and then its sparse variation.
Cholesky decomposition as a special case of conjugate directions
Cholesky decomposition is a representation of our matrix as a product of lower-diagonal matrix and its transpose upper-diagonal matrix : .
One can see that the vectors of inverse of Cholesky decomposition are actually conjugate directions of matrix , such that the first direction is parallel to the first coordinate axis, second direction is in the plan of the first two axes etc. Consider an upper-triangular matrix :
Now we can denote and , and we get .
To see that inverse of an upper-triangular matrix is upper-triangular, just apply the textbook method of matrix inversion, i.e. write down a matrix and a unit matrix next to each other, Gauss-eliminate the left, and you will obtain its inverse from the right:
Cholesky decomposition algorithm
The algorithm for Cholesky decomposition is actually pretty straightforward.
Incomplete Cholesky decomposition
Full Cholesky decomposition is expensive and takes ~ operations.
In case of sparse matrices (or introducing sparsification ourselves), we can achieve the same result with a much smaller compute.
The simplest form of incomplete Cholesky decomposition repeats the sparsity patter of the original matrix. E.g. if the original matrix is:
its incomplete Cholesky decomposition would have a sparsity pattern of:
Zeros above main diagonal stem from the fact that is lower diagonal, zeros below the main diagonal reproduce the sparsity pattern of matrix :
This approach is known as . If you want more precision and less sparsity, you can take the sparsity pattern of , which has less 0 elements, and calculate Cholesky decomposition for it. This would be called . And so on, then it would be called . Greater would result in a better pre-conditioning at the cost of more compute.
If the matrix is not sparse at all, we can employ approach: choose some threshold of absolute value of an element of matrix , called and if the element is smaller than that, consider it 0. Thus, we achieve a sparse matrix with a Frobenius norm more or less close to the original matrix . Hence, we get a reasonable pre-conditioner at an compute and memory cost.
- http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf - a great monography on conjugate gradients by Johnathan Richard Schewchuk
- https://www.csie.ntu.edu.tw/~r97002/temp/num_optimization.pdf - Numerical Optimization by Nocedal, Wright
- https://dl.acm.org/doi/pdf/10.1145/87252.88084 - a nice autobiography and explanation by Magnus Hestenes
- https://math.mit.edu/classes/18.086/2006/am64.pdf - a text on Krylov space methods: CG, Arnoldi iteration, Lanczos
- https://math.stackexchange.com/questions/3439589/are-conjugate-vectors-unique - about relation of conjugate and eigen vectors
- http://www.people.vcu.edu/~ysong3/lecture11.pdf - good intro, construction of each search direction well explained
- https://en.wikipedia.org/wiki/Conjugate_gradient_method - wikipedia
- https://en.wikipedia.org/wiki/Cholesky_decomposition#Geometric_interpretation - on interpretation of Cholesky as conjugate directions
- https://algowiki-project.org/en/Cholesky_method#Usage_of_Cholesky_decomposition_in_iterative_methods - on variants of incomplete Cholesky decomposition
Written by Boris Burkov who lives in Moscow, Russia, loves to take part in development of cutting-edge technologies, reflects on how the world works and admires the giants of the past. You can follow me in Telegram