cover

Condition numbers

August 23, 2021 8 min read

The notion of condition numbers arises when you are studying the problem of numeric stability of solutions of ordinary linear equations systems (OLES). This concept is really important in such practical applications as least-squares fitting in regression problems or search of inverse matrix (which can be an inverse of covariance matrix in such machine learning applications as Gaussian processes). Another example of their use is the time complexity of quantum algorithms for solving OLES - complexity of those algorithms is usually a polynomial or (poly-) logarithmic function of condition numbers. This post gives a brief review of condition numbers.

Condition numbers are actually a more general term that can be applied not only to matrices, but to an arbitrary function. Moreover, their definition might vary, depending on the kind of problem you are solving (e.g. OLES solution or matrix inversion) and the kind of error (absolute or relative) you are measuring.

Informal explantation of the nature of condition numbers

The nature of condition numbers in case of inverse matrix problem is quite simple: if one of eigenvalues of your matrix is 0, the matrix determinant is 0, and it has no inverse. Such a matrix is called singular or degenerate.

If none of the eigenvalues is exactly 0, but one of them is close to 0, the matrix determinant is close to 0. Then it turns out that the error that you get, when numerically calculating inverse of such matrix, is huge. Such a matrix is called ill-conditioned.

However, it turns out that it is not the absolute value of the smallest eigenvalue that determines, if the matrix is well- or ill-conditioned, but (speaking vaguely) the ratio between the absolute values of the smallest and the largest ones. I’ll explain this below.

Condition numbers for stability of OLES solution

Let us formally define what a condition number is. We’ll need a few definitions.

First, let us define vector norm as a generalization of the notion of vector length, having the properties of being positive, sub-additive and linear in terms of multiplication by scalar. Let’s think of it as just a Euclidean vector length for now: x=(ixi2)12\Vert x \Vert = (\sum \limits_i x_i^2)^{\frac{1}{2}}.

Matrix norm (or, more generally, operator norm) is the largest length of vector, you can achieve, by multiplying any vector by this matrix, divided by the vector length:

A=supAxx\Vert A \Vert = \sup {\frac{\Vert Ax \Vert}{\Vert x \Vert}}, (when x0\Vert x \Vert \neq 0 obviously).

For the purpose of study of numeric estimation of inverse matrix it is useful to consider the opposite quantity - the strongest shrinkage one can achieve by multiplying a vector by matrix:

m=infAxxm = \inf \frac{\Vert Ax \Vert}{\Vert x \Vert}

It is the reciprocal of norm of matrix inverse: m=infAxx=infyA1y=1supA1yy=1A1m = \inf \frac{\Vert Ax \Vert}{\Vert x \Vert} = \inf \frac{\Vert y \Vert}{\Vert A^{-1}y \Vert} = \frac{1}{\sup \frac{\Vert A^{-1}y \Vert}{\Vert y \Vert}} = \frac{1}{\Vert A^{-1} \Vert}.

The ratio of the largest stretch of a vector to the largest shrink of a vector, created by matrix A, is called condition number: κ(A)=supAxxinfAxx=AA1\kappa(A) = \frac{\sup {\frac{\Vert Ax \Vert}{\Vert x \Vert}}}{\inf {\frac{\Vert Ax \Vert}{\Vert x \Vert}}} = \Vert A \Vert \Vert A^{-1} \Vert.

Why does this notion make sense? Consider a system of linear equations: Ax=yAx = y. Suppose that you make an error in estimation of y:

A(x+δx)=y+δyA (x+\delta x) = y + \delta y, hence by subtracting Ax=yAx = y, Aδx=δyA \delta x = \delta y or δx=A1δy\delta x = A^{-1} \delta y. We know now the bound on absolute error on δxA1δy\Vert \delta x \Vert \leq \Vert A^{-1} \Vert \Vert \delta y \Vert.

The bound on relative error follows from multiplication of two inequalities: yAx\Vert y \Vert \leq \Vert A \Vert \Vert x \Vert and δxA1δy\Vert \delta x \Vert \leq \Vert A^{-1} \Vert \Vert \delta y \Vert:

yδxAxA1δy\Vert y \Vert \Vert \delta x \Vert \leq \Vert A \Vert \Vert x \Vert \Vert A^{-1} \Vert \Vert \delta y \Vert     \implies δxxκ(A)δyy\frac{\Vert \delta x \Vert}{\Vert x \Vert} \leq \kappa(A) \frac{\Vert \delta y \Vert}{\Vert y \Vert}.

So, condition number shows the size of relative error of your numeric solution.

Matrix norms: operator norm vs spectral norm vs LpL^p-norm vs Frobenius norm

Before we move on, we should make a short stop to clarify the notion of matrix norms.

There are several approaches to define matrix norms; they extend the notion of vector norm in different ways.

First approach, that we used above, extends the notion of generic operator norm to the case of matrices. It says, by how much at most the norm/length of vector AxAx can be larger than the norm/length of vector xx.

An important special case of operator norm is called spectral norm, which implies that the vector norm we use is actually Euclidean (L2L^2) norm, i.e. length.

Don’t confuse “spectral norm” (I won’t go deeper into the topic of singular numbers here, but will just mention that spectral norm of matrix A equals to the largest singular value of matrix AA, which equals to the square root of the largest eigenvalue of matrix ATAA^TA) with “spectral radius” (which is just the absolute value of the largest (in absolute value) eigenvalue of matrix AA), they can be different in general case, see Gelfand’s formula. For normal matrices, i.e. orthogonal/unitary and symmetric/Hermitian matrices, though, singular values correspond to absolute values of eigenvalues.

Second approach treats matrix as a mere vector, effectively flattening its 2 directions into 1, and directly applies vector norm to it as though it were just a vector. These norms are sometimes called “entrywise”.

Again, an important special case of this kind of norm is called Frobenius norm, which is an application of the regular L2L^2 vector norm to our matrices.

Matrix inverse

When you are solving the problem of matrix inversion, AA1=IA A^{-1} = I, each column xx of your inverse matrix is a solution of a linear equation Ax=vAx = v, where vv is one of the columns of identity matrix, a one-hot vector with all, but one coordinates being 0, and one coordinate 1: v=(0,0,...,1,...,0)v = (0, 0, ..., 1, ..., 0).

Hence, for each column of matrix A1A^{-1} the relative error in its calculation, again, is determined by condition number. We can say that for each column x: δxxκ(A)δyy\frac{\Vert \delta x \Vert}{\Vert x \Vert} \leq \kappa(A) \frac{\Vert \delta y \Vert}{\Vert y \Vert}.

If we assume that all the matrix columns had the same measure, we can directly express Frobenius norm of error of inverse matrix though condition numbers, not the operator/spectral norm.

L2-regularization from the condition numbers perspective

In regression problems an L2-regularization is often introduced in order to prevent the explosion of weights of the regression.

Here is how it works. In regression problems we look for weights vector w{\bf w} that minimizes the sum of squares:

w^=arg minwi(yiw,xi)2{\bf \hat{w}} = \underset{{\bf w}}{\argmin} \sum \limits_i (y_i - \langle {\bf w}, {\bf x_i} \rangle)^2

In order to prevent the solution from overfitting, it is a common practice to add an L2-norm Tichonov regularization term:

w^=arg minwi(yiw,xi)2+λw2{\bf \hat{w}} = \underset{{\bf w}}{\argmin} \sum \limits_i (y_i - \langle {\bf w}, {\bf x_i} \rangle)^2 + \lambda || {\bf w} ||^2

Why does this work? Because, upon minimization we take a derivative in each coordinate wiw_i and end up with the following expression:

(yiwTxi)xi=λw    w=(λI+ixixiT)1(iyixi)\sum (y_i - {\bf w}^T x_i)x_i = \lambda {\bf w} \implies {\bf w} = (\lambda {\bf I} + \sum \limits_i {\bf x_i} {\bf x_i}^T)^{-1} (\sum \limits_i y_i {\bf x_i})

Now, if the matrix C=ixixiTC = \sum \limits_i {\bf x_i} {\bf x_i}^T is ill-conditioned or even singular (i.e. has 0 eigenvalues), as it might be non-full-rank, if the number of elements of sum is less then dimensionality of the vectors xix_i, its inverse C1=(ixixiT)1C^{-1} = (\sum \limits_i {\bf x_i} {\bf x_i}^T)^{-1} is either calculated with a huge error (if the smallest by absolute value eigenvalue is very little), or just does not exist at all (is the smallest by absolute value eigenvalue is 0, and the matrix is non-full-rank).

Regularization is our solution for such a problem, because it increments all the eigenvalues by regularization coefficient λ\lambda. Indeed, if σi2\sigma_i^2 is an eigenvalue of this matrix, and vv is its eigenvector:

Cv=σ2vC v = \sigma^2 v

Then vv is still an eigenvector for the regularized matrix C:

(C+λI)v=(σ2+λ)v(C + \lambda {\bf I}) v = (\sigma^2 + \lambda) v

Thus, if λ\lambda is big enough, the smallest eigenvalue is at least λ\lambda, because all the eigenvalues of matrix CC are positive, because it is a Gram matrix, which is symmetric positive-definite.

References


Boris Burkov

Written by Boris Burkov who lives in Moscow, Russia, loves to take part in development of cutting-edge technologies, reflects on how the world works and admires the giants of the past. You can follow me in Telegram