Kernel methods and Reproducing Kernel Hilbert Space (RKHS)

August 03, 2021 43 min read

Kernel methods are yet another approach to automatic feature selection/engineering in the machine learning engineer's toolbox. It is based on theoretical results from the field of functional analysis, dating to 1900s and 1950s, called Reproducing Kernel Hilbert Space. Kernel methods, such as kernel SVMs, kernel ridge regressions, gaussian processes, kernel PCA or kernel spectral clustering are very popular in machine learning. In this post I'll try to summarize my readings about this topic, linearizing the pre-requisites into a coherent story.

Kernel methods and kernel trick

Suppose, you are trying to tell the difference between two sets of points with a linear binary classifier, like SVM, that can take decisions, using a linear decision boundary only.

What if your points are lying in a way that cannot be split by a single hyperplane in that space? E.g. there is no single line that could split the points, lying on a plane like a XOR function:

XOR function cannot be learnt by a linear classifier in 2D; but we can go to 3D instead

XOR function cannot be learnt by a linear classifier in 2D. Here blue points and red points are two classes, we need to learn, and there is no single hyperplane in 2D (i.e. a line), such that red points were on one side of the line and blue points - on the other. However, we can engineer extra features, mapping each of our data points onto a higher dimensional space (3D) instead, where an extra dimension would allow us to come up with a valid linear decision boundary (i.e. a 2D plane, which is a hyperplane in 3D space).

Feature maps

So what you can do is come up with extra features, made from the basic ones.

I.e. if each of your data points x\bf x \bf was something like a 2-vector xi=(x1,ix2,i)\bf x_i \bf = \begin{pmatrix} x_{1,i} \\ x_{2,i} \end{pmatrix}, you could engineer a third feature to be, e.g. a product of the first two φ(xi)=(x1,ix2,ix1,ix2,i)\bf \varphi(x_i) \bf = \begin{pmatrix} x_{1,i} \\ x_{2,i} \\ x_{1,i} \cdot x_{2,i} \end{pmatrix}, and do the classification in 3D.

Such a mapping from the initial space to the new space is called a feature map and denoted φ(x)\varphi(\bf x \bf).

Implicit features

Now, oftentimes you don’t need to express your features explicitly to perform the classification. Moreover, there is a way to make mathematics auto-magically engineer and select fairly complex features for you!

This magic is called Kernel trick, and it is widely applicable in machine learning. Many linear methods, based on least squares problem, can be “upgraded” with kernel trick to perform in a high-dimensional feature map space. This family of methods is called “kernel methods”. For example, ridge regression, PCA, SVM. Gaussian processes, important by themselves and for bayesian neural networks, also belong to this class.

To explain kernel trick, I will first formally state a general formulation of the optimization problem we solve in statistical learning theory, called Empirical Risk Minimization problem, then describe the basic least squares problem, and then will “kernelize” it.

General form of Empirical Risk Minimization problem in statistical learning

Don’t worry, Empirical Risk Minimization is just a fancy name for your typical statistical learning problem.

Suppose that we have a training set of points (x,y)(x, y), where each point consists of a vector of features x\bf x \bf and a class label yy. If the set of points is infinitely large, we can denote it as some join distribution P(x,y).

We are trying to come up with some classification/regression function h(x)h(x) from a space of functions H\mathcal{H}, so that h(x)h(x) approximates yy well enough to classify/predict value of each point correctly.

The error between our prediction h(x)h(x) and true class/value yy is measured by a loss function L(f(x),y)L(f(x), y).

The risk R(h)R(h), associated with a choice of classification/regression function h(x)h(x) is measured as expectation of loss function over all the data points:

R(h)=E[L(h(x),y)]=L(h(x),y)dP(x,y)R(h) = \mathbb{E}[L(h(x),y)] = \int L(h(x), y) dP(x,y)

Thus, the optimal function h(x)h(x) (let us denote it with a hat h^(x)\hat{h}(x)) would be the one to minimize the risk:

h^=arg minhHL(h(x),y)dP(x,y)\hat{h} = \underset{h \in \mathcal{H}}{\argmin} \int L(h(x), y) dP(x,y)

Often you would also want to add some kind of regularization term R(h)R({\lVert h \rVert}), e.g. L2-regularization R(h)=h2R({\lVert h \rVert}) = {\lVert h \rVert}^2 ), with some weight λ\lambda into this equation in order to prevent h^\hat{h} from overfitting the data:

h^=arg minhHL(h(x),y)dP(x,y)+λR(h)\hat{h} = \underset{h \in \mathcal{H}}{\argmin} \int L(h(x), y) dP(x,y) + \lambda R({\lVert h \rVert})

In case the amount of data available is finite (as is normally the case), the integral becomes a sum:

h^=arg minhHi=1nL(h(x),y)+λR(h)\hat{h} = \underset{h \in \mathcal{H}}{\argmin} \sum \limits_{i=1}^n L(h(x), y) + \lambda R({\lVert h \rVert})

Example: simple regression problem with basic least squares

Recall, how the least squares work in case of normal features.

Let X=(x1...xn)=(x1,1x1,2...x1,p............xn,1xn,2...xn,p)X = \begin{pmatrix} {\bf x_1} \\ ... \\ {\bf x_n} \end{pmatrix} = \begin{pmatrix} x_{1,1} && x_{1,2} && ... && x_{1,p} \\ ... && ... && ... && ... \\ x_{n,1} && x_{n,2} && ... && x_{n,p} \end{pmatrix} be the nn x pp matrix of data (e.g. n genes, p patients),

where xi=(xi,1...xi,n){\bf x_i} = (x_{i,1} ... x_{i,n}) are n-vectors, corresponding to each element of data (e.g. gene expressions for the patient xi{\bf x_i}).

Let y=(y1...yn){\bf y} = \begin{pmatrix} y_1 \\ ... \\ y_n \end{pmatrix} be the vector of results and w=(w1...wp){\bf w} = \begin{pmatrix} w_1 \\ ... \\ w_p \end{pmatrix} be the vector of weights of factors. So, our prediction is h(X)=Xwh(X) = X{\bf w}.

The aim is to minimize the following sum of squares of prediction errors:

L(h(X),y)=(yXw)T(yXw)=yTywTXTyyTXw+wTXTXwL(h(X), y) = ({\bf y} - X{\bf w})^{T} ({\bf y} - X{\bf w}) = y^T y - {\bf w}^T X^T {\bf y} - {\bf y}^T X {\bf w} + {\bf w}^T X^T X {\bf w},

so that the optimal weights w^\hat{\bf w} are:

w^=arg minw(yXw)T(yXw)\hat{\bf w} = \underset{\bf w}{\argmin} ({\bf y} - X {\bf w })^{T} ({\bf y} - X {\bf w})

We solve this minimization problem by taking the derivative and equating it to 0. In my opinion this is an abuse of notation, because we use a whole vector as independent variable, and we should’ve written pp equations instead for each Lwi\frac{\partial L}{\partial w_i}:

Lw=(yTywTXTyyTXw+wTXTXw)w=2XTy+2XTXw=0\frac{\partial L}{\partial {\bf w}} = \frac{\partial(y^T y - {\bf w}^T X^T {\bf y} - {\bf y}^T X {\bf w} + {\bf w}^T X^T X {\bf w})}{\partial{\bf w}} = -2 X^T {\bf y} + 2X^TX{\bf w} = 0

Hence, (XTX)w=XTy(X^T X) {\bf w} = X^T{\bf y}.

Now, if the matrix (XTX)(X^T X) were a full-rank non-singular matrix, we could’ve solved this equation, finding a unique solution: w=(XTX)1XTy{\bf w} = (X^T X)^{-1} \cdot X^T{\bf y}.

But note that XX is a rectangular, not a square matrix, thus, we don’t simplify this expression, as we cannot guarantee that (XTX)1(X^T X)^{-1} is invertible.

Consider an example. Let X=(123456)X = \begin{pmatrix} 1 && 2 && 3 \\ 4 && 5 && 6 \\ \end{pmatrix}. Then XTX=(142536)(123456)=(172227222936273645)X^T X = \begin{pmatrix} 1 && 4 \\ 2 && 5 \\ 3 && 6 \\\end{pmatrix} \begin{pmatrix} 1 && 2 && 3 \\ 4 && 5 && 6 \\ \end{pmatrix} = \begin{pmatrix} 17 && 22 && 27 \\ 22 && 29 && 36 \\ 27 && 36 && 45 \\ \end{pmatrix}.

This matrix has rank 2: you can see that the third column of this matrix is a linear combination of the first two: 1(172227)+2(222936)=(273645) -1 \cdot \begin{pmatrix} 17 \\ 22 \\ 27 \end{pmatrix} + 2 \cdot \begin{pmatrix} 22 \\ 29 \\ 36 \end{pmatrix} = \begin{pmatrix} 27 \\ 36 \\ 45 \end{pmatrix}.

Hence, the equation (XTX)w=XTy(X^T X) {\bf w} = X^T{\bf y} will either have no solutions, or have a whole linear space of dimensionality 1 of solutions.

To see this, consider another example, a matrix (1111)\begin{pmatrix} 1 && 1 \\ 1 && 1 \end{pmatrix}.

If we were to solve (1111)(xy)=(23)\begin{pmatrix} 1 && 1 \\ 1 && 1 \end{pmatrix} \cdot \begin{pmatrix} x \\ y \end{pmatrix} = \begin{pmatrix} 2 \\ 3 \end{pmatrix}, this system obviously has no solutions.

If we were to solve (1111)(xy)=(22)\begin{pmatrix} 1 && 1 \\ 1 && 1 \end{pmatrix} \cdot \begin{pmatrix} x \\ y \end{pmatrix} = \begin{pmatrix} 2 \\ 2 \end{pmatrix}, this system has a whole line of solutions y=2xy = 2 - x.

Expression (XTX)1XT(X^T X)^{-1} \cdot X^T is called Moore-Penrose pseudo-inverse matrix.

The nn x nn matrix C=XTX=(x1,1xn,1x1,2xn,2......x1,pxn,p)(x1,1x1,2...x1,pxn,1xn,2...xn,p)C = X^T X = \begin{pmatrix} x_{1,1} && x_{n,1} \\ x_{1,2} && x_{n,2} \\ ... && ... \\ x_{1,p} && x_{n,p} \\ \end{pmatrix} \cdot \begin{pmatrix} x_{1,1} && x_{1,2} && ... && x_{1,p} \\ x_{n,1} && x_{n,2} && ... && x_{n,p} \end{pmatrix} is often called a covariance matrix.

It is a Gram matrix by construction, and, thus, is positive (semi-)definite with positive/non-negative eigenvalues.

Kernel trick

Now, we shall replace the matrix XX of basic features with matrix Φ\Phi of advanced features, where each row xi{\bf x_i} of length pp, corresponding to gene expressions of a single patient, is replaced by a row of feature maps φ(xi){\bf \varphi(x_i)} of length PP of feature maps of those gene expressions:

Φ=(φ1(x1)φ2(x1)...φP(x1)φ1(xn)φ2(xn)...φP(xn))=(φ(x1)φ(xn))\Phi = \begin{pmatrix} \varphi_1({\bf x_1}) && \varphi_2({\bf x_1}) && ... && \varphi_P({\bf x_1}) \\ \varphi_1({\bf x_n}) && \varphi_2({\bf x_n}) && ... && \varphi_P({\bf x_n}) \end{pmatrix} = \begin{pmatrix} {\bf \varphi(x_1)} \\ {\bf \varphi(x_n)} \end{pmatrix}

Accordingly, the estimation function h(X)h(X) is replaced with h(Φ)h(\Phi), such that w=(ΦTΦ)1ΦTy{\bf w} = (\Phi^T \Phi)^{-1} \cdot \Phi^T{\bf y} (again, the vector of weights w{\bf w} is now NN-dimensional).

pp x pp positive-definite symmetric covariance matrix C=XTXC = X^TX is replaced with a PP x PP positive-definite symmetric matrix C=ΦTΦC = \Phi^T \Phi. Note that PP can be infinite, in which case matrices Φ\Phi and CC become infinite-dimensional.

Later in this post I am going to show that when we are looking for our function h(φ(xi))h({\bf \varphi({\bf x_i})}), we can actually get rid of an explicit Φ\Phi in its expression and make hh depend only on the kernel matrix KK.

We can already see that function h=φ(xi)wh = {\bf \varphi(x_i)} \cdot {\bf w}. A more general form of this result is known as Representer theorem (see below).

To get a feel of how this works, consider a nice special case.

If we assumed that ΦTΦ\Phi^T \Phi is invertible, which might not be true, then h(φ(xi))=φ(xi)(ΦTΦ)1ΦTyh({\bf \varphi({\bf x_i})}) = {\bf \varphi(x_i)} \cdot (\Phi^T \Phi)^{-1} \cdot \Phi^T{\bf y}.

Moreover, if both ΦTΦ\Phi^T \Phi and ΦΦT\Phi \Phi^T were invertible, our pseudo-inverse matrix could be represented differently:

(ΦTΦ)1ΦT=ΦT(ΦΦT)1(\Phi^T \Phi)^{-1} \Phi^T = \Phi^T (\Phi \Phi^T)^{-1}

Denote kernel matrix K=ΦΦTK = \Phi \Phi^T. Each element of that matrix Ki,jK_{i,j} can be expressed as Ki,j=φ(xi),φ(xj)K_{i,j} = \langle \varphi(x_i), \varphi(x_j) \rangle. Then:

h(φ(xi))=φ(xi)ΦT(ΦΦT)1y=KiK1yh({\bf \varphi({\bf x_i})}) = {\bf \varphi(x_i)} \cdot \Phi^T (\Phi \Phi^T)^{-1} {\bf y} = K_i K^{-1} {\bf y}

Our φ(xi)ΦT{\bf \varphi(x_i)} \Phi^T is a row p-vector, corresponding to a single row of matrix KK, Ki=(φ(xi),φ(x1),φ(xi),φ(x2),...,φ(xi),φ(xp))K_i = (\langle \varphi(x_i), \varphi(x_1) \rangle, \langle \varphi(x_i), \varphi(x_2) \rangle, ..., \langle \varphi(x_i), \varphi(x_p) \rangle).

Multiplication of a vector KiK_i by inverse matrix K1K^{-1} results in a one-hot vector (where every coodinate is 0, except by i-th, which is 1).

Thus, anticlimactically, h(φ(xi))=yih({\bf \varphi({\bf x_i})}) = y_i. For previously unobserved data points xp+1x_{p+1}, though, it would produce meaningful results. However, in reality npn \neq p, and one of the matrices ΦΦT\Phi \Phi^T or ΦTΦ\Phi^T \Phi is usually singular and cannot be inverted. This case is considered later.

Another notation convention is to denote K1yK^{-1} {\bf y} as a column p-vector α\bf \alpha. Hence, our h(φ(xi))=φ(xi)ΦT(ΦΦT)1y=j=1pαjφ(xi),φ(xj)=j=1pαjKi,jh({\bf \varphi({\bf x_i})}) = {\bf \varphi(x_i)} \cdot \Phi^T (\Phi \Phi^T)^{-1} {\bf y} = \sum \limits_{j=1}^{p} \alpha_j \langle \varphi(x_i), \varphi(x_j) \rangle = \sum \limits_{j=1}^{p} \alpha_j K_{i, j}.

Tikhonov (L2) regularization and Woodbury-Sherman-Morrison formula

In reality, the number of samples in your data matrix pp is rarely equal to the number of factors/features nn.

In case of kernel ridge regression, for many kernels the number of features is infinite (see RBF kernel example below).

If PNP \gg N, ΦTΦ\Phi^T \Phi is a non-full rank matrix with lots of zero eigenvalues, thus, its inverse (ΦTΦ)1(\Phi^T \Phi)^{-1} does not exist.

However, in order to overcome this problem, we can add a Tikhonov regularization term: (λI+ΦTΦ)(\lambda I + \Phi^T \Phi), effectively increasing all the matrix eigenvalues by λ\lambda and thus making it invertible.

After that we apply Woodbury-Sherman-Morrison formula to invert the regularized matrix and express the solution of kernel ridge regression through kernel matrix KK:

h(φ(xi))=φ(xi)(λI+ΦTΦ)1ΦTy=φ(xi)ΦT(ΦΦT+λI)1y=Ki(K+λI)1yh(\varphi({\bf x_i})) = \varphi({\bf x_i}) \cdot (\lambda I + \Phi^T \Phi)^{-1} \Phi^T \cdot {\bf y} = \varphi({\bf x_i}) \cdot \Phi^T (\Phi \Phi^T + \lambda I)^{-1} \cdot {\bf y} = K_i \cdot (K + \lambda I)^{-1} \cdot {\bf y}

So, yes, if not for the regularization term, this formula would’ve just selected yiy_i from outputs vector as an estimate of h(φ(xi))h(\varphi(x_i)). However, addition of regularization to the kernel matrix complicates this expression.

Anyway, we end up with an explicit solution of KRR through kernel matrix KK and regression outputs y{\bf y} and through mathematical magic avoid (possibly infinite-dimensional) feature space vectors φ(xi)\varphi({\bf x_i}).

Example: Radial Basis Functions (RBF) kernel

To give this idea a practical sense, let us consider a popular Radial Basis Functions (RBF) kernel, based on Gaussian distribution.

If we had xi\bf x_i \bf and xj\bf x_j \bf vectors of input features, the RBF kernel for them would look like this:

Ki,j=k(xi,xj)=exixj22σ2K_{i,j} = k(x_i, x_j) = e^-{\frac{{|| x_i - x_j ||}^2}{2\sigma^2}}

To see how this kernel produces feature maps, let us decompose it into separate multipliers. For now let us assume σ2=1\sigma^2=1 for shortness of notation, as it doesn’t affect the general idea.

k(xi,xj)=exixj22=e12xi212xj2+xiTxj=exiTxje12xi2e12xj2=k=1(xiTxj)kk!e12xi2e12xj2=k(x_i, x_j) = e^{-\frac{{|| x_i - x_j ||}^2}{2}} = e^{-\frac{1}{2}{||x_i||}^2 -\frac{1}{2}{||x_j||}^2 + x_i^T x_j} = e^{x_i^T x_j} \cdot e^{-\frac{1}{2}{||x_i||}^2} \cdot e^{-\frac{1}{2}{||x_j||}^2} = \sum \limits_{k=1}^{\infty} \frac{(x_i^T x_j)^k}{k!} \cdot e^{-\frac{1}{2}{||x_i||}^2} \cdot e^{-\frac{1}{2}{||x_j||}^2} =

=k=1nl=k(xi,1n1...xi,knk)n1!!e12xi2φk(xi)(xj,1n1...xj,knk)n1!!e12xj2φk(xj)=φ(xi),φ(xj) = \sum \limits_{k=1}^{\infty} \sum \limits_{\sum n_l=k} \underbrace{ \frac{(x_{i,1}^{n_1} \cdot ... \cdot x_{i,k}^{n_k})}{\sqrt{n_1!...n_l!}} \cdot e^{-\frac{1}{2}{||x_i||}^2}}_{\varphi_k(x_i)} \cdot \underbrace{\frac{(x_{j,1}^{n_1} \cdot ... \cdot x_{j,k}^{n_k})}{\sqrt{n_1!...n_l!}} \cdot e^{-\frac{1}{2}{||x_j||}^2}}_{\varphi_k(x_j)} = \langle \varphi(\bf x_i \bf), \varphi(\bf x_j \bf) \rangle

We’ve just shown, how to split RBF kernel into an inner product of feature maps of each data point. Note that the sum is actually an infinite series - kk index runs to infinity. It would not be possible to explicitly write out feature maps.

Reproducing Kernel Hilbert Space formalism

Previously we’ve seen some practical examples of how kernelization can help us automatically engineer features to make our linear methods much more powerful. With kernel trick we can automatically generate complex (e.g. polynomial) features out of our basic ones, and our linear methods, such as linear regression, SVMs, PCA, Fisher discriminant analysis, spectral clustering etc. would become so much more powerful.

However, there is a whole layer of theoretical foundation behind this, that allows to prove that kernel methods are applicable to a wide range of kernels. Turns out that each kernel gives rise to a whole space of functions, called Reproducing Kernel Hilbert Space. This topic requires a pretty strong background in functional analysis, unfortunately.

There are 4 words and 3 concepts in Reproducing, Kernel Hilbert spaces. I’ll first explain the meaning of each of them separately, providing the missing context, and then will consider them together. RKHS is a beast with multiple faces, so you can approach it from different perspectives and then find out that they are all interconnected - quite a typical situation for the kaleidoscope of mathematics. I’ll discuss those approaches in the second part of the post.

Integral transform and its kernel

The term “kernel” here in Reproducing Kernel Hilbert Space is used in the context of integral transforms, which has absolutely nothing (1, 2) to do with the kernel of homomorphism or linear transformation.

As an example of integral transformation consider a sound signal - you have a WAV-file, which is a function f(t)f(t) in the time domain, which equals to the air pressure, produced by your headphones at every given moment of time, while they play the signal. WAV-files are typically huge, so you want to compress the singal by decomposing the sound into harmonics of various frequencies and sacrificing the very high-frequency ones, as 1) their amplitudes are typically very low and 2) human ear can’t hear them anyway. This is more or less how MP3 files work and that’s why they are much smaller than WAVs.


A spectrogram of sound from Audacity software. Air pressure is at the top, amplitudes of harmonics are at the bottom.

So, you do a Fourier series decomposition of your initial signal f(t)f(t), which is an integral transform:

f^(ω)=0te2πiωtf(t)dt\hat{f}(\omega) = \int \limits_{0}^{t} e^{-2 \pi i \omega t} f(t) dt , where f^(ω)\hat{f}(\omega) is the amplitude of harmonic with frequency ω\omega.

Speaking generally, what we did here was an integral transform from a function f(t)f(t) in the time domain to a function f^(ω)\hat{f}(\omega) in the frequency domain.

This is formally written as:

(Tf)(u)=t1t2K(t,u)f(t)dt(Tf)(u) = \int \limits_{t_1}^{t_2} K(t, u) f(t) dt

Here TfTf is an integral transform of function f(t)f(t), which results in a function in a different domain uu, and K(t,u)K(t, u) is the kernel of integral transform. In case of Fourier transform our kernel was K(t,ω)=e2πiωtK(t, \omega) = e^{-2 \pi i \omega t}.

RKHS deals with some special case of kernels, which possess some nice and helpful properties, which we shall discuss later.

Hilbert spaces

Next term in the Reproducing Kernel Hilbert Space abbreviation, that we are going to explore, is “Hilbert space”.

Hilbert space is a slightly more abstract concept, that typical Euclidean space we all are used to.

Speaking informally, in regular Euclidean spaces we are considering finite-dimensional vectors.

Hilbert spaces are a bit more general, because they allow the vectors to be infinite-dimensional. The motivation can be very simple.

Imagine a function, e.g. f(x)=x2f(x) = x^2. You can measure its values in discrete points e.g. {x=-2, x=-1, x=0, x=1, x=2} and get a vector of funciton values {f(-2)=4, f(-1)=1, f(0)=0, f(1)=1, f(2)=4}.

Now, suppose that you have infinite number of points, where you measure the value of your function. So, instead of a finite-dimensional vector, you get a continuous vector. Such infinite-dimensional vector, representing a function, is a single point in a space that is called a Hilbert space.

Moreover, you can define distances and angles on the Hilbert space, by introducing dot product constructuon. A dot product between two functions f(x)f(x) and g(x)g(x) would be:

f,g=+f(t)g(t)dt\langle f, g \rangle = \int \limits_{-\infty}^{+\infty}f(t)g(t)dt

Hence, you can define “length” (or, rather, its formal generalization, called norm) of your infinite-dimentional vector ff as:

f,f=+f2(t)dt\langle f, f \rangle = \int \limits_{-\infty}^{+\infty}f^2(t)dt

And the angle between two functions ff and gg then can be defined as cos(f,g)=f,gf,fg,g\cos(f,g) = \frac{\langle f,g \rangle}{\sqrt{\langle f,f \rangle \langle g, g \rangle }}.

Note, how this definition is useful for Fourier transforms - you can treat Fourier transform of a function as a dot product of that function and a Fourier harmonic:

e2πiωt,f=0te2πiωtf(t)dt=f^(ω)\langle e^{-2 \pi i \omega t}, f \rangle = \int \limits_{0}^{t} e^{-2 \pi i \omega t} f(t) dt = \hat{f}(\omega)

Angles are super-important, because you can say that some functions are orthogonal to each other (e.g. Fourier harmonics are orthogonal to each other - you may prove it yourself). Having a set of orthogonal functions, such as Fourier harmonics, is super-useful, because you can use it as a basis in your Hilbert space, and easily represent other functions as a linear combination of your basis functions.

Hierarchy of topological spaces

If we were to address the formal definition of Hilbert space, we would note that it is defined as

  1. complete
  2. topological vector space
  3. with a dot product and a norm, generated by the notion of dot product

We’ve already discussed the concept of (3) dot product and norm. What’s left is (1) completeness and the notion of (2) topological vector space.

Mathematicians introduced more and more abstract concepts of spaces, which could be treated similarly to hierarchies of inheritance in object-oriented programming.

Let’s take a look at a diagram of various spaces, they defined:

Hierarchy of topological spaces

Hierarchy of topological spaces. Hilbert space is a Banach space with a norm, induced by inner product.

As you can see, the Hilbert space has a whole hierarchy of parents. I’ll consider only the most important ones:

  • (1a) Vector spaces or linear spaces - spaces of objects, that can be multiplied by a scalar or added, while preserving linearity
  • (1b) Topological spaces - spaces, where the notion of topology (basically, the relation of neighbourhood between points) is defined; note that neighbourhood is more abstract concept than the concept of length - you can define the notion of neighbourhood on a rubber band, and while the distances between its point can be altered, as you stretch it, the notion of neighbourhood is preserved
  • (2) Metric spaces - topological vector spaces (multiple inheritance of topological and vector spaces), with the notion of distance between two points, that induces the topology
  • (3) Normed vector spaces - metric spaces with the notion of norm (generalization of length) of a vector/point
  • (4) Banach spaces - complete normed vector spaces; completeness means that if a Cauchy sequence converges to some limit, that limit is also an element of the space - this is a valuable property for proving theorems in functional analysis
  • (5) Hilbert spaces - banach spaces with the notion of dot product/inner product that induces the norm

Reproducing property

Informally speaking, the definition of reproducing property is that if two functions from the Hilbert space are close in norm, it guarantees that evaluations of those functions at any point are also close:

f(x)h(x)<Mxfh|f(x) - h(x)| < M_x || f - h || , where MxM_x is a constant specific for point xx.

Operation of evaluation of a function is defined by an evaluation funcional Lx:ff(x)L_x: f \mapsto f(x) that accepts any funciton on input and returns its value at the point xx. As we just stated infromally, our evaluation functional is continuous, hence, bounded.

Using that fact, we can apply Riesz-Frechet representation theorem, which states that there is a representation of this functional as an integral with some function KxK_x:

Lx(f)=f,KxL_x(f) = \langle f, K_x \rangle

Now, the function KxK_x is itself a function in RKHS, so we can apply evaluation functional to it, but at a different point:

Ly(Kx)=Kx,KyL_y(K_x) = \langle K_x, K_y \rangle

If our inner product for functions ff and gg were defined just as a regular integral f,g=abf(x)g(x)dx\langle f, g \rangle = \int \limits_a^b f(x) g(x) dx, then evaluation functional KxK_x is a trivial function, which equals 1 in xx and 0 everywhere else.

However, we may define custom inner products (see an example of inner product with roughness penalty in the lecture by Arthur Gretton).

In this case reproducing property becomes non-trivial, and the custom inner product is the reason to define a custom Hilbert space.

Properties of the kernel function follow from the properties of inner product. It is symmetric, because the inner product is symmetric, and it is positive-definite, because, again, the inner product is positive definite. For the definition of positive definiteness for an operator, please see Mercer theorem section below.

Symmetry (follows from symmetry/conjugate symmetry property of inner product):

K(x,y)=Kx,Ky=Ky,Kx=K(y,x)K(x, y) = \langle K_x, K_y \rangle = \langle K_y, K_x \rangle = K(y, x)

Positive semidefiniteness: for any set of numbers ci,cjc_i, c_j the following expression is required to be non-negative:

i,j=1ncicjK(xi,xj)=i=1nciKxi,j=1ncjKxj=i=1nciKxi,j=1ncjKxj=i=1nciKxi20\sum \limits_{i,j=1}^{n} c_i c_j K(x_i, x_j) = \sum \limits_{i=1}^{n} c_i \langle K_{x_i}, \sum \limits_{j=1}^{n} c_j K_{x_j} \rangle = \langle \sum \limits_{i=1}^{n} c_i K_{x_i}, \sum \limits_{j=1}^{n} c_j K_{x_j} \rangle = || \sum \limits_{i=1}^{n} c_i K_{x_i} ||^2 \ge 0

There is an inverse statement, called Moore–Aronszajn theorem - every symmetric, positive semidefinite kernel gives rise to a unique Hilbert space. I leave the proof up to wikipedia. Note that only the kernel is unique for RKHS, feature map must not be unique for RKHS.

Representer theorem

Remember, we previously showed an example of how to represent the optimal solution of Empirical Risk Minimization problem via an inner product of some feature map with a set of coefficients?

This is actually a theorem, called Representer theorem.

In its general form it is as follows. If we are seeking the solution of an optimization problem:

h^=arg minhHL(h(x),y)dP(x,y)+λR(h)\hat{h} = \underset{h \in \mathcal{H}}{\argmin} \int L(h(x), y) dP(x,y) + \lambda R({\lVert h \rVert}),

where RR is a strictly monotonically increasing function on [0,][0, \infty] and the solution is required to be a function h(y)h(y) from the class of feature maps f(y)=i=1βiK(xi,y)f(y) = \sum \limits_{i=1}^{\infty} \beta_i K(x_i, y), corresponding to a kernel K(x,y)K(x, y), then the solution can be represented as h(y)=i=1mαiK(xi,y)h(y) = \sum \limits_{i=1}^{m} \alpha_i K(x_i, y).

We’ve already seen an example of how this works before in case of kernel regression. This is a more general statement that claims that even if we are optimizing a more or less arbitrary functional and use a rather wide class of regularization functions (as long as they are monotonic), the kernel method still works.

The proof is as follows:

Suppose that we found a way to decompose our function h(x)h(x) into a sum of two orthogonal parts: one part is the subspace, covered by all the feature map functions φ(xi)\varphi(x_i), and the other part is its orthogonal complement vv:

h=i=1mφ(xi)+vh = \sum \limits_{i=1}^{m} \varphi(x_i) + v

If we were to evaluate this function at a specific point xjx_j, the reproducing property and orthogonality of vv would kick in:

h(xj)=i=1mφ(xi)+v,φ(xj)=i=1mαiφ(xi),φ(xj)h(x_j) = \langle \sum \limits_{i=1}^{m} \varphi(x_i) + v, \varphi(x_j) \rangle = \sum \limits_{i=1}^{m}\alpha_i \langle \varphi(x_i), \varphi(x_j) \rangle

Hence, when we use this formula for h(xj)h(x_j) to evaluate the loss in our estimator, the loss does not depend on vv. As for the regularization term:

g(h)=g(i=1nαiφ(xi)+v)=g(i=1nαiφ(xi)2+v2)g(i=1nαiφ(xi)2)g(||h||) = g(|| \sum \limits_{i=1}^{n} \alpha_i \varphi(x_i) + v ||) = g(\sqrt{|| \sum \limits_{i=1}^{n} \alpha_i \varphi(x_i) ||^2 + ||v||^2}) \ge g(\sqrt{|| \sum \limits_{i=1}^{n} \alpha_i \varphi(x_i) ||^2})

Optimum of hh is achieved when vv is 0 (or minimum) due to monotonicity of regularizer gg in its argument. Hence,

h^=i=1nαiφ(xi)=\hat{h} = \sum \limits_{i=1}^{n} \alpha_i \varphi(x_i) = , which proves the theorem.

Mercer theorem

Finite-dimensional case: Gram matrix

To get an intuition of Mercer theorem let us first consider a discrete case, where our operator is just a real-valued matrix.

In discrete case a matrix is called a Gram matrix, if it is constructed as a multiplication of a data matrix XX with its inverse:

K=XTX=(x1,x1x1,x2x1,x3x2,x1x2,x2x2,x3x3,x1x3,x2x3,x3)K = X^T X = \begin{pmatrix} \langle {\bf x_1}, {\bf x_1} \rangle && \langle {\bf x_1}, {\bf x_2} \rangle && \langle {\bf x_1}, {\bf x_3} \rangle \\ \langle {\bf x_2}, {\bf x_1} \rangle && \langle {\bf x_2}, {\bf x_2} \rangle && \langle {\bf x_2}, {\bf x_3} \rangle \\ \langle {\bf x_3}, {\bf x_1} \rangle && \langle {\bf x_3}, {\bf x_2} \rangle && \langle {\bf x_3}, {\bf x_3} \rangle \\ \end{pmatrix}

A matrix like this has 2 important properties: it is symmetric and positive (semi-)definite.

Moreover, the reverse is also true: if the matrix KK is positive semidefinite, it is always a Gram matrix of dot products of some matrices BB, called realizations: M=BBM = B^{*}B. This statement is a finite-dimensional version of Mercer’s theorem. I will first explore the properties of a Gram matrix, constructed as a dot product matrix, and after that will proof the discrete analogue of Mercer theorem.


If out matrix KK of dimensionality nn is symmetric (so that its elements across the diagonal ki,j=kj,ik_{i,j} = k_{j,i} are identical), it means that its eigenvectors are orthogonal and form an orthogonal matrix EE, where each column of such a matrix is an eigenvector eie_i, and eigenvectors are orthogonal, so that ei,ej=0\langle e_i, e_j \rangle = 0 if iji \neq j and ei,ej=1\langle e_i, e_j \rangle = 1 if i=ji = j.

This leads to eigenvalue decomposition of our matrix as:

K=EΛE1=EΛETK = E \Lambda E^{-1} = E \Lambda E^T,

where Λ\Lambda is the diagonal matrix of eigenvalues λi\lambda_i corresponding to eigenvectors eie_i: Λ=(λ1000λ20.........00λn)\Lambda = \begin{pmatrix} \lambda_1 && 0 && 0 \\ 0 && \lambda_2 && 0 \\ ... && ... && ... \\ 0 && 0 && \lambda_n \end{pmatrix}, and we use the fact that eigenvector matrix is orthogonal, so that E1=ETE^{-1} = E^T.

Hence, we can use an alternative representation of eigenvalue decomposition through outer product:

K=EΛET=i=1nλieieiTK = E \Lambda E^T = \sum \limits_{i=1}^{n} \lambda_i e_i e_i^T

For more details on the topic, see this post.

Positive (semi-) definiteness

Positive definite matrix is such a matrix, that being considered as a quadratic form matrix, it always produces a positive (for positive definite) or non-negative (for positive semidefinite) scalar:

xTKx0{\bf x^T} K {\bf x} \ge 0 for any vector x{\bf x}

This leads to the following consequence:

As Gram matrix is symmetric, its eigenvalues are real (again, see this post). Moreover, all of them have to be non-negative, because if any eigenvalue λi\lambda_i were negative, it would mean that for en eigenvector xx:

Kx=λixKx = \lambda_i x, and xKx=xλix=λix2<0x^*Kx = x^* \lambda_i x = \lambda_i {|x|}^2 < 0,

as square of vector xx length is positive, and λi\lambda_i is negative by proposal), which contradicts the definition of positive semidefinite matrix.

Note that positive definite matrices generally don’t have to be symmetric. E.g. (1111)\begin{pmatrix} 1 && 1 \\ -1 && 1 \end{pmatrix} is positive definite, because:

(xy)(1111)(xy)=(xy)(x+yx+y)=x2+y2\begin{pmatrix} x && y \\ \end{pmatrix} \begin{pmatrix} 1 && 1 \\ -1 && 1 \\ \end{pmatrix} \begin{pmatrix} x \\ y \\ \end{pmatrix} = \begin{pmatrix} x && y \end{pmatrix} \begin{pmatrix} x + y \\ -x + y \\ \end{pmatrix} = x^2 + y^2

But it is not symmetric. See this StackOverflow post.

Non-full rank Gram matrices

Interestingly, an nn x nn Gram matrix is oftentimes not a full-rank matrix.

Suppose, that realization matrix X=(x1,1x1,2x1,3x2,1x2,2x2,3)X = \begin{pmatrix} x_{1,1} && x_{1,2} && x_{1,3} \\ x_{2,1} && x_{2,2} && x_{2,3} \\ \end{pmatrix}, and we were looking for the Gram matrix

K=XTX=(x1,1x2,1x1,2x2,2x1,3x2,3)(x1,1x1,2x1,3x2,1x2,2x2,3)K = X^T X = \begin{pmatrix} x_{1,1} && x_{2,1} \\ x_{1,2} && x_{2,2} \\ x_{1,3} && x_{2,3} \\ \end{pmatrix} \begin{pmatrix} x_{1,1} && x_{1,2} && x_{1,3} \\ x_{2,1} && x_{2,2} && x_{2,3} \end{pmatrix}.

Again, you can use the perspective of outer product to see that matrix KK has the rank of 2:

K=(x1Tx2T)(x1x2)=x1Tx1+x2Tx2=(x1,1x1,2x1,3)(x1,1x1,2x1,3)+(x2,1x2,2x2,3)(x2,1x2,2x2,3)K = ({\bf x_1}^T {\bf x_2}^T) \begin{pmatrix} {\bf x_1} \\ {\bf x_2} \\ \end{pmatrix} = {\bf x_1}^T {\bf x_1} + {\bf x_2}^T {\bf x_2} = \begin{pmatrix} x_{1,1} \\ x_{1,2} \\ x_{1,3} \end{pmatrix} \begin{pmatrix} x_{1,1} && x_{1,2} && x_{1,3} \end{pmatrix} + \begin{pmatrix} x_{2,1} \\ x_{2,2} \\ x_{2,3} \end{pmatrix} \begin{pmatrix} x_{2,1} && x_{2,2} && x_{2,3} \end{pmatrix}.

You can tell that there are only 2 linearly independent rows in this matrix, so the third row x1,3(x1,1x1,2x1,3)+x2,3(x2,1x2,2x2,3)x_{1,3} \cdot \begin{pmatrix} x_{1,1} && x_{1,2} && x_{1,3} \end{pmatrix} + x_{2,3} \begin{pmatrix} x_{2,1} && x_{2,2} && x_{2,3} \end{pmatrix} can be represented as a linear combination of the first two rows.

What if it is the other way around, and X=(x1,1x1,2x2,1x2,2x3,1x3,2)X = \begin{pmatrix} x_{1,1} && x_{1,2} \\ x_{2,1} && x_{2,2} \\ x_{3,1} && x_{3,2} \end{pmatrix}?

Then your Gram matrix is K=XTX=(x1,1x1,2x1,3x2,1x2,2x2,3)(x1,1x2,1x1,2x2,2x1,3x2,3)K = X^T X = \begin{pmatrix} x_{1,1} && x_{1,2} && x_{1,3} \\ x_{2,1} && x_{2,2} && x_{2,3} \end{pmatrix} \cdot \begin{pmatrix} x_{1,1} && x_{2,1} \\ x_{1,2} && x_{2,2} \\ x_{1,3} && x_{2,3} \\ \end{pmatrix} is a full-rank matrix, and realizations are not unique.

However, realizations are non-unique only up to an orthogonal transformations. For instance, if you know a matrix of realizations XX of nn x pp, and you want to find a YY realization matrix of NN by pp dimensionality, it will have a form of YY = QXQ X, where QQ is an NN x nn matrix with orthogonal columns:


YTY=XTQTQX=(x1,1x2,1x1,2x2,2x1,3x2,3)(q1,1q1,2q1,3q2,1q2,2q2,3)(q1,1q2,1q1,2q2,2q1,3q2,3)(x1,1x1,2x1,3x2,1x2,2x2,3) Y^T Y = X^T Q^T Q X = \begin{pmatrix} x_{1,1} && x_{2,1} \\ x_{1,2} && x_{2,2} \\ x_{1,3} && x_{2,3} \\ \end{pmatrix} \begin{pmatrix} q_{1,1} && q_{1,2} && q_{1,3} \\ q_{2,1} && q_{2,2} && q_{2,3} \\ \end{pmatrix} \begin{pmatrix} q_{1,1} && q_{2,1} \\ q_{1,2} && q_{2,2} \\ q_{1,3} && q_{2,3} \\ \end{pmatrix} \begin{pmatrix} x_{1,1} && x_{1,2} && x_{1,3} \\ x_{2,1} && x_{2,2} && x_{2,3} \end{pmatrix} =(x1,1x2,1x1,2x2,2x1,3x2,3)(1001)(x1,1x1,2x1,3x2,1x2,2x2,3)= \begin{pmatrix} x_{1,1} && x_{2,1} \\ x_{1,2} && x_{2,2} \\ x_{1,3} && x_{2,3} \\ \end{pmatrix} \begin{pmatrix} 1 && 0 \\ 0 && 1 \\ \end{pmatrix} \begin{pmatrix} x_{1,1} && x_{1,2} && x_{1,3} \\ x_{2,1} && x_{2,2} && x_{2,3} \end{pmatrix}.

“Discrete Mercer theorem” proof

Now it is time to prove the inverse statement: if a matrix is positive definite and symmetric, it is a Gram matrix for some vector realizations.

The proof is constructive and pretty simple.

Our nn x nn matrix KK must have an eigenvalue decomposition M=EΛETM = E \Lambda E^{T}, which we can split into 2 multipliers:

M=EΛE1=EΛ12Λ12ET=(EΛ12)TT(EΛ12)T=BTBM = E \Lambda E^{-1} = E \Lambda^{\frac{1}{2}} \Lambda^{\frac{1}{2}} E^{T} = (E \Lambda^{\frac{1}{2}})^{T^T} (E \Lambda^{\frac{1}{2}})^T = B^T B, where B=(EΛ12)TB = (E \Lambda^{\frac{1}{2}})^T.

Hence, BB is a realization matrix. Note that if MM is strictly positive definite, all the eigenvalues in Λ\Lambda matrix are positive. If it is non-negatively definite, some of the eigenvalues are 0, the matrix KK is non-full rank, and the matrix BB is rectangular with 0 rows rather than square.

Infinite-dimensional case: positive-definite kernel

Now, we can generalize this statement to infinite-dimensional case. Let ki,jk_{i,j} no longer be elements of a symmetric matrix, but instead become a kernel function k(x,y)k(x,y) of a self-adjoint operator instead.

Tkf(x)=abk(x,y)f(y)dyT_k f(x) = \int \limits_a^b k(x,y)f(y)dy

Then, if our operator also is positive (semi-) definite, we can consider it as an analogue of a Gram matrix. Then the definition of positive (semi-) definite naturally extends the definition of positive (semi-) definite matrix.

i=1nj=1nk(x,y)cicj0\sum \limits_{i=1}^{n} \sum \limits_{j=1}^{n} k(x, y) c_i c_j \ge 0 for any numbers ci,cjc_i, c_j and all finite sequences of points x1,...,xnx_1, ..., x_n / y1,...,yny_1, ..., y_n.

Then Mercer’s theorem states that the eigenfunctions ei{e_i} of TkT_k form an orthonormal basis in L2[a,b]L^2[a,b], their corresponding eigenvalues λi\lambda_i are non-negative, and the kernel has a representation as a converging absolutely and uniformly series:

k(x,y)=i=0λiei(x)ei(y)k(x,y) = \sum \limits_{i=0}^{\infty} \lambda_i e_i(x) e_i(y)

The proof builds upon the fact, that our operator TkT_k is a compact operator. For compact operators eigenfunctions form a discrete orthonormal basis by the spectral theorem, making the rest of the proof similar to finite-dimensional case.

For more details on spectral theorems in functional analysis, see this post.


Boris Burkov

Written by Boris Burkov who lives in Moscow, Russia, loves to take part in development of cutting-edge technologies, reflects on how the world works and admires the giants of the past. You can follow me in Telegram