cover

MDS, Isomap, LLE, Spectral embedding

December 10, 2022 35 min read

In this post I investigate the multi-dimensional scaling algorithm and its manifold structure-aware flavours.

Multi-dimensional scaling (MDS)

Intuition

As usual suppose that we have an nn-by-pp data matrix X=(x1,1x1,2...x1,p............xn,1xn,2...xn,p)X = \begin{pmatrix} x_{1,1} && x_{1,2} && ... && x_{1,p} \\ ... && ... && ... && ... \\ x_{n,1} && x_{n,2} && ... && x_{n,p} \end{pmatrix}.

Suppose that we’ve constructed an nn-by-nn matrix D(2)D^{(2)} of square distances between rows XiX_i of data matrix XX:

D(2)=(d2(X1,X1)d2(X1,X2)...d2(X1,Xn)............d2(Xn,X1)d2(Xn,X2)...d2(Xn,Xn))D^{(2)} = \begin{pmatrix} d^2(X_1, X_1) && d^2(X_1, X_2) && ... && d^2(X_1, X_n) \\ ... && ... && ... && ... \\ d^2(X_n, X_1) && d^2(X_n, X_2) && ... && d^2(X_n, X_n) \end{pmatrix}.

We know that d2(X1,X2)=X122+2X1,X2+X222d^2(X_1, X_2) = ||X_1||^2_2 + 2 \langle X_1, X_2 \rangle + ||X_2||^2_2.

With a certain trick we can move away from distance matrix DD to a kernel matrix KK, such that:

K=(X1,X1X1,X2...X1,Xn............Xn,X1Xn,X2...Xn,Xn)K = \begin{pmatrix} \langle X_1, X_1 \rangle && \langle X_1, X_2 \rangle && ... && \langle X_1, X_n \rangle \\ ... && ... && ... && ... \\ \langle X_n, X_1 \rangle && \langle X_n, X_2 \rangle && ... && \langle X_n, X_n \rangle \end{pmatrix}

Kernel matrix basically contains cosines of angles between vectors XiX_i and XjX_j (given their lengths are 1).

Now, in MDS we are seeking to approximate each pp-dimensional data point XiX_i with a low-dimensional vector YiY_i, so that the similarity matrix between YiY_i vectors approximates the kernel matrix K between XiX_i vectors.

Suppose, for instance, that YiY_i has a dimensionality of 2.

K^=YYT=(Y1,1Y1,2Y2,1Y2,2...Yn,1Yn,2)(Y1,1Y2,1...Yn,1Y1,2Y2,2...Yn,2)=(Y1,Y1Y1,Y2...Y1,Yn............Yn,Y1Yn,Y2...Yn,Yn)\hat{K} = Y Y^T = \begin{pmatrix} Y_{1,1} && Y_{1,2} \\ Y_{2,1} && Y_{2,2} \\ ... \\ Y_{n,1} && Y_{n,2} \end{pmatrix} \cdot \begin{pmatrix} Y_{1,1} && Y_{2,1} && ... && Y_{n,1} \\ Y_{1,2} && Y_{2,2} && ... && Y_{n,2} \end{pmatrix} = \begin{pmatrix} \langle {\bf Y_{1}}, {\bf Y_{1}} \rangle && \langle {\bf Y_{1}}, {\bf Y_{2}} \rangle && ... && \langle {\bf Y_{1}}, {\bf Y_{n}} \rangle \\ ... && ... && ... && ... \\ \langle {\bf Y_{n}}, {\bf Y_{1}} \rangle && \langle {\bf Y_{n}}, {\bf Y_{2}} \rangle && ... && \langle {\bf Y_{n}}, {\bf Y_{n}} \rangle \end{pmatrix} ,

where Y1=(Y1,1,Y1,2){\bf Y_1} = (Y_{1,1}, Y_{1,2}) is a row vector of YY matrix.

What are the optimal column vectors (Y1,1Y2,1...Yn,1)\begin{pmatrix}Y_{1,1} \\ Y_{2,1} \\ ... \\ Y_{n,1} \end{pmatrix} of this matrix Y?

The answer is simple: those are eigenvectors/singular vectors of KK (in this case it is the same, because KK is a Gram matrix).

The best rank 2 approximation of KK in terms of Frobenius norm is given by its partial SVD: K^=i=12σiuiviT\hat{K} = \sum \limits_{i=1}^2 \sigma_i u_i v_i^T.

This can be shown by iteratively applying Kσ1u1v1TF2=i=2nσi2|| K - \sigma_1 u_1 v_1^T||^2_F = \sum \limits_{i=2}^n \sigma_i^2 (e.g. see this SO post)

In this case KK matrix is a Gram matrix, so its SVD is symmetrix K=UDVT=UDUTK = U D V^T = U D U^T.

Now, how do we get from DD matrix to KK matrix? See the next section.

Kernelization and double centering

For now consider a matrix Xc=(x1,1i=1nxi,1nx1,2i=1nxi,2nx1,pi=1nxi,pn.........xn,1i=1nxi,1nxn,2i=1nxi,2nxn,pi=1nxi,pn)X_c = \begin{pmatrix} x_{1,1} - \frac{\sum_{i=1}^{n} x_{i,1}}{n} && x_{1,2} - \frac{\sum_{i=1}^{n} x_{i,2}}{n} && x_{1,p} - \frac{\sum_{i=1}^{n} x_{i,p}}{n} \\ ... && ... && ... \\ x_{n,1} - \frac{\sum_{i=1}^{n} x_{i,1}}{n} && x_{n,2} - \frac{\sum_{i=1}^{n} x_{i,2}}{n} && x_{n,p} - \frac{\sum_{i=1}^{n} x_{i,p}}{n} \end{pmatrix}, where we subtracted column means from each element, so that all the columns have zero sum.

Through SVD we can represent it as Xc=UDVTX_c = U D V^T. And we can attain two Gram matrices out of it, left and right.

One of our Gram matrices would be a pp-by-pp covariance matrix, used in PCA:

Cc=XcTXc=VDUTUDVT=VD2VTC_c = X_c^T X_c = V D U^T U D V^T = V D^2 V^T

The other Gram matrix would be an nn-by-nn kernel matrix:

Kc=XcXcT=UDVTVDUT=UD2UTK_c = X_c X_c^T = U D V^T V D U^T = U D^2 U^T

Eigenvectors VV of the covariance matrix CcC_c are right singular vectors of the matrix XcX_c and, incidentally, VDV D are Principal Components, obtained in PCA.

Eigenvectors UU of the kernel matrix KcK_c are left singular vectors of the matrix XcX_c.

In a matrix notation how do we get XcX_c from XX? Easy, we multiply it from the left by (10...0............00...1)1n(11...1............11...1)=I1n1n\begin{pmatrix} 1 && 0 && ... && 0 \\ ... && ... && ... && ... \\ 0 && 0 && ... && 1 \end{pmatrix} - \frac{1}{n} \begin{pmatrix} 1 && 1 && ... && 1 \\ ... && ... && ... && ... \\ 1 && 1 && ... && 1 \\ \end{pmatrix} = I - \frac{1}{n} {\bf 1}_n.

As a result, we get Xc=(I1n1n)XX_c = (I - \frac{1}{n} {\bf 1}_n) X. Apply this to get centered kernel KcK_c from the raw kernel matrix KK:

Kc=XcXcT=(I1n1n)XXT(I1n1n)T=(I1n1n)K(IT1n1nT)=K1n1nK1nK1n+1n21K1TK_c = X_c X_c^T = (I - \frac{1}{n}{\bf 1}_n) X X^T (I - \frac{1}{n}{\bf 1}_n)^T = (I - \frac{1}{n}{\bf 1}_n) K (I^T - \frac{1}{n}{\bf 1}_n^T) = K - \frac{1}{n}{\bf 1}_nK - \frac{1}{n}K{\bf 1}_n + \frac{1}{n^2} {\bf 1} K {\bf 1}^T.

Now let us consider the pairwise distance matrix DD between the nn data points:

D(2)=(X1222X1,X1+X122X1222X1,X2+X222...X1222X1,Xn+Xn22............Xn222Xn,X1+X122Xn222Xn,X2+X222...Xn222Xn,Xn+Xn22)=D^{(2)} = \begin{pmatrix} ||X_1||^2_2 - 2 \langle X_1, X_1 \rangle + ||X_1||^2_2 && ||X_1||^2_2 - 2 \langle X_1, X_2 \rangle + ||X_2||^2_2 && ... && ||X_1||^2_2 - 2 \langle X_1, X_n \rangle + ||X_n||^2_2 \\ ... && ... && ... && ... \\ ||X_n||^2_2 - 2 \langle X_n, X_1 \rangle + ||X_1||^2_2 && ||X_n||^2_2 - 2 \langle X_n, X_2 \rangle + ||X_2||^2_2 && ... && ||X_n||^2_2 - 2 \langle X_n, X_n \rangle + ||X_n||^2_2 \end{pmatrix} =

=(X122X122...X122............Xn22Xn22...Xn22)Z2(X1,X1X1,X2...X1,Xn............Xn,X1Xn,X2...Xn,Xn)XXT+(X122X222...Xn22............X122X222...Xn22)ZT== \underbrace{\begin{pmatrix}||X_1||^2_2 && ||X_1||^2_2 && ... && ||X_1||^2_2 \\ ... && ... && ... && ... \\ ||X_n||^2_2 && ||X_n||^2_2 && ... && ||X_n||^2_2 \end{pmatrix}}_{Z} - 2 \underbrace{\begin{pmatrix} \langle X_1, X_1 \rangle && \langle X_1, X_2 \rangle && ... && \langle X_1, X_n \rangle \\ ... && ... && ... && ... \\ \langle X_n, X_1 \rangle && \langle X_n, X_2 \rangle && ... && \langle X_n, X_n \rangle \end{pmatrix}}_{X X^T} + \underbrace{\begin{pmatrix} ||X_1||^2_2 && ||X_2||^2_2 && ... && ||X_n||^2_2 \\ ... && ... && ... && ... \\ ||X_1||^2_2 && ||X_2||^2_2 && ... && ||X_n||^2_2 \end{pmatrix}}_{Z^T} =

=Z2XXT+ZT=1zT2XXT+z1T = Z - 2 X X^T + Z^T = {\bf 1} {\bf z}^T - 2 X X^T + {\bf z} {\bf 1}^T.

Now, (I1n11T)D(2)(I1n11T)T=(I1n11T)(1zT2XXT+z1T)(I1n11T)T=(I - \frac{1}{n} {\bf 1}{\bf 1}^T) D^{(2)} (I - \frac{1}{n} {\bf 1}{\bf 1}^T)^T = (I - \frac{1}{n} {\bf 1}{\bf 1}^T) ({\bf 1} {\bf z}^T - 2 X X^T + {\bf z} {\bf 1}^T) (I - \frac{1}{n} {\bf 1}{\bf 1}^T)^T =

=(1zT1zT2XXT+2n11TXXT+z1Tz1n11T)(I1n11T)T== (\cancel{{\bf 1} {\bf z}^T} - \cancel{{\bf 1} {\bf z}^T} - 2 X X^T + \frac{2}{n} {\bf 1}{\bf 1}^T X X^T + {\bf z}{\bf 1}^T - \frac{||\bf z||_1}{n} {\bf 1}{\bf 1}^T) (I - \frac{1}{n} {\bf 1}{\bf 1}^T)^T =

=2XXT+2n11TXXT+z1Tz1n11T+2XXT1n11T2n11TXXT1n11Tz1T1n11T+z1n11T1n11T== - 2 X X^T + \frac{2}{n} {\bf 1}{\bf 1}^T X X^T + \cancel{ {\bf z}{\bf 1}^T } - \cancel{ \frac{||\bf z||_1}{n} {\bf 1}{\bf 1}^T } + 2 X X^T \frac{1}{n} {\bf 1}{\bf 1}^T - \frac{2}{n} {\bf 1}{\bf 1}^T X X^T \frac{1}{n} {\bf 1}{\bf 1}^T - \cancel{ {\bf z}{\bf 1}^T \frac{1}{n} {\bf 1}{\bf 1}^T } + \cancel{ \frac{||\bf z||_1}{n} {\bf 1}{\bf 1}^T \frac{1}{n} {\bf 1}{\bf 1}^T } =

=(I1n11T)2XXT11T(I1n11T)2XXT=2(I1n11T)XXT(I1n11T)=2(I1n11T)K(I1n11T)=2Kc = (I - \frac{1}{n} {\bf 1} {\bf 1}^T ) 2 X X^T {\bf 1} {\bf 1}^T - (I - \frac{1}{n} {\bf 1} {\bf 1}^T ) 2 X X^T = 2 (I - \frac{1}{n} {\bf 1} {\bf 1}^T ) X X^T (I - \frac{1}{n} {\bf 1} {\bf 1}^T ) = 2 (I - \frac{1}{n} {\bf 1} {\bf 1}^T ) K (I - \frac{1}{n} {\bf 1} {\bf 1}^T ) = 2 K_c.

Hence, as we’ve just seen, double centering of distances matrix gives us the centered kernel matrix KcK_c.

For this matrix we can easily find an EVD/SVD and use it as a low-dimensional approximation of the matrix KcK_c (and, hence, D(2)D^{(2)}), providing us with the dimensionality reduction method.

Isomap and Locally Linear Embeddings (LLE)

It is easy to observe a limitation to the classical MDS algorithm: oftentimes data points form a so-called manifold in the enveloping space. For instance, real-life 640x480 photos occupy only a small part of the space of all theoretically possible 640x480 pixel signals. Moreover, this shape is continuous and smooth - you can transition from one real-life photo to another one applying small changes.

Hence, the correct way to measure the distances between our data points is not euclidean distances in the enveloping space, but geodesic on the manifold. For instance, if we compare photos of lesser panda and giant panda, they’d be close in the enveloping space and euclidean distance ded_e between them would be small, but they’d be far apart on the manifold, because lesser panda is a racoon, and giant panda is a bear, so geodesic distance between them dgd_g would be large.

pandas

Manifold of photos. Each data point corresponds to a photo. Photos of giant panda and lesser panda are close in euclidean distance ded_e, because they look similar, but far apart in geodesic distance dgd_g, because they actually belong to very different taxa.

In 2000 two manifold-aware methods were published in the same Science magazine issue.

Isomap

Isomap works in 3 simple steps:

  1. Find k nearest neighbours of each point. Construct a weighted graph, connecting neighbouring points with edges, where weight of each edge is the Euclidean distance between those points.

  2. Construct a distance matrix on that graph, using e.g. Dijkstra’s algorithm.

  3. Using this distance matrix, perform MDS.

I don’t know what to add. The method is straightforward and self-explanatory, but beautiful.

Locally Linear Embeddings (LLE)

LLE is very similar to Isomap, but with a slightly different premise.

  1. Find k nearest neighbours of each point. Construct a weighted graph, connecting neighbouring points with edges, where weight of each edge is the Euclidean distance between those points.

  2. Construct a matrix W, such that each data point XiX_i is most accurately represented as a linear combindation of its neighbours with the weights from this matrix: Xˉi=j=1kWi,jXj\bar{X}_i = \sum \limits_{j=1}^{k} W_{i,j} X_j, so that Φ(X)=i=1nXij=1kWi,jXj2\Phi(X) = \sum \limits_{i=1}^n |X_i - \sum \limits_{j=1}^k W_{i,j} X_j|^2 is minimal (least square problem).

  3. Using this matrix as a similarity matrix, find matrix YY of vectors YiY_i of low dimensionality, such that the following function is minimized: Φ(Y)=Yij=1kWi,jYj2\Phi(Y) = |Y_i - \sum \limits_{j=1}^{k} W_{i,j} Y_j|^2.

The matrix W could be find with a simple least squares regression.

If you have a data matrix XX and you’re looking for WW, which best approximates XX X^=WX\hat{X} = W \cdot X:

(X^1,1X^1,2X^1,3X^1,4X^2,1X^2,2X^2,3X^2,4X^3,1X^3,2X^3,3X^3,4)=(W1,1W1,2W1,3W2,1W2,2W2,3W3,1W3,2W3,3)(X1,1X1,2X1,3X1,4X2,1X2,2X2,3X2,4X3,1X3,2X3,3X3,4)\begin{pmatrix} \hat{X}_{1,1} && \hat{X}_{1,2} && \hat{X}_{1,3} && \hat{X}_{1,4} \\ \hat{X}_{2,1} && \hat{X}_{2,2} && \hat{X}_{2,3} && \hat{X}_{2,4} \\ \hat{X}_{3,1} && \hat{X}_{3,2} && \hat{X}_{3,3} && \hat{X}_{3,4} \end{pmatrix} = \begin{pmatrix} W_{1,1} && W_{1,2} && W_{1,3} \\ W_{2,1} && W_{2,2} && W_{2,3} \\ W_{3,1} && W_{3,2} && W_{3,3} \end{pmatrix} \cdot \begin{pmatrix} X_{1,1} && X_{1,2} && X_{1,3} && X_{1,4} \\ X_{2,1} && X_{2,2} && X_{2,3} && X_{2,4} \\ X_{3,1} && X_{3,2} && X_{3,3} && X_{3,4} \end{pmatrix}

Essentially we have a linear regression problem for each row vector (X^1,1X^1,2X^1,3X^1,4)\begin{pmatrix}\hat{X}_{1,1} && \hat{X}_{1,2} && \hat{X}_{1,3} && \hat{X}_{1,4}\end{pmatrix}, where each of its coordinates can be perceived as yy-s of a data point in regression problem, weights (W1,1,W1,2,W1,3)(W_{1,1}, W_{1,2}, W_{1,3}) can be viewed as regression weights and column-vectors (X1,1X2,1X3,1)\begin{pmatrix} X_{1,1} \\ X_{2,1} \\ X_{3,1} \end{pmatrix} can be viewed as regression factors for current data point.

An alternative representation of our optimization problem is (YijWi,jYj)2=YT(IW)T(IW)Y(Y_i - \sum \limits_j W_{i,j}Y_j)^2 = Y^T (I-W)^T (I-W) Y. Again, this is a quadratic form, which can be minimized by finding eigenvectors and eigenvalues of an extreme value decomposition of matrix E=(IW)T(IW)E = (I-W)^T (I-W):

EY=ΛYE Y = \Lambda Y,

where Λ\Lambda is a diagonal matrix of eigenvalues and YY is the matrix of eigenvectors. We might need double centering just as in MDS, which will lead us to a generalized EVD problem.

Spectral embedding through Laplacian Eigenmaps

Finally, let us consider Laplacian Eigenmaps algorithm as an implementation of spectral embedding.

It turns out that it draws connections to both LLE and normalized cut algorithm, considered in my previous post.

Again, the alogrithm consists of 3 steps, similar to those considered above:

  1. Find k nearest neighbours (or, alternatively, ϵ\epsilon-neighbourhood) of each data point.

  2. Construct a weighted graph, so that the weights of each edge, connecting neighbouring vertices is a heat kernel: Wi,j=exixj2tW_{i,j} = e^{-\frac{|| x_i - x_j ||^2}{t}}. Alternatively just assign neighbour edges lengths 1 and non-neighbour - length 0 (which actually corresponds to t=t = \infty).

  3. Calculate the Laplacian matrix of this graph L=DWL = D - W (where DD is as usual a diagonal matrix of degrees of vertices) and solve the normalized Laplacian equation for eigenvectors and eigenvalues:

Lf=λDfL f = \lambda D f

Again, 0 eigenvalue corresponds to a trivial solution. Eigenvectors other than that are useful.

Justification of the algorithm

Representation as a trace of Laplacian

Suppose that you want to project the dataset onto a single line that preserves the squares of distances between points as well as possible. This is very similar to a directional derivative: when you’re looking for a direction vector d\bf d, which maximizes the derivative of a function f()f() in that direction, which means that d,f\langle {\bf d}, \nabla f \rangle., which means that the optimal d=f{\bf d} = \nabla f, because f,f\langle \nabla f, \nabla f \rangle is the maximum.

So, let us write this mathematically: we’re looking for a vector y=(y1,y2,...,yn)T{\bf y} = (y_1, y_2, ..., y_n)^T maximizing W,Y(ij)\langle W, Y^{(i-j)} \rangle, where

Yij=((y1y1)2(y1y2)2...(y1yn)2(y2y1)2(y2y2)2...(y2yn)2............(yny1)2(yny2)2...(ynyn)2)Y^{i-j} = \begin{pmatrix} (y_{1} - y_{1})^2 && (y_{1} - y_{2})^2 && ... && (y_{1} - y_{n})^2 \\ (y_{2} - y_{1})^2 && (y_{2} - y_{2})^2 && ... && (y_{2} - y_{n})^2 \\ ... && ... && ... && ... \\ (y_{n} - y_{1})^2 && (y_{n} - y_{2})^2 && ... && (y_{n} - y_{n})^2 \end{pmatrix}

Rewriting this in a one-line notation, we’re looking for y=(y1,y2,...,yn)T{\bf y} = (y_1, y_2, ..., y_n)^T, such that y=argmaxyijWi,j(yiyj)2{\bf y^*} = \arg \max_{\bf y} \sum \limits_i \sum \limits_j W_{i,j} (y_i - y_j)^2.

This maximization problem is equivalent to minimization problem of a quadratic form with a Laplacian attaining its minimum:

y=argmaxyijWi,j(yiyj)2=argminyTDy=1yTD1=0yTLy{\bf y^*} = \arg \max_{\bf y} \sum \limits_i \sum \limits_j W_{i,j} (y_i - y_j)^2 = \arg \min \limits_{\substack{ {\bf y^T} D {\bf y} = 1 \\ {\bf y}^T D {\bf 1} = 0 } } {\bf y}^T L {\bf y}.

Let us show this fact:

yTLy=(y1y2...yn)(W1,1iW1,iW1,2...W1,nW2,1W2,2iW2,i...W2,n............Wn,1Wn,2...Wn,niWn,i)(y1y2...yn)={\bf y}^T L {\bf y} = \begin{pmatrix} y_1 && y_2 && ... && y_n \end{pmatrix} \begin{pmatrix} W_{1,1} - \sum_i W_{1,i} && W_{1,2} && ... && W_{1,n} \\ W_{2,1} && W_{2,2} - \sum_i W_{2,i} && ... && W_{2,n} \\ ... && ... && ... && ... \\ W_{n,1} && W_{n,2} && ... && W_{n,n} - \sum_i W_{n,i} \end{pmatrix} \begin{pmatrix} y_1 \\ y_2 \\ ... \\ y_n \end{pmatrix} =

=(W1,1y1y1W1,2y1y2...W1,ny1ynW2,1y2y1W2,2y2y2...W2,ny2yn............Wn,1yny1Wn,2yny2...Wn,nynyn)(iW1,iy120...00iW2,iy22...0............00...iWn,iyn2)= = \begin{pmatrix} W_{1,1} y_1 y_1 && W_{1,2} y_1 y_2 && ... && W_{1,n} y_1 y_n \\ W_{2,1} y_2 y_1 && W_{2,2} y_2 y_2 && ... && W_{2,n} y_2 y_n \\ ... && ... && ... && ... \\ W_{n,1} y_n y_1 && W_{n,2} y_n y_2 && ... && W_{n,n} y_n y_n \\ \end{pmatrix} - \begin{pmatrix} \sum_i W_{1,i} y_1^2 && 0 && ... && 0 \\ 0 && \sum_i W_{2,i} y_2^2 && ... && 0 \\ ... && ... && ... && ... \\ 0 && 0 && ... && \sum_i W_{n,i} y_n^2 \end{pmatrix} =

=i,jWi,jyiyjijWi,jyj2=12(ijWi,jyi22i,jWi,jyiyj+ijWi,jyj2)=12ijWi,j(yiyj)2 = \sum_{i,j} W_{i,j} y_i y_j - \sum_i \sum_j W_{i,j} y_j^2 = - \frac{1}{2} (\sum_i \sum_j W_{i,j} y_i^2 - 2 \sum_{i,j} W_{i,j} y_i y_j + \sum_i \sum_j W_{i,j} y_j^2) = -\frac{1}{2} \sum_i \sum_j W_{i,j} (y_i - y_j)^2.

Conditions are the same as in Normalized Cut algorithm, see my two previous posts on ECI and final Normalized Cut part of NMF/k-means/Normalized Cut.

In a more general case, if we’re looking for a low-dimensional (pp-dimensional) approximation Y=(y1,1y1,2...y1,py2,1y2,2...y2,p............yn,1yn,2...yn,p)Y = \begin{pmatrix} y_{1,1} && y_{1,2} && ... && y_{1,p} \\ y_{2,1} && y_{2,2} && ... && y_{2,p} \\ ... && ... && ... && ... \\ y_{n,1} && y_{n,2} && ... && y_{n,p} \end{pmatrix} of our distance matrix WW, we come to:

argminYTDYT=ITr(YTLY)\arg \min \limits_{Y^T D Y^T = I} Tr(Y^T L Y)

Again, this result was shown in the section of one of my previous posts, dedicated to the Normalized Cut algorithm.

Laplace-Beltrami operator on a manifold guarantees optimality of mapping

Interestingly, here algebraic graph theory connects with differential geometry.

Recall that Laplacian of a graph bears this name for a reason: it is essentially the same Laplacian as in field theory (see my post on Laplacian in field theory) and this excellent post by Matthew Bernstein. In case of functions, defined on manifolds, it is also generalized by Laplace-Beltrami operator.

Suppose that our data points lie on a manifold. We shall be looking for a function f:MRf: \mathcal{M} \to \mathbb{R}, mapping points of our manifold to a line (R1\mathbb{R}^1). This approach is somewhat similar to incrementally finding more and more directions that maximize explained variance in data in PCA.

In discrete graph case we were looking for vectors y\bf y, analogous to ff, such that the quantity ijWi,j(yiyj)2\sum \limits_{i} \sum \limits_{j} W_{i,j} (y_i - y_j)^2 was maximized. Recall that the optimized quantity was alternatively represented as ijWi,j(yiyj)2=yTLy=yTVTVy\sum \limits_{i} \sum \limits_{j} W_{i,j} (y_i - y_j)^2 = {\bf y}^T L {\bf y} = {\bf y}^T V^T \cdot V {\bf y}. Here VV is a discrete matrix, analogous to divergence (again, see Matthew Bernstein post), applied at each point of vector y\bf y of points.

In a continuous case the counterpart of this expression would be Mf(x),f(x)=Mf(x)2\int_{\mathcal{M}} \langle \nabla f(x), \nabla f(x) \rangle = \int_{\mathcal{M}} ||\nabla f(x)||^2. Here we replaced the discrete VV matrix with a continuous gradient, their dot product draws similarities to divergence and sum is replaced with integral over the manifold.

Recall that Laplacian is a divergence of gradient by definition (again, see my post on Laplacian in field theory):

L(f)=div(f)=f\mathcal{L}(f) = - div \nabla (f) = \nabla \cdot \nabla f

We shall establish that Mf(x)2=ML(f)f\int_{\mathcal{M}} || \nabla f(x) ||^2 = \int_{\mathcal{M}} \mathcal{L}(f) f.

This fact is shown in 3 steps (see this answer on StackOverflow). Consider a function ff and a vector field XX:

  1. multivariate/operator calculus version of product derivative rule: (fX)=f,X+fX\nabla \cdot (f X) = \langle \nabla f, X \rangle + f \nabla \cdot X

  2. application of Gauss-Ostrogradsky theorem: M(fX)dV=δM(fX),ndS\int_{\mathcal{M}} \nabla \cdot (f X) dV = \int_{\delta \mathcal{M}} \langle (f X), n \rangle dS

  3. manifold has 0 bound, hence δM(fX),ndS=0\int_{\delta \mathcal{M}} \langle (f X), n \rangle dS = 0

Applying (3) to (1), we get: Mf,XdV+MfXdV=0\int_{\mathcal{M}} \langle \nabla f, X \rangle dV + \int_{\mathcal{M}} f \nabla \cdot X dV = 0 and Mf,XdV=MfXdV\int_{\mathcal{M}} \langle \nabla f, X \rangle dV = - \int_{\mathcal{M}} f \nabla \cdot X dV

Apply this to our previous line:

Mf(x)2=Mf,f=Mdiv(f)f=ML(f)f\int_{\mathcal{M}} || \nabla f(x) ||^2 = \int_{\mathcal{M}} \langle \nabla f, \nabla f \rangle = \int_{\mathcal{M}} -div(\nabla f) f = \int_{\mathcal{M}} \mathcal{L}(f) f

Heat Kernel

Solution of our optimization problem draws close analogies with diffusion/heat equation:

ut=Lu\frac{\partial u}{\partial t} = - \mathcal{L} u

This equation gives us an idea of how the eigenfunctions for Laplacian would look like:

Ht(x,y)=(4πt)m2exy24tH_t(x,y) = (4 \pi t)^{-\frac{m}{2}} e^{-\frac{||x - y||^2}{4t}}

Lf(x)1t(f(x)(4πt)m2Mexy24tf(y)dy)\mathcal{L} f(x) \approx \frac{1}{t} (f(x) - (4 \pi t)^{-\frac{m}{2}} \int_{\mathcal{M}} e^{- \frac{|| x - y ||^2}{4t} } f(y) dy )

Lf(x)1t(f(x)(4πt)m2Mexy24tf(y)dy)\mathcal{L} f(x) \approx \frac{1}{t} (f(x) - (4 \pi t)^{-\frac{m}{2}} \int_{\mathcal{M}} e^{- \frac{|| x - y ||^2}{4t} } f(y) dy )

In discrete case, quantify this expression to:

Lf(xi)1t(f(xi)1k(4πt)m2xj:0<xixj<ϵexixj24tf(xj))\mathcal{L} f(x_i) \approx \frac{1}{t} (f(x_i) - \frac{1}{k} (4 \pi t)^{-\frac{m}{2}} \sum \limits_{x_j: 0 < || x_i - x_j || < \epsilon} e^{ -\frac{||x_i - x_j||^2}{4t}} f(x_j))

Inherent dimensionality of M\mathcal{M} might be unknown, hence, we put α=1k(4πt)m2\alpha = \frac{1}{k} (4 \pi t)^{-\frac{m}{2} }.

Since Laplacian of a constant function is 0, α=(xj:0<xixj<ϵexixj24tf(xj))1\alpha = ( \sum \limits_{x_j: 0 < || x_i - x_j || < \epsilon} e^{ -\frac{||x_i - x_j||^2}{4t}} f(x_j) )^{-1}

This leads to several approximation schemes, and the authors of Laplacian eigenmaps use the following:

Wi,j={exixj24t,ifxixjϵ0,otherwiseW_{i,j} = \begin{cases} e^{-\frac{|| x_i - x_j ||^2}{4t}}, if || x_i - x_j || \le \epsilon \\ 0, otherwise \end{cases}

Connection to Normalized Cut

Please, refer to my previous posts on Normalized Cut and Economic Complexity Index, where Normalized Cut algorithm is derived and explained. Conditions of Laplacian Eigenmaps are identical to the conditions of Normalized Cut.

Connection to LLE

This one is instetsting.

E=(IW)T(IW)E = (I - W)^T (I - W)

Ef12L2fEf \approx \frac{1}{2} \mathcal{L}^2 f

Hence, eigenvectors of EE coincide with the eigenvectors of Laplacian L\mathcal{L}. Let us prove this.

Step 1. Hessian approximation

(IW)f12jWi,j(xixj)TH(xixj)(I-W) f \approx -\frac{1}{2} \sum \limits_j W_{i,j} (x_i - x_j)^T H (x_i - x_j)

Let us use Taylor series approximation of our function f()f() at each of the neighbouring points xjx_j near the point xix_i:

f(xj)=f(xi)+f(xi),(xixj)+12(xixj)TH(xixj)+o(xixj2)f(x_j) = f(x_i) + \langle \nabla f(x_i), (x_i - x_j) \rangle + \frac{1}{2} (x_i - x_j)^T H (x_i - x_j) + o(||x_i - x_j||^2)

At the same time each row of (IW)f(x)(I-W) f(x) equals f(xi)jf(xj)f(x_i) - \sum \limits_j f(x_j). Substitute Taylor series approximation of f(xj)f(x_j) into it:

f(xi)jWi,jf(xj)f(xi)jWi,jf(xi)jWi,jf(xi),(xixj)12jWi,j(xixj)TH(xixj)+o(xixj2)f(x_i) - \sum \limits_j W_{i,j} f(x_j) \approx f(x_i) - \sum \limits_j W_{i,j} f(x_i) - \sum \limits_j W_{i,j} \langle \nabla f(x_i), (x_i - x_j) \rangle - \frac{1}{2} \sum \limits_j W_{i,j} (x_i - x_j)^T H (x_i - x_j) + o(||x_i - x_j|^2)

First, recall that jWi,j=1\sum \limits_j W_{i,j} = 1 and jWi,jf(xi)=f(xi)\sum \limits_j W_{i,j} f(x_i) = f(x_i).

Second, jWi,j(xixj)0\sum \limits_j W_{i,j} (x_i - x_j) \approx 0, because jWi,jxi=xi\sum \limits_j W_{i,j} x_i = x_i and jWi,jxjxi\sum \limits_j W_{i,j} x_j \approx x_i.

So, jWi,jf(xi),(xixj)=f(xi),jWi,j(xixj)f(xi),0=0\sum \limits_j W_{i,j} \langle \nabla f(x_i), (x_i - x_j) \rangle = \langle \nabla f(x_i), \sum \limits_j W_{i,j} (x_i - x_j) \rangle \approx \langle \nabla f(x_i), 0 \rangle = 0

Hence, f(xi)jWi,jf(xj)f(xi)f(xi)jWi,jf(xi),(xixj)12jWi,j(xixj)TH(xixj)+o(xixj2)f(x_i) - \sum \limits_j W_{i,j} f(x_j) \approx \cancel{f(x_i)} - \cancel{f(x_i)} - \cancel{ \sum \limits_j W_{i,j} \langle \nabla f(x_i), (x_i - x_j) \rangle } - \frac{1}{2} \sum \limits_j W_{i,j} (x_i - x_j)^T H (x_i - x_j) + o(||x_i - x_j|^2) \approx

12jWi,j(xixj)TH(xixj)\approx - \frac{1}{2} \sum \limits_j W_{i,j} (x_i - x_j)^T H (x_i - x_j).

Step 2. Laplacian is a trace of Hessian

jWi,j(xixj)TH(xixj)=Tr(H)=Lf\sum \limits_j W_{i,j} (x_i - x_j)^T H (x_i - x_j) = Tr(H) = \mathcal{L} f

Denote eigenvectors of Hessian ek\bf e_k and corresponding eigenvalues λk\lambda_k.

(xixj)=kek,(xixj)ek(x_i - x_j) = \sum \limits_k \langle e_k, (x_i - x_j) \rangle {\bf e_k}

Hence, (xixj)TH(xixj)=kek,(xixj)2λkek2=1=kek,(xixj)2λk(x_i - x_j)^T H (x_i - x_j) = \sum \limits_k \langle e_k, (x_i - x_j) \rangle^2 \lambda_k \underbrace{||{\bf e_k}||^2}_{=1} = \sum \limits_k \langle e_k, (x_i - x_j) \rangle^2 \lambda_k.

Under assumption that points xjx_j are randomly sampled around xix_i, we assume that Eek,(xixj)=r\mathbb{E} \langle e_k, (x_i - x_j) \rangle = r and hence:

E(xixj)TH(xixj)=rkλk=rTr(H)=rLf\mathbb{E} (x_i - x_j)^T H (x_i - x_j) = r \sum \limits_k \lambda_k = r \cdot Tr(H) = r \cdot \mathcal{L} f

Hence, jWi,j(xixj)TH(xixj)E(xixj)TH(xixj)=rTr(H)=rLf\sum \limits_j W_{i,j} (x_i - x_j)^T H (x_i - x_j) \approx \mathbb{E} (x_i - x_j)^T H (x_i - x_j) = r \cdot Tr(H) = r \cdot \mathcal{L} f

Step 3. Apply the results of steps 1 and 2

(IW)T(IW)f12L2f(I - W)^T (I - W) f \approx \frac{1}{2} \mathcal{L}^2 f

fT(IW)T(IW)f12(Lf)TLf=12L2ff^T (I - W)^T (I - W) f \approx \frac{1}{2} (\mathcal{L} f)^T \mathcal{L} f = \frac{1}{2} \mathcal{L}^2 f

References:


Boris Burkov

Written by Boris Burkov who lives in Moscow, Russia, loves to take part in development of cutting-edge technologies, reflects on how the world works and admires the giants of the past. You can follow me in Telegram