cover

Correspondence between symmetric NMF, k-means, biclustering and spectral clustering

August 31, 2022 42 min read

Non-negative matrix factorization (NMF) is the most fashionable method of matrix decomposition among productive, but mathematically illiterate biology students, much more popular than PCA due to its perceived simplicity. However, if you start digging deeper, it happens to have profound connections to a multitude of other techniques from linear algebra and matrix analysis. In this post I discuss those connections.

I am following the general logic of this paper, including the numeration of chapters, in this post. However, I will provide lots of my own remarks, as I spent several months, working with objects, discussed in this post and “rehydrate” this very clear, but somewhat condensed 4.5-page paper into a much larger blog post.

In this post I will present 4 seemingly different algorithms: symmetric NMF, k-means, biclustering and spectral clustering via normalized cuts method. I will establish the correspondence between all 4 methods and show that essentially they represent variations of the same mathematics.

1. Introduction to NMF and k-means

NMF and recommender systems

Consider an n×pn \times p matrix V=(v1,1v1,2v1,3v1,p............vn,1vn,2vn,3vn,p)V = \begin{pmatrix} v_{1,1} && v_{1,2} && v_{1,3} && v_{1,p} \\ ... && ... && ... && ... \\ v_{n,1} && v_{n,2} && v_{n,3} && v_{n,p} \end{pmatrix}.

Oftentimes you would want to approximate it by a low-rank matrix V^\hat{V}. E.g. suppose that “low-rank” means rank kk:

V^=(w1,1w1,2w1,n)(h1,1h1,2h1,p)+...+(wk,1wk,2wk,n)(h1,1h2,1hp,1)\hat{V} = \begin{pmatrix}w_{1,1} \\ w_{1,2} \\ w_{1,n}\end{pmatrix} \cdot \begin{pmatrix}h_{1,1} && h_{1,2} && h_{1,p}\end{pmatrix} + ... + \begin{pmatrix}w_{k,1} \\ w_{k,2} \\ w_{k,n}\end{pmatrix} \cdot \begin{pmatrix}h_{1,1} && h_{2,1} && h_{p,1}\end{pmatrix}

Here I use outer product notation to represent V^\hat{V} as a product of two rectangular matrices: n×kn \times k matrix WW and k×pk \times p matrix HH:

V^=WH=(w1,1w1,kw2,1w2,kwn,1wn,k)(h1,1h1,2h1,phk,1hk,2hk,p)\hat{V} = W \cdot H = \begin{pmatrix} w_{1,1} && w_{1,k} \\ w_{2,1} && w_{2,k} \\ w_{n,1} && w_{n,k} \end{pmatrix} \cdot \begin{pmatrix} h_{1,1} && h_{1,2} && h_{1,p} \\ h_{k,1} && h_{k,2} && h_{k,p} \end{pmatrix}

Since the Netflix challenge, this approach was very popular in the recommender systems. If VV matrix represents nn users, who assigned ratings to pp movies, then WW matrix represents the characterization of each of the nn users in terms of kk psychological traits (e.g. how much the user likes humour, violence, sentimentality etc.) and HH matrix characterizes each of pp movies in terms of these kk scales - how humorous, violent, sentimental etc. they are.

Then an obvious application of NMF is data imputation - most of the user reviews are missing, hence, we need to predict user ratings for movies they haven’t seen and suggest the movies with the highest expected ratings.

How would we search for NMF in practice? We would require Frobenius norm of the difference between NMF and the original matrix VV to be as small as possible:

minW,H0VWHF=i=1nj=1p(vi,jwi,k,hk,j)2\min \limits_{W,H \ge 0} ||V - WH ||_F = \sum \limits_{i=1}^n \sum \limits_{j=1}^p (v_{i,j} - \langle {\bf w_{i,k}}, {\bf h_{k,j}} \rangle)^2

Frobenius norm is just a square root of sum of squares of elements of matrix (if we considered the matrix a regular vector, it would’be been its L2 norm). Thus it could be interpreted as the distance between matrices.

However, Frobenius norm is a good measure for the quality of our approximation not only because of its intuitiveness. A family of theorems in matrix analysis (such as Hoffman-Wielandt inequality) show that if matrix approximation converges to the true matrix in terms of Frobenius norm, then so do their eigenvalues and eigenvectors.

Another important way of representing the Frobenius norm is through the trace of Gram matrix, created out of the initial matrix: VWHF=Tr(VWH)T(VWH)||V - WH||_F = \sqrt{Tr{(V - WH)^T (V-WH)}}. We will heavily rely on this representation in this post.

An alternative option for the measure of quality of approximation is Kullback-Leibler divergence D(VWH)D(V || W H), which in matrix case evaluates to D(AB)=i,j(Ai,jlogAi,jBi,jAi,j+Bi,j)D(A||B) = \sum \limits_{i,j} (A_{i,j} \log \frac{A_{i,j}}{B_{i,j}} - A_{i,j} + B_{i,j}). In this case NMF bears similarities to various Bayesian methods, such as Probabilistic Latent Semantic Analysis (PLSA) method. However, I won’t delve into Kullback-Liebler-minimizing approach and will focus on the Frobenius norm as approximation quality metric.

NMF solver

One of the reasons of NMF popularity is its simple iterative solver.

In a series of alternating steps it updates the values of WW and HH matrices using the following update rules:

WW(VHT)(WHHT)W \leftarrow W \cdot \frac{(VH^T)}{(WHH^T)}

HH(WTV)(WTWH)H \leftarrow H \cdot \frac{(W^TV)}{(W^TWH)}

Let us derive this formula.

We want to solve the following optimization problem:

minW0,H0VWHF\min \limits_{W \ge 0, H \ge 0}|| V - WH ||_F

Or, equivalently, expressing Frobenius norm through trace notation, we have:

W,H=argminW0,H0Tr((VWH)T(VWH))=argminW0,H0Tr((VWH)T(VWH))W^*, H^* = \arg \min \limits_{W \ge 0, H \ge 0} \sqrt{Tr((V - WH)^T(V - WH))} = \arg \min \limits_{W \ge 0, H \ge 0} Tr((V - WH)^T(V - WH))

This is a minimization problem which can be solved through taking a derivative of a scalar function with respect to a matrix argument (the concept and mechanics of derivative of a scalar function with respect to a matrix argument is explained in detail here and here).

We will be doing a gradient descent, iteratively decreasing Frobenius norm with respect to WW matrix with HH fixed every odd step and with respect to HH matrix with WW matrix fixed every even step:

WWηWWf(W,H)W \leftarrow W - \eta_W \cdot \nabla_W f(W,H)

HHηHHf(W,H)H \leftarrow H - \eta_H \cdot \nabla_H f(W,H)

Here ηW\eta_W and ηH\eta_H are gradient step size.

Now let us decompose our trace into 4 terms and take its derivative WTr((VWH)T(VWH))\nabla_W Tr((V - WH)^T(V - WH)) and HTr((VWH)T(VWH))\nabla_H Tr((V - WH)^T(V - WH)) with respect to matrices WW and HH:

Tr((VTHTWT)(VWH))=Tr(VTVVTWHHTWTV+HTWWWH)=Tr((V^T - H^TW^T)(V-WH) ) = Tr( V^TV - V^TWH - H^TW^TV + H^TW^WWH) =

=Tr(VTV)Tr(VTWH)Tr(HTWTV)+Tr(HTWTWH) = Tr( V^TV) - Tr(V^TWH) - Tr(H^TW^TV) + Tr(H^TW^TWH)

Taking derivative of each term separately:

WTr(VTV)=0\nabla_W Tr( V^TV) = 0

WTr(VTWH)=WTr(HVTW)=(HVT)T=VHT\nabla_W Tr(V^TWH) = \nabla_W Tr(HV^T W) = (H V^T)^T = VH^T (using the fact that WTr(AW)=AT\nabla_W Tr(AW) = A^T)

WTr(HTWTV)=WTr(WTVHT)=VHT\nabla_W Tr(H^TW^TV) = \nabla_W Tr(W^T V H^T) = VH^T (using the fact that WTr(WTA)=A\nabla_W Tr(W^TA) = A)

WTr(HTWTWH)=Tr(HHTWTW)=2WHHT\nabla_W Tr(H^TW^TWH) = \nabla Tr(HH^TW^TW) = 2 W HH^T (using the fact that WTr(WHHTWT)=W(HHT+HHT)\nabla_W Tr(W HH^T W^T) = W (HH^T + HH^T))

Aggregating the results:

WTr((VWH)T(VWH))=0VHTVHT+2WHHT=2(WHHTVHT)=2(WHV)HT\nabla_W Tr((V - WH)^T(V - WH)) = 0 - VH^T - VH^T + 2 W HH^T = 2 (W HH^T - VH^T) = 2 (WH - V) H^T

And carrying out similar calculations for HH we get:

HTr((VWH)T(VWH))=2WTV+2WTWH=2WT(WHV)\nabla_H Tr((V - WH)^T(V - WH)) = -2 W^T V + 2 W^TW H = 2 W^T (WH - V)

Substituting this back into the original update formula (ignoring “2”, as it can be included into the gradient step):

WWηW(WHV)HTW \leftarrow W - \eta_W \cdot (WH - V) H^T

HHηHWT(WHV)H \leftarrow H - \eta_H \cdot W^T (WH - V)

Now the authors of the original paper suggest to set specific values of gradient update steps:

ηW=WWHHT\eta_W = \frac{W}{WHH^T}

ηH=HWTWH\eta_H = \frac{H}{W^TWH}

This leads us to the declared algorithm:

WWWWHHT(WHV)HT=WWHHT(WHHTWHHT+VHT)=WVHTWHHTW \leftarrow W - \frac{W}{W H H^T} (WH - V)H^T = \frac{W}{WHH^T} \cdot (\cancel{WHH^T} - \cancel{WHH^T} + VH^T) = W \cdot \frac{VH^T}{WHH^T}

HHHWTWHWT(WHV)=HWTWWTWHH \leftarrow H - \frac{H}{W^T W H} W^T (WH - V) = H \cdot \frac{W^TW}{W^TWH}

Here is a python implementation:

from typing import Tuple

import numpy as np
import numpy.typing as npt


class ConvergenceException(Exception):
    def __init__(self):
        Exception.__init__(self, "Iterations limit exceeded; algorithm failed to converge!")


def nmf(V: npt.NDArray, k: int, tol: float = 1e-4, max_iter: int=100) -> Tuple[npt.NDArray, npt.NDArray]:
    """An NMF  solver implementation which approximates an n x p input matrix V
    as a product W x H of n x k matrix W and k x p matrix H. This solver
    minimizes Frobenius norm of the difference between V and W H and implements
    an algorithm, described by Lee and Seung (2000) in:
    https://proceedings.neurips.cc/paper/2000/file/f9d1152547c0bde01830b7e8bd60024c-Paper.pdf.

    :param V: n x p matrix to be approximated with NMF low-rank approximation
    :param k: rank of low-rank approximation
    :param tol: threshold value of decrease in energy function between two consecutive iterations for the algorithm to
     be considered converged
    :param max_iter: maximum number of iterations allowed; if reached before convergence, ConvergenceException is thrown
    :return: 2-tuple of two matrices:
     - W - n x k matrix
     - H - k x p matrix
    """
    n = V.shape[0]
    p = V.shape[1]

    previous_energy = np.inf
    next_energy = np.inf

    iterations = 0

    W = np.ones((n, k))
    H = np.ones((k, p))

    while (previous_energy - next_energy > tol) or next_energy == np.inf:
        W_update = np.dot(V, H.T) / np.dot(W, np.dot(H, H.T))
        W = W * W_update

        H_update = np.dot(W.T, V) / np.dot(W.T, np.dot(W, H))
        H = H * H_update

        if iterations > max_iter:
            raise ConvergenceException
        else:
            iterations += 1
            previous_energy = next_energy
            next_energy = np.linalg.norm(V - np.dot(W, H), 'fro')  # calculate Frobenius norm

    return W, H

If we run this code, the approximation is very decent:

import unittest
import numpy as np

from ..biclustering_lasso import nmf


class NMFTestCase(unittest.TestCase):
    """
    python3 -m unittest biclustering_lasso.tests.test_nmf.NMFTestCase
    """
    def test_nmf(self):
        V = np.array([[2.1, 0.4, 1.2, 0.3, 1.1],
                      [2.1, 0.7, 2.3, 0.4, 2.2],
                      [2.4, 0.5, 3.2, 0.7, 3.3]])

        W, H = nmf(V, k=2)

        print(f"W = {W}")
        print(f"H = {H}")
>>> W
array([[0.49893966, 0.49893966],
       [0.76064226, 0.76064226],
       [1.02271335, 1.02271335]])

>>> H
array([[1.36102025, 0.33184111, 1.50013542, 0.31221326, 1.49381373],
       [1.36102025, 0.33184111, 1.50013542, 0.31221326, 1.49381373]])

>>> np.dot(W, H)
array([[1.35813395, 0.33113738, 1.49695411, 0.31155116, 1.49064583],
       [2.07049903, 0.50482475, 2.2821328 , 0.47496521, 2.27251571],
       [2.78386716, 0.67875667, 3.06841706, 0.63860935, 3.05548651]])

>>> V
array([[2.1, 0.4, 1.2, 0.3, 1.1],
       [2.1, 0.7, 2.3, 0.4, 2.2],
       [2.4, 0.5, 3.2, 0.7, 3.3]])

As you can see, np.dot(W, H) approximates V very well. And due to the fact that it is so easy to implement, it is probably the favourite algorithm of your favourite bioinformatics student:

Your favourite bioinformatics student

Your favourite bioinformatics student. Cause doing things fast is what truly matters, right?

NMF as a special case of k-means

A less obvious application of NMF is data clustering. Turns out, k-means clustering is a special case of NMF.

Suppose that we have identified kk clusters CkC_k among our data points vi\bf v_i.

We can interpret the rows hk{\bf h_k} of matrix HH as centroids of our clusters: hk=iCkvi{\bf h_k} = \sum_{i \in C_k }{\bf v_i}

Then matrix WW can be made orthogonal non-negative, representing attribution of data points to clusters. E.g.:

V^=WH=(10012012)(h1,1h1,2h1,phk,1hk,2hk,p)\hat{V} = W \cdot H = \begin{pmatrix} 1 && 0 \\ 0 && \frac{1}{\sqrt{2}} \\ 0 && \frac{1}{\sqrt{2}} \end{pmatrix} \cdot \begin{pmatrix} h_{1,1} && h_{1,2} && h_{1,p} \\ h_{k,1} && h_{k,2} && h_{k,p} \end{pmatrix}

Here matrix WW describes two clusters, the first contains data point {v1\bf v_1} and the second - {v2\bf v_2, v3\bf v_3}. If all the coordinates of the data are non-negative, this means that coordinates of all the cluster centroids hi\bf h_i are non-negative as well. WW is non-negative, too. So, all the requirements of NMF are satisfied.

k-means

k-means clustering. Here we see 3 clusters. Data points are depicted with rectangles, cluster centroids are depicted with circles.

k-means solution with EM-like algorithm

In practice, k-means can be solved with a two-step iterative EM-like algorithm.

Initialize cluster centroids with random values (obviously, we can get smarter with initialization, but even random will do for now).

Each iteration of EM algorithm consists of two steps:

  • E-step: assign each data point to the cluster with the nearest centroid
  • M-step: re-calculate the coordinates of each centroid as a center of mass of the data points, which belong to its cluster

This algorithm converges, because there exists a non-negative monotonically decreasing (non-increasing) “energy” function, which decreases at each iteration, both on E-step and M-step.

The exact choice of the energy function could vary, and depending on the one selected, the algorithm takes new interesting interpretations. For instance, NMF with Kullback-Leibler divergence as energy function results in intepretaiton of NMF as probabilistic latent semantic analysi (PLSA) algorithm. We are going to use Frobenius norm as the energy function, which results in a multitude of truncated PCA/biclustering/spectral clustering interpretations.

Here is an implementation of k-means (with k-means++ initialization):

import math
import random
from typing import Tuple, Union

import numpy as np
import numpy.typing as npt


class ConvergenceException(Exception):
    def __init__(self):
        Exception.__init__(self, "Iterations limit exceeded; algorithm failed to converge!")


def k_means_clustering(
        X: npt.NDArray,
        k: int,
        tol: float = 1e-5,
        max_iter: int = 100,
        energy: Union['KL', 'fro'] = 'fro'
) -> Tuple[npt.NDArray, npt.NDArray]:
    """A minimal implementation of k-means clustering algorithm.

    :param X: n x p data matrix, where each point of data is a p-dimensional np.array
    :param k: number of cluster centroids, we aim to find
    :param tol: tolerance in energy; stop and return result, if the decrease in energy between 2 steps is below tol
    :param max_iter: maximum number of iterations of the algorithm allowed; abort with ConvergenceException if exceeded
    :param energy: monotonically non-increasing energy function to calculate (options: Kullback-Leibler, Frobenius norm)
    :return: (centroids, clusters) - a 2-Tuple of 2 matrices:
     - centroids - k x p matrix of cluster centroids
     - clusters - n x k indicator vectors, which define a set of data points, which belongs to each cluster
    """
    n = X.shape[0]
    p = X.shape[1]

    centroids = init_centroids(X, k)
    clusters = np.empty(shape=(n, k))

    iterations = 0
    next_energy = np.Inf
    previous_energy = np.Inf
    while not (previous_energy - next_energy < tol):
        clusters = e_step(X, centroids)
        centroids = m_step(X, centroids, clusters)

        if iterations > max_iter:
            raise ConvergenceException
        else:
            iterations += 1
            previous_energy = next_energy
            next_energy = calculate_energy(X, centroids, clusters, energy)

    return centroids, clusters


def init_centroids(X, k) -> npt.NDArray:
    """Initialization procedure for settings the initial locations of
    centroids, based on k-means++ (2006-2007) algorithm:
    https://en.wikipedia.org/wiki/K-means%2B%2B.
    """
    n = X.shape[0]
    p = X.shape[1]
    centroids = np.zeros((1, p))
    random_point_index = random.randrange(n)
    np.copyto(centroids[0], X[random_point_index])  # use a random row of X matrix as a centroid
    for _ in range(k-1):
        # find closest centroid to each data point
        clusters = e_step(X, centroids)

        # construct probability distribution of selecting data points as centroids
        distribution = np.zeros(n)
        for index, data_point in enumerate(X):
            # find the coordinate of 1 in clusters[index] - it will be the index of centroid
            nearest_centroid_index = np.argmax(clusters[index])  # finds the location of 1
            nearest_centroid = centroids[nearest_centroid_index]

            # probability of a point being selected as a new centroid is ~ square distance D(point, centroid)
            distribution[index] = np.dot(data_point - nearest_centroid, data_point - nearest_centroid)

        # pick a data point to be the next centroid at random
        new_centroid = random.choices(X, weights=distribution)
        centroids = np.vstack([centroids, new_centroid])

    return centroids


def e_step(X, centroids):
    """Assign each data point to a cluster with the nearest centroid."""
    clusters = np.zeros(shape=(X.shape[0], centroids.shape[0]))

    for data_point_index, data_point in enumerate(X):
        nearest_centroid_index = 0
        shortest_distance_to_centroid = np.infty
        for index, centroid in enumerate(centroids):
            direction = data_point - centroid
            distance = math.sqrt(np.dot(direction, direction))
            if distance < shortest_distance_to_centroid:
                shortest_distance_to_centroid = distance
                nearest_centroid_index = index

        clusters[data_point_index][nearest_centroid_index] = 1

    return clusters


def m_step(X, centroids, clusters):
    """Re-calculate new centroids based on the updated clusters."""
    # divide each cluster element by the number of elements in it for averaging (e.g. [0, 1, 1] -> [0, 0.5, 0.5])
    normalized_clusters = clusters.T
    for index, cluster in enumerate(normalized_clusters):
        np.copyto(normalized_clusters[index], cluster / cluster.sum())
    normalized_clusters = normalized_clusters.T

    # move each centroid to the center of mass of its respective cluster
    new_centroids = np.dot(X.T, normalized_clusters).T

    return new_centroids


def calculate_energy(X, centroids, clusters, energy: Union['KL', 'fro']) -> float:
    """Implementation of several energy functions calculation."""
    if energy == 'fro':
        result = np.linalg.norm(X - np.dot(clusters, centroids), 'fro')
    elif energy == 'KL':
        result = 0  # TODO
    else:
        raise ValueError(f"Undefined energy function type '{energy}'")

    return result

Let us test our implementation:

import unittest
import numpy as np

from ..biclustering_lasso import k_means_clustering


class KMeansTestCase(unittest.TestCase):
    """
    python3 -m unittest biclustering_lasso.tests.test_nmf.NMFTestCase
    """
    def test_k_means(self):
        X = np.array([[2.1, 0.4, 1.2, 0.3, 1.1],
                      [2.1, 0.7, 2.3, 0.4, 2.2],
                      [2.4, 0.5, 3.2, 0.7, 3.3]])

        centroids, clusters = k_means_clustering(X, k=2)

        print(f"centroids = {centroids}")
        print(f"clusters = {clusters}")
>>> centroids
array([[2.1 , 0.4 , 1.2 , 0.3 , 1.1 ],
       [2.25, 0.6 , 2.75, 0.55, 2.75]])

>>> clusters
array([[1. , 0. ],
       [0. , 0.5],
       [0. , 0.5]])

>>> np.dot(clusters, centroids)
array([[2.1  , 0.4  , 1.2  , 0.3  , 1.1  ],
       [1.125, 0.3  , 1.375, 0.275, 1.375],
       [1.125, 0.3  , 1.375, 0.275, 1.375]])

>>> X
array([[2.1, 0.4, 1.2, 0.3, 1.1],
       [2.1, 0.7, 2.3, 0.4, 2.2],
       [2.4, 0.5, 3.2, 0.7, 3.3]])

Again, we get a decent low-rank approximation of X matrix by k-means cluster np.dot(clusters, centroids).

Symmetric NMF

NMF is not unique. For instance, you could obviously insert a matrix and its inverse in between WW and HH:

V=WH=WBB1H=W^H^V = WH = W B B^{-1} H = \hat{W} \hat{H}.

Hence, there are plenty of options for additional constraints on WW and HH.

You could demand that one of them, WW or HH is orthogonal. This special case leads to k-means interpretation (where the orthogonal matrix represents clusters and the other one - coordinates of centroids).

Or you can demand that NMF is symmetric and W=HTW = H^T. This special case is called symmetric NMF and will be considered further.

2. Symmetric NMF interpretation through k-means

It turns out that symmetric NMF algorithm can be interpreted as a variation of “soft” k-means clustering.

In “hard” k-means clustering each data point belongs to strictly one cluster. In “soft” clustering every point is assigned weights, which show how close (or likely to belong) it is to each cluster.

In “hard” clustering clusters matrix is orthogonal. As we’ll see further in this post, in “soft” clustering weights matrix is not orthogonal, but tends to be approximately orthogonal.

In order to establish correspondence between symmetric NMF and k-means I will first need a technical lemma, which shows that minimum of Frobenius norm of error of rank-11 matrix approximation coincides with the maximum of quadratic form of the approximated matrix. Moreover, in case of rank-kk matrix approximation via V^=WH\hat{V} = W \cdot H, minimum of Frobenius norm of error VV^F||V - \hat{V}||_F corresponds to a trace Tr(HVW)Tr(H V W).

This fact will prove useful for all the 4 algorithms, described in this post.

Lemma 2.0. Minimum of Frobenius norm corresponds to the maximum of Rayleigh quotient

Assume that matrices WW and HTH^T are orthogonal, i.e.

wi,wj={0,ij1,i=j\langle {\bf w_i}, {\bf w_j} \rangle = \begin{cases}0, i \ne j \\ 1, i = j \end{cases}

hi,hj={0,ij1,i=j\langle {\bf h_i}, {\bf h_j} \rangle = \begin{cases}0, i \ne j \\ 1, i = j \end{cases}

Then the following holds:

W,H=argminW,HVWHF=argmaxW,HTr(HVW)W^*, H^* = \arg \min \limits_{W, H} || V - WH ||_F = \arg \max \limits_{W, H} Tr (H V W),

or, in case of rank k=1k=1, and single vectors w\bf w and h\bf h we get:

w,h=argminw,hVwhTF=argmaxw,hhTVw{\bf w}^*, {\bf h}^* = \arg \min \limits_{ {\bf w}, {\bf h} } || V - {\bf w} {\bf h}^T ||_F = \arg \max \limits_{ {\bf w}, {\bf h} } {\bf h}^T V {\bf w},

Let us show this result for the rank k=1k=1 case (a pair of vectors) and then generalize to arbitrary rank.

VwhTF=Tr(VwhT)T(VwhT)=Tr(VTV)constTr(VTwhT)Tr(hTVTw)=Tr(wTVh)Tr(hwTV)Tr(wTVh)+Tr(hwTwhT)=wTwTr(hhT)=w2h2=1|| V - {\bf w}{\bf h}^T ||_F = Tr (V - {\bf w}{\bf h}^T)^T (V - {\bf w}{\bf h}^T) = \underbrace{Tr (V^T V)}_\text{const} - \underbrace{Tr(V^T {\bf w}{\bf h}^T)}_{Tr({\bf h}^T V^T {\bf w}) = Tr({\bf w}^T V {\bf h}) } - \underbrace{Tr({\bf h} {\bf w}^T V )}_{Tr({\bf w}^T V {\bf h})} + \underbrace{ Tr({\bf h} {\bf w}^T {\bf w} {\bf h}^T) }_{= {\bf w}^T{\bf w} \cdot Tr({\bf h}{\bf h}^T) = ||w||_2 \cdot ||h||_2 = 1}.

Hence, VwhTF|| V - {\bf w}{\bf h}^T ||_F attains its minimum when 2Tr(wTVh)- 2 \cdot Tr({\bf w}^T V {\bf h}) attains its minimum or Tr(wTVh)Tr({\bf w}^T V {\bf h}) attains its maximum.

Now consider k=2k=2 case (k=3,4,...k=3, 4, ... etc. works similarly):

VWHF=Tr((VWH)T(VWH))=Tr(VTV)constTr(VTWH)Tr(HTWTV)+Tr(HTWTWH)const=2Tr(WTVHT)+const|| V - WH ||_F = Tr ((V - WH)^T (V - WH)) = \underbrace{Tr (V^T V)}_\text{const} - Tr (V^T WH) - Tr(H^T W^T V) + \underbrace{Tr(H^T W^T W H)}_{const} = - 2 Tr(W^T V H^T) + const.

Lemma 2.1. Symmetric NMF is equivalent to kernel K-means clustering

Consider a special case of NMF, when W=HTW = H^T, called symmetric NMF. In this case we’re looking for an approximation:

V=HTHV = H^T H

And the goal is to find H=argminH:H0VHTHFH^* = \arg \min \limits_{H: H \ge 0} ||V - H^T H ||_F.

If we assume that HH is orthogonal (HTH=IH^T H = I, where II is identity matrix), then by Lemma 2.0 we immediately see that:

H=argminH:H0,HTH=IVHTHF=argmaxH:H0,HTH=IHVHTH^* = \arg \min \limits_{H: H \ge 0, H^T H = I} ||V - H^T H ||_F = \arg \max \limits_{H: H \ge 0, H^T H = I} H V H^T.

Lemma 2.2. Symmetric NMF matrix is near orthogonal

Interestingly, we might not need to explicitly impose the requirement that HTH=IH^T H = I, yet still HH turns out to be approximately orthogonal, if we find it through a symmetric NMF.

H=argminHVHTHFH^* = \arg \min \limits_{H}||V - H^T H||_F

VHTHF=Tr(VHTH)T(VHTH)=Tr(VTV)constTr((HTH)TV)Tr(VT(HTH))+Tr(HTHHTH)=const2Tr((HTH)TV)+Tr(HTHHTH)=const2Tr(HVHT)+HHTF||V - H^T H||_F = Tr (V - H^T H)^T(V - H^T H) = \underbrace{Tr(V^T V)}_\text{const} - Tr((H^T H)^T V) - Tr(V^T (H^T H)) + Tr(H^T H H^T H) = const -2 Tr((H^T H)^T V) + Tr(H^T H H^T H) = const -2 Tr(HVH^T) + || H H^T ||_F

Here the term Tr(HVHT)Tr(HVH^T) corresponds to the minimization problem, similar to k-means. The term HHTF|| H H^T ||_F works almost like a constraint. Let’s focus on it:

HHT=(h1,1h1,2h1,3h1,4h2,1h2,2h2,3h2,4)(h1,1h2,1h1,2h2,2h1,3h2,3h1,4h2,4)=(h1,h1h1,h2h2,h1h2,h2)H H^T = \begin{pmatrix} h_{1,1} && h_{1,2} && h_{1,3} && h_{1,4} \\ h_{2,1} && h_{2,2} && h_{2,3} && h_{2,4} \end{pmatrix} \cdot \begin{pmatrix} h_{1,1} && h_{2,1} \\ h_{1,2} && h_{2,2} \\ h_{1,3} && h_{2,3} \\ h_{1,4} && h_{2,4} \end{pmatrix} = \begin{pmatrix} \langle {\bf h_1}, {\bf h_1} \rangle && \langle {\bf h_1}, {\bf h_2} \rangle \\ \langle {\bf h_2}, {\bf h_1} \rangle && \langle {\bf h_2}, {\bf h_2} \rangle \end{pmatrix},

where e.g. h1=(h1,1h1,2h1,3h1,4){\bf h_1} = \begin{pmatrix} h_{1,1} \\ h_{1,2} \\ h_{1,3} \\ h_{1,4} \end{pmatrix}.

From the definition of Frobenius norm then HHTF=h1,h12+h1,h22+...+h2,h12+h2,h22+...+h4,h42=k=1pl=1nhk,hl2|| H H^T ||_F = \langle {\bf h_1}, {\bf h_1} \rangle^2 + \langle {\bf h_1}, {\bf h_2} \rangle^2 + ... + \langle {\bf h_2}, {\bf h_1} \rangle^2 + \langle {\bf h_2}, {\bf h_2} \rangle^2 + ... + \langle {\bf h_4}, {\bf h_4} \rangle^2 = \sum \limits_{k=1}^p \sum \limits_{l=1}^n \langle {\bf h_k}, {\bf h_l} \rangle^2.

Let us split this sum into two parts: diagonal elements and non-diagonal elements separately:

HHTF=lkhl,hk2+l=khk,hk2|| H H^T ||_F = \sum \limits_{l \ne k} \langle {\bf h_l}, {\bf h_k} \rangle^2 + \sum \limits_{l=k} \langle {\bf h_k}, {\bf h_k} \rangle^2

Minimization of the term lkhl,hk2\sum \limits_{l \ne k} \langle {\bf h_l}, {\bf h_k} \rangle^2 essentially enforces approximate orthogonality, while the second term is responsible for normalization l=khk,hk2=l=khk24\sum \limits_{l=k} \langle {\bf h_k}, {\bf h_k} \rangle^2 = \sum \limits_{l=k} ||{\bf h_k}||^4_2. This term attains minimum when all the norms of vectors are approximately balanced (h12h22...hk2||{\bf h_1}||_2 \approx ||{\bf h_2}||_2 \approx ... \approx ||{\bf h_k}||_2).

3. Biclustering problem and quadratic programming/NMF

Another related problem is the problem of biclustering.

Suppose that you have a matrix with expressions of pp genes, measured in nn patients, and you want to find sub-groups of patients with similar patterns of expression (e.g. you’re looking for subtypes of cancer).

Biclustering

Biclustering in a differential expression matrix. Rows of the matrix correspond to patients; columns - to genes; the intensity of color of each pixel represents the gene expression in a patient (e.g. how much RNA product is synthesized from this gene). By rearranging the order of genes and patients we can clearly identify groups of patients, in which groups of genes behave similarly.

So, you want to simultaneously detect subsets of columns and subsets of rows, such that they explain e.g. a large chunk of variance in your data (the variance in a bicluster is expected to be low, though).

Equivalently, this problem can be re-formulated as detection of dense subgraphs in a bipartite graph

Bipartite graph clustering

Dense subgraphs in a bipartite graph. Again, left part of the graph represents patients, right - genes, edges are expressions of genes in a patient.

Lemma 3.1. Biclustering problem is equivalent to quadratic programming/NMF low-rank approximation of a Jordan-Wielandt matrix

Let us formalize the biclustering problem.

We need to find a pair of indicator vectors

x:xi={1,iA0,iB{\bf x}: x_i = \begin{cases} 1, i \in A \\ 0, i \in B \end{cases}

y:yi={1,iA0,iB{\bf y}: y_i = \begin{cases} 1, i \in A \\ 0, i \in B \end{cases}

such that xTVymax/min{\bf x}^T V {\bf y} \to \max / \min

I call this bilinear programming, as we need to optimize the value of a bilinear form. In a more general case we might need to find matrices (XX, YY) (i.e. multiple pairs (xi\bf x_i, yi\bf y_i), potentially covering the whole matrix; vector pairs may intersect or may not, depending on the problem statement).

Turns out, we can re-formulate this bilinear optimization as a quadratic optimization problem. Let us put the vectors xx and yy on top of each other, concatenating them into a unified vector h=(x1,x2,...,xn,y1,y2,...,yp)Th = (x_1, x_2, ..., x_n, y_1, y_2, ..., y_p)^T.

Replace the VV matrix with the following Jordan-Wielandt matrix:

B=(0VVT0)B = \begin{pmatrix} 0 && V \\ V^T && 0 \end{pmatrix}

Then our optimization problem is to find 12hTBhmax/min\frac{1}{2}{\bf h}^T B {\bf h} \to \max / \min.

Again, if we want to find multiple biclusters, instead of a single vector hh, we use a whole matrix HH, and then our optimization problem takes a familiar form maxH12TrHTBH\max \limits_{H} \frac{1}{2} Tr H^T B H.

This problem already looks awfully familiar, right? Let us consider one more face of it.

4. k-means corresponds to spectral clustering

Another interesting interpretation of the same algorithm comes from spectral graph theory.

To get a taste of this field first, suppose that we have a graph that consists of multiple disjoint subsets.

It can be shown, that each disjoint connected component in this graph is an eigenvector of so-called graph Laplacian.

For example, consider this disjoint graph, its degree matrix DD, adjacency matrix AA and Laplacian L=DAL = D - A:

disjoint graph

A graph with 2 connected components

D=(2000002000002000001000001)D = \begin{pmatrix} 2 && 0 && 0 && 0 && 0 \\ 0 && 2 && 0 && 0 && 0 \\ 0 && 0 && 2 && 0 && 0 \\ 0 && 0 && 0 && 1 && 0 \\ 0 && 0 && 0 && 0 && 1 \end{pmatrix}, A=(0110010100110000000100010)A = \begin{pmatrix} 0 && 1 && 1 && 0 && 0 \\ 1 && 0 && 1 && 0 && 0 \\ 1 && 1 && 0 && 0 && 0 \\ 0 && 0 && 0 && 0 && 1 \\ 0 && 0 && 0 && 1 && 0 \end{pmatrix}, L=(2110012100112000001100011)L = \begin{pmatrix} 2 && -1 && -1 && 0 && 0 \\ -1 && 2 && -1 && 0 && 0 \\ -1 && -1 && 2 && 0 && 0 \\ 0 && 0 && 0 && 1 && -1 \\ 0 && 0 && 0 && -1 && 1 \end{pmatrix}

It is easy to see that vectors (11100)\begin{pmatrix} 1 \\ 1 \\ 1 \\ 0 \\ 0 \end{pmatrix} and (00011)\begin{pmatrix} 0 \\ 0 \\ 0 \\ 1 \\ 1 \end{pmatrix}, which define two connected components, are eigenvectors with 0 eigenvalue for graph Laplacian:

(2110012100112000001100011)(11100)=0(11100)\begin{pmatrix} 2 && -1 && -1 && 0 && 0 \\ -1 && 2 && -1 && 0 && 0 \\ -1 && -1 && 2 && 0 && 0 \\ 0 && 0 && 0 && 1 && -1 \\ 0 && 0 && 0 && -1 && 1 \end{pmatrix} \cdot \begin{pmatrix} 1 \\ 1 \\ 1 \\ 0 \\ 0 \end{pmatrix} = 0 \cdot \begin{pmatrix} 1 \\ 1 \\ 1 \\ 0 \\ 0 \end{pmatrix}

Now, instead of strictly disjoint graph, we might have a loosely disjoint one - with relatively dense subgraphs, interconnected with just a few “bridge” edges. We might want to find those dense sub-graphs. And they would serve as good clusters as well, if we re-formulated the problem as clustering?

Lemma 4.0. MinMax Cut solution through optimization of a quadratic form with Laplacian matrix

In 1993 Wu and Leahy suggested an algorithm of finding a minimal/maximal cut in graphs, based on spectral clustering.

The cut was meant to be the set of edges, that would split the graph into 2 halves, AA and BB, so that the sum of edges, which connect these two parts, is minimal: cut(A,B)=uA,vBw(u,v)cut(A, B) = \sum \limits_{u \in A, v \in B} w(u, v).

Speaking in terms of Laplacian and spectral graph theory, we can re-formulate MinCut optimization problem as follows. We introduce indicator vectors h=(1,1,...,1,1,1,...,1){\bf h} = (1, 1, ..., 1, -1, -1, ..., -1), where hi=1h_i = 1, if vertex iAi \in A and hi=1h_i = -1, if iBi \in B.

Now, you can see that if we defined some partition of our graph VV into halves AA and BB, adjacency matrix gets split into 4 blocks: block diagonal elements of the matrix represent nodes within AA and BB subgraphs, and off-diagonal elements represent cut(A,B)cut(A, B):

MinCut

MinCut adjacency matrix: diagonal blocks represent connections within AA and BB halves of the graph, and off-diagonal blocks represnt graph cut.

Now consider a quadratic form hTAh{\bf h}^T A {\bf h}:

hTAh=(11111)(w1,1w1,2w1,3w1,4w1,5w2,1w2,2w2,3w2,4w2,5w3,1w3,2w3,3w3,4w3,5w4,1w4,2w4,3w4,4w4,5w5,1w5,2w5,3w5,4w5,5)(11111)={\bf h}^T A {\bf h} = \begin{pmatrix} 1 && 1 && -1 && -1 && -1 \end{pmatrix} \cdot \begin{pmatrix} w_{1,1} && w_{1,2} && w_{1,3} && w_{1,4} && w_{1,5} \\ w_{2,1} && w_{2,2} && w_{2,3} && w_{2,4} && w_{2,5} \\ w_{3,1} && w_{3,2} && w_{3,3} && w_{3,4} && w_{3,5} \\ w_{4,1} && w_{4,2} && w_{4,3} && w_{4,4} && w_{4,5} \\ w_{5,1} && w_{5,2} && w_{5,3} && w_{5,4} && w_{5,5} \end{pmatrix} \cdot \begin{pmatrix} 1 \\ 1 \\ -1 \\ -1 \\ -1 \end{pmatrix} =

=i,jAwi,j+i,jBwi,jiA,jBwi,jiB,jAwi,j=i,jAwi,j+i,jBwi,j2i,jcut(A,B)wi,j = \sum \limits_{i,j \in A} w_{i,j} + \sum \limits_{i,j \in B} w_{i,j} - \sum \limits_{i \in A, j \in B} w_{i,j} - \sum \limits_{i \in B, j \in A} w_{i,j} = \sum \limits_{i,j \in A} w_{i,j} + \sum \limits_{i,j \in B} w_{i,j} - 2 \cdot \sum \limits_{i, j \in cut(A, B)} w_{i,j}.

Now, take the degree matrix DD, create a similar quadratic form out of if:

hTDh=(11111)(jw1,j...00jwi,j00...jwV,j)(11111){\bf h}^T D {\bf h} = \begin{pmatrix} 1 && 1 && -1 && -1 && -1 \end{pmatrix} \cdot \begin{pmatrix} \sum \limits_j{w_{1,j}} && ... && 0 \\ 0 && \sum \limits_j{w_{i,j}} && 0 \\ 0 && ... && \sum \limits_j{w_{|V|,j}} \end{pmatrix} \cdot \begin{pmatrix} 1 \\ 1 \\ -1 \\ -1 \\ -1 \end{pmatrix}.

If we now subtract the quadratic forms, we’d see that all edges, except by the members of cut(A,B)cut(A, B), cancel each other out:

hTDhhTAh=hT(DA)h=hTLh=4i,jcut(A,B)wi,j{\bf h}^T D {\bf h} - {\bf h}^T A {\bf h} = {\bf h}^T (D-A) {\bf h} = {\bf h}^T L {\bf h} = 4 \cdot \sum \limits_{i, j \in cut(A, B)} w_{i,j}.

In other words, to minimize the cut(A,B)cut(A, B), we need to find indicator vectors h{\bf h}, such that they minimize the quadratic form hTLh{\bf h}^T L {\bf h}, where LL is graph Laplacian. We just discovered another interesting connection with linear algebra and spectral graph theory, which is going to be instrumental a bit later.

For now, I shall notice that although MinMaxCut algorithm was enlightening, in its raw form it was prone to a fallacy of selecting a single most disconnected vertex as AA and cutting it off of the rest of the graph:

MinCut

MinCut: typical undesired behaviour of the algorithm, when it cuts away single vertices.

Lemma 4.1. Normalized Cut problems on a graph correspond to the k-means clustering problem

To address the problem of undesired behaviour of MinMaxCut, Shi and Malik (1997, 2000) came up with their (now famous) Normalized Cut algorithm. They tweaked the optimized function to cut away a sizable portion of the edges of the graph relative to the total weights of edges, kept within its halves:

Ncut(A,B)=cut(A,B)assoc(A,V)+cut(A,B)assoc(B,V)Ncut(A, B) = \frac{cut(A, B)}{assoc(A, V)} + \frac{cut(A, B)}{assoc(B, V)}, where assoc(A,V)=uA,tVw(u,t)assoc(A, V) = \sum \limits_{u \in A, t \in V} w(u, t).

If we take a look at the weighted adjacency matrix, representing the graph, and would cluster subgraphs AA and BB at diagonal blocks, we’d see that cut(A,B)cut(A, B) corresponds to the sum of weights of the off-diagonal blocks, while denominators assoc(A,V)assoc(A, V) and assoc(B,V)assoc(B, V) form full-width horizontal (or full-width vertical) blocks:

Normalized cut matrix

Normalized cut matrix.

Shi and Malik use a bit different version of indicator vectors. Unlike MinMaxCut their indicator vectors are comprised of 0-1 coordinates h\bf h: hi={1,iA0,iBh_i = \begin{cases} 1, i \in A \\ 0, i \in B \end{cases}.

Then Ncut(A,B)=cut(A,B)assoc(A,V)+cut(A,B)assoc(B,V)=hT(DA)hhTDh+(1h)T(DA)(1h)(1h)TD(1h)Ncut(A, B) = \frac{cut(A, B)}{assoc(A, V)} + \frac{cut(A, B)}{assoc(B, V)} = \frac{{\bf h}^T(D-A){\bf h}}{{\bf h}^T D {\bf h}} + \frac{{\bf (1-h)}^T(D-A){\bf (1-h)}}{{\bf (1-h)}^T D {\bf (1-h)}}.

Now, if we take a look at the denominators, hTDh=1TD1(1h)TD(1h){\bf h}^T D {\bf h} = {\bf 1}^T D {\bf 1} - {\bf (1-h)}^T D {\bf (1-h)}.

Shi and Malik then define a constant kk, such that hTDh=k1TD1{\bf h}^T D {\bf h} = k \cdot {\bf 1}^T D {\bf 1} and (1h)TD(1h)=(1k)1TD1{\bf (1-h)}^T D {\bf (1-h)} = (1-k) \cdot {\bf 1}^T D {\bf 1}.

Then Ncut(A,B)=hT(DA)hhTDh+(1h)T(DA)(1h)(1h)TD(1h)=hT(DA)hk1TD1+(1h)T(DA)(1h)(1k)1TD1=(1k)hT(DA)h+k(1h)T(DA)(1h)k(1k)1TD1Ncut(A, B) = \frac{{\bf h}^T(D-A){\bf h}}{{\bf h}^T D {\bf h}} + \frac{{\bf (1-h)}^T(D-A){\bf (1-h)}}{{\bf (1-h)}^T D {\bf (1-h)}} = \frac{{\bf h}^T(D-A){\bf h}}{k \cdot {\bf 1}^T D {\bf 1}} + \frac{{\bf (1-h)}^T(D-A){\bf (1-h)}}{(1-k) \cdot {\bf 1}^T D {\bf 1}} = \frac{ (1-k) \cdot {\bf h}^T(D-A){\bf h} + k \cdot {\bf (1-h)}^T(D-A){\bf (1-h)} }{k \cdot (1-k) \cdot {\bf 1}^T D {\bf 1}}.

After that Shi and Malik define a ratio b=k1kb = \frac{k}{1-k}. Then they perform a long series of simplifications and show that:

Ncut(A,B)=(1k)hT(DA)h+k(1h)T(DA)(1h)k(1k)1TD1=(hb(1h))T(DA)(hb(1h))b1TD1Ncut(A, B) = \frac{ (1-k) \cdot {\bf h}^T(D-A){\bf h} + k \cdot {\bf (1-h)}^T(D-A){\bf (1-h)} }{k \cdot (1-k) \cdot {\bf 1}^T D {\bf 1}} = \frac{({\bf h} - b \cdot ({\bf 1-h}))^T (D-A) ({\bf h} - b \cdot ({\bf 1-h})) }{b \cdot {\bf 1}^T D {\bf 1}}.

Finally, Shi and Malik define a new variable y=hb(1h)\bf y = {\bf h} - b ({\bf 1-h)} and apply two changes to the resulting formula:

  1. They notice that the denominator can be altered: yTDy=iAdi+b2iBdi=biBdi+b2iBdi=b(iBdi+biBdi)=b1TD1{\bf y}^T D {\bf y} = \sum \limits_{i \in A}d_i + b^2 \sum \limits_{i \in B}d_i = b \sum \limits_{i \in B}d_i + b^2 \sum \limits_{i \in B}d_i = b (\sum \limits_{i \in B}d_i + b \sum \limits_{i \in B}d_i) = b {\bf 1}^T D {\bf 1}