BorisBurkov.net
cover

Multivariate normal distribution

July 01, 2021 20 min read

Multivariate normal distribution arises in many aspects of mathematical statistics and machine learning. For instance, Cochran's theorem in statistics, PCA and Gaussian processes in ML heavily rely on its properties. Thus, I'll discuss it here in detail.

Motivational example

Sir Francis Galton

Sir Francis Galton (1822 - 1911)

Sir Francis Galton might be considered the grandfather of modern statistics. He analyzed the data on height of parents and their children:

Galton's correlation diagram

Heights of 205 parents and 930 adult children, from “Regression towards mediocrity in hereditary stature” by F.Galton, 1886

Heights of parents were normally-distributed, as well as the heights of their children. However, the distributions were obviously not independent, as taller parents generally give birth to taller children. Probably, this plate is one of the first depictions of 2-variate normal distribution’s isocontour. As a side note, Galton has also come up with the ratio of male to female heights here, which is 1.08 (according to the modern data, it is closer to 1.07).

Galton actually rediscovered the concept of correlation two years after this paper, in 1888.

Multivariate normal distribution

A random vector X=[x1,x2,...,xn]T\bm{X} = [x_1, x_2, ..., x_n]^T is called multivariate normal distribution, if each dimension of it represents a one-dimensional normal distribution.

They write XN(μ,Σ)\bm{X} \sim \mathcal{N}(\bm{\mu}, \bm{\Sigma}), where μ=[μ1,μ2,...,μn]T\bm{\mu} = [\mu_1, \mu_2, ..., \mu_n]^T is a vector of means, and the elements of matrix Σ\bm{\Sigma} are covariances between pairs of individual coordinates (xix_i, xjx_j):

Σ=(E((x1μ1)(x1μ1))E((x1μ1)(x2μ2))E((x1μ1)(xnμn))E((x2μ2)(x1μ1))E((x2μ2)(x2μ2))E((x2μ2)(xnμn))E((xnμn)(x1μ1))E((xnμn)(x2μ2))E((xnμn)(xnμn)))\bm{\Sigma} = \begin{pmatrix} \mathbb{E}((x_1-\mu_1)(x_1-\mu_1)) & \mathbb{E}((x_1-\mu_1)(x_2-\mu_2)) & \cdots & \mathbb{E}((x_1-\mu_1)(x_n-\mu_n)) \\ \mathbb{E}((x_2-\mu_2)(x_1-\mu_1)) & \mathbb{E}((x_2-\mu_2)(x_2-\mu_2)) & \cdots & \mathbb{E}((x_2-\mu_2)(x_n-\mu_n)) \\ \cdots & \cdots & \ddots & \cdots \\ \mathbb{E}((x_n-\mu_n)(x_1-\mu_1)) & \mathbb{E}((x_n-\mu_n)(x_2-\mu_2)) & \cdots & \mathbb{E}((x_n-\mu_n)(x_n-\mu_n)) \\ \end{pmatrix}

Cumulative density function of X looks like this:

FX(x1,x2,...,xn)=1(2π)ndet(Σ)t1=x1t2=x2...tn=xne(Tμ)TΣ1(Tμ)2dt1dt2...dtnF_X(x_1, x_2, ..., x_n) = \frac{1}{ \sqrt{{(2\pi)}^n |\det(\Sigma)|} } \int \limits_{t_1=-\infty}^{x_1} \int \limits_{t_2=-\infty}^{x_2} ... \int \limits_{t_n=-\infty}^{x_n} e^{-\frac{(\bm{T}-\bm{\mu})^T \bm{\Sigma}^{-1} (\bm{T}-\bm{\mu}) }{ 2 }} dt_1dt_2...dt_n, where T=(t1,t2,...,tn)TT=(t_1, t_2, ..., t_n)^T

and probability density function is:

fX(x1,x2,...,xn)=1(2π)ndet(Σ)e(Xμ)TΣ1(Xμ)2f_X(x_1, x_2, ..., x_n) = \frac{1}{ \sqrt{ {(2\pi)}^n |\det(\Sigma)| } } e^{-\frac{(\bm{X}-\bm{\mu})^T \bm{\Sigma^{-1}} (\bm{X}-\bm{\mu}) }{ 2 }}

Mahalanobis distance and covariance matrix

What is the meaning of covariance matrix and what does it do in the probability density function of multivariate normal?

Square root of a quadratic form (XY)TΣ1(XY)\sqrt{(\bm{X}-\bm{Y})^T\bm{\Sigma}^{-1}(\bm{X} - \bm{Y})}, where Y\bm{Y} and X\bm{X} are n-vectors and Σ\bm{\Sigma} is an n x n matrix, is called Mahalanobis distance between vectors X\bm{X} and Y\bm{Y}.

If the matrix Σ\bm{\Sigma} is a unit matrix, e.g. I=(100010001)I = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ \end{pmatrix}, Mahalanobis distance is the same as Euclidean.

However, if the coordinates of the vector X are strongly correlated, Mahalanobis distance could be much more helpful to e.g. detect outliers. For instance, imagine, that your vector X\bm{X} contains flat properties: (x1x_1 = total_area, x2x_2 = living_rooms_area, x3x_3 = distance from center). You can tell that total flat area and living rooms area have a reasonably strong correlation (as an edge case they could completely duplicate each other).

For instance, here is a possible covariance matrix for your flat’s properties Σ=(10.70.10.710.20.10.21)\bm{\Sigma} = \begin{pmatrix} 1 & 0.7 & 0.1 \\ 0.7 & 1 & 0.2 \\ 0.1 & 0.2 & 1 \\ \end{pmatrix}, Σ1=(1.9671.3930.0821.3932.0290.2660.0820.2661.045)\bm{\Sigma}^{-1} = \begin{pmatrix} 1.967 & -1.393 & 0.082 \\ -1.393 & 2.029 & -0.266 \\ 0.082 & -0.266 & 1.045 \\ \end{pmatrix}.

The key to understanding the covariance matrix is analysis of its eigen decomposition. Let E\bm{E} be the matrix of eigenvectors of Σ\bm{\Sigma}, let Λ\bm{\Lambda} be the diagonal matrix of eigenvalues of Σ\bm{\Sigma}.

Covariance matrix is symmetric (and positively semi-definite). For a symmetric matrix, its eigenvectors are orthogonal (so that inverse matrix of an orthogonal matrix is its transpose): Σ=EΛE1\bm{\Sigma} = \bm{E} \bm{\Lambda} \bm{E}^{-1} => ΣT=(E1)TΛTET=EΛE1=Σ\bm{\Sigma}^{T} = (\bm{E}^{-1})^{T} \bm{\Lambda}^{T} \bm{E}^{T} = \bm{E} \bm{\Lambda} \bm{E}^{-1} = \bm{\Sigma}.

ΛT=Λ\bm{\Lambda}^T = \bm{\Lambda}, thus, E1(E1)TΛETE=Λ\bm{E}^{-1}(\bm{E}^{-1})^T \Lambda \bm{E}^T\bm{E} = \bm{\Lambda}, indicating that ETE=I\bm{E}^T\bm{E}=\bm{I}, or E\bm{E} is orthogonal.

So the logic of Mahalanobis distance can be seen as follows: (XY)TΣ1(XY)=(XY)TEΛ1ET(XY)(\bm{X}-\bm{Y})^T\bm{\Sigma}^{-1}(\bm{X} - \bm{Y}) = (\bm{X}-\bm{Y})^T \bm{E} \bm{\Lambda}^{-1} \bm{E}^T (\bm{X} - \bm{Y}).

By multiplying XY\bm{X-Y} by the inverse/transposed eigen matrix ET\bm{E}^T (and doing the same in transposed way to the left side from Λ1\bm{\Lambda}^{-1}, when multiplying (XY)TE(\bm{X} - \bm{Y})^T \bm{E}), we de-correlate the dimensions of the vectors, transforming those inter-dependent factors into orthogonal, independent. Then we take the sum of squares of those de-correlated factors, but a weighted one, we give some dimensions more weight then the others, by multiplying by the matrix of eigenvalues Λ\bm{\Lambda}.

Let us show that correlated vectors, multiplied by ET\bm{E}^T, become uncorrelated. If eigenvector EiE_i had coordinates (ei,1ei,2ei,3)\begin{pmatrix} e_{i,1} \\ e_{i,2} \\ e_{i,3} \\ \end{pmatrix}, then:

Y=ETX=(e1,1e1,2e1,3e2,1e2,2e2,3e3,1e3,2e3,3)(x1x2x3)=(e1,1x1+e1,2x2+e1,3x3e2,1x1+e2,2x2+e2,3x3e3,1x1+e3,2x2+e3,3x3)\bm{Y} = \bm{E}^T X = \begin{pmatrix} e_{1,1} & e_{1,2} & e_{1,3} \\ e_{2,1} & e_{2,2} & e_{2,3} \\ e_{3,1} & e_{3,2} & e_{3,3} \\ \end{pmatrix} \cdot \begin{pmatrix} x_{1} \\ x_{2} \\ x_{3} \\ \end{pmatrix} = \begin{pmatrix} e_{1,1}x_{1} + e_{1,2}x_2 + e_{1,3}x_3 \\ e_{2,1}x_{1} + e_{2,2}x_2 + e_{2,3}x_3 \\ e_{3,1}x_{1} + e_{3,2}x_2 + e_{3,3}x_3 \\ \end{pmatrix}

Let’s now calculate the correlation between two coordinates of Y\bm{Y}, e.g.:

Cov[y1,y2]=Cov((e1,1x1+e1,2x2+e1,3x3)(e2,1x1+e2,2x2+e2,3x3))=\mathrm{Cov}[y_1, y_2] = \mathrm{Cov}( (e_{1,1}x_1 + e_{1,2}x_2 + e_{1,3}x_3) \cdot (e_{2,1}x_1 + e_{2,2}x_2 + e_{2,3}x_3)) =

=Cov(e1,1e2,1x1x1+e1,1e2,2x1x2+e1,1e2,3x1x3+e1,2e2,1x2x1+e1,2e2,2x2x2+e1,2e2,3x2x3+e1,3e2,1x3x1+e1,3e2,2x3x2+e1,3e2,3x3x3)== \mathrm{Cov}(e_{1,1}e_{2,1}x_1x_1 + e_{1,1}e_{2,2}x_1x_2 + e_{1,1}e_{2,3}x_1x_3 + e_{1,2}e_{2,1}x_2x_1 + e_{1,2}e_{2,2}x_2x_2 + e_{1,2}e_{2,3}x_2x_3 + e_{1,3}e_{2,1}x_3x_1 + e_{1,3}e_{2,2}x_3x_2 + e_{1,3}e_{2,3}x_3x_3) =

=(e2,1e2,2e2,3)(Σ1,1Σ1,2Σ1,3Σ2,1Σ2,2Σ2,3Σ3,1Σ3,2Σ3,3)(e1,1e1,2e1,3)=(e2,1e2,2e2,3)λ1(e1,1e1,2e1,3)=λ10=0= \begin{pmatrix} e_{2,1} & e_{2,2} & e_{2,3} \\ \end{pmatrix} \cdot \begin{pmatrix} \Sigma_{1,1} & \Sigma_{1,2} & \Sigma_{1,3} \\ \Sigma_{2,1} & \Sigma_{2,2} & \Sigma_{2,3} \\ \Sigma_{3,1} & \Sigma_{3,2} & \Sigma_{3,3} \\ \end{pmatrix} \cdot \begin{pmatrix} e_{1,1} \\ e_{1,2} \\ e_{1,3} \\ \end{pmatrix} = \begin{pmatrix} e_{2,1} & e_{2,2} & e_{2,3} \\ \end{pmatrix} \cdot \lambda_1 \cdot \begin{pmatrix} e_{1,1} \\ e_{1,2} \\ e_{1,3} \\ \end{pmatrix} = \lambda_1 \cdot 0 = 0.

First, we used the fact that covariance of a linear combination of random variables is a linear combination of covariances. Then we used the fact that E1E_1 is the eigenvector of matrix Σ\bm{\Sigma}, and ΣE1=λ1E1\bm{\Sigma} E_1 = \lambda_1 E_1. Lastly, we used the fact that eigenvectors E1E_1 and E2E_2 are orthogonal, and their dot product is 0.

Now, as you can see, the power of exponent in multivariate normal distribution, is the square of Mahalanobis distance between the vector and its mean, divided by 2. So, it works in the same way, it converts our correlated factors into uncorrelated ones, and takes sum of their squares, weighted by eigenvalues of respective directions.

This also explains, why the denominator contains det(Σ)|\det(\bm{\Sigma})|: the eigenvalues of the covariance matrix are the elements of diagonal matrix Λ\bm{\Lambda}, which are the variances of de-correlated normal distributions. By Binet-Cauchy formula the determinant of det(Σ)=det(Λ)\det(\bm{\Sigma}) = \det(\bm{\Lambda}). Thus, by normalizing the probability density function by det(Σ)|\det(\bm{\Sigma})|, we do the same as by normalizing pdf of one-dimensional normal distribution by σ2\sqrt{\sigma^2}.

Uncorrelated multidimensional normal variables are independent

This property of multidimensional normal distribution is fairly obvious from the previous property.

For instance, suppose that your covariance matrix is as in the following example from StackOverflow:

Σ1=(410150002)\Sigma^{-1} = \begin{pmatrix} 4 & -1 & 0 \\ -1 & 5 & 0 \\ 0 & 0 & 2 \\ \end{pmatrix}, Σ=(0.263157890.0526315800.052631580.210526320000.5)\Sigma = \begin{pmatrix} 0.26315789 & 0.05263158 & 0 \\ 0.05263158 & 0.21052632 & 0 \\ 0 & 0 & 0.5 \\ \end{pmatrix}

Then:

(Xμ)TΣ1(Xμ)=(x1x2x3)(410150002)(x1x2x3)=(x1x2)(4115)(x1x2)+2x32(\bm{X}-\bm{\mu})^T \Sigma^{-1} (\bm{X}-\bm{\mu}) = \begin{pmatrix} x_1 & x_2 & x_3 \\ \end{pmatrix} \cdot \begin{pmatrix} 4 & -1 & 0 \\ -1 & 5 & 0 \\ 0 & 0 & 2 \\ \end{pmatrix} \cdot \begin{pmatrix} x_1 \\ x_2 \\ x_3 \\ \end{pmatrix} = \begin{pmatrix} x_1 & x_2 \\ \end{pmatrix} \begin{pmatrix} 4 & -1 \\ -1 & 5 \\ \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ \end{pmatrix} + 2{x_3}^2 .

Now, if we substitute this into the probability density function, we get:

fX(x1,x2,...,xn)=1(2π)ndet(Σ)e(Xμ)TΣ(Xμ)2=1(2π)3det(Σ)e12((x1x2)(4115)(x1x2)+2x32)=f_X(x_1, x_2, ..., x_n) = \frac{1}{ \sqrt{ {(2\pi)}^n |\det(\Sigma)| } } e^{-\frac{(\bm{X}-\bm{\mu})^T \bm{\Sigma} (\bm{X}-\bm{\mu}) }{ 2 }} = \frac{1}{ \sqrt{ {(2\pi)}^3 |\det(\Sigma)| } } e^{- \frac{1}{2} ( \begin{pmatrix} x_1 & x_2 \\ \end{pmatrix} \begin{pmatrix} 4 & -1 \\ -1 & 5 \\ \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ \end{pmatrix} + 2{x_3}^2)} =

=1(2π)0.5e2x3221(2π)2det[0.263157890.052631580.052631580.21052632]e12((x1x2)(4115)(x1x2)= \frac{1}{ \sqrt{ {(2\pi)} \cdot 0.5 } } e^{-\frac{ 2{x_3}^2}{2} } \cdot \frac{1}{ \sqrt{ {(2\pi)^2} |\det{ \begin{bmatrix} 0.26315789 & 0.05263158 \\ 0.05263158 & 0.21052632 \\ \end{bmatrix} }| } } e^{- \frac{1}{2} ( \begin{pmatrix} x_1 & x_2 \\ \end{pmatrix} \begin{pmatrix} 4 & -1 \\ -1 & 5 \\ \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ \end{pmatrix} }

Thus, we can see that uncorrelated dimensions of random vector can be factored-out into independent random variables.

Marginalization and Conditioning

Marginalization

You marginalize multivariate normal distribution by taking an integral over 1 of its dimensions.

For instance, if you integrate Galton’s 2-variate normal distribution over the heights of all the fathers, you get 1-dimensional distribution of heights all the children.

Conditioning

You do conditioning, when you fix the value of one dimension of multivariate normal distribution and achieve a lower-variate one.

For instance, you can choose fathers, who are x2x_2 inches tall, and achieve the conditional distribution of heights of their children, which is one-dimensional normal:

fξ(x1)=12π(Σ11Σ12Σ21)Σ22e12(x1μ1Σ1,2Σ22(x2μ2))2(Σ1,1Σ1,2Σ2,1Σ2,2)2f_{\xi}(x_1) = \frac{1}{\sqrt{2\pi(\Sigma_{11} - \frac{\Sigma_{12}\Sigma_{21})}{\Sigma_{22}}}} e^{-\frac{1}{2}\frac{ (x_1 - \mu_1 - \frac{ \Sigma_{1,2} }{ \Sigma_{22} } (x_2 - \mu_2))^2 }{ (\Sigma_{1,1} - \frac{ \Sigma_{1,2}\Sigma_{2,1} }{\Sigma_{2,2}})^2 }}

Note that the mean and variance of this distribution differ from the marginalized one - children of taller fathers are, obviously, taller.

Quadratic forms, their ranks and special cases of quadratic forms

The power of exponent of p.d.f. of a multivariate normal (Xμ)TΣ1(Xμ)2\frac{(\bm{X}-\bm{\mu})^T \bm{\Sigma}^{-1} (\bm{X}-\bm{\mu}) }{ 2 } is a quadratic form.

Speaking of the matrix Σ\bm{\Sigma}, there is a useful concept of matrix rank, which is the number of linearly independent rows/columns in the matrix.

For instance, if our quadratic form is just a product of two vectors (a1x1+a2x2+a3x3)(b1x1+b2x2+b3x3)(a_1x_1 + a_2x_2 + a_3x_3)(b_1x_1 + b_2x_2 + b_3x_3), the rank equals 1, because all the rows are linearly dependent.

Indeed, we could see this, when we write the product in matrix notation:

(x1x2x3)(a1a2a3)(b1b2b3)(x1x2x3)\begin{pmatrix} x_1 & x_2 & x_3 \\ \end{pmatrix} \cdot \begin{pmatrix} a_1 \\ a_2 \\ a_3 \end{pmatrix} \cdot \begin{pmatrix} b_1 & b_2 & b_3 \\ \end{pmatrix} \cdot \begin{pmatrix} x_1 \\ x_2 \\ x_3 \end{pmatrix}

The multiplication of row-vector by column-vector in linear algebra is called dot product, or inner product: (x1x2x3)(a1a2a3)=a1x1+a2xx+a3x3\begin{pmatrix} x_1 & x_2 & x_3 \\ \end{pmatrix} \cdot \begin{pmatrix} a_1 \\ a_2 \\ a_3 \end{pmatrix}= a_1x_1 + a_2x_x + a_3x_3.

Less commonly used, the multiplication of columns-vector by row vector is called an outer product, and it results in a matrix, where each element is a product of respective elements of column-vector and row-vector:

(a1a2a3)(b1b2b3)=(a1b1a1b2a1b3a2b1a2b2a2b3a3b1a3b2a3b3)\begin{pmatrix} a_1 \\ a_2 \\ a_3 \end{pmatrix} \cdot \begin{pmatrix} b_1 & b_2 & b_3 \\ \end{pmatrix} = \begin{pmatrix} a_1 b_1 & a_1 b_2 & a_1 b_3 \\ a_2 b_1 & a_2 b_2 & a_2 b_3 \\ a_3 b_1 & a_3 b_2 & a_3 b_3 \\ \end{pmatrix}

You can choose an arbitrary order of application of outer-product and inner-product operations:

(x1x2x3)(a1a2a3)(b1b2b3)(x1x2x3)=(a1x1+a2x2+a3x3)(b1x1+b2x2+b3x3)=(x1x2x3)(a1b1a1b2a1b3a2b1a2b2a2b3a3b1a3b2a3b3)(x1x2x3)\begin{pmatrix} x_1 & x_2 & x_3 \\ \end{pmatrix} \cdot \begin{pmatrix} a_1 \\ a_2 \\ a_3 \end{pmatrix} \cdot \begin{pmatrix} b_1 & b_2 & b_3 \\ \end{pmatrix} \cdot \begin{pmatrix} x_1 \\ x_2 \\ x_3 \end{pmatrix} = (a_1x_1 + a_2x_2 + a_3x_3) \cdot (b_1x_1 + b_2x_2 + b_3x_3) = \begin{pmatrix} x_1 & x_2 & x_3 \\ \end{pmatrix} \cdot \begin{pmatrix} a_1 b_1 & a_1 b_2 & a_1 b_3 \\ a_2 b_1 & a_2 b_2 & a_2 b_3 \\ a_3 b_1 & a_3 b_2 & a_3 b_3 \\ \end{pmatrix} \cdot \begin{pmatrix} x_1 \\ x_2 \\ x_3 \end{pmatrix}

If you choose the latter way, it becomes obvious that the rank of matrix (a1b1a1b2a1b3a2b1a2b2a2b3a3b1a3b2a3b3)\begin{pmatrix} a_1 b_1 & a_1 b_2 & a_1 b_3 \\ a_2 b_1 & a_2 b_2 & a_2 b_3 \\ a_3 b_1 & a_3 b_2 & a_3 b_3 \\ \end{pmatrix} , formed by an outer product of coefficients a\bm{a} and b\bm{b}, equals 1.

Indeed, its i-th row is a multiple of aia_i by row-vector bT=(b1b2b3)\bm{b}^T = \begin{pmatrix} b_1 & b_2 & b_3 \\ \end{pmatrix}, so all the rows differ just by a scalar aia_i, so there is just 1 linearly independent row.

If we chose a real-life covariance matrix (10.50.51)\begin{pmatrix} 1 & 0.5 \\ 0.5 & 1 \\ \end{pmatrix}, it is clear that its rank equals 2, so it cannot be represented as (a1x1+a2x2)(b1x1+b2x2)(a_1x_1 + a_2x_2)(b_1x_1 + b_2x_2).

References


Boris Burkov

Written by Boris Burkov who lives in Moscow, Russia and Cambridge, UK, loves to take part in development of cutting-edge technologies, reflects on how the world works and admires the giants of the past. You can follow me on Telegram