Multivariate normal distribution arises in many aspects of mathematical statistics and machine learning. For instance, Cochran's theorem in statistics, PCA and Gaussian processes in ML heavily rely on its properties. Thus, I'll discuss it here in detail.
Heights of 205 parents and 930 adult children, from “Regression towards mediocrity in hereditary stature” by F.Galton, 1886
Heights of parents were normally-distributed, as well as the heights of their children. However, the distributions were
obviously not independent, as taller parents generally give birth to taller children. Probably, this plate is one of the first depictions of 2-variate normal distribution’s isocontour. As a side note,
Galton has also come up with the ratio of male to female heights here, which is 1.08 (according to the modern data, it is closer to 1.07).
Galton actually rediscovered the concept of correlation two years after this paper, in 1888.
Multivariate normal distribution
A random vector X=[x1,x2,...,xn]T is called multivariate normal distribution, if each dimension of it represents a one-dimensional normal distribution.
They write X∼N(μ,Σ), where μ=[μ1,μ2,...,μn]T is a vector of means, and the elements of matrix Σ are covariances between pairs of individual coordinates (xi, xj):
What is the meaning of covariance matrix and what does it do in the probability density function of multivariate normal?
Square root of a quadratic form (X−Y)TΣ−1(X−Y), where Y and X are n-vectors and Σ is an n x n matrix, is
called Mahalanobis distance between vectors X and Y.
If the matrix Σ is a unit matrix, e.g. I=⎝⎛100010001⎠⎞, Mahalanobis distance is the same as Euclidean.
However, if the coordinates of the vector X are strongly correlated, Mahalanobis distance could be much more helpful to e.g. detect outliers.
For instance, imagine, that your vector X contains flat properties: (x1 = total_area, x2 = living_rooms_area, x3 = distance from center).
You can tell that total flat area and living rooms area have a reasonably strong correlation (as an edge case they could completely duplicate each other).
For instance, here is a possible covariance matrix for your flat’s properties Σ=⎝⎛10.70.10.710.20.10.21⎠⎞, Σ−1=⎝⎛1.967−1.3930.082−1.3932.029−0.2660.082−0.2661.045⎠⎞.
The key to understanding the covariance matrix is analysis of its eigen decomposition. Let E be the matrix of eigenvectors of Σ, let Λ be the diagonal matrix of eigenvalues of Σ.
Covariance matrix is symmetric (and positively semi-definite). For a symmetric matrix, its eigenvectors are orthogonal (so that inverse matrix of an orthogonal matrix is its transpose): Σ=EΛE−1 => ΣT=(E−1)TΛTET=EΛE−1=Σ.
ΛT=Λ, thus, E−1(E−1)TΛETE=Λ, indicating that ETE=I, or E is orthogonal.
So the logic of Mahalanobis distance can be seen as follows: (X−Y)TΣ−1(X−Y)=(X−Y)TEΛ−1ET(X−Y).
By multiplying X−Y by the inverse/transposed eigen matrix ET (and doing the same in transposed way to the left side from Λ−1, when multiplying (X−Y)TE), we de-correlate the dimensions of the vectors, transforming those inter-dependent factors into orthogonal, independent.
Then we take the sum of squares of those de-correlated factors, but a weighted one, we give some dimensions more weight then the others, by multiplying by the matrix of eigenvalues Λ.
Let us show that correlated vectors, multiplied by ET, become uncorrelated. If eigenvector Ei had coordinates ⎝⎛ei,1ei,2ei,3⎠⎞, then:
First, we used the fact that covariance of a linear combination of random variables is a linear combination of covariances. Then we used the fact that E1 is the eigenvector of matrix Σ, and ΣE1=λ1E1. Lastly, we used the fact that eigenvectors E1 and E2 are orthogonal, and their dot product is 0.
Now, as you can see, the power of exponent in multivariate normal distribution, is the square of Mahalanobis distance between the vector and its mean, divided by 2.
So, it works in the same way, it converts our correlated factors into uncorrelated ones, and takes sum of their squares, weighted by eigenvalues of respective directions.
This also explains, why the denominator contains ∣det(Σ)∣: the eigenvalues of the covariance matrix are the elements of diagonal matrix Λ, which are the variances of de-correlated normal distributions.
By Binet-Cauchy formula the determinant of det(Σ)=det(Λ). Thus, by normalizing the probability density function by ∣det(Σ)∣,
we do the same as by normalizing pdf of one-dimensional normal distribution by σ2.
Uncorrelated multidimensional normal variables are independent
This property of multidimensional normal distribution is fairly obvious from the previous property.
The multiplication of row-vector by column-vector in linear algebra is called dot product, or inner product:
Less commonly used, the multiplication of columns-vector by row vector is called an outer product, and it results in a matrix, where each element is a product of respective elements of column-vector and row-vector:
Written by Boris Burkov who lives in Moscow, Russia, loves to take part in development of cutting-edge technologies, reflects on how the world works and admires the giants of the past. You can follow me in Telegram