cover

Cochran's theorem

June 30, 2021 18 min read

Here I discuss the Cochran's theorem that is used to prove independence of quadratic forms of random variables, such as sample variance and sample mean.

Tearing through the unintelligible formulation

Cochran’s theorem is a field-specific result, sometimes used in the analysis of chi-squared and multivariate normal distributions.

Probably, the main two reasons to be concerned about it are the proof of independence between the sample variance and sample mean in derivation of t-Student distribution and ANOVA, where total variance can be split into multiple sources of variance, which are usually expressed as quadratic forms.

As an example of the last case (maybe not them most relevant one), consider the famous bias-variance tradeoff, where total expected error of the regression model is divided into a sum of two sources of variance, squared bias (the systematic error due to crude model being unable to fit the more complex nature of data) and variance (the error created by the fact that the amount of data available is limited, and is somewhat insufficient for the model to learn to fit the data perfectly).

Anyway, the formulation of the theorem sounds really technical, and is hard to digest. Basically, it says that if you had a sum of squares of nn independent identically distributed normal variables and managed to split it into a several non-full-rank quadratic forms where sum of their ranks equals nn, each of those quadratic forms is an independent random variable, distributed as χri2\chi^2_{r_i}, a chi-square with rir_i degrees of freedom, where, where rir_i is the rank of corresponding quadratic form.

For instance, if we’ve split our sum of square of i.i.d. normal variables with 0 mean (so that XTXχn2X^T X \sim \chi_{n}^2) into 2 quadratic forms XTX=XTB1X+XTB2XX^T X = X^T B_1 X + X^T B_2 X with the matrix B1B_1 having rank r1r_1 and B2B_2 having rank r2=nr1r_2 = n - r_1, XTB1Xχr12X^T B_1 X \sim \chi_{r_1}^2, XTB2Xχr22X^T B_2 X \sim \chi_{r_2}^2 and XTB1XX^T B_1 X and XTB2XX^T B_2 X are independent.

See the examples below to see, why this result is valuable.

Cochran’s theorem proof

Here I’ll outline the proof of Cochran’s theorem.

The central piece of the proof is the following lemma, which is a result from pure linear algebra, not probabilities - it deals with matrices and real numbers, not random variables. When we are done with the lemma, the proof of the theorem itself gets pretty straightforward.

Lemma

Suppose that we have an n-dimensional vector X\bm{X} and a quadratic form XTIX=XTX{\bm{X}}^T \bm{I} \bm{X} = {\bm{X}}^T \bm{X}.

Suppose that we found a way to split this quadratic form into several: XTX=XTB1X+XTB2X+...+XTBkX{\bm{X}}^T \bm{X} = {\bm{X}}^T \bm{B_1} \bm{X} + {\bm{X}}^T \bm{B_2} \bm{X} + ... + {\bm{X}}^T \bm{B_k} \bm{X}, where the matrices B1,B2,...Bk\bm{B_1}, \bm{B_2}, ... \bm{B_k} have lower ranks r1,r2,...rkr_1, r_2, ... r_k, so that the sum of those ranks equals n: i=1kri=n\sum \limits_{i=1}^{k} r_i = n.

Then all those matrices can be simultaneously diagonalized. There exists an orthogonal matrix E\bm{E} of their joint eigenvectors, and after diagonalising, we get:

XTX=XTB1X+XTB2X+...+XTBkX=XTEΛ1ETX+XTEΛ2ETX+...+XTETΛkETX={\bm{X}}^T \bm{X} = {\bm{X}}^T \bm{B_1} \bm{X} + {\bm{X}}^T \bm{B_2} \bm{X} + ... + {\bm{X}}^T \bm{B_k} \bm{X} = {\bm{X}}^T \bm{E} \bm{\Lambda_1} \bm{E}^T \bm{X} + {\bm{X}}^T \bm{E} \bm{\Lambda_2} \bm{E}^T \bm{X} + ... + {\bm{X}}^T \bm{E}^T \bm{\Lambda_k} \bm{E}^T \bm{X} =

=YTΛ1Y+YTΛ2Y+...+YTΛkY = \bm{Y}^T \bm{\Lambda_1} \bm{Y} + \bm{Y}^T \bm{\Lambda_2} \bm{Y} + ... + \bm{Y}^T \bm{\Lambda_k} \bm{Y}, where Y\bm{Y} are such transforms of X\bm{X} vector, that in the resulting quadratic forms YTΛiY\bm{Y^T} \bm{\Lambda_i} \bm{Y} matrices Λi\bm{\Lambda_i} are diagonal.

As matrices Λi\Lambda_i contain only rir_i non-zero eigenvalues: (000000λj0000λj+10000000)\begin{pmatrix} 0 & 0 & \cdots & 0 & 0 & 0 \\ \cdots & \cdots & \ddots & \ddots & \cdots & \cdots \\ 0 & \cdots & \lambda_{j} & 0 & \cdots & 0 \\ 0 & \cdots & 0 & \lambda_{j+1} & 0 & 0 \\ \cdots & \cdots & \ddots & \ddots & \cdots & \cdots \\ 0 & 0 & \cdots & 0 & 0 & 0 \\ \end{pmatrix}, where jj starts with r0+r1+...+ri1+1r_0 + r_1 + ... + r_{i-1} + 1 and ends with r0+r1+...+ri1+rir_0 + r_1 + ... + r_{i-1} + r_i, in each expression YTΛiY\bm{Y}^T \bm{\Lambda_i} \bm{Y} only jj-th coordinates of Y\bm{Y} actually matter.

Moreover, importantly all the eigenvalues λj\lambda_j equal to 1, so each quadratic form quadratic forms YTΛiY\bm{Y^T} \bm{\Lambda_i} \bm{Y} actually end up being just a sum of squares of i.i.d normal variables YTY\bm{Y^T} \bm{Y}, which means it is chi-square-distributed.

Preparations for the lemma: eigenvalues and eigenvectors of lower-rank matrices

As you see the statement of the lemma deals with lower-rank matrices and their eigen decomposition. We need to learn how to work with them in order to understand the proof of the lemma.

For starters, let us first consider a rank 1 matrix, e.g. A=(1212)A = \begin{pmatrix} 1 & 2 \\ 1 & 2 \\ \end{pmatrix}

It has to have 2 eigenvalues.

We can represent it as an outer product of two vectors, uu and vv: A=uvTA = uv^T.

Then one of its eigenvalues is λ1=uTv\lambda_1 = u^T v because Au=(uvT)u=u(vTu)=uλ1=λ1uAu = (uv^T)u = u(v^Tu) = u \lambda_1 = \lambda_1 u, and u is the eigenvector.

As for the rest of eigenvalues, they equal to zero, and the corresponding eigenvectors have to cover the remaining space of dimensionality (n-1). For instance, it is clear that (2, -1) is an eigenvector with 0 eigenvalue.

Obviously, in case of matrix of rank 1, we have just one linearly independent equation, that makes all but one coordinates of eigenvectors, corresponding to eigenvalue 0, arbitrary, and the last one is determined by that row.

Now, suppose that you have dimension of your matrix n=4n=4 and rank of your matrix r=2r=2, for instance: A=(1111111211121112)A = \begin{pmatrix} 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 2 \\ 1 & 1 & 1 & 2 \\ 1 & 1 & 1 & 2 \\ \end{pmatrix}.

Then 2 eigenvectors are fixed and correspond to non-zero eigenvalues λ1\lambda_1 and λ2\lambda_2, while a 2-dimensional subspace is left, corresponding to the eigenvalues λ3=λ4=0\lambda_3 = \lambda_4 = 0. Indeed, if you try solving Ax=λxA x = \lambda x with λ=0\lambda=0 or Ax=0Ax = 0, you can clearly see, that you’ll have just 2 constraints on solutions: 1x1+1x2+1x3+1x4=01x_1 + 1x_2 + 1x_3 + 1x_4 = 0 and 1x1+1x2+1x3+2x4=01x_1 + 1x_2 + 1x_3 + 2x_4 = 0, which allows for eigenspace of dimensionality 2. You can choose arbitrary values for x1x_1 and x2x_2, then x3=(x1+x2)x_3 = - (x_1 + x_2) and x4=0x_4 = 0.

Preparations for the lemma: simultaneous diagonalization

Sometimes two lower-rank matrices can be simultaneously diagonalized.

Example: say, you have 2 matrices, B1=(1000020000000000)B_1 = \begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & 2 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ \end{pmatrix}, B2=(0000000000300004)B_2 = \begin{pmatrix} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 3 & 0 \\ 0 & 0 & 0 & 4 \\ \end{pmatrix}.

For matrix B1B_1 eigenvectors are E1=(1,0,0,0)TE_1 = (1, 0, 0, 0)^T with eigenvalue λ1=1\lambda_1 = 1, E2=(0,1,0,0)TE_2 = (0, 1, 0, 0)^T with eigenvalue λ2=2\lambda_2 = 2; λ3=λ4=0\lambda_3 = \lambda_4 = 0, and corresponding eigenspace is V=(0,0,c1,c2)TV = (0, 0, c_1, c_2)^T, where c1,c2c_1, c_2 - arbitrary values.

For matrix B2B_2 eigenvectors are E1=(0,0,1,0)TE_1 = (0, 0, 1, 0)^T with eigenvalue λ1=3\lambda_1 = 3, E2=(0,0,0,1)TE_2 = (0, 0, 0, 1)^T with eigenvalue λ2=4\lambda_2 = 4; λ3=λ4=0\lambda_3 = \lambda_4 = 0, and corresponding eigenspace is V=(c1,c2,0,0)TV = (c_1, c_2, 0, 0)^T, where c1,c2c_1, c_2 - arbitrary values.

Thus, we can simultaneously diagonalize these two matrices, because eigenvalues of matrix B1B_1 are compatible with the eigenspace of B2B_2 and vice versa.

Now, we won’t be using the following results below, but I’ll still mention them. Simultaneous diagonalization of lower-rank matrices in general is NOT always possible. However, there is always a way for two symmetric, positive-definite matrices of the same size - Cholesky decomposition. This result is well-known, because it has important practical applications for Lagrangian and Quantum mechanics. In quantum mechanics two operators can be observed simultaneously, if they commute (i.e. their eigen functions are the same, or they are simultaneously diagonalizable).

Proof of the lemma

Let us start with 2-dimensional case. We start with:

XTX=XTB1X+XTB2XX^T X = X^T B_1 X + X^T B_2 X

Do an eigen decomposition of B1B_1: B1=E1Λ1E1TB_1 = E_1 \Lambda_1 E_1^T or E1TB1E1=Λ1E_1^T B_1 E_1 = \Lambda_1.

Note that as the rank of matrix B1B_1 is not full, the matrix E_1 of eigenvectors has en eigenspace, corresponding to 0 eigenvalue, where we can choose an arbitrary basis. Let us do this in such a way that the resulting matrix E_1 is full-rank orthogonal (this is possible because B1B_1 is symmetric).

XTX=XTE1Λ1E1TX+XTB2XX^T X = X^T E_1 \Lambda_1 E_1^T X + X^T B_2 X

Now, denote Y=E1TXY = E_1^T X and recall that YTY=XTE1E1TX=XTXY^T Y = X^T E_1 E_1^T X = X^T X.

YTY=YTΛ1Y+XTB2XY^TY = Y^T \Lambda_1 Y + X^T B_2 X or, equivalently, YTY=YTΛ1Y+YTE1TB2E1YY^TY = Y^T \Lambda_1 Y + Y^T E_1^T B_2 E_1 Y

Now, if rank(Λ1)=r1rank(\Lambda_1) = r_1 (for instance, 2), we can re-arrange this as follows:

(y1y2...yn)(1000100001)(y1y2...yn)=\begin{pmatrix} y_1 & y_2 & ... & y_n \\ \end{pmatrix} \cdot \begin{pmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \cdots & \cdots & \ddots & \cdots \\ 0 & 0 & 0 & 1 \\ \end{pmatrix} \cdot \begin{pmatrix} y_1 \\ y_2 \\ ... \\ y_n \\ \end{pmatrix} = (y1y2...yn)(λ1000λ2000000)(y1y2...yn)+YTE1TB2E1Y \begin{pmatrix} y_1 & y_2 & ... & y_n \\ \end{pmatrix} \cdot \begin{pmatrix} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \cdots & \cdots & 0 & \cdots \\ 0 & 0 & 0 & 0 \\ \end{pmatrix} \cdot \begin{pmatrix} y_1 \\ y_2 \\ ... \\ y_n \\ \end{pmatrix} +Y^T E_1^T B_2 E_1 Y

Rearrange the terms to get:

i=1r1(1λi)yi2+j=r1+1nyj2=YTE1TB2E1Y\sum \limits_{i=1}^{r_1} (1-\lambda_i) y_i^2 + \sum \limits_{j=r_1+1}^{n} y_j^2 = Y^T E_1^T B_2 E_1 Y

Now that we know that the rank(B2)=r2=nr1rank(B_2) = r_2 = n - r_1, we can come to the conclusion that λ1,...,λr1=1\lambda_1, ..., \lambda_{r_1} = 1.

Indeed, recall that rank(AB)min(rank(A),rank(B)rank(AB) \leq \min(rank(A), rank(B). So, rank(E1TB2E1)rank(B2)=nr1rank(E_1^T B_2 E_1) \leq rank(B_2) = n - r_1 (see wiki).

As a result we have:

(y1y2...yn)(1λ10001λ200001)(y1y2...yn)=(y1y2...yn)E1TB2E1(y1y2...yn)\begin{pmatrix} y_1 & y_2 & ... & y_n \\ \end{pmatrix} \cdot \begin{pmatrix} 1-\lambda_1 & 0 & \cdots & 0 \\ 0 & 1-\lambda_2 & \cdots & 0 \\ \cdots & \cdots & \ddots & \cdots \\ 0 & 0 & 0 & 1 \\ \end{pmatrix} \cdot \begin{pmatrix} y_1 \\ y_2 \\ ... \\ y_n \\ \end{pmatrix} = \begin{pmatrix} y_1 & y_2 & ... & y_n \\ \end{pmatrix} \cdot E_1^T B_2 E_1 \cdot \begin{pmatrix} y_1 \\ y_2 \\ ... \\ y_n \\ \end{pmatrix}

There is only one way for the matrix E1TB2E1E_1^T B_2 E_1 to have rank nr1n-r_1 - all the eigenvalues should equal to 1: λ1=λ2=...=λri=1\lambda_1 = \lambda_2 = ... = \lambda_{r_i} = 1.

Now, what if n > 2, e.g. n=3? The key observation for this case is the fact that rank is subadditive: rank(A+B)rank(A)+rank(B)rank(A+B) \leq rank(A) + rank(B). So we can be sure that B2+B3B_2 + B_3 is a matrix of rank no greater than nr1n-r_1. Hence, we can disregard the first yiy_i from r1r_1 to rir_i and apply the same argument again for the remaining yjy_j.

Theorem proof

Now that we have proven the lemma, which constitutes the core of Cochran’s theorem, we can apply it to our random variables.

By analogy to the lemma, we apply an orthogonal transform CC to our random variable X=CY\bm{X} = C \bm{Y}, so that our sum of quadratic forms takes the following form:

Q1=Y12+Y22+...+Yr12Q_1 = Y_1^2 + Y_2^2 + ... + Y_{r_1}^2

Q2=Yr1+12+Yr1+22+...+Yr1+r22Q_2 = Y_{r_1+1}^2 + Y_{r_1+2}^2 + ... + Y_{r_1+r_2}^2

Qk=Y(r1+...+rk1)+12+Y(r1+...+rk1)+22+...+Yn2Q_k = Y_{(r_1 + ... + r_{k-1})+1}^2 + Y_{(r_1 + ... + r_{k-1})+2}^2 + ... + Y_{n}^2

Let us show now that all Yi2Y_i^2 random variables are independent. Recall that covariance matrix of X\bm{X} is ΣX=(σ120...00σ22...0.........00...σn2)=σ2I\bm{\Sigma_X} = \begin{pmatrix} \sigma_1^2 & 0 & ... & 0 \\ 0 & \sigma_2^2 & ... & 0 \\ ... & ... & \ddots & ... \\ 0 & 0 & ... & \sigma_n^2 \\ \end{pmatrix} = \sigma^2 I, since all the σ1=σ2=...=σn=σ\sigma_1 = \sigma_2 = ... = \sigma_n = \sigma

Now, if Y=CX\bm{Y} = C \bm{X}, where CC is orthogonal matrix, the covariance matrix of Y is an outer product:

ΣY=Cov[Y,YT]=Cov[CX,XTCT]=CΣXCT=Cσ2ICT=σ2I\bm{\Sigma_Y} = \mathrm{Cov}[Y, Y^T] = \mathrm{Cov}[C X, X^T C^T] = C \bm{\Sigma_X} C^T = C \sigma^2 I C^T = \sigma^2 I.

So, all YiY_i are independent identically distributed random variables.

Since every Yi2Y_i^2 occurs in exactly one QjQ_j and the YiY_i’s are all independent random variables N(0,σ2)\in \mathcal{N}(0, \sigma^2) (because CC is an orthogonal matrix), Cochran’s theorem follows.

Example: Application to ANOVA

Total sum of squares (SSTO) can be split into two terms: SSTO=i=1n(YiYˉ)2=i=1n(Yi22YiYˉ+Yˉ2)=i=1nYi22YˉnYˉ+nYˉ2=i=1nYi2nYˉ2=i=1nYi2(Yi)2nSSTO = \sum \limits_{i=1}^{n} (Y_i - \bar{Y})^2 = \sum \limits_{i=1}^{n} (Y_i^2 - 2Y_i\bar{Y} + \bar{Y}^2) = \sum \limits_{i=1}^{n} {Y_i}^2 - 2\bar{Y} n \bar{Y} + n\bar{Y}^2 = \sum \limits_{i=1}^{n} {Y_i}^2 - n\bar{Y}^2 = \sum \limits_{i=1}^{n} {Y_i}^2 - \frac{(\sum Y_i)^2}{n}.

Thus, i=1nYi2=i=1n(YiYˉ)2+(Yi)2n\sum \limits_{i=1}^{n} {Y_i}^2 = \sum \limits_{i=1}^{n} (Y_i - \bar{Y})^2 + \frac{(\sum Y_i)^2}{n} .

Now, both terms of the sum can be represented in matrix notation as quadratic forms:

i=1nYi2=(Y1Y2Y3)(100010001)(Y1Y2Y3)\sum \limits_{i=1}^{n} {Y_i}^2 = \begin{pmatrix} Y_1 & Y_2 & Y_3 \\ \end{pmatrix} \cdot \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ \end{pmatrix} \cdot \begin{pmatrix} Y_1 \\ Y_2 \\ Y_3 \\ \end{pmatrix},

i=1n(YiYˉ)2=(Y1Y2Y3)((100010001)1n(111111111))(Y1Y2Y3)\sum \limits_{i=1}^{n} (Y_i - \bar{Y})^2 = \begin{pmatrix} Y_1 & Y_2 & Y_3 \\ \end{pmatrix} \cdot (\begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ \end{pmatrix} - \frac{1}{n} \begin{pmatrix} 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \\ \end{pmatrix}) \cdot \begin{pmatrix} Y_1 \\ Y_2 \\ Y_3 \\ \end{pmatrix},

(Yi)2n=(Y1Y2Y3)1n(111111111)(Y1Y2Y3)\frac{(\sum Y_i)^2}{n} = \begin{pmatrix} Y_1 & Y_2 & Y_3 \\ \end{pmatrix} \cdot \frac{1}{n} \begin{pmatrix} 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \\ \end{pmatrix} \cdot \begin{pmatrix} Y_1 \\ Y_2 \\ Y_3 \\ \end{pmatrix}

Application to sample mean and sample variance

TODO; see wikipedia

References


Boris Burkov

Written by Boris Burkov who lives in Moscow, Russia, loves to take part in development of cutting-edge technologies, reflects on how the world works and admires the giants of the past. You can follow me in Telegram