Written by Boris Burkov who lives in Moscow, Russia, loves to take part in development of cutting-edge technologies, reflects on how the world works and admires the giants of the past.
Pearson's Chi-square tests - intuition and derivation
June 17, 2021 21 min read
Here I discuss, how an average mathematically inclined person like myself could stumble upon Karl Pearson's chi-squared test (it doesn't seem intuitive at all from the first glance). I demonstrate the intuition behind it and then prove its applicability to multinomial distribution.
In one of my previous posts I derived the chi-square distribution for sum of squares of Gaussian random
variables and showed that it is a special case of Gamma distribution and very similar to Erlang distribution. You can
look them up for reference.
A motivational example
Suppose you’ve taken 5 samples of a random variable that you assumed to be standard normal Gaussian and received
suspicious results: you have a feeling that your dots have landed too far away from the peak of distribution.
Given the probability density function (abbreviated pdf) of a gaussian distribution and knowledge that standard deviation is 1,
you would expect a much more close distribution of your samples, something like this:
You could ask yourself a question: what is the average value of probability density function that I would observe,
when sampling a point from standard normal?
Well, you can actually calculate it, following the expectation formula: 2π1−∞∫+∞This is the variable you’re averaginge−2x2⋅This is pdf over which you’re averaginge−2x2⋅dx=2π1−∞∫+∞e−x2dx.
So, probability density function of an average point, you sample, is expected to be 21=0.707106.
Let us calculate pdf for several points:
if x=0 (which is the most probable point), your fξ(0)=e−0/2=1, a bit more probable than average.
if x=1 (which is the standard deviation), your fξ(1)=e−1/2=0.60653, a bit less probable than average.
if x=2 (which is 2 standard deviations), your fξ(2)=e−4/2=0.13533, not much.
if x=3 (which is 3 standard deviations), your fξ(3)=e−9/2=0.01111, very small.
With these numbers let’s return to the 5 points I’ve observed. Out of those five point two points are at 3 standard
deviations, two points are at 2 standard deviations and one point is at 1 standard deviation. So the probability density
function of such an observation is fξ2(3)⋅fξ2(2)⋅fξ(1)=0.011112⋅0.135332⋅0.60653=1.37⋅10−6.
At the same time the expected pdf of five average point to be observed is 0.707115=0.17678. Seems like my observation was a really improbable one, it is less probable than average by over 100 000 times.
Intuition behind Pearson’s chi-square test
We need a more rigorous tool than just comparison of the probability of our observed five points with the average.
Let us call our vector of observations X=(x1,x2,x3,x4,x5)T. The combined pdf to observe exactly our 5 points is a 5-dimensional multidimensional standard normal distribution 2π51e−2x12+x22+x32+x42+x52.
But note that we don’t want exactly our point. Any “equiprobable” point will do. For instance, e−21+1+1+1+1=e−20+2+1+1+1, the 5-dimensional points (1,1,1,1,1)T and (0,2,1,1,1)T are “equiprobable”, and we want to group them into one.
So, we are actually interested in the distribution of the sum x12+x22+x32+x42+x52 as for identical values of the sum the pdfs of likelihood of a vector of observations X=(x1,x2,x3,x4,x5)T are identical.
Each xi∼N(0,1) is a standard Gaussian-distributed random variable, so the sum in question is a random variable distributed as Chi-square: i∑Nxi2∼χN2. Please, refer to my older post on Gamma/Erlang/Chi-square distribution for the proof.
That’s why chi-square distribution is the right tool to answer the question, we’re interested in: is it likely, that the observed set of points was sampled from a standard normal distribution.
Derivation of Pearson’s goodness of fit test statistic
The chi-square test is widely used to validate the hypothesis that a number of samples were taken from a multinomial distribution.
Suppose you’ve rolled a k=6-sided dice n=120 times, and you expect it to be fair. You would expect Ei=20 occurrences of each value of the cube i∈ {1,2,3,4,5,6} (row E - expected), instead you see some different outcomes Oi (row O - observed):
1
2
3
4
5
6
E
20
20
20
20
20
20
O
15
14
22
21
25
23
We need to estimate the likelihood of an outcome like this, if the dice was fair. Turns out that a certain statistic based on these data follows the chi-square distribution: χk−12=i∑kEi(Oi−Ei)2. I’ll prove this fact here by induction for increasing number of dice sides k loosely following some constructs from de Moivre-Laplace theorem (which is a special case of Central Limit Theorem, proven before the adoption of much more convenient Fourier analysis techniques).
Induction step: (k+1)-sided dice from k-sided dice
In order to perform the induction step, we need to show that if we had a k-dice described by a k-nomial distribution and could apply χk−1 test to it, there is a way to construct a dice with k+1 sides from it, so that it can be described by a (k+1)-nomial distribution and validated by a χk test.
Here’s an example to illustrate the process of creation of an additional side of our dice.
Imagine that we had a continuous random variable, representing the distribution of human heights.
We transformed the real-valued random variable of height into a discrete one by dividing all people into one of two bins: “height < 170cm” and “height >= 170cm” with probabilities p1 and p2, so that p1+p2=1.
Chi-square test works for our data according to induction basis. We can see sampling of a value from this r.v. as a 2-dice roll (or a coin flip).
Now we decided to split the second bin (“height >= 170cm”) into two separate bins: “170cm <= height < 180cm” and “height >= 180cm”.
So, what used to be one side of our 2-dice has become 2 sides, and now we have a 3-dice. Let’s show that Chi-squared test will just get another degree of freedom (it’s going to be χ22 instead of χ12 now),
but will still work for our new 3-dice.
Let’s write down the formula of Pearson’s test for 3-nomial disitrubion and split it into binomial part and an additional term:
i=1∑3npi(Oi−npi)2=np1(O1−np1)2+np2(O2−np2)2+np3(O3−np3)2=j=1∑2npjO′−npj2∼χ12 for sum of k=2 terms by induction basenp1(O1−np1)2+n(p2+p3)(O2+O3−n(p2+p3))2−this part should also be ∼χ12, let’s prove thisn(p2+p3)(O2+O3−n(p2+p3))2+np2(O2−np2)2+np3(O3−np3)2
The random variable that we’ve received has a χ12 distribution because it is a square of ξ=np2p3(p2+p3)O2p3−O3p2 random variable, which is a standard normal one.
Let’s show this fact: indeed O2 and O3 are gaussian r.v. (by de Moivre-Laplace/C.L.T.)
with expectations of E[O2]=np2 and E[O3]=np3 and variances Var[O2]=np2(1−p2) and Var[O3]=np3(1−p3) respectively.
Sum of 2 gaussian random variables is gaussian with expectation equal to sum of expectations and variance equal to sum of
variances, plus covariance: σX+Y=σX2+σY2+2ρσXσY. This fact can be proved using either convolutions or Fourier transform (traditionally known as characteristic functions in the field theory of probabilities).
Now, how do we calculate the covariance Cov[O2,O3]?
We take a look at a single roll of our dice and consider the indicator Bernoulli random variables o2={0,diceroll=21,diceroll=2 and o3={0,diceroll=31,diceroll=3:
Cov[o2,o3]=E(o2−Eo2)(o3−Eo3)=Eo2o3−2Eo2Eo3+Eo2Eo3==0,because o2 and o3 can never be 1 at the same timeEo2o3−np2⋅np3Eo2Eo3=−p2p3.
Many thanks to Ming Zhang for finding an error in this post.
Written by Boris Burkov who lives in Moscow, Russia, loves to take part in development of cutting-edge technologies, reflects on how the world works and admires the giants of the past. You can follow me in Telegram