Intro to the Extreme Value Theory and Extreme Value Distribution
April 30, 2023 176 min read
Quite often in mathematical statistics I run into Extreme Value Distribution - an analogue of Central Limit Theorem, which describes the distribution of maximum/minimum, observed in a series of i.i.d random variable tosses. This is an introductory text with the basic concepts and proofs of results from extreme value theory, such as Generalized Extreme Value and Pareto distributions, Fisher-Tippett-Gnedenko theorem, von Mises conditions, Pickands-Balkema-de Haan theorem and their applications.
Contents:
- Problem statement and Generalized Extreme Value distribution
- Type I: Gumbel distribution
- Type II: Frechet distribution
- Type III: Inverse Weibull distribution
- Fisher-Tippett-Gnedenko theorem
- General approach: max-stable distributions as invariants/fixed points/attractors and EVD types as equivalence classes
- Khinchin’s theorem (Law of Convergence of Types)
- Necessary conditions of maximium stability
- Fisher-Tippett-Gnedenko theorem (Extreme Value Theorem)
- Distributions not in domains of attraction of any maximum-stable distributions
- Von Mises sufficient conditions for a distribution to belong to a type I, II or III
- Pre-requisites from survival analysis
- Von Mises conditions proof
- Generalizations of von Mises condition for Type I EVD: auxiliary function and von Mises function
- Necessary and sufficient conditions for a distribution to belong to a type I, II or III
- Pre-requisites from Karamata’s theory of slow/regular/Г-/П- variation
- Necessary and sufficient conditions of convergence to Types II or III EVD
- Necessary and sufficient conditions of convergence to Type I EVD
- Residual life time
- Generalized Pareto distribution
- Residual life time problem
- Pickands-Balkema-de Haan theorem (a.k.a. Second Extreme Value Theorem)
- Order statistics and parameter estimation
- Order statistics
- Hill’s estimator
- Pickands’ estimator
- Other estimators
- Summary and examples of practical application
- Examples of Type I Gumbel distribution
- Examples of Type II Frechet distribution
- Examples of Type III Inverse Weibull distribution
- Concluding remarks
1. Problem statement and Generalized Extreme Value distribution
One of the most famous results in probabilities is Central Limit Theorem, which claims that sum of i.i.d. random variables after centering and normalizing converges to Gaussian distribution.
Now, what if we ask a similar question about maximum of those i.i.d. random variables instead of sum? Does it converge to any distribution?
Turns out that it depends on the properties of the distribution , but not much really. Regardless of the distribution of the distribution of maximum of random variables is:
This distribution is called Generalized Extreme Value Distribution. Depending on the coefficient it can take one of three specific forms:
Type I: Gumbel distribution
If , we can assume that . Then generalized EVD converges to a doubly-exponential distribution (sometimes this is called a law of double logarithm) by definition of and :
.
This is Gumbel distribution, it oftentimes occurs in various areas, e.g. bioinformatics, describing the distribution of longest series of successes in coin tosses in experiments of tossing a coin 100 times.
It is often parametrized by scale and center parameters. I will keep it centered here, but will add shape parameter :
, or, in a more intuitive notation .
It is straightforward to derive probability density function from here:
.
import math
import numpy as np
import matplotlib.pyplot as plt
scale = 1
# Generate x values from 0.1 to 20 with a step size of 0.1
x = np.arange(-20, 20, 0.1)
# Calculate y values
gumbel_cdf = math.e**(-math.e**(-(x/scale)))
gumbel_pdf = (1 / scale) * np.exp(-( x/scale + math.e**(-(x / scale))))
# Create the figure and axis objects
fig, ax = plt.subplots(figsize=(12,8), dpi=100)
# Plot cdf
ax.plot(x, gumbel_cdf, label='cdf')
# Plot pdf
ax.plot(x, gumbel_pdf, label='pdf')
# Set up the legend
ax.legend()
# Set up the labels and title
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Plot of Gumbel pdf and cdf')
# Display the plot
plt.show()
Type II: Frechet distribution
If , let us denote (k > 0), , where is called shape parameter and - scale parameter. Then distribution takes the shape:
.
To make it more intuitive, I’ll re-write cdf in the following way: .
This is Frechet distribution. It arises when the tails of the original cumulative distribution function are heavy, e.g. when it is Pareto distribution.
Let us derive the probability density function for it:
.
Here is the plot:
import math
import numpy as np
import matplotlib.pyplot as plt
shape = 2 # alpha
scale = 2 # beta
# Generate x values from 0.1 to 20 with a step size of 0.1
x = np.arange(0, 20, 0.1)
# Calculate y values
frechet_cdf = math.e**(-(scale / x) ** shape)
frechet_pdf = (shape / scale) * ((scale / x) ** (shape + 1)) * np.exp(-((scale / x) ** shape))
# Create the figure and axis objects
fig, ax = plt.subplots(figsize=(12,8), dpi=100)
# Plot cdf
ax.plot(x, frechet_cdf, label='cdf')
# Plot pdf
ax.plot(x, frechet_pdf, label='pdf')
# Set up the legend
ax.legend()
# Set up the labels and title
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Plot of Frechet distribution pdf and cdf')
# Display the plot
plt.show()
Type III: Inverse Weibull distribution
If , let us denote (k > 0, different kinds of behaviour are observed at , and ), .
Then distribution takes the shape:
.
.
This is Inverse Weibull distribution. Its direct counterpart (Weibull distribution) often occurs in survival analysis as a hazard rate function. It also arises in mining - there it describes the mass distribution of particles of size and is closely connected to Pareto distribution. We shall discuss this connection later.
Generalized extreme value distribution converges to Inverse Weibull, when distribution of our random variable is bounded. E.g. consider uniform distribution . It is clear that the maximum of uniformly distributed variables will be approaching 1 as . Turns out that the convergence rate is described by Inverse Weibull distribution.
To make it more intuitive, we can re-write the cdf as .
Derive from cumulative distribution function the probability density function:
.
Let us draw the plot:
import math
import numpy as np
import matplotlib.pyplot as plt
shape = 2 # alpha
scale = 2 # beta
# Generate x values from 0.1 to 20 with a step size of 0.1
x = np.arange(-20, 0, 0.1)
# Calculate y values
inverse_weibull_cdf = math.e**(-(-x/scale) ** shape)
inverse_weibull_pdf = (shape / scale) * ((-x / scale) ** (shape - 1)) * np.exp(-((-x / scale) ** shape))
# Create the figure and axis objects
fig, ax = plt.subplots(figsize=(12,8), dpi=100)
# Plot cdf
ax.plot(x, inverse_weibull_cdf, label='cdf')
# Plot pdf
ax.plot(x, inverse_weibull_pdf, label='pdf')
# Set up the legend
ax.legend()
# Set up the labels and title
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Plot of Inverse Weibull pdf and cdf')
# Display the plot
plt.show()
2. Fisher-Tippett-Gnedenko theorem
Extreme Value Theorem is a series of theorems, proven in the first half of 20-th century. They claim that maximum of several tosses of i.i.d. random variables converges to just one of 3 possible distributions, Gumbel, Frechet or Weibull.
Here I will lay out the outline of the proof with my comments. The proof includes introduction of several technical tools, but I will comment on their function and rationale behind each of them.
Consider a random variable , which describes the distribution of maximum of ,
.
Similarly to the Central Limit Theorem, a convergence theorem might be applicable to the distribution of a normalized random variable rather than the non-normalized:
We aim to show that for some series of constants and
as converges in distribution to some distribution : .
Now I will informally describe the proof outline, before introducing the mathematical formalism.
General approach: max-stable distributions as invariants/fixed points/attractors and EVD types as equivalence classes
I assume that all three types of Extreme Value Distribution were first discovered experimentally. Later statisticians came up with a proof that EVD can converge to just one of three possible types of distributions and no other types of EVD can exist. Finally, they came up with criteria for a distribution to belong to each type.
Design of this proof is similar to many other proofs. I will outline it informally here:
Assume that as the number of random variables increases, approaching infinity, the distribution of the observed maximum approaches some type of distribution. Then such a distribution type can be considered as an invariant or attractor or fixed point, similar to many other mathematical problems. For instance, eigenvectors are fixed points of matrix multiplication. E.g. matrix eigenvector, multiplied by a matrix, results in itself, multiplied by a scalar. Or no matter how many times you take a derivative of , you get , multiplied by a scalar .
Similarly, maximum-stable distributions are invariant objects. Those are distributions, maximum of i.i.d. variables of which converges to themselves, no matter how many more i.i.d. random variables you toss. E.g. if for one Gumbel-distributed random variable we know that , for Gumbel-distributed random variables the maximum of still is Gumbel-distributed (after centering and normalizing them by some numbers , ): .
Ok. Then after we established that there are some distributions, for which maximum of centered and normalized i.i.d. variables produces a random variable with the same distribution, how do we show that all distributions converge to one of them?
We’ll use another classical mathematical tool: equivalence classes and equivalence relation. For instance, odd numbers and even numbers form two equivalence classes under operation of modulo 2. Odd numbers are equivalent to each other in terms of producing remainder 1 (e.g. , where is equivalence relation of modulo 2), and even numbers are equivalent in terms of producing remainder 0.
Similarly, we will show that types of EVD form equivalence classes under the operation of finding maximum of i.i.d. random variables with any distribution, and as a result all the distributions converge to one of those types. E.g. Pareto’s distribution is equivalent to Cauchy distribution under equivalence relation of convergence of maximum of Pareto/Cauchy i.i.d’s to the same maximum stable type II (Frechet) EVD.
Now that I’ve laid out the plan of the proof, it is time to get into technicalities. I will formally introduce the concepts I mentioned above and prove some lemmas about their relatedness.
Definition 2.1: Max-stable cumulative distribution function
is max-stable if for all and for all x there exists such that for all .
Definition 2.2: Domain of attraction
If is a cdf, then is in the domain of attraction (for maxima) of , and it is written , when there exist sequences such that .
Definition 2.3: Type of convergence
If is another non-degenerate cdf, we say that and have the same type if for all there exist and such that for every x ∈ R .
Khinchin’s theorem (Law of Convergence of Types)
Lemma 2.1: Khinchin’s theorem (law of Convergence of Types)
Suppose that we have a sequence of distribution functions (e.g. the distributions of maximum of random variable in experiments).
Let those distribution functions upon converge to a certain distribution : . Then we have two series of constants .
Suppose there is another distribution function such that the sequence of distributions converges to that function: and there is a different pair of series .
Then and , .
Proof:
Consider two distribution functions and , such that for every : and .
Denote . Then and .
Similarly and and .
Now choose two points: , corresponding to , and , corresponding to and subtract and from each other:
Apply the same for :
Which results in .
Substitute into and .
On the other hand we recall that . Subtracting these, we get: or .
Hence, .
Lemma 2.2: Necessary condition of maximum-stability
Given G a non-degenerate cdf:
- G is max-stable if and only if there exists a sequence of cdf ’s and sequences
, such that for all
- if and only if is max-stable. In that case, .
Proof:
Proposition 1 direct statement: if is max-stable, there exists such that …
If is max-stable, then by definition for every there exist , , such that .
Define . Then . We arrive at the direct statement.
Proposition 1 reverse statement: if is max-stable, there exists such that …
Let us proof the reverse statement: suppose that the sequences , , exist, such that for all :
Then consider and :
and
By Khinchin’s lemma there exists .
Similarly, for every other : or , which is the definition of max-stability.
Proposition 2 direct statement:
The proof is self-evident: if G is max-stable, , and by defintion.
Proposition 2 reverse statement:
Assume , i.e. .
For all we have .
Hence,
This makes and fit for the conditions of previous result, proving that is max-stable.
Corollary 2.1:
Let be a max-stable cdf. Then there exist functions and such that for all , for all , .
Corollary is self-evident from inversion of indices .
Fisher-Tippett-Gnedenko theorem (Extreme Value Theorem)
Sir Ronald Aylmer Fisher | Leonard Henry Caleb Tippett | Boris Vladimirovich Gnedenko |
---|---|---|
Theorem 2.1: Fisher-Tippett-Gnedenko theorem (Extreme Value Theorem)
Let be a sequence of i.i.d. random variables.
If there exist constants , and some non-degenerate cumulative distribution function such that , then is one of these:
(Type I) Gumbel: , ,
(Type II) Frechet: , ,
(Type III) Inverse Weibull: , .
Proof
Here we give the proof of Fisher-Tippett-Gnedenko theorem without introducing any additional pre-requisites and intermediate constructs. Because of that it might look like black magic now. It is not clear, how anyone could’ve come up with this proof.
However, later on in parts 3 and 4 we will give the definitions of tail quantile function and tools from Karamata’s theory of slow/regular variation.
If you revisit this proof afterwards, you will notice that we’re making use of those tools, without naming them explicitly.
Step 1.
Consider double negative logarithm of max-stable distribution .
Step 2.
Denote . Then from previous .
Step 3.
Denote . Apply to both sides. We get: .
Step 4.
Note that . Subtract from both sides:
Step 5.
Substitute variables: , , . Then:
Step 6.
We can swap and in previous equation, settings and :
After that subtract from :
Here we consider two cases.
Step 7a.
If , previous equation leads us to . But then let’s substitute into the result of step 5:
This means that and denoting , we get:
, which is Gumbel (Type I) EVD.
Step 7b.
If :
Now recall that and substitute there:
This leads us to equation , which, upon monotonous has a solution . Hence:
, where .
Now recall that , and we get: . Hence:
, which is either a Frechet (Type II), or a Inverse Weibull (Type III) EVD.
Distributions not in domains of attraction of any maximum-stable distributions
We’ve shown that if maximum of n i.i.d. random variables of current distribution converge to any maximum-stable distribution, it is one of the 3 described types. However, maximum might not converge to any max-stable distribution at all.
For instance, Poisson distribution and Geometric distribution do not converge to any type of Extreme Value Distriubtion. To show this we will need much more tools in our toolbox, the corresponding theorem will be proven in the end of section 4.
3. Von Mises sufficient conditions for a distribution to belong to a type I, II or III
The Fisher-Tippett-Gnedenko theorem is an important theoretical result, but it does not provide an answer to the basic question: what type of EVD does our distribution function belong to?
Fortunately, there are two sets of criteria that let us determine the domain of attraction of . First, there are von Mises conditions, which are sufficient, but not necessary. Still, they are more intuitive and give a good insight into what kinds of distributions converge to what types of EVD and why. Second, there are general sufficient and necessary conditions. Proving them is a much more technical task and requires some extra preliminaries.
We will start with von Mises conditions, postulated by Richard von Mises in 1936, 7 years before Fisher-Tippett-Gnedenko theorem was proved by Boris Gnedenko in 1943. Von Mises conditions are formulated in terms of survival analysis. We shall introduce some basic notions from survival analysis first.
Pre-requisites from survival analysis
Definition 3.1: Survival function
Survival function is reverse of cumulative distribution function : .
Basically, if our random variable’s value represents a human longevity, cumulative distribution funcion represents the fraction of people, who die by the time .
Survival function on the contrary is the fraction of people, who are still alive by the time .
Proposition 3.1: integral of survival function equals to average life expectancy
Basically rotate survival function plot by 90 degrees to see that it is expectation of lifetime (just swap x and y axes and it becomes obvious).
Definition 3.2: Survival function end point
We shall denote the end point of survival function . It is also sometimes denoted .
Basically, is the smallest point , where survival function becomes exactly 0. For instance, if we’re studying the survival of human, and there are known survivors at the age of , but everybody dies by the age of 129 years, .
If there is no such limit (e.g. the population dies out exponentially or polynomially ), we say that .
Definition 3.3: Tail quantile function
Tail quantile function of is the smallest time , when the fraction of survivors becomes smaller than :
For instance, tail quantile function of 10 is the time, when 1/10 of population is still alive.
Lemma 3.1: convergence of tail quantile function to exponent
Consider a sequence of data points, such that each as , where are the values of tail quantile function at :
Then .
Proof:
(last equality by definition of exponent)
Definition 3.4: Hazard rate
Hazard rate in the same context of survival analysis is your chance of dying at the time .
Basically, what’s your chances to die at 64, if you’re an average person? It is the number of people, who died aged 64, to number of people, who survived by 64. In mathematical terms it is the ratio of probability density function to survival function:
Definition 3.5: Cumulative hazard rate
Cumulative hazard rate is integral of hazard rate over some period of time.
Cumulative hazard rate is basically the number of times you avoided death by now. Suppose you’re a train robber in the Wild West. At your first robbery your chance of being killed (hazard rate) is . Then you get more experienced and at the second and third times your hazard rate is and . If you survived 3 robberies, your cumulative hazard rate equals . Basically, you “deserved” more than 1 death by now and are lucky to still be alive.
Proposition 3.1. Cumulative hazard rate relation to survival function
.
Von Mises conditions proofs
Theorem 3.1: Von Mises sufficient condition for a distribution to belong to type II (Frechet) EVD
If a distribution function has an infinite end point and , then distribution belongs to type II (Frechet) EVD.
Proof:
Speaking informally, what we aim to show is that if hazard rate function basically behaves as a hyperbolic function as (i.e. has a fat tail, decreasing much slower that ), the corresponding cumulative distribution function is in the domain of attraction of Frechet (type II) EVD.
I will drop indices under , and and will just write in context of our random variable in question.
We start the proof by recalling the connection between the cumulative hazard rate function and survival function :
Exponentiation of both sides gets us:
Recalling that upon by the conditions of the theorem and :
Now take (i.e. such a point in time, where survival function , we just experessed this through the tail quantile function ) and and substitute it into the previous line:
and
In other words or or .
We’ve just shown that a random variable converges to Frechet Type II EVD, where and .
Theorem 3.2: Von Mises sufficient condition for a distribution to belong to type III (Inverse Weibull) EVD
If a distribution function has a finite end point and , then distribution belongs to type III (Inverse Weibull).
Proof:
If our original random variable had a finite upper end , let us consider a derived random variable .
approaches as approaches upper end and approached as approaches .
Let us look at the connection between c.d.f.s of and :
.
Basically, with we created a mapping of onto a domain. Suppose that random variable fits the conditions of Theorem 3.1:
Denote , note that and substitute this into the previous result:
We came to the expression in the conditions of our theorem exactly, hence, .
I.e. if and only if the conditions of this theorem are satisfied, is in the domain of attraction of Type II.
Theorem 3.3: Von Mises sufficient condition for a distribution to belong to type I (Gumbel) EVD
If a distribution function