cover

Intro to the Extreme Value Theory and Extreme Value Distribution

April 30, 2023 167 min read

Quite often in mathematical statistics I run into Extreme Value Distribution - an analogue of Central Limit Theorem, which describes the distribution of maximum/minimum, observed in a series of i.i.d random variable tosses. This is an introductory text with the basic concepts and proofs of results from extreme value theory, such as Generalized Extreme Value and Pareto distributions, Fisher-Tippett-Gnedenko theorem, von Mises conditions, Pickands-Balkema-de Haan theorem and their applications.

Contents:

  1. Problem statement and Generalized Extreme Value distribution
    • Type I: Gumbel distribution
    • Type II: Frechet distribution
    • Type III: Inverse Weibull distribution
  2. Fisher-Tippett-Gnedenko theorem
    • Examples of convergence
    • General approach: max-stable distributions as invariants/fixed points/attractors and EVD types as equivalence classes
    • Khinchin’s theorem (Law of Convergence of Types)
    • Necessary conditions of maximium stability
    • Fisher-Tippett-Gnedenko theorem (Extreme Value Theorem)
    • Distributions not in domains of attraction of any maximum-stable distributions
  3. Von Mises sufficient conditions for a distribution to belong to a type I, II or III
    • Pre-requisites from survival analysis
    • Von Mises conditions proof
    • Generalizations of von Mises condition for Type I EVD: auxiliary function and von Mises function
  4. Necessary and sufficient conditions for a distribution to belong to a type I, II or III
    • Pre-requisites from Karamata’s theory of slow/regular/Г-/П- variation
    • Necessary and sufficient conditions of convergence to Types II or III EVD
    • Necessary and sufficient conditions of convergence to Type I EVD
  5. Residual life time
    • Generalized Pareto distribution
    • Residual life time problem
    • Pickands-Balkema-de Haan theorem (a.k.a. Second Extreme Value Theorem)
  6. Order statistics and parameter estimation
    • Order statistics
    • Hill’s estimator
    • Pickands’ estimator
    • Other estimators
  7. Summary and examples of practical application
    • Examples of Type I Gumbel distribution
    • Examples of Type II Frechet distribution
    • Examples of Type III Inverse Weibull distribution
  8. Concluding remarks

1. Problem statement and Generalized Extreme Value distribution

One of the most famous results in probabilities is Central Limit Theorem, which claims that sum of nn \to \infty i.i.d. random variables ξi\xi_i after centering and normalizing converges to Gaussian distribution.

Now, what if we ask a similar question about maximum of those nn \to \infty i.i.d. random variables instead of sum? Does it converge to any distribution?

Turns out that it depends on the properties of the distribution ξi\xi_i, but not much really. Regardless of the distribution of ξi\xi_i the distribution of maximum of nn random variables ξi\xi_i is:

Gγ(x)=exp((1+γx)1γ)G_{\gamma}(x) = exp(-(1 + \gamma x)^{-\frac{1}{\gamma}})

This distribution is called Generalized Extreme Value Distribution. Depending on the coefficient γ\gamma it can take one of three specific forms:

Type I: Gumbel distribution

If γ0\gamma \to 0, we can assume that k=1γk = \frac{1}{\gamma} \to \infty. Then generalized EVD converges to a doubly-exponential distribution (sometimes this is called a law of double logarithm) by definition of e=(1+1k)ke = (1 + \frac{1}{k})^k and ex=(1+1kx)ke^x = (1 + \frac{1}{k}x)^k:

Gγ(x)=exp((1+γx)1γ)=exp((1+1kx)k)=exp(ex)G_{\gamma}(x) = exp(-(1 + \gamma x)^{-\frac{1}{\gamma}}) = exp(-(1 + \frac{1}{k} x)^{-k}) = exp(-e^{-x}).

This is Gumbel distribution, it oftentimes occurs in various areas, e.g. bioinformatics, describing the distribution of longest series of successes in coin tosses in nn experiments of tossing a coin 100 times.

It is often parametrized by scale and center parameters. I will keep it centered here, but will add shape parameter λ\lambda:

F(x)=eexλF(x) = e^{-e^{-\frac{x}{\lambda}}}, or, in a more intuitive notation F(x)=1eex/λF(x) = \frac{1}{\sqrt[e^{x/\lambda}]{e}}.

It is straightforward to derive probability density function f(x)f(x) from here:

f(x)=Fx=exλ(1λ)eexλ=1λexλ+exλf(x) = \frac{\partial F}{\partial x} = -e^{-\frac{x}{\lambda}} \cdot (-\frac{1}{\lambda}) \cdot e^{-e^{-\frac{x}{\lambda}}} = \frac{1}{\lambda} e^{- \frac{x}{\lambda} + e^{-\frac{x}{\lambda}}}.

import math

import numpy as np
import matplotlib.pyplot as plt


scale = 1

# Generate x values from 0.1 to 20 with a step size of 0.1
x = np.arange(-20, 20, 0.1)

# Calculate y values
gumbel_cdf = math.e**(-math.e**(-(x/scale)))
gumbel_pdf = (1 / scale) * np.exp(-( x/scale + math.e**(-(x / scale))))

# Create the figure and axis objects
fig, ax = plt.subplots(figsize=(12,8), dpi=100)

# Plot cdf
ax.plot(x, gumbel_cdf, label='cdf')

# Plot pdf
ax.plot(x, gumbel_pdf, label='pdf')

# Set up the legend
ax.legend()

# Set up the labels and title
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Plot of Gumbel pdf and cdf')

# Display the plot
plt.show()

Gumbel distribution plot

Gumbel distribution plot, scale = 1.

Type II: Frechet distribution

If γ>0\gamma > 0, let us denote k=1γk = \frac{1}{\gamma} (k > 0), y=λ(1+γx)y = \lambda \cdot (1 + \gamma x), where kk is called shape parameter and λ\lambda - scale parameter. Then distribution takes the shape:

Gγ(x)=exp((1+γx)1γ)=exp((yλ)k)G_{\gamma}(x) = exp(-(1 + \gamma x)^{-\frac{1}{\gamma}}) = exp(-(\frac{y}{\lambda})^{-k}).

To make it more intuitive, I’ll re-write cdf in the following way: F(x)=1e(λx)kF(x) = \frac{1}{e^{(\frac{\lambda}{x})^k}}.

This is Frechet distribution. It arises when the tails of the original cumulative distribution function Fξ(x)F_{\xi}(x) are heavy, e.g. when it is Pareto distribution.

Let us derive the probability density function for it:

f(x)=Fx=(xλ)k1(k)1λe(xλ)k=kλ(xλ)k1e(xλ)kf(x) = \frac{\partial F}{\partial x} = - (\frac{x}{\lambda})^{-k - 1} \cdot (-k) \cdot \frac{1}{\lambda} \cdot e^{-(\frac{x}{\lambda})^{-k}} = \frac{k}{\lambda} \cdot (\frac{x}{\lambda})^{-k-1} \cdot e^{-(\frac{x}{\lambda})^{-k} }.

Here is the plot:

import math

import numpy as np
import matplotlib.pyplot as plt


shape = 2  # alpha
scale = 2  # beta

# Generate x values from 0.1 to 20 with a step size of 0.1
x = np.arange(0, 20, 0.1)

# Calculate y values
frechet_cdf = math.e**(-(scale / x) ** shape)
frechet_pdf = (shape / scale) * ((scale / x) ** (shape + 1)) * np.exp(-((scale / x) ** shape))

# Create the figure and axis objects
fig, ax = plt.subplots(figsize=(12,8), dpi=100)

# Plot cdf
ax.plot(x, frechet_cdf, label='cdf')

# Plot pdf
ax.plot(x, frechet_pdf, label='pdf')

# Set up the legend
ax.legend()

# Set up the labels and title
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Plot of Frechet distribution pdf and cdf')

# Display the plot
plt.show()

Frechet distribution plot

Frechet distribution plot, scale = 1, shape = 1.

Type III: Inverse Weibull distribution

If γ<0\gamma < 0, let us denote k=1γk = -\frac{1}{\gamma} (k > 0, different kinds of behaviour are observed at 0<k<10 < k < 1, k=1k = 1 and k>1k > 1), y=λ(1+γx)y = \lambda (1 + \gamma x).

Then distribution takes the shape:

Gγ(x)=exp((1+γx)1γ)=exp((yλ)k)G_{\gamma}(x) = exp(-(1 + \gamma x)^{-\frac{1}{\gamma}}) = exp(-(\frac{y}{\lambda})^k).

Gγ(x)={exp((xλ)k),x01,x>0G_{\gamma}(x) = \begin{cases} exp(-(\frac{x}{\lambda})^{k}), x \le 0 \\ 1, x > 0 \end{cases}.

This is Inverse Weibull distribution. Its direct counterpart (Weibull distribution) often occurs in survival analysis as a hazard rate function. It also arises in mining - there it describes the mass distribution of particles of size xx and is closely connected to Pareto distribution. We shall discuss this connection later.

Generalized extreme value distribution converges to Inverse Weibull, when distribution of our random variable ξ\xi is bounded. E.g. consider uniform distribution ξU(0,1)\xi \sim U(0, 1). It is clear that the maximum of nn uniformly distributed variables will be approaching 1 as nn \to \infty. Turns out that the convergence rate is described by Inverse Weibull distribution.

To make it more intuitive, we can re-write the cdf as F(x)={1e(xλ)k,x01,x>0F(x) = \begin{cases} \frac{1}{e^{ (\frac{-x}{\lambda})^k }}, x \le 0 \\ 1, x > 0 \end{cases}.

Derive from cumulative distribution function F(x)=exp((xλ)k)F(x) = exp(-(\frac{-x}{\lambda})^{k}) the probability density function:

f(x)=Fx=(xλ)k1kλexp((xλ)k)=kλ(xλ)k1exp((xλ)k)f(x) = \frac{\partial F}{\partial x} = -(\frac{-x}{\lambda})^{k-1} \cdot \frac{-k}{\lambda} \cdot exp(-(\frac{-x}{\lambda})^{k}) = \frac{k}{\lambda} \cdot (\frac{-x}{\lambda})^{k-1} \cdot exp(-(\frac{-x}{\lambda})^{k}).

Let us draw the plot:

import math

import numpy as np
import matplotlib.pyplot as plt


shape = 2  # alpha
scale = 2  # beta

# Generate x values from 0.1 to 20 with a step size of 0.1
x = np.arange(-20, 0, 0.1)

# Calculate y values
inverse_weibull_cdf = math.e**(-(-x/scale) ** shape)
inverse_weibull_pdf = (shape / scale) * ((-x / scale) ** (shape - 1)) * np.exp(-((-x / scale) ** shape))

# Create the figure and axis objects
fig, ax = plt.subplots(figsize=(12,8), dpi=100)

# Plot cdf
ax.plot(x, inverse_weibull_cdf, label='cdf')

# Plot pdf
ax.plot(x, inverse_weibull_pdf, label='pdf')

# Set up the legend
ax.legend()

# Set up the labels and title
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Plot of Inverse Weibull pdf and cdf')

# Display the plot
plt.show()

Inverse Weibull plot

Plot of Inverse Weibull distribution, shape = 2, scale = 2.


2. Fisher-Tippett-Gnedenko theorem

Extreme Value Theorem is a series of theorems, proven in the first half of 20-th century. They claim that maximum of several tosses of i.i.d. random variables converges to just one of 3 possible distributions, Gumbel, Frechet or Weibull.

Here I will lay out the outline of the proof with my comments. The proof includes introduction of several technical tools, but I will comment on their function and rationale behind each of them.

Consider a random variable MnM_n, which describes the distribution of maximum of ξi\xi_i, i1..ni \in 1..n

p(Mnx)=i=1np(ξix)=Fn(ξix)p(M_n \le x) = \prod \limits_{i=1}^{n} p(\xi_i \le x) = F^n(\xi_i \le x).

Similarly to the Central Limit Theorem, a convergence theorem might be applicable to the distribution of a normalized random variable MnM_n rather than the non-normalized:

p(Mnbnanx)=p(Mnanx+bn)=Fn(anx+bn)p(\frac{M_n - b_n}{a_n} \le x) = p(M_n \le a_n x + b_n) = F^n(a_n x + b_n)

We aim to show that for some series of constants aia_i and bib_i

Fn(anx+bn)F^n(a_n x + b_n) as nn \to \infty converges in distribution to some distribution G(x)G(x): Fn(anx+bn)nwG(x)F^n(a_n x + b_n) \xrightarrow[n \to \infty]{w} G(x).

Now I will provide several examples and then informally describe the proof outline, before introducing the mathematical formalism.

Examples of convergence

Let us start with simple examples of convergence to types I and II to get a taste of how convergence to maxima works.

Example 2.1. Convergence of maximum of i.i.d. exponential random variables to Gumbel distribution

horse_and_wagon.png

I’ll never believe that a horse and a wagon age in the same way.

- Alex Comfort (as quoted by Vladimir P. Skulachev)

Consider a wagon with an exponentially-distributed life time. For instance, assume that the probability for a wagon to break down every year is 1/e1 / e.

Then let ξ\xi random variable be its lifetime and its cdf be:

Fξ(x)=p(ξx)=1exF_{\xi}(x) = p (\xi \le x) = 1 - e^{-x}

Given nn \to \infty wagons, what’s the distribution of lifetime of the longest living one?

p(maxξix)=Fn(x)=(1p(ξx))n=(1ex)n=(1ex+lnnn)nneex+lnnp (\max \xi_i \le x) = F^n(x) = (1 - p (\xi \ge x))^n = (1 - e^{-x})^n = (1 - \frac{e^{-x + \ln n}}{n})^n \xrightarrow{n \to \infty} e^{-e^{-x + \ln n}}

We see that this is almost standard Gumbel distribution (type I EVD).

In order to get the stadard one, we need to consider a shifted random variable p(maxξianx+bn)p (\max \xi_i \le a_nx + b_n) by choosing an=1a_n = 1, bn=lognb_n = \log n, so that we get:

Fn(x+logn)=(1e(x+logn))n=(1exn)nexp(ex)F^n(x + \log n) = (1 - e^{-(x + \log n)})^n = (1 - \frac{e^{-x}}{n})^n \to exp(-e^{-x}).

Example 2.2. Convergence of maximum of i.i.d. Pareto random variables to Frechet distribution

Consider another example: Pareto distribution. It describes the distribution of wealth (e.g. fraction of people, having net worth above xx), sizes of cities (e.g. fraction of cities with a population greater than xx), sizes of orders on an exchange (e.g. fraction of orders of size greater than xx), number of Telegram channels with more than xx subsribers. Here is its cdf:

Fξ(x)=p(ξx)=1(x0x)αF_{\xi}(x) = p (\xi \le x) = 1 - (\frac{x_0}{x})^{\alpha}

It is often convenient to depict it in log-log coordinates because ln(1Fξ(x))=γ(lnx0lnx)\ln (1 - F_{\xi}(x)) = \gamma (\ln x_0 - \ln x):

log_log_Pareto.png

Pareto distribtuion also describes the distribution of life time of memes, comedy shows etc. This phenomenon is known as Lindy effect. With a little calculation one can show that expected lifetime of a comedy show equals to the time it’s already been around, which leads to Pareto distribution of its cdf.

Again, let us consider the distribution of maximum lifetime of the comedy shows then:

p(maxξix)=Fn(x)=(1p(ξx))n=(1(x0x)α)n=(1(nαx0x)αn)n=(1(λx)αn)n=e(λx)αp (\max \xi_i \le x) = F^n(x) = (1 - p (\xi \ge x))^n = (1 - (\frac{x_0}{x})^{\alpha})^n = (1 - \frac{(\frac{\sqrt[\alpha]{n} x_0}{x})^{\alpha}}{n})^n = (1 - \frac{(\frac{\lambda}{x})^{\alpha}}{n})^n = e^{-(\frac{\lambda}{x})^{\alpha}}

We see that it converges to Frechet (EVD type II) distribution.

Example 2.3. Convergence of minimum/maximum of i.i.d. Weibull/inverse Weibull random variables to Weibull/inverse Weibull distribution

Consider a single link of a chain, the strength of which is described by a Weibull distribution Fξ(x)=p(ξx)=1e(xx0)mF_{\xi}(x) = p(\xi \le x) = 1 - e^{-(\frac{x}{x_0})^m}:

link

In this specific application of Weibull distribution we may say that the strength of a link S(F)S(F) is the probability that it does not break under application of force FF (sorry for the duplication of symbol FF in the notation, used as both cdf and force).

Then we can show that strength of the whole chain is also described by Weibull distribution, as the chain is as strong as its weakest link:

chain

Fchain(Fi)=F(minξiFi)n=ein(FiF0)m=en(FiF0)m=e(FiH0)mF_{chain}(F_i) = F(\min \xi_i \le F_i)^n = e^{- \sum \limits_{i}^{n} (\frac{F_i}{F_0})^m } = e^{-n (\frac{F_i}{F_0})^m } = e^{-(\frac{F_i}{H_0})^m}

We see that the whole chain’s strength also follows the Weibull distribution, but with somewhat altered parameter:

Weibull stability

Similarly, we may consider a zipper, where strength of a zipper is the strength of the hardest-to-open element. Hence, strength of a zipper and of each element of it is described by inverse Weibull distribution:

zipper

Later on we will introduce the concept of max-stable/min-stable distributions, i.e. such distributions that maximum/minimum of them follows the same distribution. With this example we can already see that Weibull distribtuion is min-stable and inverse Weibull distribution is max-stable.


In these examples we used the same argument: (11F(x)n)nne(1F(x))(1 - \frac{1 - F(x)}{n})^n \xrightarrow{n} e^{-(1 - F(x))}.

It might seem that we can substitute any distribution as (1 - F(x)) and make the distribution of maximum e(1F(x))e^{-(1 - F(x))} to take almost any shape.

However, it turns out that this intuition is wrong, as there are just 3 possible shapes of Extreme Value Distribution.

Let’s see, how (counter-intuitively) the maximum of i.i.d. Gaussian random variables converges to Gumbel.

Example 2.4. Convergence of maximum of i.i.d. Gaussian random variables to Gumbel distribution

Step 1. Integration in parts

Let us consider the cdf of normal distribution F(x)=p(ξx)=12πxet2/2dt=1p(ξx)=112πxet2/2dtF(x) = p(\xi \le x) = \frac{1}{\sqrt{2 \pi}} \int \limits_{-\infty}^{x} e^{-t^2/2} dt = 1 - p(\xi \ge x) = 1 - \frac{1}{\sqrt{2 \pi}} \int \limits_{x}^{\infty} e^{-t^2/2} dt.

Do integration in parts: 12πxet2/2dt=12πex22x12πxet2/2t2dt\frac{1}{\sqrt{2 \pi}} \int \limits_{x}^{\infty} e^{-t^2/2} dt = \frac{1}{\sqrt{2 \pi}} \frac{e^{-\frac{x^2}{2}}}{x} - \frac{1}{\sqrt{2\pi}} \int \limits_{x}^{\infty} \frac{e^{-t^2/2}}{t^2} dt

Recall that we are interested in maximum of anx+bna_nx + b_n, not just xx.

Hence, we want to study the behaviour of 12πe(anx+bn)22anx+bn\frac{1}{\sqrt{2 \pi}} \frac{e^{-\frac{(a_n x + b_n)^2}{2}}}{a_n x + b_n} and to show that 12πe(anx+bn)22anx+bnexn\frac{1}{\sqrt{2 \pi}} \frac{e^{-\frac{(a_n x + b_n)^2}{2}}}{a_n x + b_n} \sim \frac{e^{-x}}{n}.

Step 2. Split result in two multipliers, select ana_n

Let us use our freedom to choose ana_n and bnb_n.

First let us choose ana_n, while letting bnb_n \to \infty (we will choose a specific bnb_n at a later stage).

Let an=1bna_n = \frac{1}{b_n}:

12πe(anx+bn)22anx+bn=12πex2/2bn2exx/bn2+1ebn2/2bn\frac{1}{\sqrt{2 \pi}} \frac{e^{-\frac{(a_n x + b_n)^2}{2}}}{a_n x + b_n} = \frac{1}{\sqrt{2 \pi}} \frac{e^{-x^2 / 2 b_n^2} e^{-x}}{x / {b_n}^2 + 1} \frac{e^{-{b_n}^2 / 2}}{b_n}

By evaluating the square of anx+bna_nx + b_n we’ve managed to come up with 2 multipliers. First of them contains exe^{-x}, which is required for convergence to Gumbel distribution. Note that as bnb_n \to \infty, first term converges to exe^{-x}:

ex2/2bn2exx/bn2+1bnex\frac{e^{-x^2 / 2 b_n^2} e^{-x}}{x / {b_n}^2 + 1} \xrightarrow{b_n \to \infty} e^{-x}

The second term ebn2/2bn\frac{e^{-{b_n}^2 / 2}}{b_n} will produce 1/n1 / n that we need to obtain (1F(x)/n)n(1 - F(x)/n)^n.

Step 3. Select bnb_n

Choose bn=F1(11/n)b_n = F^{-1}(1 - 1/n), so that:

ebn2/2bn1F(F1(11/n))=1n\frac{e^{-{b_n}^2 / 2}}{b_n} \to 1 - F(F^{-1}(1 - 1/n)) = \frac{1}{n}

Hence, we end up having F(anx+bn)1ex/nF(a_n x + b_n) \to 1 - e^{-x}/n, where bn=F1(11/n)b_n = F^{-1}(1 - 1/n) and an=1bna_n = \frac{1}{b_n}.

Then p(maxξianx+bn)=p(ξianx+bn)n=(1ex/n)n=eexp(\max \xi_i \le a_nx + b_n ) = p(\xi_i \le a_nx + b_n)^n = (1 - e^{-x}/n)^n = e^{-e^{-x}}.

This is just one example of a distribution, for which it doesn’t appear at the first glance that its maximum would converge to one of 3 types of EVD, but its maximum actually does converge to Type I. We shall prove as a theorem that for i.i.d. random variables from any distribution there are just 3 options for their maximum to converge to. Let’s first discuss the general approach of how this could be shown, and then will proceed with a formal proof.

General approach: max-stable distributions as invariants/fixed points/attractors and EVD types as equivalence classes

I assume that all three types of Extreme Value Distribution were first discovered experimentally. Later statisticians came up with a proof that EVD can converge to just one of three possible types of distributions and no other types of EVD can exist. Finally, they came up with criteria for a distribution to belong to each type.

Design of this proof is similar to many other proofs. I will outline it informally here:

Assume that as the number of random variables nn \to \infty increases, approaching infinity, the distribution of the observed maximum approaches some type of distribution. Then such a distribution type can be considered as an invariant or attractor or fixed point, similar to many other mathematical problems. For instance, eigenvectors are fixed points of matrix multiplication. E.g. matrix eigenvector, multiplied by a matrix, results in itself, multiplied by a scalar. Or no matter how many times you take a derivative of ekxe^{kx}, you get ekxe^{kx}, multiplied by a scalar kk.

Similarly, maximum-stable distributions are invariant objects. Those are distributions, maximum of i.i.d. variables of which converges to themselves, no matter how many more i.i.d. random variables you toss. E.g. if for one Gumbel-distributed random variable ξ\xi we know that pξ(M1b1a1x)=eexp_{\xi}(\frac{M_1 - b_1}{a_1} \le x) = e^{-e^{-x}}, for nn \to \infty Gumbel-distributed random variables the maximum of ξ1..ξn\xi_1.. \xi_n still is Gumbel-distributed (after centering and normalizing them by some numbers ana_n, bnb_n): pMn(Mnbnanx)=eexp_{M_{n}}(\frac{M_n - b_n}{a_n} \le x) = e^{-e^{-x}}.

Ok. Then after we established that there are some distributions, for which maximum of nn \to \infty centered and normalized i.i.d. variables produces a random variable with the same distribution, how do we show that all distributions converge to one of them?

We’ll use another classical mathematical tool: equivalence classes and equivalence relation. For instance, odd numbers and even numbers form two equivalence classes under operation of modulo 2. Odd numbers are equivalent to each other in terms of producing remainder 1 (e.g. 353 \sim 5, where \sim is equivalence relation of modulo 2), and even numbers are equivalent in terms of producing remainder 0.

Similarly, we will show that types of EVD form equivalence classes under the operation of finding maximum of nn \to \infty i.i.d. random variables with any distribution, and as a result all the distributions converge to one of those types. E.g. Pareto’s distribution is equivalent to Cauchy distribution under equivalence relation of convergence of maximum of nn \to \infty Pareto/Cauchy i.i.d’s to the same maximum stable type II (Frechet) EVD.

Now that I’ve laid out the plan of the proof, it is time to get into technicalities. I will formally introduce the concepts I mentioned above and prove some lemmas about their relatedness.

Definition 2.1: Max-stable cumulative distribution function

GG is max-stable if for all n1..Nn \in 1..N and for all x there exists {an},{bn}R+\{a_n\}, \{b_n\} \subset \mathbb{R}^+ such that for all xRx \in \mathbb{R} G(x)=Gn(anx+bn)G(x) = G_n(a_n x + b_n).

Definition 2.2: Domain of attraction

If FF is a cdf, then FF is in the domain of attraction (for maxima) of GG, and it is written FD(G)F \in \mathcal{D}(G), when there exist sequences {an},{bn}R+\{a_n\}, \{b_n\} \subset \mathbb{R}^+ such that Fn(anx+bn)nwG(x)F^n (a_n x + b_n) \xrightarrow[n \to \infty]{w} G(x).

Definition 2.3: Type of convergence

If G(x)G^*(x) is another non-degenerate cdf, we say that GG and GG^* have the same type if for all xx there exist a>0a > 0 and bRb \in R such that for every x ∈ R G(ax+b)=G(x)G^*(ax + b) = G(x).

Khinchin’s theorem (Law of Convergence of Types)

Khinchin

Aleksandr Yakovlevich Khinchin

Lemma 2.1: Khinchin’s theorem (law of Convergence of Types)

Suppose that we have a sequence of distribution functions {Fn}\{F_n\} (e.g. the distributions of maximum of random variable ξi\xi_i in nn experiments).

Let those distribution functions upon nn \to \infty converge to a certain distribution G(x)G(x): Fn(anx+bn)nwG(x)F_n(a_n x + b_n) \xrightarrow[n \to \infty]{w} G(x). Then we have two series of constants {an},{bn}\{a_n\}, \{b_n\}.

Suppose there is another distribution function H(x)H(x) such that the sequence of distributions Fn(αnx+βn)F_n(\alpha_n x + \beta_n) converges to that function: Fn(αnx+βn)nwH(x)F_n(\alpha_n x + \beta_n) \xrightarrow[n \to \infty]{w} H(x) and there is a different pair of series {αn},{βn}\{ \alpha_n \}, \{\beta_n \}.

Then H(x)=G(Ax+B)H(x) = G(Ax + B) and A=αnanA = \frac{\alpha_n}{a_n}, B=βnbnanB = \frac{\beta_n - b_n}{a_n}.

Proof:

Consider two distribution functions G(x)G(x) and H(x)H(x), such that for every xx: y=F(ax+b)y = F(ax+b) and y=F(αx+β)y = F(\alpha x + \beta).

Denote y=F(ax+b)G(x)y = F(ax + b) \to G(x). Then F1(y)=ax+bF^{-1}(y) = ax + b and x=F1(y)baG1(y)x = \frac{F^{-1}(y) - b}{a} \to G^{-1}(y).

Similarly y=F(αx+β)H(x)y = F(\alpha x + \beta) \to H(x) and F1(y)=αx+βF^{-1}(y) = \alpha x + \beta and x=F1(y)βαH1(y)x = \frac{F^{-1}(y) - \beta}{\alpha} \to H^{-1}(y).

Now choose two points: x1x_1, corresponding to y1y_1, and x2x_2, corresponding to y2y_2 and subtract x1x_1 and x2x_2 from each other:

x1x2=F1(y1)F1(y2)aG1(y1)G1(y2)x_1 - x_2 = \frac{F^{-1}(y_1) - F^{-1}(y_2)}{a} \to G^{-1}(y_1) - G^{-1}(y_2)

Apply the same for H1H^{-1}:

x1x2=F1(y1)F1(y2)αH1(y1)H1(y2)x_1 - x_2 = \frac{F^{-1}(y_1) - F^{-1}(y_2)}{\alpha} \to H^{-1}(y_1) - H^{-1}(y_2)

Which results in G1(y1)G1(y2)H1(y1)H1(y2)αa=A\frac{G^{-1}(y_1) - G^{-1}(y_2)}{H^{-1}(y_1) - H^{-1}(y_2)} \to \frac{\alpha}{a} = A.

Substitute α=Aa\alpha = A \cdot a into H1(y)x=F1(y)βAaH^{-1}(y) \to x = \frac{F^{-1}(y) - \beta}{A \cdot a} and AH1(y)Ax=F1(y)βaA \cdot H^{-1}(y) \to A \cdot x = \frac{F^{-1}(y) - \beta}{a}.

On the other hand we recall that G1(y)x=F1(y)baG^{-1}(y) \to x = \frac{F^{-1}(y) - b}{a}. Subtracting these, we get: AH1(y)G1(y)F1(y)βaF1(y)ba=bβaA \cdot H^{-1}(y) - G^{-1}(y) \to \frac{F^{-1}(y) - \beta}{a} - \frac{F^{-1}(y) - b}{a} = \frac{b - \beta}{a} or βba=BG1(y)AH1(y)\frac{\beta - b}{a} = B \to G^{-1}(y) - A \cdot H^{-1}(y).

Hence, G1(y)AH1(y)+BG^{-1}(y) \to A \cdot H^{-1}(y) + B.

Lemma 2.2: Necessary condition of maximum-stability

Given G a non-degenerate cdf:

  1. G is max-stable if and only if there exists a sequence {Fn}\{F_n\} of cdf ’s and sequences

{an}R+\{a_n\} \subset \mathbb{R}^+, {bn}\{b_n\} such that for all kNk \in N Fn(ankx+bnk)nwG1/k(x)F_n(a_{nk} x + b_{nk}) \xrightarrow[n \to \infty]{w} G^{1/k}(x)

  1. D(G)0\mathcal{D}(G) \neq 0 if and only if GG is max-stable. In that case, GD(G)G \in \mathcal{D}(G).

Proof:

Proposition 1 direct statement: if GG is max-stable, there exists {Fn}\{F_n\} such that …

If GG is max-stable, then by definition for every nNn \in \mathbb{N} there exist ana_n, bnb_n, such that Gn(anx+bn)=G(x)G^{n}(a_n x + b_n) = G(x).

Define Fn=GnF_n = G^n. Then Fnk(ankx+bnk)=Gnk(ankx+bnk)=GF^k_n(a_{nk} x + b_{nk}) = G^{nk}(a_{nk} x + b_{nk}) = G. We arrive at the direct statement.

Proposition 1 reverse statement: if GG is max-stable, there exists {Fn}\{F_n\} such that …

Let us proof the reverse statement: suppose that the sequences {Fn}\{F^n\}, {an}\{a_n\}, {bn}\{b_n\} exist, such that for all kNk \in \mathbb{N}:

Fn(ankx+bnk)nwG1/k(x)F_n(a_{nk}x + b_{nk}) \xrightarrow[n \to \infty]{w} G^{1/k}(x)

Then consider k=1k=1 and k=2k=2:

Fn(anx+bn)nwG(x)F_n(a_{n}x + b_{n}) \xrightarrow[n \to \infty]{w} G(x) and Fn(a2nx+b2n)nwG1/2(x)F_n(a_{2n}x + b_{2n}) \xrightarrow[n \to \infty]{w} G^{1/2}(x)

By Khinchin’s lemma there exists G(α2x+β2)=G1/2(x)G(\alpha_2 x+ \beta_2) = G^{1/2}(x).

Similarly, for every other kk: G(αkx+βk)=G1/k(x)G(\alpha_k x + \beta_k) = G^{1/k}(x) or Gk(αkx+βk)=G(x)G^k(\alpha_k x + \beta_k) = G(x), which is the definition of max-stability.

Proposition 2 direct statement:

The proof is self-evident: if G is max-stable, Gn(anx+bn)=G(x)G^n(a_n x + b_n) = G(x), and GD(G)G \in \mathcal{D}(G) by defintion.

Proposition 2 reverse statement:

Assume FD(G)F \in \mathcal{D}(G), i.e. Fn(anx+bn)nwG(x)F^n (a_n x + b_n) \xrightarrow[n \to \infty]{w} G(x).

For all kNk \in \mathbb{N} we have Fnk(ankx+bnk)nwG(x)F^{nk} (a_{nk} x + b_{nk}) \xrightarrow[n \to \infty]{w} G(x).

Hence, Fn(ankx+bnk)nwG1/k(x)F^{n} (a_{nk} x + b_{nk}) \xrightarrow[n \to \infty]{w} G^{1/k}(x)

This makes GG and GkG^k fit for the conditions of previous result, proving that GG is max-stable.

Corollary 2.1:

Let GG be a max-stable cdf. Then there exist functions a(s)>0a(s) > 0 and b(s)b(s) such that for all xRx \in \mathbb{R}, for all s>0s > 0, Gs(a(s)x+b(s))=G(x)G^s(a(s)x + b(s)) = G(x).

Corollary is self-evident from inversion of indices s=1ks = \frac{1}{k}.

Fisher-Tippett-Gnedenko theorem (Extreme Value Theorem)

Sir Ronald Aylmer Fisher Leonard Henry Caleb Tippett Boris Vladimirovich Gnedenko
Fisher Tippett Gnedenko

Theorem 2.1: Fisher-Tippett-Gnedenko theorem (Extreme Value Theorem)

Let ξi\xi_i be a sequence of i.i.d. random variables.

If there exist constants an>0a_n > 0, bnRb_n \in \mathbb{R} and some non-degenerate cumulative distribution function GG such that MnbnanG\frac{M_n - b_n}{a_n} \sim G, then GG is one of these:

(Type I) Gumbel: G(x)=exp(ex)G(x) = exp(-e^{-x}), xRx \in \mathbb{R},

(Type II) Frechet: G(x)=exp(xα)G(x) = exp(-x^{-\alpha}), x0,α>0x \ge 0, \alpha > 0,

(Type III) Inverse Weibull: G(x)=exp((x)α)G(x) = exp(-(-x)^{\alpha}), x0,α>0x \le 0, \alpha > 0.

Proof

Here we give the proof of Fisher-Tippett-Gnedenko theorem without introducing any additional pre-requisites and intermediate constructs. Because of that it might look like black magic now. It is not clear, how anyone could’ve come up with this proof.

However, later on in parts 3 and 4 we will give the definitions of tail quantile function and tools from Karamata’s theory of slow/regular variation.

If you revisit this proof afterwards, you will notice that we’re making use of those tools, without naming them explicitly.

Step 1.

Consider double negative logarithm of max-stable distribution G(a(s)x+b(s))s=G(x)G(a(s)x + b(s))^s = G(x).

ln(ln(G(a(s)x+b(s))s))=ln(sln(G(a(s)x+b(s))))=ln(ln(G(a(s)x+b(s))))lns=ln(lnG(x))-\ln(-\ln(G(a(s)x + b(s))^{s})) = -\ln( -s \cdot \ln(G(a(s)x + b(s)))) = -\ln(-\ln(G(a(s)x + b(s)))) - \ln s = -\ln(-\ln G(x))

Step 2.

Denote ϕ(x)=ln(ln(G(x)))\phi(x) = -\ln(-\ln(G(x))). Then from previous ϕ(a(s)x+b(s))lns=ϕ(x)\phi(a(s)x + b(s)) - \ln s = \phi(x).

Step 3.

Denote y=ϕ(x)y = \phi(x). Apply ϕ1\phi^{-1} to both sides. We get: ϕ1(ϕ(a(s)x+b(s)))=y+lns\phi^{-1}(\phi(a(s)x + b(s))) = y + \ln s.

a(s)x+b(s)=ϕ1(y+lns)a(s)x + b(s) = \phi^{-1}(y + \ln s)

a(s)ϕ1(y)+b(s)=ϕ1(y+lns)a(s) \phi^{-1}(y) + b(s) = \phi^{-1}(y + \ln s)

ϕ1(y)=ϕ1(y+lns)b(s)a(s)\phi^{-1}(y) = \frac{\phi^{-1}(y + \ln s) - b(s)}{a(s)}

Step 4.

Note that ϕ1(0)=ϕ1(lns)b(s)a(s)\phi^{-1}(0) = \frac{\phi^{-1}(\ln s) - b(s)}{a(s)}. Subtract ϕ1(0)\phi^{-1}(0) from both sides:

ϕ1(y)ϕ1(0)=ϕ1(y+lns)b(s)a(s)ϕ1(lns)b(s)a(s)=ϕ1(y+lns)ϕ1(lns)a(s)\phi^{-1}(y) - \phi^{-1}(0) = \frac{\phi^{-1}(y + \ln s) - b(s)}{a(s)} - \frac{\phi^{-1}(\ln s) - b(s)}{a(s)} = \frac{\phi^{-1}(y + \ln s) - \phi^{-1}(\ln s)}{a(s)}

Step 5.

Substitute variables: ψ1(y)=ϕ1(y)ϕ1(0)\psi^{-1}(y) = \phi^{-1}(y) - \phi^{-1}(0), z=lnsz = \ln s, a~(z)=a(ez)\tilde a(z) = a(e^z). Then:

ψ1(y)=ϕ1(y)ϕ1(0)=ϕ1(y+lns)ϕ1(lns)a(s)=ψ1(y+z)ψ1(z)a~(z)\psi^{-1}(y) = \phi^{-1}(y) - \phi^{-1}(0) = \frac{\phi^{-1}(y + \ln s) - \phi^{-1}(\ln s)}{a(s)} = \frac{\psi^{-1}(y + z) - \psi^{-1}(z)}{\tilde a(z)}

ψ1(y+z)ψ1(z)=ψ1(y)a~(z)\psi^{-1}(y + z) - \psi^{-1}(z) = \psi^{-1}(y) \tilde a(z)

Step 6.

We can swap yy and zz in previous equation, settings y=zy = z and z=yz = y:

ψ1(y+z)ψ1(y)=ψ1(z)a~(y)\psi^{-1}(y + z) - \psi^{-1}(y) = \psi^{-1}(z) \tilde a(y)

After that subtract ψ1(y+z)ψ1(z)=ψ1(y)a~(z)\psi^{-1}(y + z) - \psi^{-1}(z) = \psi^{-1}(y) \tilde a(z) from ψ1(y+z)ψ1(y)=ψ1(z)a~(y)\psi^{-1}(y + z) - \psi^{-1}(y) = \psi^{-1}(z) \tilde a(y):

ψ1(z)ψ1(y)=ψ1(z)a~(y)ψ1(y)a~(z)\psi^{-1}(z) - \psi^{-1}(y) = \psi^{-1}(z) \tilde a(y) - \psi^{-1}(y) \tilde a(z)

ψ1(z)(1a~(y))=ψ1(y)(1a~(z))\psi^{-1}(z) (1 - \tilde a(y)) = \psi^{-1}(y) (1 - \tilde a(z))

Here we consider two cases.

Step 7a.

If a~(z)=1\tilde{a}(z) = 1, previous equation leads us to 0=00 = 0. But then let’s substitute a~(z)=1\tilde{a}(z) = 1 into the result of step 5:

ψ1(y+z)=ψ1(y)+ψ1(z)\psi^{-1}(y + z) = \psi^{-1}(y) + \psi^{-1}(z)

This means that ψ1(y)=ρy\psi^{-1}(y) = \rho y and denoting ν=ϕ1(0)\nu = \phi^{-1}(0), we get:

ρy=ψ1(y)=ϕ1(y)ϕ1(0)=ϕ1(y)ν\rho y = \psi^{-1}(y) = \phi^{-1}(y) - \phi^{-1}(0) = \phi^{-1}(y) - \nu

ϕ1(y)=ν+ρy\phi^{-1}(y) = \nu + \rho y

x=ϕ1(ϕ(x))=ν+ρln(ln(G(x)))x = \phi^{-1}(\phi(x)) = \nu + \rho \ln(-\ln(-G(x)))

G(x)=exp(exνρ)G(x) = exp(-e^{-\frac{x - \nu}{\rho}}), which is Gumbel (Type I) EVD.

Step 7b.

If a~(z)1\tilde{a}(z) \ne 1:

ψ1(y)=ψ1(z)(1a~(z))(1a~(y))=c(1a~(y))\psi^{-1}(y) = \frac{ \psi^{-1}(z) }{ (1 - \tilde a(z)) } (1 - \tilde a(y)) = c (1 - \tilde a(y))

Now recall that ψ1(y+z)ψ1(z)=ψ1(y)a~(z)\psi^{-1}(y + z) - \psi^{-1}(z) = \psi^{-1}(y) \tilde a(z) and substitute ψ1(y)=c(1a~(y))\psi^{-1}(y) = c (1 - \tilde a(y)) there:

c(1a~(y+z))c(1a~(y))=c(1a~(y))a~(z)c (1 - \tilde{a}(y + z)) - c (1 - \tilde{a}(y)) = c (1 - \tilde{a}(y)) \tilde a(z)

This leads us to equation a~(z+y)=a~(y)a~(z)\tilde{a}(z + y) = \tilde{a}(y) \tilde{a}(z), which, upon monotonous a~(y)\tilde{a}(y) has a solution a~(y)=eρy\tilde{a}(y) = e^{\rho y}. Hence:

ψ1(y)=c(1eρy)=ϕ1(y)ϕ1(0)\psi^{-1}(y) = c (1 - e^{\rho y}) = \phi^{-1}(y) - \phi^{-1}(0)

ϕ1(y)=ν+c(1eρy)\phi^{-1}(y) = \nu + c (1 - e^{\rho y}), where ν=ϕ1(0)\nu = \phi^{-1}(0).

Now recall that ϕ(x)=ln(ln(G(x)))\phi(x) = -\ln(-\ln(G(x))), and we get: x=ϕ1(ϕ(x))=ν+c(1eρln(ln(G(x))))x = \phi^{-1}(\phi(x)) = \nu + c (1 - e^{-\rho \ln(-\ln(G(x)))}). Hence:

xνc=1(lnG(x))ρ\frac{x - \nu}{c} = 1 - (-\ln G(x))^{-\rho}

(lnG(x))ρ=1xνc(-\ln G(x))^{-\rho} = 1 - \frac{x - \nu}{c}

lnG(x)=(1xνc)1ρ-\ln G(x) = (1 - \frac{x - \nu}{c})^{-\frac{1}{\rho}}

G(x)=e(1xνc)1ρG(x) = e^{-(1 - \frac{x - \nu}{c})^{-\frac{1}{\rho}}}, which is either a Frechet (Type II), or a Inverse Weibull (Type III) EVD.

Distributions not in domains of attraction of any maximum-stable distributions

We’ve shown that if maximum of n i.i.d. random variables of current distribution converge to any maximum-stable distribution, it is one of the 3 described types. However, maximum might not converge to any max-stable distribution at all.

For instance, Poisson distribution and Geometric distribution do not converge to any type of Extreme Value Distriubtion. To show this we will need much more tools in our toolbox, the corresponding theorem will be proven in the end of section 4.


3. Von Mises sufficient conditions for a distribution to belong to a type I, II or III

Richard von Mises

Richard von Mises

The Fisher-Tippett-Gnedenko theorem is an important theoretical result, but it does not provide an answer to the basic question: what type of EVD does our distribution function FF belong to?

Fortunately, there are two sets of criteria that let us determine the domain of attraction of FF. First, there are von Mises conditions, which are sufficient, but not necessary. Still, they are more intuitive and give a good insight into what kinds of distributions converge to what types of EVD and why. Second, there are general sufficient and necessary conditions. Proving them is a much more technical task and requires some extra preliminaries.

We will start with von Mises conditions, postulated by Richard von Mises in 1936, 7 years before Fisher-Tippett-Gnedenko theorem was proved by Boris Gnedenko in 1943. Von Mises conditions are formulated in terms of survival analysis. We shall introduce some basic notions from survival analysis first.

Pre-requisites from survival analysis

Definition 3.1: Survival function

Survival function S(t)S(t) is reverse of cumulative distribution function F(t)F(t): S(t)=1F(t)S(t) = 1 - F(t).

Basically, if our random variable’s value represents a human longevity, cumulative distribution funcion F(t)=p(ξt)=tf(x)dxF(t) = p(\xi \le t) = \int \limits_{-\infty}^{t} f(x) dx represents the fraction of people, who die by the time tt.

Survival function S(t)=p(ξt)=1p(ξt)=1F(t)S(t) = p(\xi \ge t) = 1 - p(\xi \le t) = 1 - F(t) on the contrary is the fraction of people, who are still alive by the time tt.</