Sampling from the Ewens distribution

library(ewens)

The Ewens distribution

The Ewens distribution is a distribution over partitions of integers. It has a single non-negative parameter θ, which determines how many parts one can expect in the partition: as θ goes to zero, the Ewens distribution returns a single part; as θ goes to infinity, the Ewens distribution returns n parts each with a value of one.

The Ewens distribution also gives the expected count of objects belonging to category k when a sample of size n is drawn from an infinitely large population. For this reason, the formula for the probability mass function of the Ewens distribution is also called the Ewens sampling formula [@ewens1972sampling].

Suppose that we have m1 categories which appear once in the sample, m2 categories which appear twice in the sample, and so on. Then the probability of the vector m is as follows:

$$ p(m_1, ..., m_n; \theta) = \frac{n!}{\theta (\theta + 1) ... (\theta + n - 1)} \prod_{j=1}^n \frac{\theta^{m_j}}{j^{m_j} m_j! } \qquad(1)$$

The Ewens distribution is related to other constructions in statistics. For example: to sample from a Ewens distribution, and return a sample of size n where each sample observation belongs to some category k, we follow the Chinese Restaurant Process [@aldous1983exchangeability, p. 92] for n steps. We imagine n customers entering the restaurant and choosing to sit at one of an infinite number of infinitely large banquet tables. The first customer enters, and sits at a table. The second customer enters, and either

The third customer enters, and either

In general, the jth customer

From the Ewens sampling formula, we can derive an expression for the probability of there being k parts in the partition of n – or alternately, the probability distribution of the number of tables in our Chinese restaurant. This formula is sometimes referred to as the Chinese restaurant table distribution – or at least so says Wikipedia. This probability mass function is given by

$$ Pr(K = k) = \lvert{} S^k_n \rvert{} \frac{\theta^k}{\theta (\theta + 1) ... (\theta + n - 1)} \qquad(2)$$

where |Snk$ is the absolute value of a Stirling number of the first kind. If we are solely interested in the expected number of tables, rather than the entire distribution, we can use the following expression:

$$ \newcommand{\Expect}{\operatorname{\mathbb{E}}} \Expect{[k]} = \theta \sum_{j=1}^{n} \frac{1}{\theta + j - 1} \qquad(3)$$

Thus, if we know our sample size and the value of θ, we can work out the expected number of classes. One might think that it wouldn’t be possible to apply this logic in reverse, and arrive at an expression for θ in terms of the sample size and the observed number of classes. After all, the observed number of classes reports much less information than the full vector of frequencies m1, ..., mn. This seems plausible on the face of it, and yet it is wrong. As @tavare2021magical writes, discussing the Ewens sampling formula in its original context of population genetics:

“In statistical parlance, the number Kn of different alleles observed in the sample is sufficient for the parameter θ. It follows from the Rao–Blackwell theorem that estimation of should be based on Kn; earlier, estimation of had been based on the observed allele frequencies”

The log-likelihood for θ given K and n is:

$$ K \log \theta - \sum_{i=0}^{n-1}\log (\theta + i) + constant $$

If, for the purposes of maximum likelihood estimation, we differentiate this and set to zero, we get:

$$ \frac{\mathrm{d} \ell}{\mathrm{d} \theta} = \frac{K}{\theta} - \sum_{i=0}^{n-1}\frac{1}{\theta +i} \qquad(4)$$

Useful readings on the Ewens distribution include @crane2016ubiquitous and @tavare2021magical; @ewens1972sampling is probably of historical interest only.

R implementation

The dewens function gives the probability mass function in Equation 1. It accepts a vector with class memberships and tabulates the number of classes that appear once, twice, and so on.

The rewens function samples from the Ewens distribution by following the Chinese Restaurant Process.

The probability mass function given in Equation 2 is implemented in dewens_k; the expression for the mean K in Equation 3 in function ewens_k_exact.

Finally, function ewens_mle takes an input vector and estimates θ.