---
title: "Sampling from the Ewens distribution"
vignette: >
  %\VignetteIndexEntry{Sampling from the Ewens distribution}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
knitr:
  opts_chunk:
    collapse: true
    comment: '#>'
---

```{r}
#| label: setup
library(ewens)
```

# The Ewens distribution

The Ewens distribution is a distribution over partitions of
integers. It has a single non-negative parameter $\theta$, which
determines how many parts one can expect in the partition: as $\theta$
goes to zero, the Ewens distribution returns a single part; as
$\theta$ goes to infinity, the Ewens distribution returns $n$ parts
each with a value of one.

The Ewens distribution also gives the expected count of objects
belonging to category $k$ when a sample of size $n$ is drawn from an
infinitely large population. For this reason, the formula for the
probability mass function of the Ewens distribution is also called the
Ewens sampling formula [@ewens1972sampling]. 

Suppose that we have $m_1$ categories which appear once in the sample,
$m_2$ categories which appear twice in the sample, and so on. Then the
probability of the vector $m$ is as follows:

$$
p(m_1, ..., m_n; \theta) = \frac{n!}{\theta (\theta + 1) ... (\theta + n - 1)} \prod_{j=1}^n \frac{\theta^{m_j}}{j^{m_j} m_j! }
$$ {#eq-ewens-pmf}

The Ewens distribution is related to other constructions in
statistics. For example: to *sample* from a Ewens distribution, and
return a sample of size $n$ where each sample observation belongs to
some category $k$, we follow the Chinese Restaurant Process
[@aldous1983exchangeability, p. 92] for $n$ steps. We imagine $n$
customers entering the restaurant and choosing to sit at one of an
infinite number of infinitely large banquet tables. The first customer
enters, and sits at a table. The second customer enters, and either

 - sits at a new table with probability $\frac{\theta}{2 - 1 + \theta}$, or
 - sits at the occupied table with probability $\frac{1}{2 - 1 + \theta}$

The third customer enters, and either

 - sits at a new table with probability $\frac{\theta}{3 - 1 + \theta}$, or 
 - sits at one of the occupied tables with probability $\frac{c}{3 -
   1 + \theta}$, where $c$ is the count of people at that table.

In general, the $j$th customer

 - sits at a new table with probability $\frac{\theta}{j - 1 + \theta}$, or 
 - sits at an occupied table with probability $\frac{c}{j - 1 + \theta}$
 
From the Ewens sampling formula, we can derive an expression for the
probability of there being $k$ parts in the partition of $n$ -- or
alternately, the probability distribution of the number of tables in
our Chinese restaurant. This formula is sometimes referred to as the
Chinese restaurant table *distribution* -- or at least so says
[Wikipedia](https://en.wikipedia.org/wiki/Chinese_restaurant_process#Expected_number_of_tables). This
probability mass function is given by

$$
Pr(K = k) = \lvert{} S^k_n \rvert{} \frac{\theta^k}{\theta (\theta + 1) ... (\theta + n - 1)}
$$ {#eq-crt-pmf}

where $\lvert{}S^k_n$\rvert{}$ is the absolute value of a Stirling
number of the first kind. If we are solely interested in the expected
number of tables, rather than the entire distribution, we can use the
following expression:

$$
\newcommand{\Expect}{\operatorname{\mathbb{E}}}
\Expect{[k]} = \theta \sum_{j=1}^{n} \frac{1}{\theta + j - 1}
$$ {#eq-crt-expectation}

Thus, if we know our sample size and the value of $\theta$, we can
work out the expected number of classes. One might think that it
wouldn't be possible to apply this logic in reverse, and arrive at an
expression for $\theta$ in terms of the sample size and the observed
number of classes. After all, the observed number of classes reports
much less information than the full vector of frequencies $m_1, ...,
m_n$. This seems plausible on the face of it, and yet it is wrong. As
@tavare2021magical writes, discussing the Ewens sampling formula in
its original context of population genetics:

> "In statistical parlance, the number $K_n$ of different alleles
> observed in the sample is sufficient for the parameter $\theta$. It
> follows from the Rao–Blackwell theorem that estimation of \theta
> should be based on $K_n$; earlier, estimation of \theta had been
> based on the observed allele frequencies"

The log-likelihood for $\theta$ given $K$ and $n$ is:

$$
K \log \theta - \sum_{i=0}^{n-1}\log (\theta + i) + constant
$$

If, for the purposes of maximum likelihood estimation, we
differentiate this and set to zero, we get:

$$
\frac{\mathrm{d} \ell}{\mathrm{d} \theta} = \frac{K}{\theta} - \sum_{i=0}^{n-1}\frac{1}{\theta +i}
$$ {#eq-mle}

Useful readings on the Ewens distribution include @crane2016ubiquitous
and @tavare2021magical; @ewens1972sampling is probably of historical
interest only.

# R implementation

The `dewens` function gives the probability mass function in
@eq-ewens-pmf. It accepts a vector with class memberships and
tabulates the number of classes that appear once, twice, and so on.

The `rewens` function samples from the Ewens distribution by following
the Chinese Restaurant Process.

The probability mass function given in @eq-crt-pmf is implemented in
`dewens_k`; the expression for the mean K in @eq-crt-expectation in
function `ewens_k_exact`.

Finally, function `ewens_mle` takes an input vector and estimates $\theta$.