--- title: "Sampling from the Ewens distribution" vignette: > %\VignetteIndexEntry{Sampling from the Ewens distribution} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} knitr: opts_chunk: collapse: true comment: '#>' --- ```{r} #| label: setup library(ewens) ``` # The Ewens distribution The Ewens distribution is a distribution over partitions of integers. It has a single non-negative parameter $\theta$, which determines how many parts one can expect in the partition: as $\theta$ goes to zero, the Ewens distribution returns a single part; as $\theta$ goes to infinity, the Ewens distribution returns $n$ parts each with a value of one. The Ewens distribution also gives the expected count of objects belonging to category $k$ when a sample of size $n$ is drawn from an infinitely large population. For this reason, the formula for the probability mass function of the Ewens distribution is also called the Ewens sampling formula [@ewens1972sampling]. Suppose that we have $m_1$ categories which appear once in the sample, $m_2$ categories which appear twice in the sample, and so on. Then the probability of the vector $m$ is as follows: $$ p(m_1, ..., m_n; \theta) = \frac{n!}{\theta (\theta + 1) ... (\theta + n - 1)} \prod_{j=1}^n \frac{\theta^{m_j}}{j^{m_j} m_j! } $$ {#eq-ewens-pmf} The Ewens distribution is related to other constructions in statistics. For example: to *sample* from a Ewens distribution, and return a sample of size $n$ where each sample observation belongs to some category $k$, we follow the Chinese Restaurant Process [@aldous1983exchangeability, p. 92] for $n$ steps. We imagine $n$ customers entering the restaurant and choosing to sit at one of an infinite number of infinitely large banquet tables. The first customer enters, and sits at a table. The second customer enters, and either - sits at a new table with probability $\frac{\theta}{2 - 1 + \theta}$, or - sits at the occupied table with probability $\frac{1}{2 - 1 + \theta}$ The third customer enters, and either - sits at a new table with probability $\frac{\theta}{3 - 1 + \theta}$, or - sits at one of the occupied tables with probability $\frac{c}{3 - 1 + \theta}$, where $c$ is the count of people at that table. In general, the $j$th customer - sits at a new table with probability $\frac{\theta}{j - 1 + \theta}$, or - sits at an occupied table with probability $\frac{c}{j - 1 + \theta}$ From the Ewens sampling formula, we can derive an expression for the probability of there being $k$ parts in the partition of $n$ -- or alternately, the probability distribution of the number of tables in our Chinese restaurant. This formula is sometimes referred to as the Chinese restaurant table *distribution* -- or at least so says [Wikipedia](https://en.wikipedia.org/wiki/Chinese_restaurant_process#Expected_number_of_tables). This probability mass function is given by $$ Pr(K = k) = \lvert{} S^k_n \rvert{} \frac{\theta^k}{\theta (\theta + 1) ... (\theta + n - 1)} $$ {#eq-crt-pmf} where $\lvert{}S^k_n$\rvert{}$ is the absolute value of a Stirling number of the first kind. If we are solely interested in the expected number of tables, rather than the entire distribution, we can use the following expression: $$ \newcommand{\Expect}{\operatorname{\mathbb{E}}} \Expect{[k]} = \theta \sum_{j=1}^{n} \frac{1}{\theta + j - 1} $$ {#eq-crt-expectation} Thus, if we know our sample size and the value of $\theta$, we can work out the expected number of classes. One might think that it wouldn't be possible to apply this logic in reverse, and arrive at an expression for $\theta$ in terms of the sample size and the observed number of classes. After all, the observed number of classes reports much less information than the full vector of frequencies $m_1, ..., m_n$. This seems plausible on the face of it, and yet it is wrong. As @tavare2021magical writes, discussing the Ewens sampling formula in its original context of population genetics: > "In statistical parlance, the number $K_n$ of different alleles > observed in the sample is sufficient for the parameter $\theta$. It > follows from the Rao–Blackwell theorem that estimation of \theta > should be based on $K_n$; earlier, estimation of \theta had been > based on the observed allele frequencies" The log-likelihood for $\theta$ given $K$ and $n$ is: $$ K \log \theta - \sum_{i=0}^{n-1}\log (\theta + i) + constant $$ If, for the purposes of maximum likelihood estimation, we differentiate this and set to zero, we get: $$ \frac{\mathrm{d} \ell}{\mathrm{d} \theta} = \frac{K}{\theta} - \sum_{i=0}^{n-1}\frac{1}{\theta +i} $$ {#eq-mle} Useful readings on the Ewens distribution include @crane2016ubiquitous and @tavare2021magical; @ewens1972sampling is probably of historical interest only. # R implementation The `dewens` function gives the probability mass function in @eq-ewens-pmf. It accepts a vector with class memberships and tabulates the number of classes that appear once, twice, and so on. The `rewens` function samples from the Ewens distribution by following the Chinese Restaurant Process. The probability mass function given in @eq-crt-pmf is implemented in `dewens_k`; the expression for the mean K in @eq-crt-expectation in function `ewens_k_exact`. Finally, function `ewens_mle` takes an input vector and estimates $\theta$.