\documentclass{article} \usepackage{url} %\VignetteIndexEntry{Quantile rules} \usepackage{Sweave} \author{Thomas Lumley} \title{Estimating quantiles} \newlength{\wdth} \newcommand{\strike}[1]{\settowidth{\wdth}{#1}\rlap{\rule[.5ex]{\wdth}{.4pt}}#1} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle The $p$th quantile is defined as the value where the estimated cumulative distribution function is equal to $p$. As with quantiles in unweighted data, this definition only pins down the quantile to an interval between two observations, and a rule is needed to interpolate. As the help for the base R function \texttt{quantile} explains, even before considering sampling weights there are many possible rules. Rules in the \texttt{svyquantile()} function can be divided into three classes \begin{itemize} \item Discrete rules, following types 1 to 3 in \texttt{quantile} \item Continuous rules, following types 4 to 9 in \texttt{quantile} \item A rule proposed by Shah \& Vaish (2006) and used in some versions of \textsf{SUDAAN} \end{itemize} \subsection*{Discrete rules} These are based on the discrete empirical CDF that puts weight proportional to the weight $w_k$ on values $x_k$. $$\hat F(x) = \frac{\sum_i \{x_i\leq x\}w_i}{\sum_i w_i}$$ \paragraph{The mathematical inverse} The mathematical inverse $\hat F^{-1}(p)$ of the CDF is the smallest $x$ such that $F(x)\geq p$. This is rule \texttt{hf1} and \texttt{math} and in equally-weighted data gives the same answer as \texttt{type=1} in \texttt{quantile} \paragraph{The primary-school median} The school definition of the median for an even number of observations is the average of the middle two observations. We extend this to say that the $p$th quantile is $q_{\mathrm{low}}=\hat F^{-1}(p)$ if $\hat F(q_{\mathrm{low}})=p$ and otherwise is the the average of $\hat F^{-1}(p)$ and the next higher observation. This is \texttt{school} and \texttt{hf2} and is the same as \texttt{type=2} in \texttt{quantile}. \paragraph{Nearest even order statistic} The $p$th quantile is whichever of $\hat F^{-1}(p)$ and the next higher observation is at an even-numbered position when the distinct data values are sorted. This is \texttt{hf3} and is the same as \texttt{type=3} in \texttt{quantile}. \subsection*{Continuous rules} These construct the empirical CDF as a piecewise-linear function and read off the quantile. They differ in the choice of points to interpolate. Hyndman \& Fan describe these as interpolating the points $(p_k,x_k)$ where $p_k$ is defined in terms of $k$ and $n$. For weighted use they have been redefined in terms of the cumulative weights $C_k=\sum_{i\leq k} w_i$, the total weight $C_n=\sum w_i$, and the weight $w_k$ on the $k$th observation. \begin{tabular}{lll} {\tt qrule} & Hyndman \& Fan & Weighted\\ \hline {\tt hf4} & $p_k= k/n$ & $p_k = C_k/C_n$\\ {\tt hf5} & $p_k= (k-0.5)/n$ & $p_k = (C_k-w_k)/C_n$\\ {\tt hf6} & $p_k= k/(n+1)$ & $p_k = C_k/(C_n+w_n)$\\ {\tt hf7} & $p_k= (k-1)/(n-1)$ & $p_k = C_{k-1}/C_{n-1}$\\ {\tt hf8} & $p_k= (k-1/3)/(n+2/3)$ & $p_k = (C_k-w_k/3)/(C_n+w_n/3)$\\ {\tt hf9} & $p_k= (k-3/8)/(n+1/4)$ & $p_k = (C_k-3w_k./8)/(C_n+w_n/4)$\\ \hline \end{tabular} \subsection*{Shah \& Vaish} This rule is related to {\tt hf6}, but it is discrete and more complicated. First, define $w^*_i=w_in/C_n$, so that $w^*_i$ sum to the sample size rather than the population size, and $C^*k$ as partial sums of $w^*_k$. Now define the estimated CDF by $$\hat F(x_k) =\frac{1}{n+1}\left(C^*_k+1/2-w_k/2 \right)$$ and take $\hat F^{-1}(p)$ as the $p$th quantile. \subsection*{Other options} It would be possible to redefine all the continuous estimators in terms of $w^*$, so that type 8, for example, would use $$p_k = (C^*_k-1/3)/(C^*_n+2/3)$$ Or a compromise, eg using $w^*_k$ in the numerator and numbers in the denominator, such as $$p_k = (C^*_k-w^*_k/3)/(C^*_n+2/3).$$ Comparing these would be \strike{a worthwhile}\dots\strike{an interesting}... a research question for simulation. \end{document}