\documentclass[11pt]{article} % \VignetteEngine{knitr::knitr} %\VignetteIndexEntry{The nieve package: Yet Another Extreme Value package?} %\VignetteEncoding{UTF-8} %% =========================================================================== %% Copyright Yves Deville 2022 %% =========================================================================== \usepackage{amsmath,amssymb,amsthm} \usepackage[english]{babel} \usepackage{hyperref} \usepackage[T1]{fontenc} \usepackage{fullpage} \usepackage{natbib} \setcitestyle{author} \usepackage{hyperref} \definecolor{MonVert}{rgb}{0.398,0.801,0.000} \definecolor{MonVertF}{rgb}{0.13,0.54,0.13} \definecolor{MonBleu}{rgb}{0.000,0.602,0.801} \definecolor{SteelBlue2}{rgb}{0.359375,0.671875,0.9296875} \definecolor{orange}{rgb}{1.0,0.6470,0.0} \definecolor{SteelBlue4}{rgb}{0.212, 0.392, 0.545} \definecolor{orange1}{rgb}{0.996,0.645,0} \newcommand{\m}{\mathbf} \newcommand{\bs}{\boldsymbol} \newcommand{\pkg}[1]{\textbf{#1}} \newcommand{\code}[1]{\texttt{#1}} \newcommand{\Gre}[1]{{\color{MonVertF}#1}} \title{\bf \Large \textbf{nieve}: Yet Another Extreme Value package?} \author{Yves Deville \href{mailto:deville.yves@alpestat.com}%% {deville.yves@alpestat.com} } %% \date{} \begin{document} \maketitle{} \tableofcontents{} \section{Probability functions of Extreme-Value distributions} The probability functions for the Generalized Pareto (GP) and Generalized Extreme Value (GEV) distributions \citep{Coles_ISMEV} are of ubiquitous use in Extreme Value (EV) analysis. These functions depend smoothly on the parameters: they are infinitely differentiable functions of the parameters. However, these functions are not \textit{analytic} functions of the parameters and a singularity exists for all of them when the shape parameter, say $\xi$, is zero. In practice, the functions are given with different formulas depending on whether the shape parameter~$\xi$ is zero or not; the formulas for $\xi = 0$ relate to the exponential and Gumbel distributions and correspond to the limit for $\xi \to 0$ of the functions given by the formulas for $\xi \neq 0$. As an example, consider the quantile function of the Generalized Pareto distribution with shape $\xi$ and unit scale \begin{equation} \label{eq:Quant} q(p) = \begin{cases} [(1 - p)^{-\xi} - 1] / \xi & \xi \neq 0 \\ -\log(1 - p) & \xi = 0, \end{cases} \qquad 0 < p < 1. \end{equation} It can be shown that for $\xi \approx 0$ whatever be $p$ $$ q \approx - \log(1-p), \qquad \frac{\partial q}{\partial \xi} \approx \frac{1}{2}\, \log^2(1-p), \qquad \frac{\partial^2 q}{\partial \xi^2} \approx -\frac{1}{3}\, \log^3(1-p). $$ It is quite easy to obtain expressions for the derivatives w.r.t. $\xi$ using the definition (\ref{eq:Quant}). We can even rely on the symbolic differentiation method \verb@D@ available in R which, as opposed to me and many other humans, never makes any mistake when differentiating. <>= qEx <- function(p, xi) ((1 - p)^(-xi) - 1) / xi dqEx <- D(expression(((1 - p)^(-xi) - 1) / xi), name = "xi") d2qEx <- D(dqEx, name = "xi") p <- 0.99; pBar <- 1 - p xis <- c(1e-4, 1e-5, 1e-6, 1e-7, 1e-8, 1e-9) for (xi in xis) { r <- rbind("ord 0" = c("lim" = -log(pBar), "der" = qEx(p = p, xi = xi)), "ord 1" = c("lim" = log(pBar)^2 / 2, "der" = eval(dqEx, list(p = p, xi = xi))), "ord 2" = c("lim" = -log(pBar)^3 / 3, "der" = eval(d2qEx, list(x = p, xi = xi)))) cat("xi = ", xi, "\n") print(r) } @ \noindent We see that the formula for the function works fine. However, the formula for the 2-nd order derivative can be completely wrong when $\xi$ is about $\text{1e-6}$ and the formula for the 1-st order derivative can also be wrong when $\xi$ is about $\text{1e-8}$. Although evaluated at a reasonably small value of $\xi$, the derivative given in the column \code{der} differs much from its limit in column~\code{lim}. The reason is that the formulas for the derivatives involve difference and/or fractions or small quantities because~$\xi$ or $\xi^2$ comes at the denominator. As a general rule, the derivatives with higher order are more difficult to evaluate numerically, since they involve more complex expressions. Note that using a shape $\xi$ with $|\xi| \leqslant \text{1e-6}$ is quite common in EV analysis because the values of $\xi$ used in practice are often quite small, and moreover very small values of $\xi$ are often used in the initialization of the Maximum-Likelihood (ML) optimization. Let us now see what \pkg{nieve} tells about the derivatives. These can be found as attributes of the result returned by the \verb@qGPD2@ function, with names \verb@"gradient"@ and \verb@"hessian"@. If the formal argument \code{p} is a vector with length \code{n}, the attributes are arrays with dimension \code{c(n, p)} and \code{c(n, p, p)} where \code{p} stands for the number of parameters, here \code{p = 2}. The arrays of derivatives have suitable \verb@dimnames@ hence can be indexed with characters as in \verb@H[1 , "scale", "shape"]@ if \verb@H@ is the Hessian found in the corresponding attribute. <>= library(nieve) for (xi in xis) { x <- qGPD2(p, shape = xi, deriv = TRUE, hessian = TRUE) r <- rbind("ord 0" = c("lim" = -log(pBar), "der" = x), "ord 1" = c("lim" = log(pBar)^2 / 2, "der" = attr(x, "gradient")[1, "shape"]), "ord 2" = c("lim" = -log(pBar)^3 / 3, "der" = attr(x, "hessian")[1, "shape", "shape"])) cat("xi = ", xi, "\n") print(r) } @ \noindent So, no more major departures from the theory can be seen. Although not yet widespread, the use of the exact formulas for the derivatives w.r.t. the parameters can be of great help in the optimization tasks required in EV analysis. These tasks of course involve the ML estimation, but also profile likelihood inference for models with covariates. Differential equations methods can be also used to derive confidence intervals. Note that the use of formulas for the derivatives is called \textit{symbolic} differentiation and differs from \textit{automatic} differentiation as increasingly available. However, from the previous analysis it transpires that the cure might be worse than the disease. Unless the derivatives are evaluated with great care, they can merely ruin an optimization in which they are involved, instead of improving it. Our strategy consists in fixing a small $\epsilon >0$ and use the formulas for $\xi \neq 0$ only when $|\xi| > \epsilon$. When $|\xi| \leqslant \epsilon$, we use a few terms from the Taylor series at $\xi =0$ e.g., $$ q(p;\,\xi) \approx q(p;\, 0) + \left. \partial_{\xi} q(p;\, \xi)\right|_{\xi = 0} \times \xi + \frac{1}{2} \, \left.\partial^2_{\xi,\xi} q(p;\, \xi)\right|_{\xi = 0} \times \xi^2 + o(\xi^2). $$ In order to maintain the consistency between the derivatives, it seems good to use the same $\epsilon$ for all the derivatives and use consistent Taylor approximations, so use the order $1$ for the derivative~$\partial_{\xi} q$ and the order order~$0$ for~$\partial^2_{\xi,\xi} q$ or for a crossed derivative such as~$\partial^2_{\sigma,\xi} q$. Since the $2$-nd order derivatives can be required, we must take a value for $\epsilon$ which is not too small: $\text{1e-4}$ or $\text{1e-5}$, not much smaller. Note that~$\epsilon$ gives the level of error for the $2$-nd order derivative; since the error on the function is~$O(\xi^3)$, using $\epsilon = \text{1e-4}$ leads to an error of order~$\text{1e-12}$ on the function, which seems acceptable in practice. This kind of approximation is used in some codes of the \textbf{revdbayes} package by Paul Northrop, see the code if the \code{dgev} and \code{pgev} functions on \href{https://github.com/paulnorthrop/revdbayes/blob/master/R/distributions.R}{GitHub repos}. \section{Deriving the formulas} The technical reports provided with \textbf{nieve}: \href{https://github.com/yvesdeville/nieve/blob/main/inst/computing/GEV.pdf}{GEV.pdf}, \href{https://github.com/yvesdeville/nieve/blob/main/inst/computing/GPD2.pdf}{GPD2.pdf}, \href{https://github.com/yvesdeville/nieve/blob/main/inst/computing/PoisGP2PP.pdf}{PoisGP2PP.pdf} and \href{https://github.com/yvesdeville/nieve/blob/main/inst/computing/PP2PoisGP.pdf}{PP2PoisGP.pdf} give the exact expressions for the first-order and the second-order derivatives of the probability functions w.r.t. the parameters and also provide workable approximations for the case $\xi \approx 0$. We used the \href{https://maxima.sourceforge.io/}{\pkg{Maxima} Computer Algebra System}~\citep{Maxima} along with the \href{https://maxima.sourceforge.io/contrib/maxiplot/maxiplot.sty}{\pkg{maxiplot}} package for \LaTeX{}. The technical reports follow the following rules. \begin{itemize} \item The {\color{MonVertF} \bf raw expressions given by Maxima are reported in green}. The expressions can be regarded as exact, not being influenced by manual computations. However these formulas are usually difficult to use in a compiled code and require some manual transformation for this aim. \item The {\color{red} \bf simplified expressions are reported in red}. These expressions are derived by us from the raw expressions; they are influenced by manual computations hence could in principle contain errors, although they have been carefully checked. These formulas are used to write the compiled code. They often use auxiliary variables that are shared across several formulas. \end{itemize} \section{Testing the derivatives} The \textbf{nieve} package comes with a series of tests in the format of the \textbf{testthat} package. The \textbf{numDeriv} package is used to compute the derivatives by numeric differentiation up to the order~2; these derivatives are compared to those provided by the formulas. A quite difficult task when checking derivatives is to give a threshold used to decide if the difference between the numeric derivative and the symbolic derivative, say the ``error'', is acceptable or not. This error has two sources: one is the numerical evaluation of the symbolic derivative or of its approximation for $\xi \approx 0$ (see example above), and the other is the approximation used in numeric differentiation where the limit defining the derivative is replaced by a finite difference. We should use small values of $\xi$ with $|\xi| < \epsilon$ to check that the approximation for small~$\xi$ is correct, although we can only test the approximation at the first order by doing so. Then, with a good choice of $\epsilon$ the error should be mainly due to the numeric differentiation. But when the true derivative is small, the relative error may be large (think of a true derivative which is exactly zero). On the other hand, when a derivative is large in absolute value, the absolute error may also be quite large. Mind that the derivatives can in practice be very small or very large, and also that a gradient vector or a Hessian matrix often contain values that are not of the same order of magnitude. We check that either the \textit{absolute} error or the \textit{relative} error is small. The idea is that none of these two things can come by chance, and if one holds, the formula used must be good even if the other criterion suggests an opposed conclusion. The test is made \textit{elementwise}, meaning that the relative error is computed for each element of a gradient vector or Hessian matrix ignoring the other elements. \section{Parameterizations for Peaks Over Threshold models} Peaks Over Threshold (POT) models are very popular in EV analysis. These models relate to marked Poisson Process: at each time $T_i$ in a sequence or random times $T_1 < T_2 < \dots$ we observe a random variable $Y_i$ called the \textit{mark}. In applications, the time $T_i$ is typically for the occurrence a storm and the mark $Y_i$ can be an amount of precipitation or a sea level. The following two frameworks are commonly used, see~\cite{NorthropEtAl_NSExtremes} for more details. \begin{itemize} \item The \textit{Poisson-GP} framework involves a given threshold $u$ having the same physical dimension as the marks $Y_i$. It assumes that the $T_i$ form an homogeneous Poisson Process with rate $\lambda_u$, and that $Y_i$ are i.i.d. with a Generalized Pareto distribution $\text{GPD}(u, \, \sigma_u, \,\xi)$ or equivalently that the so-called \textit{excesses} over the threshold $Y_i -u$ follow the two-parameter GP distribution with scale $\sigma >0$ and shape $\xi$. The marks $Y_i$ are further assumed to be independent of the event process $\{T_i\}_i$. The parameters form the vector $\bs{\theta}_u=[\lambda_u,\, \sigma_u,\, \xi]$. \begin{figure} \centering \label{BlockMaxima} \includegraphics[width=0.8\textwidth]{images/POTAggreg.pdf} \caption{\sf \small Block maxima by aggregation of the Poisson-GP.} \end{figure} \item An alternative framework is often named \textit{Point Process} (PP) or \textit{Non-Homogeneous Poisson Process} (NHPP). It involves a reference duration $w>0$, usually chosen to be one year, and a vector of parameters $\bs{\theta}^\star := [\mu_w^\star,\,\sigma_w^\star,\, \xi^\star]$. The random observations $[T_i,\, Y_i]$ are given by a Poisson process on the $(t,\,y)$-plane with intensity $$ \gamma^\star_w(t,\, y) := \frac{1}{w} \times \frac{1}{\sigma^\star_w}\, \left[1 + \xi^\star \, \frac{y- \mu^\star_w}{\sigma^\star_w} \right]^{-1 / \xi^\star -1} \, 1_{\mathcal{S}_{\bs{\theta}^\star_w}}(t,\, y), $$ where $\mathcal{S}_{\bs{\theta}^\star_w}$ is the domain of the plane $$ \mathcal{S}_{\bs{\theta}_w^\star} := \left\{ [t,\, y]: \: \sigma_w^\star + \xi^\star [y - \mu_w^\star] > 0 \right\}. $$ This is a half-plane for $\xi^\star \neq 0$ and the whole plane for $\xi^\star = 0$. \end{itemize} If we take the observation region for the PP model as being the product of the time interval $(0, \, t^\dag)$ by the $y$-interval $(u, \, \infty)$ where the threshold~$u$ is such that $\sigma_w^\star + \xi^\star [u - \mu_w^\star] > 0$, we get the same model as the Poisson-GP on $(0, \, t^\dag)$, up to a re-parameterization. As an interesting feature of the PP parameter $\bs{\theta}^\star_w$, it does not depend on the threshold~$u$. It relates to the distribution of the maximum $M$ of the marks $Y_i$ corresponding to a time interval with duration~$w$, see Figure~\ref{BlockMaxima}. This distribution is indeed essentially the $\text{GEV}(\mu_w^\star, \, \sigma_w^\star,\,\xi^\star)$, up to an atom corresponding to the possibility that no mark is observed during the time interval. We can then define $M:= -\infty$ since this is arguably the maximum of the empty set, and the probability of the corresponding event is $\exp\{- \lambda_u w\}$. We may speak of $\text{GEV}(\mu_w^\star, \, \sigma_w^\star,\,\xi^\star)$ as the \textit{GEV reference distribution} in relation to~$w$, although this is not exactly the distribution of a maximum~$M$ over the reference duration. For a given threshold $u$ and a given reference duration $w>0$, the one-to-one correspondence between the vectors $\bs{\theta}_u$ and $\bs{\theta}_w^\star$ is given by \begin{equation} \label{eq:poisGP2PP} \left\{ \begin{array}{c c l} \mu^\star_w &=& u + \frac{(\lambda_u w)^\xi - 1}{\xi} \, \sigma_u, \\ \sigma^\star_w &=& (\lambda_u w)^\xi \, \sigma_u, \rule{0pt}{1em}\\ \xi^\star &=& \xi, \rule{0pt}{1em} \end{array} \right. \end{equation} the fraction $[(\lambda_u w)^\xi - 1]/\xi$ of the first equation being to be replaced for $\xi = 0$ by its limit $\log(\lambda_u w)$. The reciprocal transformation is \begin{equation} \label{eq:PP2poisGP} \left\{ \begin{array}{c c l} \sigma_u &=& \sigma_w^\star + \xi^\star \left[ u - \mu_w^\star \right],\\ \lambda_u &=& w^{-1} \, \left[\sigma_u / \sigma_w^\star \right]^{-1/ \xi^\star}, \rule{0pt}{1.1em}\\ \xi &=& \xi^\star, \rule{0pt}{1.1em} \\ \end{array} \right. \end{equation} where the second equation becomes $\lambda_u = w^{-1}$ for $\xi^\star = 0$. \begin{figure} \centering \includegraphics[width=0.8\textwidth]{images/TwoParameterisationsC.pdf} \caption{\sf \small Two parameterizations for POT models.} \end{figure} When a vector $\m{x}$ of covariates can be used, two different kinds of so-called \textit{non-stationary} POT models can be obtained by relating or ``linking'' the three parameters to the covariates, either parametrically or non-parametrically. For instance the Poisson-GP scale or its logarithm can be specified as having the parametric form $\m{x}^\top \bs{\beta}^\sigma$ where $\bs{\beta}^\sigma$ is a vector of parameters. Different forms of models arise from the two frameworks. Moreover, parametric Poisson-GP models relate to a specific threshold since the same form of link can not persist when the threshold is changed. Remind also that the threshold should in general depend on the covariates. Anyway, for non-stationary models it is often useful to transform one of the two parameterizations into the other. If a Poisson-GP model is used, we may be interested in the GEV reference distribution conditional on a given value $\m{x}$. When instead a PP model is used, it may be useful to investigate the relation of the implied rate $\lambda_u$ with the covariates, and possibly to compare the POT estimate with a non-parametric estimate. The \textbf{nieve} package provides the two transformations~(\ref{eq:poisGP2PP}) and~(\ref{eq:PP2poisGP}) required for all these tasks --~along with their Jacobian, under the names \verb@poisGP2PP@ and \verb@PP2poisGP@. As for the probability functions, the singularity for $\xi = 0$ or $\xi^\star = 0$ is coped with by using a second-order Taylor approximation. As often required when coping with non-stationary POT models, the functions \verb@poisGP2PP@ and \verb@PP2poisGP@ are vectorized w.r.t. their arguments including \verb@threshold@ for the \verb@PP2poisGP@ transformation. Each element $i$ in the provided vector arguments (such as \verb@lambda@) correspond to a value $\m{x}_i$ of the covariates, and most often the threshold is then also chosen as depending on the covariates, hence used via a vector with $n$ elements $u_i = u(\m{x}_i)$. The condition $$ \sigma^\star_i + \xi^\star_i [u_i - \mu^\star_i] > 0, \qquad i=1,\, \dots,\, n $$ should then hold. There does not seem to exist any motivation for using a reference duration~$w$ depending on $i$, hence the function \code{poisGP2PP} only accepts a length-one argument~\code{w}. \section{EV distributions from other R packages} The EV distributions are implemented in many R packages. A variety of strategies regarding the problem $\xi = 0$ can be found. We now describe these strategies and provide for each of them a ``code'' (shown as framed: \fbox{NT}, ...) that is used in Table~\ref{TabXi}, columns $\xi = 0$. Each strategy is briefly discussed. \begin{enumerate} \item \fbox{NT} Use only the formula for $\xi \neq 0$ i.e., {``do nothing''}. In practice, an optimization or sampling algorithm will never come to the case $\xi = 0$ \textit{exactly} and this can only happen when the user gives this value e.g., as an initial value. There will be some numerical problems when $\xi$ is very small, say $\xi = \text{1e-14}$ or less. These problems are not so crucial for the usual probability functions: we get some wiggling when plotting the curves and zooming. Mind however that the random generation functions\footnote{Usually given names such as \code{"rgev"} or \code{"rgpd"}.} will produce silly results with a very small $\xi$ if they are based on the corresponding quantile functions. \item \fbox{$0.0$} Test the exact equality $\xi = 0$, and if this is true, \textit{switch to the exponential/Gumbel formula}. This helps only when the user gives \code{xi = 0.0}, but we are essentially doing the same thing as in \fbox{NT}. \item \fbox{$\epsilon$/S} Test the equality $ |\xi| \leqslant \epsilon$ where $\epsilon >0$ is very small, and if this is true, \textit{switch to the exponential/Gumbel formula}. So this produces a (very small) discontinuity. E.g., \pkg{Renext} uses $\epsilon \approx \text{2e-14}$. \item \fbox{$\epsilon$/AI} Test the equality $ |\xi| \leqslant \epsilon$ where $\epsilon >0$ is very small, and if this is true, use a dedicated \textit{approximation or interpolation}. Several methods can be used including Taylor approximations. The discontinuity should then be undetectable. Mind the probability functions although not being \textit{analytic} functions, are infinitely differentiable $C^\infty$ w.r.t. the parameters. \end{enumerate} Note that some of the cited packages are quite old: \pkg{evir}~\citep{pack_evir}, \pkg{evd}~\citep{pack_evd}, \pkg{ismev}~\citep{pack_ismev}, \pkg{Renext}~\citep{pack_Renext}, \pkg{POT}~\citep{pack_POT} and \pkg{SpatialExtremes}~\cite{pack_SpatialExtremes}. The packages \pkg{revdbayes}~\citep{pack_revdbayes} and \pkg{mev}~\citep{pack_mev} are more recent. See the CRAN Task View on Extreme Value Analysis \citep{TV_ExtremeValue} for an extended list of packages devoted to EV. Also it is worth mentioning that the \pkg{extRemes} package~\citep{pack_extRemes} optionally uses the exact gradient of the log-likelihood for models with GEV and GP margins but the derivatives are coded (in R) only for internal use in optimization tasks. \begin{table} \centering \small \begin{tabular}{| l || c | c | c | c | c || c | c | c | c | c |} \hline \multicolumn{1}{|c||}{\raisebox{-0.3em}{Package}} & \multicolumn{5}{c||}{GEV} & \multicolumn{5}{c|}{GPD}\\ \cline{2-6} \cline{7-11} \multicolumn{1}{|c||}{} & \multicolumn{1}{c|}{Lang.} & \multicolumn{1}{c|}{Vec. $\bs{\theta}$} & \multicolumn{1}{c|}{Grad.} & \multicolumn{1}{c|}{Hess.} & \multicolumn{1}{c||}{$\xi=0$} & \multicolumn{1}{c|}{Lang.} & \multicolumn{1}{c|}{Vec. $\bs{\theta}$} & \multicolumn{1}{c|}{Grad.} & \multicolumn{1}{c|}{Hess.} & \multicolumn{1}{c|}{$\xi=0$} \\ \hline\hline \textbf{evir} & R & no & no & no & NT & R & no & no & no & NT\\ \hline \textbf{evd} & R & no & no & no & 0.0 & R & no & no & no & 0.0\\ \hline \textbf{ismev} & R$^\star$ & no & no & no & NT & R$^\star$ & no & no & no & NT\\ \hline \textbf{Renext} & & & & & & R & no & yes & yes & $\epsilon$/S \\ \hline \textbf{POT} & & & & & & R$^\star$ & yes & no & no & 0.0 \\ \hline \textbf{SpatialExtremes} & R & yes & no & no & $0.0$ & R & yes & no & no & $0.0$ \\ \hline \textbf{revdbayes} & R & yes & no & no & $\epsilon$/AI & R & yes & no & no & $\epsilon$/AI \\ \hline \textbf{mev} & R & yes & yes$^\star$ & yes$^\star$& NT & R & yes & yes$^\star$ & yes$^\star$ & NT \\ \hline \Gre{\textbf{nieve}} & \Gre{C} & \Gre{yes} & \Gre{yes} & \Gre{yes} & \Gre{$\epsilon$/AI} & \Gre{C} & \Gre{yes} & \Gre{yes} & \Gre{yes} & \Gre{$\epsilon$/AI} \\ \hline \end{tabular} \caption{\label{TabXi}\small \sf Features of some CRAN packages. \textit{Lang.}: the implementation language, \textit{Vec.} $\bs{\theta}$: vectorized w.r.t. the parameters. The columns \textit{Grad.} and the \textit{Hess.} indicate if the gradient and Hessian are provided, and the columns $\epsilon = 0$ indicate the strategy used to cope with a zero or small shape, as described in the text. A star $\star$ means that the functions are not exported.} \end{table} \section*{Acknowledgments} The \pkg{nieve} package was partly funded by the French \href{https://www.irsn.fr/}{Institut de Radioprotection et Sûreté Nucléaire (IRSN)} and some of the code formerly was part of R packages owned by IRSN/Behrig. We are grateful to the authors and contributors of \textbf{Maxima} and to the authors of the \textbf{maxiplot} \LaTeX{} package (J.M. Planas and José Manuel Mira univ. de Murcia, Spain) which helped much for the tedious computations required by the package. \bibliographystyle{apalike} \bibliography{nieve} \end{document}