% \VignetteIndexEntry{Tutorial} % \VignetteDepends{vsgoftest} % \VignetteKeyword{vsgoftest} % \VignetteKeyword{tutorial} % \VignetteKeyword{manual} % \VignetteEngine{knitr::knitr} \documentclass[a4paper, 10pt]{article} \usepackage[a4paper, total={470pt, 583pt}]{geometry} \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \usepackage{amsmath} \usepackage{amssymb} \def\firstMarker{$^*$} \def\secondMarker{$\dagger$} \def\code{\texttt} \def\pkg{\textbf} \def\proglang{\textsf} \newcommand{\der}{\mbox{d}} \newcommand{\KK}{\mathbb{K}} \newcommand{\HHH}{\mathbb{H}} \newcommand{\SSS}{\mathbb{S}} \newcommand{\R}{\mathbb{R}} \newcommand{\cC}{\mathcal{C}} \newcommand{\cE}{\mathcal{E}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cP}{\mathcal{P}} \newcommand{\N}{\mathbb{N}} \newcommand{\mand}{\quad\mbox{and}\quad} \def\id{\hbox{{\rm\bf 1}\kern-.4em\hbox{{\rm\bf 1}}}\! } %Fonction indicatrice \newcommand{\kpq}{k} \numberwithin{equation}{section} \author{Justine Lequesne\firstMarker and Philippe Regnault\secondMarker} \title{Package \pkg{vsgoftest}: goodness-of-fit tests based on Kullback-Leibler divergence} \begin{document} \maketitle \begin{center} \firstMarker Unit{\'e} de Recherche Clinique, Centre Henri Becquerel, 76038 Rouen Cedex1, France. E-mail: justine.lequesne@chb.unicancer.fr \\ \secondMarker Laboratoire de Math{\'e}matiques de Reims, FRE 2011, Université de Reims Champagne-Ardenne, BP 1039, 51687 Reims cedex 2, France. E-mail: {philippe.regnault@univ-reims.fr} \end{center} The \pkg{vsgoftest} package provides functions for estimating Shannon entropy of absolutely continuous distributions and testing the goodness-of-fit of some theoretical family of distributions to a vector of real numbers. It also provides functions for computing the density, cumulative density and quantile functions of Pareto and Laplace distributions, as well as for generating samples from these distributions. The \pkg{vsgoftest} package is available on CRAN mirrors and can be installed by executing the command <>= knitr::opts_chunk$set(comment = '') @ <>= install.packages('vsgoftest') @ Alternatively, the latest (under development) version of the \pkg{vsgoftest} package is also available and can be installed in \proglang{R} from the github repository of the project as follows: <>= #Package devtools must be installed devtools::install_github('pregnault/vsgoftest') @ The package is structured around two functions, \code{entropy.estimate} and \code{vs.test}. The first one computes the spacing based estimator~of the differential entropy \begin{equation} \SSS(P):=-\int_\R p(x) \log p(x) \der x \label{shannon} \end{equation} of a distribution $P$ on $\R$ with density $p$ from a numeric sample $X_1, \dots, X_n$ drawn from $P$. The second one performs Vasicek-Song GOF test for usual parametric families of distributions. A comprehensive presentation of their usage is proposed in Sections~\ref{MEGOFPSecEntEst} and~\ref{MEGOFPSecVSTest}, with numerous examples. An application to environmental data is presented in Section~\ref{TutoAppliData}. The theoretical background attached to entropy estimation and Vasicek-Song tests is presented in the Appendix. In addition, details about entropy estimation, Vasicek-Song goodness-of-fit tests and the contents and features of the package are available in the listed references at the end of this document. Particularly, see~\cite{GirardinLequesne} and \cite{LequesneRegnault}. \section{Function entropy.estimate for estimating differential entropy} \label{MEGOFPSecEntEst} The function \code{entropy.estimate} computes the spacing based estimate~(\ref{MEGOFPEqnVasiEst}) of Shannon entropy~(\ref{shannon}) from a numeric sample. Two arguments have to be provided: \begin{itemize} \item \code{x}: the numeric sample; \item \code{window}: an integer between 1 and half of the sample size, specifying the window size of the spacing-based estimator~(\ref{MEGOFPEqnVasiEst}). \end{itemize} It returns, as a single value, the estimate of Shannon entropy of the sample. Here is an example for a sample drawn from a normal distribution with parameters $\mu = 0$ and $\sigma^2 = 1$. <<>>= library('vsgoftest') set.seed(2) #set seed of PRNG samp <- rnorm(n = 100, mean = 0, sd = 1) #sampling from normal distribution entropy.estimate(x = samp, window = 8) #estimating entropy with window = 8 log(2*pi*exp(1))/2 #the exact value of entropy @ The estimate returned by \code{entropy.estimate} obviously depends on the window selected by the user, as illustrated by the following chunck. <<>>= sapply(1:10, function(w) entropy.estimate(x = samp, window =w)) @ One may select the window size that maximizes the entropy estimate, as follows. <<>>= n <- 100 #sample size V <- sapply(1:(n/2 - 1), function(w) entropy.estimate(x = samp, window =w)) which.max(V) #Choose window that maximizes entropy @ Let us consider a sample drawn from a Pareto distribution with density $$p(x;c,\mu) = \frac{\mu c^\mu}{x^{\mu+1}}, \quad x \geq c,$$ where $c >0 $ and $\mu >0 $, which can be obtained by making use of the function \code{rpareto} as illustrated below. Its Shannon entropy is $$\SSS(p( . ;c, \mu)) = -\ln \mu + \ln c + \frac{1}{\mu} +1.$$ <<>>= set.seed(5) n <- 100 #Sample size samp <- rpareto(n, c = 1, mu = 2) #sampling from Pareto distribution entropy.estimate(x = samp, window = 3) -log(2) + 3/2 #Exact value of entropy @ \section{Function vs.test for testing GOF to a specified model} \label{MEGOFPSecVSTest} The function \verb+vs.test+ performs the Vasicek-Song GOF test (VS test) of a numerical sample for whether a prescribed distribution $P ={P}_{0}(\theta)$, the so-called simple null hypothesis test \begin{equation}\label{arthyps} H_{0}: P ={P}_{0}(\theta) \quad\mbox{against}\quad H_{1}: P \ne {P}_{0}(\theta), \end{equation} or to a parametric family $\cP_{0}(\Theta)$, the so-called composite null hypothesis test \begin{equation}\label{arthyp} H_{0}: P \in \cP_{0}(\Theta) \quad\mbox{against}\quad H_{1}: P \notin \cP_{0}(\Theta); \end{equation} see Appendix below for details. Setting two non-optional arguments is required: \begin{itemize} \item \code{x}: the numeric sample; \item \code{densfun}: a character string specifying the theoretical family of distributions of the null hypothesis. Available families of distributions are: uniform, normal, log-normal, exponential, gamma, Weibull, Pareto, Fisher and Laplace distributions. They are referred to by the symbolic name in \proglang{R} of their density function. For example, set \code{densfun = 'dnorm'} to test GOF of the family of normal distributions. \end{itemize} It returns an object of class \code{htest}, i.e., a list whose main components are: \begin{itemize} \item \code{statistic}: the value of VS test statistic~(\ref{PKGEqnTestStat}) for the sample, with optimal window size defined by~(\ref{arteqm}); \item \code{parameter}: the optimal window size; \item \code{estimate}: the maximum likelihood estimate of the parameters of the null distribution; \item \code{p.value}: the p-value associated to the sample. \end{itemize} By default, \code{vs.test} performs the composite VS test of the family of distributions \code{densfun} for the sample~\code{x}. The p-value is estimated by means of Monte-Carlo simulation if the sample size is smaller than 80, or through the asymptotic distribution~(\ref{asymptoticn}) of the VS test statistic otherwise. In the following example, a normally distributed sample is simulated. VS test rejects the null hypothesis that this sample is drawn from a Laplace distribution, but does not reject the normality hypothesis (for a significant level set to $0.05$). <<>>= set.seed(5) samp <- rnorm(50,2,3) vs.test(x = samp, densfun = 'dlaplace') @ <<>>= set.seed(4) vs.test(x = samp, densfun = 'dnorm') @ For performing a simple null hypothesis GOF test, the additional argument \code{param} has to be set as a numeric vector, consistent with the parameter requirements of the null distribution. In such case, the MLE of the parameter(s) of the null distribution has not to be computed and hence the component \code{estimate} in results is not available. <<>>= set.seed(26) vs.test(x = samp, densfun = 'dnorm', param = c(2,3)) @ If \code{param} is not consistent with the specified distribution -- e.g., standard deviation for testing a normal distribution is missing or negative, the execution is stopped and an error message is returned. { \raggedright <>= set.seed(2) samp <- rnorm(50, -2, 1) vs.test(samp, densfun = 'dnorm', param = -2) @ } One can estimate the p-value of the sample by Monte-Carlo simulation, even when sample size is larger than 80, by setting the optional argument \code{simulate.p.value} to \code{TRUE} (\code{NULL} by default). The number of Monte-Carlo replicates can be fixed through the optional argument~\code{B} (default is \code{B = 5000}). <<>>= set.seed(1) samp <- rweibull(200, shape = 1.05, scale = 1) vs.test(samp, densfun = 'dexp') @ <<>>= set.seed(2) vs.test(samp, densfun = 'dexp', simulate.p.value = TRUE, B = 10000) @ Vasicek's estimates $V_{mn}$ are computed for all $m$ from $1$ to $n^{1/3-\delta}$, where $\delta < 1/3$; the test statistic is $I_{\widehat{m}n}$ for $\widehat{m}$ the optimal window size, as defined in~(\ref{arteqm}). The choice of $\delta$ depends on the family of distributions of the null hypothesis. Precisely, for Weibull, Pareto, Fisher, Laplace and Beta, $\delta$ is set by default to $2/15$, while for uniform, normal, log-normal, exponential and gamma, it is set to $1/12$. These default settings result from numerous experimentations. Still, the user can choose another value through the optional argument \code{delta}. <<>>= set.seed(63) vs.test(samp, densfun = 'dexp', delta = 5/30) @ Note that upper-bounding the window size by $n^{1/3-\delta}$ is only required when the asymptotic normality of $I_{mn}$ is used to compute asymptotic p-values from~(\ref{asymptoticn}). When the p-value is computed by means of Monte-Carlo simulation, this upper-bound can be extended to $n/2$ by adding \code{extend = TRUE}, which may lead to a more reliable test, as illustrated below. <<>>= set.seed(8) samp <- rexp(30, rate = 3) vs.test(x = samp, densfun = "dlnorm") @ <<>>= vs.test(x = samp, densfun = "dlnorm", extend = TRUE) @ Enlarging the range of $m$ is also pertinent if ties are present in the sample. Indeed, the presence of ties is particularly inappropriate for performing VS tests, because some spacings $X_{(i+m)} - X_{(i-m)}$ can be null. The window size $m$ has thus to be greater than the maximal number of ties in the sample. Hence, if the upper-bound $n^{1/3-\delta}$ is less than the maximal number of ties, the test statistic can not be computed. Setting \code{extend} to \code{TRUE} can avoid this behavior, as illustrated below. { \raggedright <>= samp <- c(samp, rep(4,3)) #add ties in the previous sample vs.test(x = samp, densfun = "dexp") @ } { \raggedright <<>>= vs.test(x = samp, densfun = "dexp", extend = TRUE) @ } Finally, Vasicek's estimate $V_{mn}$ may exceed the parametric estimate of the entropy of the null distribution for all $m$ between $1$ and $n^{1/3-\delta}$. Then, no window size exists satisfying~(\ref{arteqm}), as illustrated below. { \raggedright <>= set.seed(84) ech <- rpareto(20, mu = 1/2, c = 1) vs.test(x = ech, densfun = 'dpareto', param = c(1/2, 1)) @ } Enlarging the possible window sizes by setting \code{extend} to \code{TRUE} may enable Vasicek estimates to be smaller than empirical entropy. Note that when computing the p-value by Monte-Carlo simulation, the constraint~(\ref{PKGEqnConstraint}) may not be satisfied for some replicates, whatever be the window size. These replicates are then ignored and the p-value is computed from the remaining replicates. A warning message is added to the output, informing on the number of ignored replicates. { \raggedright <<>>= data(contaminants) set.seed(1) vs.test(x = aluminium2, densfun = 'dpareto') @ } A large proportion of such ignored replicates may indicate that the original sample is too small or the null distribution does not fit it. \section{Application to real data} \label{TutoAppliData} The \pkg{vs.test} package contains environmental data originating from a guidance report edited by the Technology Support Center of the United States Environmental Protection Agency; see~\cite{singh}. According to~\cite{singh}, environmental scientists take remediation decisions at suspected sites based on organic and inorganic contaminant concentration measurements. These decisions usually derive from the computation of confidence upper bounds for contaminant concentrations. Testing the goodness-of-fit of specified models hence appears of prior interest. \cite{singh} also points out that contaminant concentration data from sites often appear to follow a skewed probability distribution, making the log-normal family a frequently-used model. The authors illustrate their purpose by applying Shapiro-Wilk test to the log-transformed of the samples \code{aluminium1}, \code{manganese}, \code{aluminium2} and \code{toluene} (stored in the present package)\footnote{A succinct description of these data is available by executing the following \proglang{R} command: \code{?contaminants}} The following code chunks intend to illustrate the use and behavior of the function \code{vs.test} for these environmental data. The significant level is fixed to $0.1$ as in \cite{singh}. Note that warning messages notifying that there are ties in the samples have been dropped out from outputs. <>= knitr::opts_chunk$set(warning = FALSE) @ <<>>= set.seed(1) vs.test(x = aluminium1, densfun = 'dlnorm') @ The log-normal hypothesis is not rejected for \code{aluminium1}. Similar results are obtained for \code{manganese}. Log-normality is rejected for \code{aluminium2}. <<>>= set.seed(1) vs.test(x = aluminium2, densfun = 'dlnorm') @ Due to numerous ties in \code{toluene}, \code{vs.test} can not compute Vasicek entropy estimate unless \code{extend} is set to \code{TRUE}. Still, \code{vs.test} notifies that the constraint~(\ref{PKGEqnConstraint}) is violated for all window sizes, which suggests that data are not likely to be drawn from the log-normal distribution. Turning \code{relax} to \code{TRUE} yields the following result. <<>>= set.seed(1) vs.test(x = toluene, densfun = 'dlnorm', extend = TRUE, relax = TRUE) @ Again, this last result looks spurious because the test statistic is negative -- resulting from (\ref{PKGEqnConstraint}) not being satisfied by setting \code{relax = TRUE}. An alternative is to test normality of the log-transformed sample as follows. <<>>= set.seed(1) vs.test(x = log(toluene), densfun ='dnorm', extend = TRUE) @ The log-normal hypothesis is not rejected for \code{aluminium1} and \code{manganese}, while it is rejected for \code{aluminium2} and \code{toluene}. These results are consistent with those obtained by \cite{singh}. Further, the goodness-of-fit to the Pareto distributions is performed for \code{aluminium2} and \code{toluene}. Log-normal and Pareto distributions usually compete with closely related generating processes and hard to distinguish tail properties. Goodness-of-fit of Pareto distribution is rejected for \code{aluminium2}. <<>>= set.seed(1) vs.test(x = aluminium2, densfun = 'dpareto') @ Applying \code{vs.test} to \code{toluene} with default settings yields no result because of numerous ties and the violation of~(\ref{PKGEqnConstraint}). Uniformity of the sample transformed by the cumulative density function of the Pareto distribution can be tested as follows. Goodness-of-fit of the Pareto distribution is not rejected for \code{toluene}. <<>>= #Compute the MLE of parameters of Pareto dist. res.test <- vs.test(x = toluene, densfun = 'dpareto', extend = TRUE, relax = TRUE) #Test uniformity of transformed data set.seed(5) vs.test(x = ppareto(toluene, mu = res.test$estimate[1], c = res.test$estimate[2]), densfun ='dunif', param = c(0,1), extend = TRUE) @ \appendix \renewcommand{\thesection}{A} \section{Appendix: Vasicek-Song tests, theoretical background} \label{SecTheorBackgr} \cite{song} proposes a goodness-of-fit test based on Kullback-Leibler divergence for either simple (\ref{arthyps}) or composite (\ref{arthyp}) null hypotheses. Precisely, the test statistic $I_{mn}$ is an estimator of the Kullback-Leibler divergence $\KK(P|P_0(\theta)) = -\SSS(P) - \int p_0(x;\theta) p(x) \der x$ of the sampled distribution $P$, with respect to the null distribution~$P_0(\theta)$ (with respective densities $p$ and $p_0(.;\theta)$) in case of a simple hypothesis or some estimate $P_0(\widehat{\theta}_n)$ otherwise: \begin{equation*}% \label{artqimn} I_{mn} := -V_{mn}-\frac{1}{n} \sum_{i=1}^{n} \log p_{0}(X_{i},{\widehat{\theta}}_{n}), \end{equation*} where \begin{equation} \label{MEGOFPEqnVasiEst} V_{mn} :=\frac{1}{n} \sum_{i=1}^{n} \log \left( \frac{n}{2m} \left[ X_{(i+m)}-X_{(i-m)} \right] \right) \end{equation} estimates $\SSS(P)$ while $- \frac{1}{n}\sum_{i=1}^{n} \log p_{0}(X_{i},{\widehat{\theta}}_{n})$ estimates $-\int_{\R} \log p_0(x;\theta) p(x) \der x$. For the test~(\ref{arthyps}) with simple null hypothesis, set $\widehat{\theta}_n = \theta$, where $\theta$ is the null parameter. Otherwise, $\widehat{\theta}_n$ is the maximum likelihood estimator (MLE) of $\theta$, i.e., it satisfies $$\frac{1}{n}\sum_{i=1}^n \log p_{0}(X_{i},{\widehat{\theta}}_{n}) = \max_{\theta \in \Theta} \frac{1}{n} \sum_{i=1}^{n} \log p_{0}(X_{i},\theta).$$ \cite{song} establishes the asymptotic behavior of $I_{mn}$, independently of the null hypothesis. Precisely, $I_{mn}$ is consistent and asymptotically normally distributed, provided the distribution of the sample belongs to the following class of distributions: \begin{equation} \cF=\left\{ P\in\mathcal{D}: \sup_{x:\; 00$, where $F$ is the cumulative density function of $P$, with density $p$ whose derivative is ${p'}$ (almost every where). The class $\cF$ contains the most classical distributions such as uniform ($\gamma=0$), normal, exponential and gamma ($\gamma=1$), Fisher ($\gamma=(2+\nu_2)/\nu_2$ where $\nu_2$ is the second degree of freedom), Pareto ($\gamma=(\mu+1)/\mu$ where $\mu$ is the shape parameter), etc. If $\cP_0(\Theta)\subset\cF$, and if \begin{equation} \label{PKGEqnConditionsWindow} m/\log n \xrightarrow[n \rightarrow \infty]{} 0 \quad \textrm{and} \quad m(\log n)^{2/3}/n^{1/3} \xrightarrow[n \rightarrow \infty]{} 0, \end{equation} then \begin{equation} \label{asymptoticn} \sqrt{6mn}[I_{mn}-\log(2m)+\psi(2m)] \xrightarrow{\cal D} \mathcal{N}(0,1), \end{equation} where $\psi(m)$ is the digamma function. The asymptotic bias $\log (2m)-\psi(2m)$ of $I_{mn}$ is that of $-V_{mn}$. \cite{song} points out that $I_{mn}$ may have an additional substantial bias for small samples and suggests the following bias correction in the asymptotic distribution~(\ref{asymptoticn}), from which decision rule can be consistently derived for moderate and large sample sizes: \begin{equation} \label{asymptoticn2} \sqrt{6mn}\left[ I_{mn}-b_{mn} \right] \xrightarrow{\cal D} \mathcal{N}(0,1), \end{equation} where $$b_{mn} = \log(2m) - \log(n) -\psi(2m) + \psi(n+1) +\frac{2m}{n} R_{2m-1} - \frac{2}{n} \sum_{i=1}^m R_{i+m-2},$$ with $R_{m} = \sum_{j=1}^m 1/j$. Through (\ref{asymptoticn2}), an asymptotic p-value for the related VS test is derived, given by \begin{equation} \label{asymppvalue} p=1-\Phi^{-1}\left( \sqrt{6mn}\left[I_{mn}(x_1^n)-b_{mn}\right]\right), \end{equation} where $I_{mn} (x_1^n)$ denotes the value of the statistic $I_{mn}$ for the observations $x_1^n = (x_1, \dots, x_n)$ and $\Phi$ denotes the cumulative density function of the normal distribution. According to~\cite{song}, the asymptotic p-value (\ref{asymppvalue}) provides accurate results when the sample size $n$ is at least 80. For small sample sizes, Monte Carlo simulations may be preferred for computing p-values, as follows. A large number $N$ of replications of the sample $X_1^n$ drawn from the distribution $P_{0}(\widehat{\theta}_{n})$ (or $P_{0}(\theta)$ in case of simple null hypothesis) are generated. The test statistic $I_{mn}^{i}$ is computed for each replication $i, 1\leq i\leq N.$ The p-value is then given by the empirical mean $(\sum_{i=1}^{N}\id_{\{I_{mn}^{i}>I_{mn}(x_1^n)\}})/N.$ For choosing $m$, \cite{song} proposes to minimize $I_{mn}$ -- that is maximize $V_{mn}$, with respect to~$m$, yielding the most conservative test. The author notes also that the values of $m$ for which $I_{mn}$ is negative have to be excluded. Indeed, such negative values for $I_{mn}$ constitute poor estimates of the non-negative divergence $\KK(P |P_0(\theta))$. Hence, $m$ has to be chosen subject to the constraint \begin{equation} \label{PKGEqnConstraint} V_{mn} \leq - \frac{1}{n} \sum_{i=1}^n \log (p_{0}(.; \widehat{\theta}_n)). \end{equation} Finally, the window size selected by Song -- say the optimal window size, is \begin{equation}\label{arteqm} % \widehat{m}=\min\limits_{ 1\leq m^{*} < \lfloor n^{1/3-\delta}\rfloor} \left\{ m^* = \arg\!\!\max_{m\in\N^*} V_{mn} : \; V_{mn}\leq -\frac{1}{n} \sum_{i=1}^{n} \log p_{0}(X_{i}, \widehat{\theta}_{n}) \right\}, \widehat{m}= \min \left\{ m^* \in \arg\!\!\max_{m\in\N^*} \left\{ V_{mn} : \; V_{mn}\leq -\frac{1}{n} \sum_{i=1}^{n} \log p_{0}(X_{i}, \widehat{\theta}_{n}) \right\} : 1 \leq m^{*} < \lfloor n^{1/3-\delta} \rfloor \right\}, %\widehat{m}=\min \left\{ m^* = \arg\!\!\max_{m\in\N^*} V_{mn} : V_{mn}\leq -\frac{1}{n} \sum_{i=1}^{n} \log p_{0}(X_{i}, \widehat{\theta}_{n}) \right\} \end{equation} for some $\delta \in \R$ such that $1/3-\delta>0$ and the VS test statistic is \begin{equation} \label{PKGEqnTestStat} I_{\widehat{m}n} = - V_{\widehat{m}n} - \frac{1}{n} \sum_{i=1}^n \log p_0(X_i;\widehat{\theta}_n). \end{equation} The upper bound $n^{1/3 - \delta}$ for the window size $m$ is chosen so that conditions~(\ref{PKGEqnConditionsWindow}) are fulfilled and the asymptotic normality~(\ref{asymptoticn}) holds. No systematic optimal choice for $\delta$ exists; it can depend on the family of distributions the GOF is tested to. \begin{thebibliography}{4} \bibitem{GirardinLequesne} Girardin V, Lequesne J (2017). \textit{Entropy-based goodness-of-fit test, a unifying framework. Application to DNA replication.} Communications in Statistics -- Theory and Methods. doi:10.1080/03610926.2017.1401084 \bibitem{LequesneRegnault} Lequesne J, Regnault P (2018) \textit{vsgoftest: An R Package for Goodness-of-Fit Testing Based on Kullback-Leibler Divergence.} https://hal.archives-ouvertes.fr/hal-01816063 \bibitem{singh} Singh AK, Singh A, Engelhardt M (1997). The lognormal distribution in environmental applications. In \textit{Technology Support Center Issue Paper.} Citeseer. \bibitem{song} Song KS (2002). \textit{Goodness-of-fit tests based on Kullback-Leibler discrimination information.} IEEE Transactions on Information Theory, \textbf{48}(5), 1103-1117. \end{thebibliography} \end{document}