\documentclass[nojss]{jss}

%\VignetteIndexEntry{tscount: An R Package for Analysis of Count Time Series Following Generalized Linear Models}
%\VignetteDepends{xtable}
%\VignettePackage{tscount}
%\VignetteEngine{utils::Sweave}

\usepackage[latin9]{inputenc}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{bbm}
\usepackage{mathabx}
\usepackage{microtype}
\usepackage{booktabs}
\usepackage{setspace}
\usepackage{enumitem}
\usepackage{Sweave}

%% specify the standard width of figures
\newcommand*{\figurewidth}{0.9\textwidth}

\clubpenalty=10000
\widowpenalty=10000
\displaywidowpenalty=10000

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% declarations for jss.cls %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%% almost as usual
\author{Tobias Liboschik\\TU Dortmund University \And
				Konstantinos Fokianos\\University of Cyprus\And 
        Roland Fried\\TU Dortmund University}
\title{\pkg{tscount}: An \proglang{R} Package for Analysis of Count Time Series Following Generalized Linear Models}

%% for pretty printing and a nice hypersummary also set:
\Plainauthor{Tobias Liboschik, Konstantinos Fokianos, Roland Fried} %% comma-separated
\Plaintitle{tscount: An R Package for Analysis of Count Time Series Following Generalized Linear Models} %% without formatting
\Shorttitle{\pkg{tscount}: An \proglang{R} Package for Analysis of Count Time Series Following GLMs} %% a short title (if necessary)

%% an abstract and keywords
\Abstract{The \proglang{R} package \pkg{tscount} provides likelihood-based estimation methods for analysis and modeling of count time series following generalized linear models. This is a flexible class of models which can describe serial correlation in a parsimonious way. The conditional mean of the process is linked to its past values, to past observations and to potential covariate effects. The package allows for models with the identity and with the logarithmic link function. The conditional distribution can be Poisson or Negative Binomial. An important special case of this class is the so-called INGARCH model and its log-linear extension.
The package includes methods for model fitting and assessment, prediction and intervention analysis.
This paper summarizes the theoretical background of these methods. It gives details on the implementation of the package and provides simulation results for models which have not been studied theoretically before. The usage of the package is illustrated by two data examples.
Additionally, we provide a review of \proglang{R} packages which can be used for count time series analysis. This includes a detailed comparison of \pkg{tscount} to those packages.
}
\Keywords{aberration detection, autoregressive models, intervention analysis, likelihood, mixed Poisson, model selection, prediction, \proglang{R}, regression model, serial correlation}
\Plainkeywords{aberration detection, autoregressive models, intervention analysis, likelihood, mixed Poisson, model selection, prediction, R, regression model, serial correlation} %% without formatting
%% at least one keyword must be supplied

%% publication information
%% NOTE: Typically, this can be left commented and will be filled out by the technical editor
%% \Volume{50}
%% \Issue{9}
%% \Month{June}
%% \Year{2012}
%% \Submitdate{2012-06-04}
%% \Acceptdate{2012-06-04}

%% The address of (at least) one author should be given
%% in the following format:
\Address{
  Tobias Liboschik\\
  Department of Statistics\\
  TU Dortmund University\\
  44221 Dortmund, Germany\\
  E-mail: \email{liboschik@statistik.tu-dortmund.de}\\
  URL: \url{http://www.statistik.tu-dortmund.de/liboschik-en.html}\\[1em]
  Konstantinos Fokianos\\
  Department of Mathematics \& Statistics\\
  University of Cyprus\\
  Nicosia 1678, Cyprus\\
  E-mail: \email{fokianos@ucy.ac.cy}\\
  URL: \url{http://www.mas.ucy.ac.cy/~fokianos}\\[1em]
  Roland Fried \emph{(Corresponding author)}\\
  Department of Statistics\\
  TU Dortmund University\\
  44221 Dortmund, Germany\\
  E-mail: \email{fried@statistik.tu-dortmund.de}\\
  URL: \url{http://www.statistik.tu-dortmund.de/fried-en.html}
}

%% for those who use Sweave please include the following line (with % symbols):
%% need no \usepackage{Sweave.sty}

%% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\begin{document}
\SweaveOpts{concordance=FALSE, prefix.string=tscount, keep.source=TRUE}
<<options, echo=FALSE>>=
options(prompt="R> ", continue="+  ", width=76, useFancyQuotes=FALSE, digits=4)
@
%% include your article here, just as usual
%% Note that you should use the \pkg{}, \proglang{} and \code{} commands.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Options for working on the manuscript, remove this before submission!
%\large %increased text size
%\doublespacing %increased line spread
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\setlength{\fboxsep}{2mm}
\newlength{\parboxwidth}
\setlength{\parboxwidth}{\textwidth}
\advance\parboxwidth by -2\fboxsep
\fbox{\parbox{\parboxwidth}{
\small
This document is published as a vignette of the \proglang{R} package \pkg{tscount} with minor updates since its publication in \emph{Journal of Statistical Software}. Its first version has been published as a discussion paper in February 2015 (\doi{10.17877/DE290R-7239}). Please cite this manuscript as:\\[0.5em]
Liboschik, T., Fokianos, K. and Fried, R. (2017). tscount: An R package for analysis of count time series following generalized linear models. \emph{Journal of Statistical Software} \textbf{82(5)}, 1--51, \doi{10.18637/jss.v082.i05}.
}}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}

Recently, there has been an increasing interest in regression models for time series of counts and a considerable number of publications on this subject has appeared in the literature. However, most of the proposed methods are not yet available in a statistical software package and hence they cannot be applied easily. We aim at filling this gap and publish a package, named \pkg{tscount}, for the popular free and open source software environment \proglang{R} \citep{r_core_team_r_2016}. In fact, our main goal is to develop software for models whose conditional mean depends on previous observations and on its own previous values. These models are quite analogous to the generalized autoregressive conditional heteroscedasticity (GARCH) models \citep{bollerslev_generalized_1986} which were proposed for describing the conditional variance.

%%Why are count data time series relevant?
Count time series appear naturally in various areas whenever a number of events per time period is observed over time. Examples showing the wide range of applications are
%the monthly number of cases of some disease, from epidemiology,
%the monthly number of patients recruited for a clinical trial,
the daily number of hospital admissions, from public health,
the number of stock market transactions per minute, from finance
or the hourly number of defect items, from industrial quality control.
%or the number of photon arrivals per microsecond measured in a biological experiment.

Models for count time series should take into account that the observations are nonnegative integers and they should capture suitably the dependence among observations.
A convenient and flexible approach is to employ the generalized linear model (GLM) methodology \citep{nelder_generalized_1972} for modeling the observations conditionally on the past information. This methodology is implemented by choosing a suitable distribution for count data and an appropriate link function. Such an approach is pursued by \citet[Chapter 6]{fahrmeir_multivariate_2001} and \citet[Chapters 1--4]{kedem_regression_2002}, among others.
Another important class of models for time series of counts is based on the thinning operator, like the integer autoregressive moving average (INARMA) models, which, in a way, imitate the structure of the common autoregressive moving average (ARMA) models \citep[see the review article by][]{weis_thinning_2008}.
A different type of count time series models are the so-called state space models. We refer to the reviews of \citet{fokianos_recent_2011}, \citet{jung_useful_2011}, \citet{fokianos_count_2012}, \citet{tjostheim_recent_2012} and \citet{fokianos_statistical_2015} for an in-depth overview of models for count time series.
Advantages of GLM-based models compared to the models which are based on the thinning operator are the following:
\begin{enumerate}[nosep]
\item[(a)] They can describe covariate effects and negative correlations in a straightforward way.
\item[(b)] There is a rich toolkit available for this class of models.
\end{enumerate}
State space models allow to describe even more flexible data generating processes than GLM models but at the cost of a more complicated model specification. On the other hand, GLM-based models yield predictions in a convenient matter due to their explicit formulation.

In the first version of the \pkg{tscount} package we provide likelihood-based methods for the framework of count time series following GLMs.
Some simple autoregressive models can be fitted with standard software by treating the observations as if they were independent (see Section~\ref{sec:comparison} and Appendix~\ref{app:startestimation}), for example, using the \proglang{R} function \code{glm}. However, these procedures are in general not tailored for dependent data and may yield invalid model fits.
The implementation in the package \pkg{tscount} allows for a more general dependence structure which is specified conveniently by the user. We consider general time series models whose conditional mean may depend on time-varying covariates, previous observations and, similar to the conditional variance of a GARCH model, on its own previous values.
The usage and output of our functions is in parts inspired by the \proglang{R} functions \code{arima} and \code{glm} in order to provide a familiar user experience. Furthermore \pkg{tscount} is object-oriented and provides many standard \code{S3} methods for well-known generic functions.
There are several other \proglang{R} functions available which can be employed for analyzing count time series. Many of those are related to GLMs and have been developed for independent observations but are, with some limitations, also capable to describe simple forms of serial dependence. There are also some functions available for extending such models to time series. Another group of functions handles state space models for count time series. We briefly review these functions and the corresponding model classes in Section~\ref{sec:comparison} and compare them to \pkg{tscount}. As it turns out, there are special cases for which our model corresponds to existing ones. In these cases we obtain quite similar results with functions from some other packages, thus confirming the reliability of our package. However, many features of \pkg{tscount}, like the flexible dependence structure, outreach the capability of other packages. Admittedly, some packages provide features like zero-inflation or more general forms of the linear predictor which cannot be accommodated yet by \pkg{tscount} but could possibly be included in future versions. As a conclusion, this package is a valuable addition to the \proglang{R} environment which fills some significant gaps associated with time series fitting.

The functionality of \pkg{tscount} partly goes beyond the theory available in the literature since theoretical investigation of these models is still an ongoing research theme. For instance the problem of accommodating covariates in such GLM-type count time series models or fitting a mixed Poisson log-linear model have not been studied theoretically. We have checked their appropriateness by simulations reported in Appendix~\ref{app:simulations}. However, some care should be taken when applying the package's programs to situations which are not covered by existing theory.

This paper is organized as follows.
At first the theoretical background of the methods included in the package is briefly summarized with references to the literature for more details. Section~\ref{sec:models} introduces the models we consider. Section~\ref{sec:inference} describes quasi maximum likelihood estimation of the unknown model parameters and gives some details regarding its implementation. Section~\ref{sec:prediction} treats prediction with such models. Section~\ref{sec:modelassessment} sums up tools for model assessment. Section~\ref{sec:interventions} discusses procedures for the detection of interventions.
Section~\ref{sec:usage} demonstrates the usage of the package with two data examples.
Section~\ref{sec:comparison} reviews other \proglang{R} packages which are capable to model count time series and compares them with our package.
Finally, Section~\ref{sec:outlook} gives an outlook on possible future extensions of our package. In the Appendix we give further details and we confirm empirically some of the new methods that we discuss but which have not been studied, as of yet.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Models}
\label{sec:models}

Denote a count time series by $\{Y_t:t\in\mathbb{N}\}$. We will denote by $\{\boldsymbol{X}_t:t\in\mathbb{N}\}$ a time-varying $r$-dimensional covariate vector, say $\boldsymbol{X}_t=(X_{t,1},\dots,X_{t,r})^\top$. We model the conditional mean $\E\left(Y_{t}|{\cal F}_{t-1}\right)$ of the count time series by a process, say $\{\lambda_t:t\in\mathbb{N}\}$, such that $\E\left(Y_{t}|{\cal F}_{t-1}\right)=\lambda_t$.
Denote by ${\cal F}_t$ the history of the joint process $\{Y_t,\lambda_t,\boldsymbol{X}_{t+1}:t\in\mathbb{N}\}$ up to time $t$ including the covariate information at time $t+1$.
The distributional assumption for $Y_t$ given ${\cal F}_{t-1}$ is discussed later.
We are interested in models of the general form
\begin{align}
g(\lambda_t) = \beta_0
    + \sum_{k=1}^p \beta_k \,\widetilde{g}(Y_{t-i_k})
    + \sum_{\ell=1}^q \alpha_{\ell} g(\lambda_{t-j_{\ell}})
    + \boldsymbol{\eta}^\top \boldsymbol{X}_t,
\label{eq:linpred}
\end{align}
where $g:\mathbb{R}^+\rightarrow\mathbb{R}$ is a link function and $\widetilde{g}:\mathbb{N}_0\rightarrow\mathbb{R}$ is a transformation function.
The parameter vector $\boldsymbol{\eta}=(\eta_1,\dots,\eta_r)^\top$ corresponds to the effects of covariates.
In the terminology of GLMs we call $\nu_t=g(\lambda_t)$ the linear predictor.
To allow for regression on arbitrary past observations of the response, define a set $P=\{i_1,i_2,\dots,i_p\}$ and integers $0<i_1<i_2\ldots<i_p<\infty$, with $p\in\mathbb{N}_0$. This enables us to regress on the lagged observations $Y_{t-i_1},Y_{t-i_2},\dots,Y_{t-i_p}$. Analogously, define a set $Q=\{j_1,j_2,\dots,j_q\}$, $q\in\mathbb{N}_0$ and integers $0<j_1<j_2\ldots<j_q<\infty$, for regression on lagged conditional means $\lambda_{t-j_1},\lambda_{t-j_2},\dots,\lambda_{t-j_q}$.
This case is covered by the theory for models with $P=\{1,\dots,p\}$ and $Q=\{1,\dots,q\}$ by choosing $p$ and $q$ suitably and setting some model parameters to zero. Our formulation is useful particularly when dealing with modeling stochastic seasonality (see Section~\ref{sec:campy}, for an example).
Specification of the model order, i.e., of the sets $P$ and $Q$, are guided by considering the empirical autocorrelation functions of the observed data. This approach is described for ARMA models in many time series analysis textbooks and transfers to the above model by employing its ARMA representation (see \eqref{eq:armarepr} in Appendix~\ref{app:startestimation}).
Parameter constraints which ensure stationarity and ergodicity of two important special cases of \eqref{eq:linpred} are given in Section~\ref{sec:inference}.

We give several examples of model~\eqref{eq:linpred}. Consider the situation where $g$ and $\widetilde{g}$ equal the identity, i.e., $g(x)=\widetilde{g}(x)=x$. Furthermore, let $P=\{1,\dots,p\}$, $Q=\{1,\dots,q\}$ and $\boldsymbol{\eta}=\boldsymbol{0}$. Then model \eqref{eq:linpred} becomes
\begin{align}
\lambda_t = \beta_0
    + \sum_{k=1}^p \beta_k \,Y_{t-k}
    + \sum_{\ell=1}^q \alpha_{\ell} \lambda_{t-{\ell}}. \label{eq:linear}
\end{align}
Assuming further that $Y_t$ given the past is Poisson distributed, then we obtain an \emph{integer-valued GARCH model} of order $p$ and $q$, abbreviated as INGARCH($p$,$q$). These models are also known as \emph{autoregressive conditional Poisson (ACP) models}. They have been discussed by \citet{heinen_modelling_2003}, \citet{ferland_integer-valued_2006} and \citet{fokianos_poisson_2009}, among others.
When $\boldsymbol{\eta}\neq\boldsymbol{0}$, then our package fits INGARCH models with nonnegative covariates; this is so because we need to ensure that the resulting mean process is positive. An example of an INGARCH model with covariates is given in Section~\ref{sec:interventions}, where we fit a count time series model which includes intervention effects.

Consider again model~\eqref{eq:linpred} but now with the logarithmic link function $g(x)=\log(x)$, $\widetilde{g}(x)=\log(x+1)$ and $P$, $Q$ as before. Then, we obtain a \emph{log-linear model} of order $p$ and $q$ for the analysis of count time series. Indeed, set $\nu_t=\log(\lambda_t)$ to obtain from \eqref{eq:linpred} that
\begin{align}
\nu_t = \beta_0
    + \sum_{k=1}^p \beta_k \,\log(Y_{t-k}+1)
    + \sum_{\ell=1}^q \alpha_{\ell} \nu_{t-{\ell}}. \label{eq:loglin}
\end{align}
This log-linear model is studied by \citet{fokianos_log-linear_2011}, \citet{woodard_stationarity_2011} and \citet{douc_ergodicity_2013}. We follow \citet{fokianos_log-linear_2011} in transforming past observations by employing the function $\widetilde{g}(x)=\log(x+1)$, such that they are on the same scale as the linear predictor $\nu_t$. These authors show that the addition of a constant $c$ to each observation for avoiding zero values does not affect inference; in addition they argue that a reasonable choice for $c$ is 1.
Note that model~\eqref{eq:loglin} allows modeling of negative serial correlation, whereas \eqref{eq:linear} accommodates positive serial correlation only.
Additionally, \eqref{eq:loglin} accommodates covariates easier than \eqref{eq:linear} since the log-linear model implies positivity of the conditional mean process $\{\lambda_t\}$. The linear model \eqref{eq:linear} with covariates should be fitted with some care because it is limited to positive effects on $\{\lambda_t\}$. The effects of covariates on the response is multiplicative for model~\eqref{eq:loglin}; it is additive though for model~\eqref{eq:linear}. For a discussion on the inclusion of time-dependent covariates see \citet[Section 4.3]{fokianos_log-linear_2011}.

Model~\eqref{eq:linpred} together with the \emph{Poisson} assumption, i.e., $Y_{t}|{\cal F}_{t-1} \sim \mathrm{Poisson}(\lambda_t)$, implies that 
\begin{align}
\Prob\left(Y_{t}=y|{\cal F}_{t-1}\right) = \frac{\lambda_t^y\exp(-\lambda_t)}{y!}, \quad y=0,1,\dots. 
\label{eq:poisson}
\end{align}
It holds $\VAR\left(Y_{t}|{\cal F}_{t-1}\right)=\E\left(Y_{t}|{\cal F}_{t-1}\right)=\lambda_t$. Hence in the case of a conditional Poisson response model the conditional mean is identical to the conditional variance of the observed process.

The \emph{Negative Binomial} distribution allows for a conditional variance to be larger than the mean $\lambda_t$, which is often referred to as overdispersion. Following \citet{christou_quasi-likelihood_2014}, it is assumed that
$Y_{t}|{\cal F}_{t-1} \sim \mathrm{NegBin}(\lambda_t,\phi)$, where the Negative Binomial distribution is parametrized in terms of its mean with an additional dispersion parameter $\phi\in(0,\infty)$, i.e.,
\begin{align}
\Prob\left(Y_{t}=y|{\cal F}_{t-1}\right) = \frac{\Gamma(\phi+y)}{\Gamma(y+1)\Gamma(\phi)} \left(\frac{\phi}{\phi +\lambda_t}\right)^{\phi} \left(\frac{\lambda_t}{\phi +\lambda_t}\right)^y, \quad y=0,1,\dots.
\label{eq:nbinom} 
\end{align}
In this case, $\VAR\left(Y_{t}|{\cal F}_{t-1}\right)=\lambda_t+\lambda_t^2/\phi$, i.e., the conditional variance increases quadratically with $\lambda_t$. The Poisson distribution is a limiting case of the Negative Binomial when $\phi\to\infty$. 

Note that the Negative Binomial distribution belongs to the class of mixed Poisson processes.
A mixed Poisson process is specified by setting $Y_t=N_t(0,Z_t\lambda_t]$, where $\{N_t\}$ are i.i.d. Poisson processes with unit intensity and $\{Z_t\}$ are i.i.d. random variables with mean 1 and variance $\sigma^2$, independent of $\{Y_t\}$ \citep{christou_quasi-likelihood_2014}.
When $\{Z_t\}$ is an i.i.d. process of Gamma random variables, then we obtain the Negative Binomial process with $\sigma^2=1/\phi$.
We refer to $\sigma^2$ as the overdispersion coefficient because it is proportional to the extent of overdispersion of the conditional distribution. The limiting case of $\sigma^2=0$ corresponds to the Poisson distribution, i.e., no overdispersion.
The estimation procedure we study is not confined to the Negative Binomial case but to any mixed Poisson distribution. However, the Negative Binomial assumption is required for prediction intervals and model assessment; these topics are discussed in Sections~\ref{sec:prediction} and \ref{sec:modelassessment}.

In model~\eqref{eq:linpred} the effect of a covariate fully enters the dynamics of the process and propagates to future observations both by the regression on past observations and by the regression on past conditional means. The effect of such covariates can be seen as an internal influence on the data-generating process, which is why we refer to it as an \emph{internal} covariate effect. We also allow to include covariates in a way that their effect only propagates to future observations by the regression on past observations but not directly by the regression on past conditional means. Following \citet{liboschik_modelling_2016}, who make this distinction for the case of intervention effects described by deterministic covariates, we refer to the effect of such covariates as an \emph{external} covariate effect. 
Let $\boldsymbol{e}=(e_1,\dots,e_r)^\top$ be a vector specified by the user with $e_i=1$ if the $i$-th component of the covariate vector has an external effect and $e_i=0$ otherwise, $i=1,\dots,r$. Denote by $\text{diag}(\boldsymbol{e})$ a diagonal matrix with  diagonal elements given by $\boldsymbol{e}$.
The generalization of \eqref{eq:linpred} allowing for both internal and external covariate effects is given by
\begin{align}
g(\lambda_t) = \beta_0
    + \sum_{k=1}^p \beta_k \,\widetilde{g}(Y_{t-i_k})
    + \sum_{\ell=1}^q \alpha_{\ell} \left( g(\lambda_{t-j_{\ell}}) - \boldsymbol{\eta}^\top \text{diag}(\boldsymbol{e}) \boldsymbol{X}_{t-j_{\ell}} \right)
    + \boldsymbol{\eta}^\top \boldsymbol{X}_t.
\label{eq:linpredextint}
\end{align}
Basically, the effect of all covariates with an external effect is subtracted in the feedback terms such that their effect enters the dynamics of the process only via the observations. We refer to \citet{liboschik_modelling_2016} for an extensive discussion and comparison of internal and external effects. It is our experience with these models that an empirical discrimination between internal and external covariate effects is difficult and that it is not crucial which type of covariate effect is chosen for applications.


\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Estimation and inference}
\label{sec:inference}

The \pkg{tscount} package fits models of the form \eqref{eq:linpred} by quasi conditional maximum likelihood (ML) estimation (function \code{tsglm}). If the Poisson assumption holds true, then we obtain an ordinary ML estimator. However, under the mixed Poisson assumption we obtain a quasi-ML estimator.
Denote by $\boldsymbol{\theta}=(\beta_0,\beta_1,\dots,\beta_p,\alpha_1,\dots,\alpha_q,\eta_1,\dots,\eta_r)^\top$ the vector of regression parameters.
Regardless of the distributional assumption, the parameter space for the INGARCH model~\eqref{eq:linear} with covariates is given by
\begin{align*}
\Theta = \Biggl\{
{\boldsymbol{\theta}\in\mathbb{R}^{p+q+r+1}}:\ 
{\beta_0>0,\ \beta_1,\dots,\beta_p, \alpha_1,\dots,\alpha_q, \eta_1,\dots,\eta_r \geq 0,\ 
\sum\limits_{k=1}^p \beta_k + \sum\limits_{\ell=1}^q \alpha_{\ell} < 1}
\Biggr\}.
\end{align*}
The intercept $\beta_0$ must be positive and all other parameters must be nonnegative to ensure positivity of the conditional mean $\lambda_t$. The other condition ensures that the fitted model has a stationary and ergodic solution with moments of any order \citep{ferland_integer-valued_2006, fokianos_poisson_2009, doukhan_weak_2012}; see also \citet{tjostheim_count_2015} for a recent review.
For the log-linear model~\eqref{eq:loglin} with covariates the parameter space is taken to be
\begin{align*}
\Theta = \Biggl\{
{\boldsymbol{\theta}\in\mathbb{R}^{p+q+r+1}}:\ 
{|\beta_1|,\dots,|\beta_p|, |\alpha_1|,\dots,|\alpha_q|<1,\ 
\left|\sum\limits_{k=1}^p \beta_k + \sum\limits_{\ell=1}^q \alpha_{\ell} \right| < 1}
\Biggr\},
\end{align*}
see Appendix~\ref{app:paramspace} for a discussion.
\citet{christou_quasi-likelihood_2014} point out that with the parametrization \eqref{eq:nbinom} of the Negative Binomial distribution the estimation of the regression parameters $\boldsymbol{\theta}$ does not depend on the additional dispersion parameter $\phi$.
This allows to employ a quasi maximum likelihood approach based on the Poisson likelihood to estimate the regression parameters $\boldsymbol{\theta}$, which is described below. The nuisance parameter $\phi$ is then estimated separately in a second step.
This approach is different from a full maximum likelihood estimation based on the Negative Binomial distribution, which for example has been implemented in the function \code{glm.nb} in the \proglang{R} package \pkg{MASS} \citep{venables_modern_2002}. In that algorithm, maximization of the Negative Binomial likelihood for an estimated dispersion parameter $\phi$ and estimation of $\phi$ given the estimated regression parameters $\boldsymbol{\theta}$ are iterated until convergence.
The quasi negative binomial approach has been chosen for simplicity and its usefulness on deriving consistent estimators when the model for $\lambda_t$ has been correctly specified \citep[see also][]{ahmad_poisson_2016}.

The log-likelihood, score vector and information matrix are derived conditionally on pre-sample values of the time series and the conditional mean process $\{\lambda_t\}$, precisely on ${\cal F}_0$. An appropriate initialization is needed for their evaluation, which is discussed in the next subsection.
For a vector of observations $\boldsymbol{y}=(y_1,\dots,y_n)^\top$, the conditional quasi log-likelihood function,
%$\ell:\Theta\to\mathbb{R}$
up to a constant, is given by
\begin{align}
\ell(\boldsymbol{\theta})
= \sum_{t=1}^n \log p_t(y_t;\boldsymbol{\theta})
= \sum_{t=1}^n \Bigl( y_t \ln(\lambda_t(\boldsymbol{\theta}))-\lambda_t(\boldsymbol{\theta}) \Bigr),
\label{eq:loglik}
\end{align}
where $p_t(y;\boldsymbol{\theta})=\Prob(Y_{t}=y|{\cal F}_{t-1})$ is the probability density function of a Poisson distribution as defined in \eqref{eq:poisson}.
The conditional mean is regarded as a function $\lambda_t:\Theta\to\mathbb{R}^+$ and thus it is denoted by $\lambda_t(\boldsymbol{\theta})$ for all $t$.
The conditional score function
%$S_{n\tau}:\Theta\to\mathbb{R}^{p+q+2}$
is the $(p+q+r+1)$-dimensional vector given by
\begin{align}
S_{n}(\boldsymbol{\theta})
= \frac{\partial \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}
= \sum_{t=1}^n \left( \frac{y_t}{\lambda_t(\boldsymbol{\theta})} - 1 \right) \frac{\partial \lambda_t(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}.
\label{eq:score} 
\end{align}
The vector of partial derivatives $\partial \lambda_t(\boldsymbol{\theta}) / \partial \boldsymbol{\theta}$ can be computed recursively by the recursions given in Appendix~\ref{app:recursions}.
Finally, the conditional information matrix is given by
\begin{align*}
G_n(\boldsymbol{\theta}; \sigma^2)
&= \sum_{t=1}^n \COV\left(\left.\frac{\partial \ell(\boldsymbol{\theta}; Y_t)}{\partial \boldsymbol{\theta}}\right|{\cal F}_{t-1}\right)
%= \sum_{t=1}^n \condCov{\left( \frac{Z_t}{\lambda_t(\boldsymbol{\theta})} - 1 \right) \frac{\partial \lambda_t(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}}{{\cal F}_{t-1}} \notag \\
%&= \sum_{t=1}^n \left(\frac{1}{\lambda_t(\boldsymbol{\theta})}\right)^2 \left(\frac{\partial \lambda_t(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\right) \underbrace{\condCov{Z_t}{{\cal F}_{t-1}}}_{=\lambda_t(\boldsymbol{\theta})} \left(\frac{\partial \lambda_t(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\right)^\top
= \sum_{t=1}^n \biggl( \frac{1}{\lambda_t(\boldsymbol{\theta})} + \ \sigma^2 \biggr) \left(\frac{\partial \lambda_t(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\right) \left(\frac{\partial \lambda_t(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\right)^\top.
%\label{eq:info}
\end{align*}
In the case of the Poisson assumption it holds $\sigma^2=0$ and in the case of the Negative Binomial assumption $\sigma^2=1/\phi$. For the ease of notation let $G_n^*(\boldsymbol{\theta})=G_n(\boldsymbol{\theta}; 0)$, which is the conditional information matrix in case of a Poisson distribution.

The quasi maximum likelihood estimator (QMLE) $\widehat{\boldsymbol{\theta}}_n$ of $\boldsymbol{\theta}$, assuming that it exists, is the solution of the non-linear constrained optimization problem
\begin{align}
\widehat{\boldsymbol{\theta}}:=\widehat{\boldsymbol{\theta}}_n = \text{arg\,max}_{\boldsymbol{\theta} \in \Theta} \ell(\boldsymbol{\theta}).
\label{eq:mle}
\end{align}

Denote the fitted values by $\widehat{\lambda}_t=\lambda_t(\widehat{\theta})$. Following \citet{christou_quasi-likelihood_2014}, the dispersion parameter $\phi$ of the Negative Binomial distribution is estimated by solving the equation
\begin{align}
\sum_{t=1}^n \frac{(Y_t-\widehat{\lambda}_t)^2}{\widehat{\lambda}_t+\widehat{\lambda}_t^2/\widehat{\phi})} = n-(p+q+r+1),
\label{eq:dispestim}
\end{align}
which is based on Pearson's $\chi^2$ statistic.
The variance parameter $\sigma^2$ is estimated by $\widehat{\sigma}^2=1/\widehat{\phi}$.
For the Poisson distribution we set $\widehat{\sigma}^2=0$.
Strictly speaking, the log-linear model~\eqref{eq:loglin} does not fall into the class of models considered by \citet{christou_quasi-likelihood_2014}.
%, because it does not fulfill the contraction property assumed in their first theorem.
However, results obtained by \citet{douc_ergodicity_2013} (for $p=q=1$) and \citet{sim_maximum_2016} (for $p=q$) allow us to use this estimator also for the log-linear model.
This issue is addressed by simulations in Appendix~\ref{app:distrcoef}, which support that the estimator obtained by \eqref{eq:dispestim} provides good results also for models with the logarithmic link function.

Inference for the regression parameters is based on the asymptotic normality of the QMLE, which has been studied by \citet{fokianos_poisson_2009} and \citet{christou_quasi-likelihood_2014} for models without covariates. For a well behaved covariate process $\{\boldsymbol{X}_t\}$ we conjecture that
\begin{align}
\sqrt{n} \left(\widehat{\boldsymbol{\theta}}_n - \boldsymbol{\theta}_0\right) \stackrel{d}{\longrightarrow} N_{p+q+r+1}\left(\boldsymbol{0}, G_n^{-1}(\widehat{\boldsymbol{\theta}}_n;\widehat{\sigma}^2) G_n^*(\widehat{\boldsymbol{\theta}}_n) G_n^{-1}(\widehat{\boldsymbol{\theta}}_n;\widehat{\sigma}^2) \right),
\label{eq:asymptoticnormality}
\end{align}
as $n\to\infty$, where $\boldsymbol{\theta}_0$ denotes the true parameter value and $\widehat{\sigma}^2$ is a consistent estimator of $\sigma^2$.
%which has to be in the interior of the parameter space $\Theta$.
We suppose that this applies under the same assumptions usually made for the ordinary linear regression model \citep[see for example][p. 140 ff.]{demidenko_mixed_2013}.
For deterministic covariates these assumptions are $||\boldsymbol{X}_t||<c$, where $||\cdot||$ denotes the usual Euclidean norm, i.e., the covariate process is bounded, and $\lim_{n\to\infty} n^{-1}\sum_{t=1}^n \boldsymbol{X}_t \boldsymbol{X}_t^\top = A$, where $c$ is a constant and $A$ is a nonsingular matrix.
For stochastic covariates it is assumed that the expectations $\E\left(\boldsymbol{X}_t\right)$ and $\E\left(\boldsymbol{X}_t \boldsymbol{X}_t^\top\right)$ exist and that $\E\left(\boldsymbol{X}_t \boldsymbol{X}_t^\top\right)$ is nonsingular. The assumptions imply that the information on each covariate grows linearly with the sample size and that the covariates are not linearly dependent.
\citet[Theorem 9.1.1]{fuller_introduction_1996} shows asymptotic normality of the least squares estimator for a regression model with time series errors under even more general conditions which allow the presence of certain types of trends in the covariates.
For the special case of a Poisson model with the identity link, \citet{agosto_modeling_2015} show asymptotic normality of the MLE for a model with covariates that are functions of Markov processes with finite second moments and that are not collinearly related to the response.
The asymptotic normality of the QMLE in our context is supported by the simulations presented in Appendix~\ref{app:covariates}. A formal proof requires further research. To avoid numerical instabilities when inverting $G_n(\widehat{\boldsymbol{\theta}}_n;\widehat{\sigma}^2)$ we apply an algorithm which makes use of the fact that it is a real symmetric and positive definite matrix; see Appendix~\ref{app:invertinfo}.

As an alternative method to the normal approximation \eqref{eq:asymptoticnormality} for obtaining standard errors and confidence intervals (function \code{se}) we include a parametric bootstrap procedure (argument \code{B}), for which computation time is many times higher. Accordingly, $B$ time series are simulated from the model fitted to the original data. The empirical standard errors of the parameter estimates for these $B$ time series are the bootstrap standard errors. Confidence intervals are based on quantiles of the bootstrap sample, see \citet[][Chapter 13]{efron_introduction_1993}.
This procedure can compute standard errors and confidence intervals both for $\widehat{\boldsymbol{\theta}}$ and $\widehat{\sigma}^2$. In our experience $B=500$ yields stable results.




% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
\subsection*{Implementation}

This section and Appendix~\ref{app:implementation} provide some details on the implementation of the function \code{tsglm} and explain its technical arguments. The default settings of this arguments are chosen wisely based on plenty of experiments and should be sufficient for most situations.

The parameter restrictions which are imposed by the condition $\boldsymbol{\theta}\in\Theta$ can be formulated as $d$ linear inequalities.
This means that there exists a matrix $\boldsymbol{U}$ of dimension $d \times (p+q+r+1)$ and a vector $\boldsymbol{c}$ of length $d$, such that $\Theta = \{ \boldsymbol{\theta} \ | \ \boldsymbol{U} \boldsymbol{\theta} \geq \boldsymbol{c} \}$.
For the linear model~\eqref{eq:linear} one needs $d=p+q+r+2$ constraints to ensure nonnegativity of the conditional mean $\lambda_t$ and stationarity of the resulting process.
For the log-linear model~\eqref{eq:loglin} there are not any constraints on the intercept term and on the covariate coefficients; hence $d=2(p+q+1)$.
In order to enforce strict inequalities the respective constraints are tightened by an arbitrarily small constant $\xi>0$; this constant is set to $\xi=10^{-6}$ by default (argument \code{slackvar}).

For solving numerically the maximization problem~\eqref{eq:mle} we employ by default the function \code{constrOptim}. This function applies an algorithm described by \citet[Chapter 14]{lange_numerical_1999}, which essentially enforces the constraints by adding a barrier value to the objective function and then employs an algorithm for unconstrained optimization of this new objective function, iterating these two steps if necessary. By default the quasi-Newton Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm is employed for the latter task of unconstrained optimization, which additionally makes use of the score vector \eqref{eq:score}. 
It is possible to tune the optimization algorithm and even to employ an unconstrained optimization (argument \code{final.control}).

Note that the log-likelihood \eqref{eq:loglik} and the score \eqref{eq:score} are given conditional on unobserved pre-sample values. They depend on the linear predictor and its partial derivatives, which can be computed recursively using any initialization. We give the recursions and present several strategies for their initialization in Appendix~\ref{app:recursions} (arguments \code{init.method} and \code{init.drop}).
\citet[Remark~3.1]{christou_quasi-likelihood_2014} show that the effect of the initialization vanishes asymptotically. Nevertheless, from a practical point of view the initialization of the recursions is crucial. Especially in the presence of strong serial dependence, the resulting estimates can differ substantially even for long time series with 1000 observations; see the simulated example in Table~\ref{tab:recursioninit} in Appendix~\ref{app:recursions}.

Solving the non-linear optimization problem \eqref{eq:mle} requires a starting value for the parameter vector $\boldsymbol{\theta}$. This starting value can be obtained from fitting a simpler model for which an estimation procedure is readily available.
We consider either to fit a GLM or to fit an ARMA model. 
A third possibility is to fit a naive i.i.d. model without covariates.
Furthermore, the user can assign fixed values.
All these possibilities are available by the argument \code{start.control}.
It turns out that the optimization algorithm converges very reliably even if the starting values are not close to the global optimum of the likelihood. A starting value which is closer to the global optimum usually requires fewer iterations until convergence. However, we have encountered some examples where starting values close to a local optimum, obtained by one of the first two aforementioned methods, do not yield the global optimum. Consequently, we recommend fitting the naive i.i.d. model without covariates to obtain starting values.
More details on these approaches are given in Appendix~\ref{app:startestimation}.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Prediction}
\label{sec:prediction}

In terms of the mean square error, the optimal 1-step-ahead predictor $\widehat{Y}_{n+1}$ for $Y_{n+1}$, given ${\cal F}_n$, i.e., the past of the process up to time $n$ and potential covariates at time $n+1$, is the conditional expectation $\lambda_{n+1}$ given in \eqref{eq:linpred} (\code{S3} method of function \code{predict}). By construction of the model the conditional distribution of $\widehat{Y}_{n+1}$ is a Poisson~\eqref{eq:poisson} respectively Negative Binomial~\eqref{eq:nbinom} distribution with mean $\lambda_{n+1}$.
An $h$-step-ahead prediction $\widehat{Y}_{n+h}$ for $Y_{n+h}$ is obtained by recursive 1-step-ahead predictions, where unobserved values $Y_{n+1},\dots,Y_{n+h-1}$ are replaced by their respective 1-step-ahead prediction, $h\in\mathbb{N}$. The distribution of this $h$-step-ahead prediction $\widehat{Y}_{n+h}$ is not known analytically but can be approximated numerically by a parametric bootstrap procedure, which is described below.

In applications, $\lambda_{n+1}$ is substituted by its estimator $\widehat{\lambda}_{n+1}=\lambda_{n+1}(\widehat{\boldsymbol{\theta}})$, which depends on the estimated regression parameters $\widehat{\boldsymbol{\theta}}$. The dispersion parameter $\phi$ of the Negative Binomial distribution is replaced by its estimator $\widehat{\phi}$. Note that plugging in the estimated parameters induces additional uncertainty to the predictive distribution. This estimation uncertainty is not taken into account for the construction of prediction intervals described in the following paragraphs.

Prediction intervals for $Y_{n+h}$ with a given coverage rate $1-\alpha$ (argument \code{level}) are designed to cover the true observation $Y_{n+h}$ with a probability of $1-\alpha$. Simultaneous prediction intervals achieving a global coverage rate for $Y_{n+1},\dots,Y_{n+h}$ can be obtained by a Bonferroni adjustment of the individual coverage rates to $1-\alpha/h$ each (argument \code{global = TRUE}).

There are two different principles for constructing predictions intervals available which in practice often yield identical intervals. Firstly, the limits can be the $(\alpha/2)$- and $(1-\alpha/2)$-quantile of the (approximated) predictive distribution (argument \code{type = "quantiles"}). Secondly, the limits can be chosen such that the interval has minimal length given that, according to the (approximated) predictive distribution, the probability that a value falls into this interval is at least as large as the desired coverage rate $1-\alpha$ (argument \code{type = "shortest"}).

One-step-ahead prediction intervals can be straightforwardly obtained from the conditional distribution (argument \code{method = "conddistr"}). Prediction intervals obtained by a parametric bootstrap procedure (argument \code{method = "bootstrap"}) are based on $B$ simulations of realizations $y_{n+1}^{(b)},\dots,y_{n+h}^{(b)}$ from the fitted model, $b=1,\dots,B$ (argument \code{B}). To obtain an approximative prediction interval for $Y_{n+h}$ one can either use the empirical $(\alpha/2)$- and $(1-\alpha/2)$-quantile of $y_{n+h}^{(1)},\dots,y_{n+h}^{(B)}$ (if \code{type = "quantiles"}) or find the shortest interval which contains at least $\lceil (1-\alpha) \cdot B \rceil$ of these observations (if \code{type = "shortest"}). This bootstrap procedure can be accelerated by distributing it to multiple cores simultaneously (argument \code{parallel = TRUE}), which requires a computing cluster registered by the \proglang{R} package \pkg{parallel} (see the help page of the function \code{setDefaultCluster}).


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Model assessment}
\label{sec:modelassessment}

Tools originally developed for generalized linear models as well as for time series can be utilized to asses the model fit and its predictive performance.
Within the class of count time series following generalized linear models it is desirable to asses the specification of the linear predictor as well as the choice of the link function and of the conditional distribution.
The tools presented in this section facilitate the selection of an adequate model for a given data set.
Note that all tools are introduced as in-sample versions, meaning that the observations $y_1\dots,y_n$ are used for fitting the model as well as for assessing the obtained fit. However, it is straightforward to apply such tools as out-of-sample criteria. 

Recall that the fitted values are denoted by $\widehat{\lambda}_t=\lambda_t(\widehat{\theta})$. Note that these do not depend on the chosen distribution, because the mean is the same regardless of the response distribution.
There are various types of \emph{residuals} available (\code{S3} method of function \code{residuals}).
Response (or raw) residuals (argument \code{type = "response"}) are given by
\begin{align}
r_t=y_t-\widehat{\lambda}_t,
\label{eq:resid_raw}
\end{align}
whereas a standardized alternative are Pearson residuals (argument \code{type = "pearson"})
\begin{align}
r_t^P
=(y_t-\widehat{\lambda}_t) / \sqrt{\widehat{\lambda}_t+\widehat{\lambda}_t^2\widehat{\sigma}^2},
%=\frac{y_t-\widehat{\lambda}_t}{\sqrt{\text{V}(\widehat{\lambda}_t)}}
%=
%\begin{cases}
%(y_t-\widehat{\lambda}_t) / \sqrt{\widehat{\lambda}_t}, & \text{for Poisson distribution}\\
%(y_t-\widehat{\lambda}_t) / \sqrt{\widehat{\lambda}_t+\widehat{\lambda}_t^2/\widehat{\phi}}, & \text{for Negative Binomial distribution}
%\end{cases},
\label{eq:resid_pearson}
\end{align}
or the more symmetrically distributed standardized Anscombe residuals (argument \code{type = "anscombe"})
\begin{align}
r_t^A
%=\frac{A(y_t)-A(\widehat{\lambda}_t)}{\sqrt{\text{Var}(A(\widehat{\lambda}_t))}}
= 
\frac{3/\widehat{\sigma}^2\bigl(\bigl(1+y_t\widehat{\sigma}^2\bigr)^{2/3} - \bigl(1+\widehat{\lambda}_t\widehat{\sigma}^2\bigr)^{2/3}\bigr) + 3\bigl(y_t^{2/3}-\widehat{\lambda}_t^{2/3}\bigr)}{2\bigl(\widehat{\lambda}_t+\widehat{\lambda}_t^2\widehat{\sigma}^2\bigr)^{1/6}},
%\begin{cases}
%\frac{3\bigl(y_t^{2/3}-\widehat{\lambda}_t^{2/3}\bigr)}{2\widehat{\lambda}_t^{1/6}}, & \text{for Poisson distribution}\\
%\frac{3\widehat{\phi}^{-1}\bigl(\bigl(1+\widehat{\phi} y_t\bigr)^{2/3} - \bigl(1+\widehat{\phi}\widehat{\lambda}_t\bigr)^{2/3}\bigr) + 3\bigl(y_t^{2/3}-\widehat{\lambda}_t^{2/3}\bigr)}{2\bigl(\widehat{\phi}\widehat{\lambda}_t^2+\widehat{\lambda}_t\bigr)^{1/6}}, & \text{for Negative Binomial distribution}
%\end{cases},
\label{eq:resid_anscombe}
\end{align}
for $t=1,\dots,n$ \citep[see for example][Section~5.1]{hilbe_negative_2011}. The empirical autocorrelation function of these residuals is useful for diagnosing serial dependence which has not been explained by the fitted model. A plot of the residuals against time can reveal changes of the data generating process over time. Furthermore, a plot of squared residuals $r_t^2$ against the corresponding fitted values $\widehat{\lambda}_t$ exhibits the relation of mean and variance and might point to the Poisson distribution if the points scatter around the identity function or to the Negative Binomial distribution if there exists a quadratic relation \citep[see][]{ver_hoef_quasi-poisson_2007}. 

\citet{christou_count_2015} and \citet{jung_useful_2011} extend tools for assessing the predictive performance to count time series, which were originally proposed by \citet{gneiting_probabilistic_2007} and others for continuous data and transferred to independent but not identically distributed count data by \citet{czado_predictive_2009}. These tools follow the \emph{prequential principle} formulated by \citet{dawid_statistical_1984}, depending only on the realized observations and their respective forecast distributions.
Denote by $P_t(y)=P\bigl(Y_t \leq y | {\cal F}_{t-1}\bigr)$ the cumulative distribution function (c.d.f.), by $p_t(y)=P\bigl(Y_t = y | {\cal F}_{t-1}\bigr)$ the probability density function, $y\in\mathbb{N}_0$, and by $\upsilon_t=\sqrt{\VAR\left(Y_{t}|{\cal F}_{t-1}\right)}$ the standard deviation of the predictive distribution, which is either a Poisson distribution with mean $\widehat{\lambda}_t$ or a Negative Binomial distribution with mean $\widehat{\lambda}_t$ and overdispersion coefficient $\widehat{\sigma}^2$ (recall Section~\ref{sec:prediction} on 1-step-ahead prediction).

A tool for assessing the probabilistic calibration of the predictive distribution \citep[see][]{gneiting_probabilistic_2007} is the \emph{probability integral transform} (PIT), which will follow a uniform distribution if the predictive distribution is correct.
For count data \citet{czado_predictive_2009} define a non-randomized PIT value for the observed value $y_t$ and the predictive distribution $P_t(y)$ by
\begin{align*}
F_t(u|y) =
\begin{cases}
0, & u \leq P_t(y-1) \\
\dfrac{u-P_t(y-1)}{P_t(y)-P_t(y-1)}, & P_t(y-1) < u < P_t(y) \\
1, & u \geq P_t(y)
\end{cases}.
\end{align*}
The mean PIT is then given by
\begin{align*}
\overline{F}(u)=\frac{1}{n} \sum_{t=1}^n F_t(u|y_t), \quad 0\leq u \leq 1.
\end{align*}
To check whether $\overline{F}(u)$ is the c.d.f. of a uniform distribution \citet{czado_predictive_2009} propose plotting a histogram with $H$ bins, where bin $h$ has the height $f_j=\overline{F}(h/H)-\overline{F}((h-1)/H)$, $h=1,\dots,H$ (function \code{pit}).
By default $H$ is chosen to be 10.
%Deviations from the density of the uniform distribution on the interval $[0,1]$ can be assessed by a confidence band based on the assumption of an ideal predictive distribution.
%%Details? This is not so easy to write down properly: For the continuous-valued case $n$ times the height of each bin has a binomial distribution and the bins are independent. In our case $nf_j$ is not necessarily an integer. But I think in principle the same should hold. I use the normal approximation with a Bonferroni correction as a confidence band, i.e., $(n/H \pm u_{1-\alpha/(2H)}\sqrt{n(1/H)(1-1/H)})/(n/H)$.
A U-shape indicates underdispersion of the predictive distribution, whereas an upside down U-shape indicates overdispersion. \citet{gneiting_probabilistic_2007} point out that the empirical coverage of central, e.g., 90\% prediction intervals can be read off the PIT histogram as the area under the 90\% central bins.

\emph{Marginal calibration} is defined as the difference of the average predictive c.d.f.\ and the empirical c.d.f.\ of the observations, i.e.,
\begin{align}
\frac{1}{n}\sum_{t=1}^n P_t(y) - \frac{1}{n}\sum_{t=1}^n \mathbbm{1}(y_t \leq y)
\label{eq:marcal}
\end{align}
for all $y\in\mathbb{R}$. In practice we plot the marginal calibration for values $y$ in the range of the original observations \citep{christou_count_2015} (function \code{marcal}).
If the predictions from a model are appropriate the marginal distribution of the predictions resembles the marginal distribution of the observations and \eqref{eq:marcal} should be close to zero. Major deviations from zero point to model deficiencies.

\citet{gneiting_probabilistic_2007} show that the calibration assessed by a PIT histogram or a marginal calibration plot is a necessary but not sufficient condition for a forecaster to be ideal. They advocate to favor the model with the maximal sharpness among all sufficiently calibrated models. Sharpness is the concentration of the predictive distribution and can be measured by the width of prediction intervals.
A simultaneous assessment of calibration and sharpness summarized in a single numerical score can be accomplished by \emph{proper scoring rules} \citep{gneiting_probabilistic_2007}. Denote a score for the predictive distribution $P_t$ and the observation $y_t$ by $s(P_t, y_t)$. A number of possible proper scoring rules is given in Table~\ref{tab:scoring}. The mean score for each corresponding model is given by $\sum_{t=1}^n s(P_t,y_t)/n$. Each of the different proper scoring rules captures different characteristics of the predictive distribution and its distance to the observed data (function \code{scoring}).
Except for the normalized error score, the model with the lowest score is preferable. The mean squared error score is the only one which does not depend on the distribution and is also known as mean squared prediction error. The mean normalized squared error score measures the variance of the Pearson residuals and is close to one if the model is adequate. The Dawid-Sebastini score is a variant of this with an extra term to penalize overerstimation of the standard deviation.

\begin{table}[tbp]
	\centering
		\begin{tabular}{llc}
			\toprule
			\textbf{Scoring rule} & \textbf{Abbreviation} & \textbf{Definition} \\
			\midrule
			squared error score & \code{sqerror} & $(y_t-\lambda_t)^2$ \\
			normalized squared error score & \code{normsq} & $(y_t-\lambda_t)^2/\upsilon_t^2$ \\
			Dawid-Sebastiani score & \code{dawseb} & $(y_t-\lambda_t)^2/\upsilon_t^2 + 2\log(\upsilon_t)$ \\
			logarithmic score	& \code{logarithmic} & $-\log(p_t(y_t))$ \\
			quadratic (or Brier) score & \code{quadratic} & $-2p_t(y_t) + \left\|p_t\right\|^2$ \\
			spherical score & \code{spherical} & $-p_t(y_t)/\left\|p_t\right\|$ \\
			ranked probability score & \code{rankprob} & $\sum_{y=0}^{\infty} \left(P_t(y) - \mathbbm{1}(y_t \leq y)\right)^2$ \\
			\bottomrule
		\end{tabular}
	\caption{Definitions of proper scoring rules $s(P_t,y_t)$ \citep[cf.][]{czado_predictive_2009} and their abbreviations in the package; $\left\|p_t\right\|^2=\sum_{y=0}^{\infty} p_t^2(y)$.}
	\label{tab:scoring}
\end{table}

Other popular tools are model selection criteria like Akaike's information criterion (AIC) and the Bayesian information criterion (BIC) (functions \code{AIC} and \code{BIC}). The model with the lowest value of the respective information criterion is preferable.
Denote the log-likelihood by $\widetilde{\ell}(\widehat{\boldsymbol{\theta}},\widehat{\sigma}^2) = \sum_{t=1}^n \log\left(p_t(y_t)\right)$. Note that this is the true and not the quasi log-likelihood given in \eqref{eq:loglik}.
Furthermore, $\widetilde{\ell}(\widehat{\boldsymbol{\theta}},\widehat{\sigma}^2)$ includes all constant terms which have been omitted on the right hand side of \eqref{eq:loglik}.
The AIC and BIC are given by $\text{AIC}=-2 \widetilde{\ell}(\widehat{\boldsymbol{\theta}},\widehat{\sigma}^2) + 2 df$ and $\text{BIC}=-2 \widetilde{\ell}(\widehat{\boldsymbol{\theta}},\widehat{\sigma}^2) + \log(n_{\text{eff}}) df$, respectively. Here $df$ is the total number of parameters (including the dispersion coefficient) and $n_{\text{eff}}$ the number of effective observations (excluding those only used for initialization when argument \code{init.drop = TRUE}). The BIC generally yields more parsimonious models than the AIC.
Note that for other distributions than the Poisson, $\widehat{\boldsymbol{\theta}}$ maximizes the quasi log-likelihood \eqref{eq:loglik} but not $\widetilde{\ell}(\boldsymbol{\theta},\sigma^2)$. In such cases the quasi information criterion (QIC), proposed by \citet{pan_akaikes_2001} for regression analysis based on the generalized estimating equations, is a properly adjusted alternative to the AIC (function \code{QIC}).
%In our context the QIC is given by
%\begin{align*}
%\text{QIC} = -2 \widetilde{\ell}(\widehat{\boldsymbol{\theta}},0) + \text{trace}\left( 
%matrix product of covariance matrix and Hessian matrix
%\right),
%\end{align*}
%where $G_n(\boldsymbol{\theta}; \sigma^2)$ is the conditional information matrix given by~\eqref{eq:info} and $G_n^*(\boldsymbol{\theta})= G_n(\boldsymbol{\theta}; 0)$; recall Section~\ref{sec:inference}.
We have verified by a simulation reported in Appendix~\ref{app:qic} that in case of a Poisson distribution the QIC approximates the AIC quite satisfactory.



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Intervention analysis}
\label{sec:interventions}

In many applications sudden changes or extraordinary events occur. \citet{box_intervention_1975} refer to such special events as interventions. This could be for example the outbreak of an epidemic in a time series which counts the weekly number of patients infected with a particular disease. It is of interest to examine the effect of known interventions, for example to judge whether a policy change had the intended impact, or to search for unknown intervention effects and find explanations for them \emph{a posteriori}.

\citet{fokianos_interventions_2010, fokianos_interventions_2012} model interventions affecting the location by including a deterministic covariate of the form $\delta^{t-\tau} \mathbbm{1}(t \geq \tau)$, where $\tau$ is the time of occurrence and the decay rate $\delta$ is a known constant (function \code{interv\_covariate}). This covers various types of interventions for different choices of the constant $\delta$: a singular effect for $\delta=0$ (spiky outlier), an exponentially decaying change in location for $\delta\in(0,1)$ (transient shift) and a permanent change of location for $\delta=1$ (level shift). Similar to the case of covariates, the effect of an intervention is essentially additive for the linear model and multiplicative for the log-linear model. However, the intervention enters the dynamics of the process and therefore its effect on the linear predictor is not purely additive. Our package includes methods to test for such intervention effects developed by \citet{fokianos_interventions_2010, fokianos_interventions_2012}, suitably adapted to the more general model class described in Section~\ref{sec:models}.
The linear predictor of a model with $s$ types of interventions according to parameters $\delta_1,\dots,\delta_s$ occurring at time points $\tau_1,\dots,\tau_s$ reads
\begin{align}
g(\lambda_t) = \beta_0
    + \sum_{k=1}^p \beta_k \,\widetilde{g}(Y_{t-i_k})
    + \sum_{\ell=1}^q \alpha_{\ell} g(\lambda_{t-j_{\ell}})
    + \boldsymbol{\eta}^\top \boldsymbol{X}_t
		+ \sum_{m=1}^{s} \omega_{m} \delta_{m}^{t-\tau_{m}}\mathbbm{1}(t \geq \tau_{m}),
\label{eq:intervention}
\end{align}
where $\omega_m$, $m=1,\dots,s$ are the intervention sizes. At the time of its occurrence an intervention changes the level of the time series by adding the magnitude $\omega_m$, for a linear model like \eqref{eq:linear}, or by multiplying the factor $\exp(\omega_m)$, for a log-linear model like \eqref{eq:loglin}. In the following paragraphs we briefly outline the proposed intervention detection procedures and refer to the original articles for details.

Our package allows to test whether $s$ interventions of certain types occurring at given time points, according to model~\eqref{eq:intervention}, have an effect on the observed time series, i.e., to test the hypothesis $H_0: \omega_1=\ldots=\omega_s=0$ against the alternative $H_1: \omega_{\ell}\neq0$ for some $\ell\in\{1,\dots,s\}$. This is accomplished by employing an approximate score test (function \code{interv\_test}). Under the null hypothesis the score test statistic $T_n(\tau_1,\dots,\tau_s)$ has asymptotically a $\chi^2$-distribution with $s$ degrees of freedom, assuming some regularity conditions \citep[][Lemma~1]{fokianos_interventions_2010}.

For testing whether a single intervention of a certain type occurring at an unknown time point $\tau$ has an effect, the package employs the maximum of the score test statistics $T_n(\tau)$ and determines a $p$~value by a parametric bootstrap procedure (function \code{interv\_detect}). If we consider a set $D$ of time points at which the intervention might occur, e.g., $D=\{2,\dots,n\}$, this test statistic is given by $\widetilde{T}_n=\max_{\tau \in D} T_n(\tau)$.
The bootstrap procedure can be computed on multiple cores simultaneously (argument \code{parallel = TRUE}).
The time point of the intervention is estimated to be the value $\tau$ which maximizes this test statistic. Our empirical observation is that such an estimator usually has a large variability. It is possible to speed up the computation of the bootstrap test statistics by using the model parameters used for generation of the bootstrap samples instead of estimating them for each bootstrap sample (argument \code{final.control\_bootstrap = NULL}). This results in a conservative procedure, as noted by \citet{fokianos_interventions_2012}.

If more than one intervention is suspected in the data, but neither their types nor the time points of its occurrences are known, an iterative detection procedure is used (function \code{interv\_multiple}). Consider the set of possible intervention times $D$ as before and a set of possible intervention types $\Delta$, e.g., $\Delta=\{0,0.8,1\}$.
In a first step the time series is tested for an intervention of each type $\delta\in\Delta$ as described in the previous paragraph and the $p$~values are corrected to account for the multiple testing by the Bonferroni method. If none of the $p$~values is below a previously specified significance level, the procedure stops and does not identify an intervention effect. Otherwise the procedure detects an intervention of the type corresponding to the lowest $p$~value. In case of equal $p$~values preference is given to interventions with $\delta=1$, that is level shifts, and then to those with the largest test statistic. In a second step, the effect of the detected intervention is eliminated from the time series and the procedures starts anew and continues until no further intervention effects are detected. Finally, model~\eqref{eq:intervention} with all detected intervention effects can be fitted to the data to estimate the intervention sizes and the other parameters jointly (which are in general different than when estimated in separate steps). Note that statistical inference for this final model fit has to be done with care.

In practical applications, the decay rate $\delta$ of a particular intervention effect is often unknown and needs to be estimated. Since the parameter $\delta$ is not identifiable when the corresponding intervention size $\omega$ is zero, its estimation is nonstandard. As suggested by a reviewer, estimation could be carried out by profiling the likelihood over this parameter. For a single intervention effect this could be done by computing the (quasi) ML estimator of all other parameters for a given decay rate $\delta$. This is repeated for all $\delta\in\Delta$, where $\Delta$ is a set of possible decay rates, and the value which results in the maximum value of the log-likelihood is chosen (apply the function \code{tsglm} repeatedly). Note that this approach affects the validity of the usual statistical inference for the other parameters.

\citet{liboschik_modelling_2016} study a model for external intervention effects (modeled by external covariate effects, recall \eqref{eq:linpredextint} and the related discussion) and compare it to internal intervention effects studied in the two aforementioned publications (argument \code{external}).



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Usage of the package}
\label{sec:usage}

The most recent stable version of the \pkg{tscount} package is distributed via the Comprehensive \proglang{R} Archive Network (CRAN). A current development version is available from the project's website \url{http://tscount.r-forge.r-project.org} on the development platform R-Forge. After installation of the package it can be loaded in \proglang{R} by typing \code{library("tscount")}.

The central function for fitting a GLM for count time series is \code{tsglm}, whose help page (accessible by \code{?tsglm}) is a good starting point to become familiar with the usage of the package. The most relevant functions of the package are summarized in Table~\ref{tab:tscount}. There are many standard \code{S3} methods available for well-known generic functions. A detailed description of the functions' usage including examples can be found on the accompanying help pages.
The package provides some data sets which are also listed in Table~\ref{tab:tscount}.

In the following sections we demonstrate typical applications of the package by two data examples.

\begin{table}[tbp]
	\centering
		\begin{tabular}{llp{0.6\textwidth}}
\toprule
 & \textbf{Name} & \textbf{Description} \\
\midrule
\textbf{Functions}
  & \code{tsglm} & Fitting a model to given data (class \code{"tsglm"}) \\
  & \code{tsglm.sim} & Simulating from the model \\[0.5em]
  & \multicolumn{2}{l}{\textit{Generic functions with methods for class} \code{"tsglm"}:} \\
  & \code{plot} & Diagnostic plots \\
  & \code{se} & Standard errors and confidence intervals \\
  & \code{summary} & Summary of the fitted model \\
  & \code{fitted} & Fitted values \\
  & \code{residuals} & Residuals \\
  & \code{AIC} & Akaike's information criterion \\
  & \code{BIC} & Bayesian information criterion \\
  & \code{QIC} & Quasi information criterion \\
  & \code{pit} & Probability integral transform histogram \\
  & \code{marcal} & Marginal calibration plot \\
  & \code{scoring} & Proper scoring rules \\  
  & \code{predict} & Prediction \\  
  & \code{interv\_test} & Test for intervention effects \\
  & \code{interv\_detect} & Detection of single intervention effects \\
  &\code{interv\_multiple} & Iterative detection of multiple intervention effects \\
\midrule
\textbf{Data sets}
  & \code{campy} & Campylobacter infections in Qu\'{e}bec \\
  & \code{ecoli} & E. coli infections in North Rhine-Westphalia (NRW) \\
  & \code{ehec} & EHEC/HUS infections in NRW \\
  & \code{influenza} & Influenza infections in NRW \\
  & \code{measles} & Measles infections in NRW \\
\bottomrule
		\end{tabular}
	\caption{Most important functions of the \proglang{R} package \pkg{tscount} and the included data sets.}
	\label{tab:tscount}
\end{table}


% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
\subsection{Campylobacter infections in Canada}
\label{sec:campy}

<<campy1, echo=FALSE, fig=TRUE, include=FALSE, height=3, width=7>>=
library("tscount")
par(mar=c(3, 3, 0.5, 0.5), mgp=c(1.8, 0.6, 0))
plot(campy, ylab="Number of cases", type="o")
@
%
\begin{figure}[tbp]
\centering
\includegraphics[width=\figurewidth]{tscount-campy1}
\caption{Number of campylobacterosis cases (reported every 28 days) in the North of Qu\'{e}bec in Canada.}
\label{fig:campyplot}
\end{figure}
%
We first analyze the number of campylobacterosis cases (reported every 28 days) in the North of Qu\'{e}bec in Canada. The data are shown in Figure~\ref{fig:campyplot} and were first reported by \citet{ferland_integer-valued_2006}. These data are made available in the package (object \code{campy}).
We fit a model to this time series using the function \code{tsglm}.
Following the analysis of \citet{ferland_integer-valued_2006} we fit model~\eqref{eq:linear} with the identity link function, defined by the argument \code{link}.
For taking into account serial dependence we include a regression on the previous observation. Seasonality is captured by regressing on $\lambda_{t-13}$, the unobserved conditional mean 13 time units (which is about one year) back in time. The aforementioned specification of the model for the linear predictor is assigned by the argument \code{model}, which has to be a list.
We also include the two intervention effects detected by \citet{fokianos_interventions_2010} in the model by suitably chosen covariates provided by the argument \code{xreg}.
We compare a fit of a Poisson with that of a Negative Binomial conditional distribution, specified by the argument \code{distr}. The call for both model fits is then given by:
<<campy2>>=
interventions <- interv_covariate(n = length(campy), tau = c(84, 100),
                  delta = c(1, 0))
campyfit_pois <- tsglm(campy, model = list(past_obs = 1, past_mean = 13),
                  xreg = interventions, distr = "poisson")
campyfit_nbin <- tsglm(campy, model = list(past_obs = 1, past_mean = 13),
                  xreg = interventions, distr = "nbinom")
@
The resulting fitted models \code{campyfit\_pois} and \code{campyfit\_nbin} have class \code{"tsglm"}, for which a number of methods is provided (see help page), including \code{summary} for a detailed model summary and \code{plot} for diagnostic plots. 
%
The diagnostic plots like in Figure~\ref{fig:campydiagnostic} can be produced by:
<<campy3a, fig=TRUE, include=FALSE, echo=FALSE, height=5, width=7>>=
par(mfrow = c(2, 2), mar=c(3, 4, 2, 0.5), mgp=c(1.8, 0.6, 0))
acf(residuals(campyfit_pois), main="", xlab="Lag (in years)")
title(main = "ACF of response residuals")
marcal(campyfit_pois, main = "Marginal calibration")
  lines(marcal(campyfit_nbin, plot = FALSE), lty = "dashed")
  legend("bottomright", legend = c("Pois", "NegBin"), lwd=1,
         lty=c("solid", "dashed"))
pit(campyfit_pois, ylim = c(0, 1.5), main = "PIT Poisson")
pit(campyfit_nbin, ylim = c(0, 1.5), main = "PIT Negative Binomial")
@
<<campy3b, eval=FALSE>>=
acf(residuals(campyfit_pois), main = "ACF of response residuals")
marcal(campyfit_pois, main = "Marginal calibration")
  lines(marcal(campyfit_nbin, plot = FALSE), lty = "dashed")
  legend("bottomright", legend = c("Pois", "NegBin"), lwd = 1,
         lty = c("solid", "dashed"))
pit(campyfit_pois, ylim = c(0, 1.5), main = "PIT Poisson")
pit(campyfit_nbin, ylim = c(0, 1.5), main = "PIT Negative Binomial")
@
%
\begin{figure}[tbh]
\centering
\includegraphics[width=\figurewidth]{tscount-campy3a}
\caption{Diagnostic plots after model fitting to the campylobacterosis data.}
\label{fig:campydiagnostic}
\end{figure}
%
The response residuals are identical for the two conditional distributions. Their empirical autocorrelation function, shown in Figure~\ref{fig:campydiagnostic} (top left), does not exhibit any serial correlation or seasonality which has not been taken into account by the models.
Figure~\ref{fig:campydiagnostic} (bottom left) points to an approximately U-shaped PIT histogram indicating that the Poisson distribution is not adequate for model fitting. As opposed to this, the PIT histogram which corresponds to the Negative Binomial distribution appears to approach uniformity better. Hence the probabilistic calibration of the Negative Binomial model is satisfactory. The marginal calibration plot, shown in Figure~\ref{fig:campydiagnostic} (top right), is inconclusive.
As a last tool we consider the scoring rules for the two distributions:
<<campy4>>=
rbind(Poisson = scoring(campyfit_pois), NegBin = scoring(campyfit_nbin))
@
All considered scoring rules are in favor of the Negative Binomial distribution.
Based on the PIT histograms and the results obtained by the scoring rules, we decide for the Negative Binomial model. The degree of overdispersion seems to be small, as the estimated overdispersion coefficient \code{sigmasq} of \Sexpr{round(campyfit_nbin$sigmasq,4)} given in the output below is close to zero.
<<campy5>>=
summary(campyfit_nbin)
@
The coefficient \code{beta\_1} corresponds to regression on the previous observation, \code{alpha\_13} corresponds to regression on values of the conditional mean thirteen units back in time. The output reports the estimation of the overdispersion coefficient $\sigma^2$, which is related to the dispersion parameter $\phi$ of the Negative Binomial distribution by $\phi=1/\sigma^2$.
Accordingly, the fitted model for the number of new infections $Y_t$ in time period $t$ is given by $Y_t|{\cal F}_{t-1} \sim \text{NegBin}(\lambda_t,\Sexpr{round(campyfit_nbin$distrcoefs[1],2)})$ with
\begin{align*}
\lambda_t = \Sexpr{round(coef(campyfit_nbin)[1],2)} + \Sexpr{round(coef(campyfit_nbin)[2],2)} Y_{t-1} + \Sexpr{round(coef(campyfit_nbin)[3],2)} \lambda_{t-13} + \Sexpr{round(coef(campyfit_nbin)[4],2)} \mathbbm{1}(t = 84)  + \Sexpr{round(coef(campyfit_nbin)[5],2)} \mathbbm{1}(t \geq 100), \quad t=1,\dots,\Sexpr{campyfit_nbin$n_obs}.
\end{align*}
The standard errors of the estimated regression parameters and the corresponding confidence intervals in the summary above are based on the normal approximation given in \eqref{eq:asymptoticnormality}.
For the additional overdispersion coefficient \code{sigmasq} of the Negative Binomial distribution there is no analytical approximation available for its standard error. Alternatively, standard errors (and confidence intervals, not shown here) of the regression parameters and the overdispersion coefficient can be obtained by a parametric bootstrap (which takes about 15 minutes computation time on a single 3.2 GHz processor for 500 replications):
<<campy6a, echo=FALSE>>=
load("campy.RData")
@
<<campy6b, eval=FALSE>>=
se(campyfit_nbin, B = 500)$se
@
<<campy6c, echo=FALSE>>=
campyse$se
warningse[length(warningse)]
@
Estimation problems for the dispersion parameter (see warning message) occur occasionally for models where the true overdispersion coefficient $\sigma^2$ is small, i.e., which are close to a Poisson model; see Appendix~\ref{app:distrcoef}.
The bootstrap standard errors of the regression parameters are slightly larger than those based on the normal approximation. Note that neither of the approaches reflects the additional uncertainty induced by the model selection.


% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
\subsection{Road casualties in Great Britain}

Next we study the monthly number of killed drivers of light goods vehicles in Great Britain between January 1969 and December 1984 shown in Figure~\ref{fig:seatbelts1}. This time series is part of a dataset which was first considered by \citet{harvey_effects_1986} for studying the effect of compulsory wearing of seatbelts introduced on 31 January 1983. The dataset, including additional covariates, is available in \proglang{R} in the object \code{Seatbelts}. In their paper \citet{harvey_effects_1986} analyze the numbers of casualties for drivers and passengers of cars, which are so large that they can be treated with methods for continuous-valued data. The monthly number of killed drivers of vans analyzed here is much smaller (its minimum is 2 and its maximum 17) and therefore methods for count data are to be preferred.
%
<<seatbelts1, echo=FALSE, fig=TRUE, include=FALSE, height=3, width=7>>=
par(mar=c(3, 3, 0.5, 0.5), mgp=c(1.8, 0.6, 0))
plot(Seatbelts[, "VanKilled"], ylab = "Number of casualties", type = "o",
     xaxt = "n", ylim = c(0, 18))
axis(side = 1, at = 1969:1985)
abline(v = 1983, col = "darkgrey", lwd=2)
@
%
\begin{figure}[tbp]
\centering
\includegraphics[width=\figurewidth]{tscount-seatbelts1}
\caption{Monthly number of killed van drivers in Great Britain. The introduction of compulsory wearing of seatbelts on 31 January 1983 is marked by a vertical line.}
\label{fig:seatbelts1}
\end{figure}

For model selection we only use the data until December 1981.
We choose the log-linear model with the logarithmic link because it allows for negative covariate effects.
We aim at capturing the short range serial dependence by a first order autoregressive term and the yearly seasonality by a 12th order autoregressive term. Both of these terms are declared by the list element named \code{past\_obs} of the argument \code{model}.
Following \citet{harvey_effects_1986} we use the real price of petrol as an explanatory variable. We also include a deterministic covariate describing a linear trend. Both covariates are provided by the argument \code{xreg}.
Based on PIT histograms, a marginal calibration plot and the scoring rules (not shown here) we find that the Poisson distribution is sufficient for modeling.
The model is fitted by the call:
<<seatbelts2>>=
timeseries <- Seatbelts[, "VanKilled"]
regressors <- cbind(PetrolPrice = Seatbelts[, c("PetrolPrice")],
                    linearTrend = seq(along = timeseries)/12)
timeseries_until1981 <- window(timeseries, end = 1981 + 11/12)
regressors_until1981 <- window(regressors, end = 1981 + 11/12)
seatbeltsfit <- tsglm(timeseries_until1981,
  model = list(past_obs = c(1, 12)), link = "log", distr = "poisson",
  xreg = regressors_until1981)
@
%
<<seatbelts3a, echo=FALSE>>=
load("seatbelts.RData")
@
<<seatbelts3b, eval=FALSE>>=
summary(seatbeltsfit, B = 500)
@
<<seatbelts3c, echo=FALSE>>=
seatbeltssummary
#warningse[length(warningse)]
@
%(computation time is about three minutes on a single 3.2 GHz processor)
Accordingly, the fitted model for the number of van drivers $Y_t$ killed in month $t$ is given by $Y_t|{\cal F}_{t-1} \sim \text{Poisson}(\lambda_t)$ with
\begin{align*}
\log(\lambda_t) = \Sexpr{round(coef(seatbeltsfit)[1],2)} + \Sexpr{round(coef(seatbeltsfit)[2],2)} Y_{t-1} + \Sexpr{round(coef(seatbeltsfit)[3],2)} Y_{t-12} + \Sexpr{round(coef(seatbeltsfit)[4],2)} X_t  - \Sexpr{round(-coef(seatbeltsfit)[5],2)} t/12, \quad t=1,\dots,\Sexpr{seatbeltsfit$n_obs},
\end{align*}
where $X_t$ denotes the real price of petrol at time $t$.
The estimated coefficient \code{beta\_1} corresponding to the first order autocorrelation is very small and even slightly below the size of its approximative standard error, indicating that there is no notable dependence on the number of killed van drivers of the preceding week. We find a seasonal effect captured by the twelfth order autocorrelation coefficient \code{beta\_12}. Unlike in the model for the car drivers by \citet{harvey_effects_1986}, the petrol price does not seem to influence the number of killed van drivers. An explanation might be that vans are much more often used for commercial purposes than cars and that commercial traffic is less influenced by the price of fuel. The linear trend can be interpreted as a yearly reduction of the number of casualties by a factor of \Sexpr{round(exp(coef(seatbeltsfit)["linearTrend"]), 2)} (obtained by exponentiating the corresponding estimated coefficient), i.e., on average we expect \Sexpr{round((1-exp(coef(seatbeltsfit)["linearTrend"]))*100, 1)}\% fewer killed van drivers per year (which is below one in absolute numbers).

Based on the model fitted to the training data until December 1981, we can predict the number of road casualties in 1982 given the respective petrol price. Coherent, i.e. integer-valued forecasts could be obtained by rounding the predictions. A graphical representation of the following predictions is given in Figure~\ref{fig:seatbeltspred}.
<<seatbelts4a, echo=FALSE>>=
opts <- options(digits=3)
@
<<seatbelts4b>>=
timeseries_1982 <- window(timeseries, start = 1982, end = 1982 + 11/12)
regressors_1982 <- window(regressors, start = 1982, end = 1982 + 11/12) 
predict(seatbeltsfit, n.ahead = 12, level = 0.9, global = TRUE,
        B = 2000, newxreg = regressors_1982)$pred
@
<<seatbelts4c, echo=FALSE>>=
options(opts)
@
%
<<seatbelts5, echo=FALSE, fig=TRUE, include=FALSE, height=3, width=7>>=
par(mar=c(3, 3, 0.5, 0.5), mgp=c(1.8, 0.6, 0))
predictions_1982 <- predict(seatbeltsfit, n.ahead = 12,
                            level = 0.9, global = TRUE,
                            B = 2000, newxreg = regressors_1982)
plot(window(timeseries, end = 1982.917), type = "o",
     xlim = c(1978.7, 1982.9), ylim = c(0, 20),
     ylab = "Number of casualities")
lines(fitted(seatbeltsfit), col = "black", lty = "longdash", lwd = 2)
arrows(x0 = time(predictions_1982$interval),
       y0 = predictions_1982$interval[, "lower"],
       y1 = predictions_1982$interval[, "upper"],
       angle = 90, code = 3, length = 0.04, col = "darkgrey", lwd = 2)
points(timeseries_1982, pch = 16, type = "o") 
lines(x = c(1981.917, time(predictions_1982$pred)), c(fitted(seatbeltsfit)[156], predictions_1982$pred), col = "black", lty = "solid", lwd = 2)
@
%
\begin{figure}[tbp]
\centering
\includegraphics[width=\figurewidth]{tscount-seatbelts5}
\caption{Fitted values (dashed line) and predicted values (solid line) according to the model with the Poisson distribution. Prediction intervals (grey bars) are designed to ensure a global coverage rate of 90\%. They are chosen to have minimal length and are based on a simulation with 2000 replications.}
\label{fig:seatbeltspred}
\end{figure}

Finally, we test whether there was an abrupt shift in the number of casualties occurring when the compulsory wearing of seatbelts is introduced on 31 January 1983. The approximative score test described in Section~\ref{sec:interventions} is applied: 
<<seatbelts6a>>=
seatbeltsfit_alldata <- tsglm(timeseries, link = "log",
                              model = list(past_obs = c(1, 12)),
                              xreg = regressors, distr = "poisson")
@
<<seatbelts6b, echo=FALSE>>=
seatbelts_test <- interv_test(seatbeltsfit_alldata, tau = 170,
                              delta = 1, est_interv = TRUE)
@
<<seatbelts6c, eval=FALSE>>=
interv_test(seatbeltsfit_alldata, tau = 170, delta = 1, est_interv = TRUE)
@
<<seatbelts6d, echo=FALSE>>=
seatbelts_test
@
With a $p$~value of \Sexpr{round(seatbelts_test$p_value,2)} the null hypothesis of no intervention cannot be rejected at a 5\% significance level. Note that this result does not rule out that there is an effect of the seatbelts law which is either too small for being significant or of a different type than it is tested for. For illustration we fit the model under the alternative of a level shift after the introduction of the seatbelts law (see the output above). The multiplicative effect size of the intervention is found to be \Sexpr{round(exp(coef(seatbelts_test$fit_interv)["interv_1"]), 3)}. This indicates that according to this model fit \Sexpr{round(-(1-exp(coef(seatbelts_test$fit_interv)["interv_1"]))*100, 1)}\% less van drivers are killed after the law enforcement. For comparison, \citet{harvey_effects_1986} estimate a reduction of 18\% for the number of killed car drivers.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newpage
\section{Comparison with other software packages}
\label{sec:comparison}

%  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %

In this section we review functions (and the corresponding models) from other \proglang{R} packages which can be employed for count time series analysis. Many of them have been published only very recently, a fact that demonstrates the raising interest in count time series analysis. We discuss how these packages differ from our package \pkg{tscount}. For illustration we use the time series of Campylobacter infections analyzed in Section~\ref{sec:campy} ignoring the intervention effects.
For the presentation of other models we use a notation parallel to the one used in the previous sections to highlight similarities. Interpretation of the final model should be done carefully, though.

We consider a large number of somehow related packages which makes this comparison quite extensive yet interesting for those readers who want guidance on choosing the most appropriate package for their data. In the first subsection we present packages for independent data and in the second subsection we discuss packages for dependent data.

<<tsglm-comparison, echo=FALSE>>=
campyfit_tsglm <- tsglm(campy, model = list(past_obs = 1, past_mean = 13),
                        distr = "nbinom", link = "identity")
#model like in the Section 'Usage' but without considering the intervention effects
@

% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
\subsection{Packages for independent data}
\label{sec:comparison_indep}

We start reviewing functions which have been introduced for independent observations but can, with certain limitations, be employed for time series whose temporal dependence is rather simple. This is exemplarily discussed in the following paragraph.

%  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %

The function \code{glm} in the package \pkg{stats} and, for the Negative Binomial distribution, the function \code{glm.nb} in the package \pkg{MASS} \citep{venables_modern_2002} can fit standard GLMs to count time series with the iteratively reweighted least squares (IRLS) algorithm. Just like with our \code{tsglm} function, one can choose the identity or logarithmic link in combination with a Poisson or Negative Binomial conditional distribution. Standard GLMs have been introduced to model independent but not identically distributed observations. In principle, one could also fit simple models for time series by including lagged values of the time series, i.e., $Y_{t-i_1}, \dots, Y_{t-i_p}$, as covariates. However, the \code{glm} function has several limitations; the most important being that it does not allow for regression on past values of the conditional mean. For example, the \code{glm} function cannot be used to fit the model which included stochastic seasonality; recall Section~\ref{sec:campy}.
Furthermore, the \code{glm} function does not induce the constraints on the vector of parameters given in Section~\ref{sec:inference}, which are necessary to ensure stationarity of the fitted process. Models which are violating these parameter constraints are generally not suitable for prediction. We have also experienced that \code{glm} occasionally does not find good starting values for its optimization procedure such that it returns an error and requests the user to provide starting values.
At least for the very simple case of a Poisson INGARCH(1,0) model fitted to the Campylobacterosis data the \code{glm} function performs well and we obtain very similar parameter estimates like with the \code{tsglm} function:
<<glm-function>>=
campydata <- data.frame(ts = campy[-1], lag1 = campy[-length(campy)])
coef(glm(ts ~ lag1, family = poisson(link = "identity"),
         data = campydata))
coef(tsglm(campy, model = list(past_obs = 1), link = "identity")) 
@
As described in more detail in Section~\ref{app:startestimation}, a fit by the \code{glm} function can be used as a starting value to the function \code{tsglm}.

%  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %

The class of generalized additive models for location, scale and shape (GAMLSS) has been introduced by \citet{rigby_generalized_2005} as an extension of a GLM and a generalized additive model (GAM). In addition to the location parameter further parameters of the conditional distribution can be modeled as functions of the explanatory variables.
In the following example we use the package \pkg{gamlss} \citep[authored by][]{rigby_generalized_2005} to fit an INGARCH(1,0) model to the Campylobacterosis data. The overdispersion coefficient $\sigma^2_t$ of the Negative Binomial distribution is not constant but changes with time according to the equation
\begin{align*}
\sigma^2_t = \exp\left( \beta_0^* + \beta_1^* \log(Y_{t-1}+1) \right).
\end{align*}
% <<gamlss-package1>>=
% library("gamlss")
% gamlss(ts ~ lag1, sigma.formula = ~ log(lag1+1), data = campydata,
%        family = NBI(mu.link = "identity", sigma.link = "log"))[c(25, 43)]
% @
% <<gamlss-package2, echo=FALSE, results=hide>>=
% campyaic_variablesigma <- AIC(gamlss(ts ~ lag1, sigma.formula = ~ log(lag1+1), data = campydata, family = NBI(mu.link = "identity", sigma.link = "log")))
% campyaic_constantsigma <- AIC(gamlss(ts ~ lag1, data = campydata, family = NBI(mu.link = "identity")))
% @
\begin{Schunk}
\begin{Sinput}
R> library("gamlss")
R> gamlss(ts ~ lag1, sigma.formula = ~ log(lag1+1), data = campydata,
+         family = NBI(mu.link = "identity", sigma.link = "log"))[c(25, 43)]
\end{Sinput}
\begin{Soutput}
GAMLSS-RS iteration 1: Global Deviance = 803.7 
GAMLSS-RS iteration 2: Global Deviance = 803.7 
GAMLSS-RS iteration 3: Global Deviance = 803.7 
$mu.coefficients
(Intercept)        lag1 
     3.8409      0.6768 

$sigma.coefficients
  (Intercept) log(lag1 + 1) 
      -4.2986        0.7167 
\end{Soutput}
\end{Schunk}
The possibility of a time dependent dispersion coefficient does not improve the fit for this data example (according to the AIC, which is
% \Sexpr{#round(campyaic_variablesigma,2)}
811.72
compared to
% \Sexpr{#round(campyaic_constantsigma,2)}
811.64
for a model with constant overdispersion coefficient) but might be quite useful for other data examples. However, it is clear that such a complex model yields more uncertainty of the parameter estimations (i.e., larger standard errors, which are not shown here).

%  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %

The package \pkg{ZIM} \citep{yang_zim:_2014} fits zero-inflated models (ZIM) for count time series with excess zeros. These models are suitable for data where the value zero occurs more frequently than it would be expected when assuming other count time series models. The main idea of these models is to replace the ordinary Poisson or Negative Binomial distribution by its respective zero-inflated version, which is a mixture of a singular distribution in zero (with probability $\omega_t$) and a Poisson or Negative Binomial distribution (with probability $1-\omega_t$), respectively. 
The model proposed by \citet{yang_markov_2013} allows both, the probability $\omega_t$ and the conditional mean $\lambda_t$ of the ordinary count data distribution, to vary over time. The conditional mean $\lambda_t$ is modeled by using a logistic regression model. The probability $\omega_t$ is modeled by a GLM with the logistic link.
%Like with the \code{glm} function, temporal dependence can be included by manually including lagged values of the time series as covariates (with the same limitations as described before).
Other methods for count data with excess zeros, which also have these limitations, are provided by the well-established functions \code{zeroinfl} and \code{hurdle} from the package \pkg{pscl} \citep{zeileis_regression_2008}.
However, the package \pkg{ZIM} includes an extension of ZIM to state space models, which is treated in the next section.
The parameters of a ZIM are fitted by the function \code{zim} employing an EM algorithm.
Zero-inflation models are definitely appealing for count time series which occasionally exhibit excess zeros. For our data example of Campylobacter infections, which does not include any zero observations, ZIM are not applicable.
% <<ZIM-package1, warning=TRUE>>=
% library("ZIM")
% zim(ts ~ lag1, control = zim.control(dist = "zinb", type= "ginv"),
%     data = campydata) #data has no zeros!
% @

%  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %

The current version of the \pkg{tscount} package considered in this paper is limited to modeling univariate data. A possible extension to models for vectors of counts is provided by the package \pkg{VGAM} \citep{yee_vgam:_2016} introduced by \citet{yee_vector_2015}. The function \code{vglm} in this package fits a vector GLM (VGLM) \citep[see][]{yee_vector_1996} where the conditional density function of a $d$-dimensional response vector $\boldsymbol{Y}_t$ given an $r$-dimensional covariate vector $\boldsymbol{X}_t$ is assumed to be of the form
\begin{align*}
f(\boldsymbol{Y}_t|\boldsymbol{X}_t; \boldsymbol{H}) = h(\boldsymbol{Y}_t, \nu_1, \dots, \nu_s),
\end{align*}
where $\nu_j=\boldsymbol{\eta}_j^\top\boldsymbol{X}_t$, $j=1,\dots,s$ and $h(\cdot)$ is a suitably defined function. The model parameters are given by the $(r \times s)$-dimensional parameter matrix $\boldsymbol{H}=(\boldsymbol{\eta}_1^\top,\dots,\boldsymbol{\eta}_s^\top)^\top$.
Choosing $d=s=1$ results in the special case of an ordinary, univariate GLM.
%The VGLM has been introduced for independent observations but can be employed for dependent time series data by including past observations as covariates like it is described at the beginning of this section for ordinary GLMs. Naturally, this has the same problems and limitations which have been discussed in that context.
We demonstrate a fit of an INGARCH(1,0) model to the Campylobacterosis data by the following code:
% <<VGAM-package>>=
% library("VGAM")
% coef(vglm(ts ~ lag1, family = poissonff(link = "identitylink"),
%           data = campydata))
% @
\begin{Schunk}
\begin{Sinput}
R> library("VGAM")
R> coef(vglm(ts ~ lag1, family = poissonff(link = "identitylink"),
+            data = campydata))
\end{Sinput}
\begin{Soutput}
(Intercept)        lag1 
     4.0322      0.6556 
\end{Soutput}
\end{Schunk}
We note that the function \code{vglm} produces exactly the same output as the function \code{glm} for this special case.
The function \code{vgam} from the same package would allow to fit an even more general vector generalized additive model (VGAM), which is a multivariate generalization of a generalized additive model (GAM), see \citet{yee_vector_2015} for more details.

%  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %

Due to the aforementioned limitations of the procedures developed for independent data we would generally suggest the use of the function \code{tsglm} for modeling count time series.
However, in certain situations, where features of the data are currently not supported by \code{tsglm}, the aforementioned packages can be employed with care; recall the second paragraph of this section. For count time series with many zeros one might want to consider using, for example, the package \pkg{ZIM}. If there are reasons to assume a time-varying overdispersion coefficient, the package \pkg{gamlss} is a good choice. Multivariate count time series could be analyzed with the package \pkg{VGAM}.


% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
\subsection{Packages for time series data}
\label{sec:comparison_timser}

In this section we present \proglang{R} packages developed for count time series data.

%  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %

The package \pkg{acp} \citep{siakoulis_acp:_2015} has been published recently and provides maximum likelihood fitting of autoregressive conditional Poisson (ACP) regression models. These are the INGARCH models given by \eqref{eq:linear}; see Section~\ref{sec:models}. The \pkg{acp} package also allows to include covariate effects. In its latest version 2.1, which has been published in December 2015, the package has been extended to fit models of general order $p$ and $q$. The \code{tsglm} function of our package includes these models as special cases and is more general in the following aspects:
\begin{itemize}[nosep]
\item The \pkg{acp} package is different in many technical details. Notably, it does not allow to incorporate the parameter constraints given in Section~\ref{sec:inference}.
\item Quasi maximum likelihood fitting allows to choose a more flexible Negative Binomial model instead of a Poisson model (argument \code{distr = "nbinom"}).
\item The \code{tsglm} function additionally comprises a log-linear model (argument \code{link = "log"}), which is more adequate for many count time series.
\item The \code{tsglm} function allows for more flexible dependence modeling by allowing arbitrary specification of dynamics. This flexibility is missing by the \code{acp} function for model fitting because it requires all variables up to a given order to be included (e.g., $\lambda_{t-1},\dots,\lambda_{t-12}$ and not just $\lambda_{t-12}$). For instance, \code{tsglm} allows for stochastic seasonality (see Section~\ref{sec:campy}).
\item The \code{tsglm} function differentiates between covariates with so-called external and internal effect (see Equation~\eqref{eq:linpredextint} and the accompanying discussion).
\end{itemize}
In the following example, an INGARCH(1,1) model (ignoring the seasonal effect) is fitted to the Campylobacterosis data analyzed in Section~\ref{sec:campy}:
%v<<acp-package>>=
% library("acp")
% coef(acp(campy ~ -1, p = 1, q = 1))
% @
\begin{Schunk}
\begin{Sinput}
R> library("acp")
R> coef(acp(campy ~ -1, p = 1, q = 1))
\end{Sinput}
\begin{Soutput}
[1] 2.5320 0.5562 0.2295
\end{Soutput}
\end{Schunk}
<<acp-package2>>=
coef(tsglm(campy, model = list(past_obs = 1, past_mean = 1)))
@
The parameter estimations obtained by the \code{acp} function are very similar to those obtained by the \code{tsglm} function when fitting the same model.

%  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %

The class of generalized linear autoregressive moving average (GLARMA) models combines GLM with ARMA processes. A software implementation is available in the package \pkg{glarma} \citep{dunsmuir_glarma_2015}.
The GLARMA model assumes the conditional distribution of $Y_t$ given the past ${\cal F}_{t-1}$ to be Poisson or Negative Binomial with mean $\lambda_t$ and density $f(Y_t|\lambda_t)$, with $\lambda_t$ given by
\begin{align*}
g(\lambda_t) = \boldsymbol{\eta}^\top\boldsymbol{X}_t + O_t + Z_t,
\end{align*}
where $O_t$ is an offset term. An intercept is included by choosing the first column of the time-varying covariate matrix $\boldsymbol{X}_t$ to be the vector $(1,\dots,1)^\top$. Serial correlation is induced by an autoregressive moving average (ARMA) structure of $Z_t$, which is given by
\begin{align*}
Z_t = \sum_{k=1}^p \phi_k (Z_{t-i_k} + e_{t-i_k}) + \sum_{\ell=1}^q \psi_{\ell} e_{t-j_{\ell}}.
\end{align*}
Hereby the process $\{Z_t:t\in\mathbb{N}\}$ is defined by means of residuals $e_t$ which can be possibly rescaled, see \eqref{eq:resid_raw} and \eqref{eq:resid_pearson}. In the example below we choose Pearson residuals.
For the link function $g(\cdot)$, the \pkg{glarma} package currently supports only the logarithm but not the identity, which is available in our function \code{tsglm}. Like in our package, the user can specify the model order by considering the sets $P=\{i_1,\dots,i_p\}$ and $Q=\{j_1,\dots,j_q\}$.
The formulation of the GLARMA model we consider describes the modeling possibilities provided by the \pkg{glarma} package. In fact, this formulation is more general than the accompanying article by \citet{dunsmuir_glarma_2015}, where the authors consider the case $Q=\{1,\dots,q\}$ and $P=\{1,\dots,p\}$. Choosing $P$ and $Q$, in the context of GLARMA modeling, should be done cautiously \citep[see][Seection 3.4]{dunsmuir_glarma_2015}. Our limited experience shows that the minimum element of the set $Q$ should be chosen in such a way that it is larger than the maximum element of $P$ for avoiding errors.
Unlike ordinary ARMA models, GLARMA models are not driven by random innovations but by residuals $e_t$.
Note that the model fitted by the function \code{tsglm} is also related to ARMA processes, see \eqref{eq:armarepr} in the Appendix.
The function \code{glarma} implements maximum likelihood fitting of a GLARMA model. We compare the following model fitted to the Campylobacterosis data by the function \code{glarma} with a fit by \code{tsglm} (see Section~\ref{sec:campy}, but without the intervention effects):
% <<glarma-package1>>=
% library("glarma")
% glarmaModelEstimates(glarma(campy, phiLags = 1:3, thetaLags = 13,
%     residuals = "Pearson", X = cbind(intercept=rep(1, length(campy))),
%     type = "NegBin"))[c("Estimate", "Std.Error")]
% @
% %
% <<glarma-package2, include=FALSE, echo=FALSE, results=hide>>=
% campyfit_glarma <- glarma(campy, phiLags = 1:3, thetaLags = 13, 
%                           X = cbind(intercept=rep(1, length(campy))),
%                           type = "NegBin", residuals = "Pearson")
% @
\begin{Schunk}
\begin{Sinput}
R> library("glarma")
R> glarmaModelEstimates(glarma(campy, phiLags = 1:3, thetaLags = 13,
+      residuals = "Pearson", X = cbind(intercept=rep(1, length(campy))),
+      type = "NegBin"))[c("Estimate", "Std.Error")]
\end{Sinput}
\begin{Soutput}
          Estimate Std.Error
intercept  2.34110   0.10757
phi_1      0.22101   0.03940
phi_2      0.05978   0.04555
phi_3      0.09784   0.04298
theta_13   0.08602   0.03736
alpha     10.50823   1.91232
\end{Soutput}
\end{Schunk}
With the notation introduced above the fitted model for the number of new infections $Y_t$ in time period $t$ is given by $Y_t|{\cal F}_{t-1} \sim \text{NegBin}(\lambda_t,
%\Sexpr{#round(campyfit_glarma$delta[["alpha"]],2)}
10.51
)$ with $\log(\lambda_t) = 
% \Sexpr{#round(campyfit_glarma$delta[["intercept"]],2)}
2.34
+ Z_t$
and
\begin{align*}
Z_t = 
% \Sexpr{#round(campyfit_glarma$delta[["phi_1"]],2)}
0.22
(Z_{t-1} + e_{t-1}) + 
% \Sexpr{#round(campyfit_glarma$delta[["phi_2"]],2)}
0.06
(Z_{t-2} + e_{t-2}) + 
% \Sexpr{#round(campyfit_glarma$delta[["phi_3"]],2)}
0.1
(Z_{t-3} + e_{t-3}) + 
% \Sexpr{#round(campyfit_glarma$delta[["theta_13"]],2)}
0.09
e_{t-13}.
\end{align*}
We focus on the models' capability to explain the serial correlation which is present in the data. Considering the GLARMA model, a choice of $P=\{1,2,3\}$ and $Q=\{13\}$ leaves approximately uncorrelated residuals (see the autocorrelation function in Figure~\ref{fig:comparison-acf} (top right)). For the model class fitted by our function \code{tsglm} we have chosen the more parsimonious model with $P=\{1\}$ and $Q=\{13\}$ to obtain a fit with approximately uncorrelated residuals. 
Figure~\ref{fig:comparison-fit} shows that both models seem to provide an adequate fit to given data. 
The package \pkg{glarma} provides a collection of functions which can be applied to a fitted GLARMA model. For example it provides a function for testing whether there exists serial dependence and it offers tools for model diagnostics. 
To conclude, both models are able to explain quite general forms of serial correlation but the role of the dependence parameters is quite different and any results should be interpreted carefully. A more detailed comparison would be interesting but is beyond the scope of this thesis.

%  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %

Another class of models, which is closely related to the GLARMA models, are the so-called generalized autoregressive moving average (GARMA) models developed by \citet{benjamin_generalized_2003}. \citet[][Section 3]{dunsmuir_glarma_2015} remark that both model classes are similar in their structure but they have some important differences.
The GARMA model is formulated by
\begin{align*}
g(\lambda_t) = \boldsymbol{\eta}^\top\boldsymbol{X}_t + 
\sum_{k=1}^p \phi_k \left( g(Y_{t-k}) - \boldsymbol{\eta}^\top\boldsymbol{X}_{t-k} \right) + \sum_{\ell=1}^q \psi_{\ell} \left( g(Y_{t-k}) - g(\lambda_{t-\ell}) \right),
\end{align*}
where the notation follows the GLARMA notation.
Compared to the GLARMA model, the GARMA model does not include an offset and the ARMA structure applies to values which are transformed by the link function $g$, i.e., on the scale of the linear predictor. In case of a logarithmic link, the observations $Y_t$ are replaced by $\max(Y_t,c)$ for a threshold $c\in(0,1)$, such that $g(Y_t)$ is well-defined. In our package this problem is handled replacing $Y_t$ by $Y_t+1$.
The package \pkg{gamlss.util} \citep{stasinopoulos_gamlss.util:_2015} contains the function \code{garmaFit} for fitting such GARMA models. Like ordinary GLMs, these models are fitted by maximum likelihood employing the IRLS algorithm. As pointed out on the accompanying help page, the function \code{garmaFit} does not guarantee stationarity of the fitted model. Additionally, the function \code{garmaFit} does not allow to specify serial dependence of higher order without including all lower orders, which would be necessary for parsimoniously describing stochastic seasonality. The following example shows a fit of a Negative Binomial GARMA model of order $p=1$ and $q=1$ with link $g(\cdot)=\log(\cdot)$ to the Campybacterosis data:
% <<gamlss.util-package>>=
% library("gamlss.util")
% coef(garmaFit(campy ~ 1, order = c(1, 1), family = NBI(mu.link = "log")))
% @
\begin{Schunk}
\begin{Sinput}
R> library("gamlss.util")
R> coef(garmaFit(campy ~ 1, order = c(1, 1), family = NBI(mu.link = "log")))
\end{Sinput}
\begin{Soutput}
deviance of linear model=  891.1
deviance of  garma model=  803.3
beta.(Intercept)             phi            theta
          2.6216          0.7763          -0.2917
\end{Soutput}
\end{Schunk}\begin{verbatim}
deviance of linear model=  891.1
deviance of  garma model=  803.3
beta.(Intercept)             phi            theta
          2.6216          0.7763          -0.2917
\end{verbatim}
In the above output the AR coefficient $\phi_1$ is named \code{phi} and the MA coefficient $\psi_1$ \code{theta}.
The function \code{garma} from the package \pkg{VGAM} \citep{yee_vgam:_2016} is an alternative implementation for fitting GARMA models. However, the accompanying help page warns that this function is still in premature stage and points to potential problems with the initialization (in version 1.0-1 of the package). In addition, \code{garma} allows only for autoregressive modeling (i.e. $q=0$) and the Negative Binomial distribution is not supported. Hence our example can only show a fit of a Poisson GARMA model of order $p=1$ and $q=0$ to the Campylobacterosis data:
% <<VGAM-package2>>=
% coef(vglm(campy ~ 1, family = garma(link="loge", p.ar.lag = 1,
%                                q.ma.lag = 0, coefstart = c(0.1, 0.1))))
% @
\begin{Schunk}
\begin{Sinput}
R> coef(vglm(campy ~ 1, family = garma(link="loge", p.ar.lag = 1,
+                                  q.ma.lag = 0, coefstart = c(0.1, 0.1))))
\end{Sinput}
\begin{Soutput}
(Intercept)      (lag1) 
   55952455   -71039968 
\end{Soutput}
\end{Schunk}
In this example the estimated coefficient for the autoregressive term is larger than one, which suggests that the fitted process is not stationary.
We could not find settings for which the functions \code{garmaFit} and \code{garma} fit the same model and give identical or at least similar results.
Due to the close relationship of GARMA and GLARMA models we refrain from presenting a comparison to a fit with our package and refer to the comparison in the previous paragraph made for GLARMA models.

%  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %

The models presented so far are determined by a single source of randomness, i.e., given all past observations, uncertainty is only induced by the Poisson or Negative Binomial distribution from which the observations are assumed to be drawn. These models belong to the class of observation-driven models according to the classification of \citet{cox_statistical_1981}.
In the following paragraphs we present parameter-driven models. These models are determined by multiple sources of randomness introduced by one or more innovation processes. \citet{helske_kfas:_2016} comment on the merits of both approaches. He argues that parameter-driven models are appealing because they allow to introduce even multiple latent structures in a flexible way.
On the other hand, he observes that observation-driven models like the ones we consider are of advantage for prediction because of their explicit dependence on past observations and covariates.

%  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %

The package \pkg{surveillance} \citep{salmon_monitoring_2016} includes methods for online change point detection in count time series.
The included function \code{hhh4} fits the model proposed by \citet{held_statistical_2005}, where it is assumed that $Y_t \sim \text{NegBin}(\lambda_t,\phi)$. The conditional mean $\lambda_t$ is given by
\begin{align*}
\lambda_t = \exp\left(\beta_0 + \delta t + \gamma_t \right) + \beta_1 Y_{t-1},
\end{align*}
where $\gamma_t$ is a periodic function describing a seasonal effect. The exponential function is applied to the linear trend and the seasonal effect but not to the autoregressive component such that this model is not a GLM (since the linear predictor is not linear in the parameters) but could instead be regarded as a generalized additive model (GAM).
Another type of model considered by the \pkg{surveillance} package are hierarchical time series (HTS), as proposed by \citet{manitz_bayesian_2013} based on the work by \citet{heisterkamp_automated_2006}. This particular state space model accounts for serial dependence by a time-varying intercept. More precisely, it is assumed that the conditional mean $\lambda_t$ is given by
\begin{align*}
\lambda_t = \exp\left(\beta_{0,t} + \delta t + \gamma_t + \boldsymbol{\eta}^{\top}\boldsymbol{X}_t \right).
\end{align*}
The time-varying intercept $\beta_{0,t}$ is assumed to depend on its previous values according to
\begin{align*}
\Delta_d\beta_{0,t}|\beta_{0,t-1},\dots,\beta_{0,t-d} \sim \text{N}(0,\kappa_{\beta_0}^{-1}),
\end{align*}
for $d=1,2,3$, respectively. For $d>0$ this induces dependence between successive observations. The other parameters, $\delta$ for the linear trend, $\gamma_t$ for a seasonal effect, and the vector $\boldsymbol{\eta}$ for the effect of a covariate vector $\boldsymbol{X}_t$, are also assumed to be normally distributed with certain priors. Inference is done in a Bayesian framework and utilizes an efficient integrated nested Laplace approximation (INLA) provided by the package \pkg{INLA} \citep{lindgren_bayesian_2015} (available from \url{http://www.r-inla.org}).
In the following example we fit a Negative Binomial model without trend, seasonality or covariate effects but with a time-varying intercept of order $d=1$ to the Campylobacterosis data:
% <<INLA-package-install, include=FALSE, echo=FALSE, results=hide, eval=FALSE>>=
% #The INLA package is not available on CRAN and needs to installed from another repository if it is not yet available.
% if(!require("INLA")) install.packages("INLA", repos="https://www.math.ntnu.no/inla/R/stable")
% @
% <<INLA-package, eval=FALSE>>=
% library("INLA")
% campyfit_INLA <- inla(ts ~ f(time, model = "rw1", cyclic = FALSE),
%                 data = data.frame(time = seq(along = campy), ts = campy),
%                 family = "nbinomial", E = mean(campy),
%                 control.predictor = list(compute = TRUE, link = 1),
%                 control.compute = list(cpo = FALSE, config = TRUE),
%                 control.inla = list(int.strategy = "grid", dz = 1,
%                                     diff.logdens = 10))
% posterior <- inla.posterior.sample(1000, campyfit_INLA)
% rowMeans(sapply(posterior, function(x) (unname(x$hyperpar))))
% @
% <<INLA-package2, include=FALSE, echo=FALSE, results=hide, eval=FALSE>>=
% mu <- rowMeans(sapply(posterior,
%                       function(x) exp(unname(x$latent[seq(along=campy), 1]))))
% campyfitted_INLA <- mu*mean(campy)
% @
% <<seatbelts3a, echo=FALSE>>=
% #param_INLA <- rowMeans(sapply(posterior, function(x) (unname(x$hyperpar))))
% #save(campyfitted_INLA, param_INLA, file="INLA.RData")
% load("INLA.RData")
% param_INLA
% @
\begin{Schunk}
\begin{Sinput}
R> library("INLA")
R> campyfit_INLA <- inla(ts ~ f(time, model = "rw1", cyclic = FALSE),
+                  data = data.frame(time = seq(along = campy), ts = campy),
+                  family = "nbinomial", E = mean(campy),
+                  control.predictor = list(compute = TRUE, link = 1),
+                  control.compute = list(cpo = FALSE, config = TRUE),
+                  control.inla = list(int.strategy = "grid", dz = 1,
+                                      diff.logdens = 10))
R> posterior <- inla.posterior.sample(1000, campyfit_INLA)
R> rowMeans(sapply(posterior, function(x) (unname(x$hyperpar))))
\end{Sinput}
\end{Schunk}
\begin{Schunk}
\begin{Soutput}
[1] 329.86  33.79
\end{Soutput}
\end{Schunk}
The estimates for the parameters $\phi$ (the former) and $\kappa_{\beta_0}$ (the latter) in the output above are based on means of a sample of size 1000 from the posterior distribution. Fitted values $\widehat{\lambda}_1,\dots,\widehat{\lambda}_{140}$ are obtained in the same way (not shown in the above code).
With the Bayesian approach it is very natural to obtain prediction intervals for future observations from the posterior distribution which account for the estimation and observation uncertainties. This is a clear advantage over the classical likelihood-based approach pursued in our package (cf. Section~\ref{sec:prediction}). A disadvantage of the Bayesian approach is its much higher computational effort which could be an obstacle for real-time applications and simultaneous analysis of several time series. The above example runs more than eight seconds on a standard office computer (Intel Xeon CPU with 2.83~GHz); this is seven times longer than \code{tsglm} takes to fit the model.
An additional difference between our approach and that taken by INLA is the specification of temporal dependence. The comparison of the final fitted values, shown in Figure~\ref{fig:comparison-fit}, illustrates that the model with a time-varying intercept fitted by \code{inla} possess a much smoother line through the observed values when compared to the model fitted by \code{tsglm}.
For this example, the empirical autocorrelation function of the response residuals in Figure~\ref{fig:comparison-acf} (bottom left) is significantly different from zero at lag one; hence short term temporal correlation is not explained sufficiently by the hierarchical model. It also becomes clear by this plot that we should have included seasonality by employing the term $\gamma_t$. The residuals of the GLM-based fit by the function \code{tsglm} do not exhibit any serial correlation which has not been explained by the model (see Figure~\ref{fig:comparison-acf} (top left)).
In general, the GLM-based model is expected to provide more accurate 1-step-ahead predictions whilst the hierarchical model prediction obtained by \code{inla} is more stable. Either of these two features could be preferable depending upon the specific application. It would be interesting to study these two ways of modeling temporal dependence in a future work.

%  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %

The package \pkg{KFAS} \citep{helske_kfas:_2016-1} treats state space models for multivariate time series where the distribution of the observations belongs to the exponential family (and also includes the Negative Binomial distribution). Its name refers to Kalman filtering and smoothing, which are the two key algorithms employed by the package. This package is able to cope with very general state space models at the cost of a rather big effort for its correct specification. However, some auxiliary functions and the use of symbolic model description reduces this effort. In contrast to the package \pkg{INLA}, which is also capable of fitting state space models, \pkg{KFAS} implements maximum likelihood estimation \citep[for a comparison of these two packages see][]{helske_kfas:_2016}.
One possible univariate model for the Campylobacterosis data could be the state space model $Y_t|{\cal F}_{t-1} \sim \text{NegBin}(\lambda_t,\phi)$ where $\lambda_t=\exp(\nu_t)$ and the state equation is
\begin{align*}
\nu_t = \nu_{t-1} + \varepsilon_t.
\end{align*}
The initialization $\nu_1$ is specified by assuming $\nu_1 \sim \text{N}(\lambda, \sigma_{\nu}^2)$. The degree of serial dependence is induced by independently distributed innovations $\varepsilon_t$ for which it is assumed that $\varepsilon_t \sim \text{N}(0, \sigma_{\varepsilon}^2)$. This model has unknown parameters $\lambda\in\mathbb{R}$, and $\phi,\sigma_{\nu_1}^2,\sigma_{\varepsilon}^2\in[0,\infty)$ and can be fitted as follows:
% <<KFAS-package>>=
% library("KFAS")
% model <- SSModel(campy ~ SSMcustom(Z = 1, T = 1, R = 1, Q = 0,
%                                    a1 = NA, P1 = NA) - 1, 
%                  distribution = "negative binomial", u = NA)
% updatefn <- function(pars, model, ...){
%   model$a1[1, 1] <- pars[1]
%   model$u[, 1] <- exp(pars[2])
%   model$P1[1, 1] <- exp(pars[3])
%   model$Q[1,1,1] <- exp(pars[4])
%   return(model)
% }
% campyfit_KFAS <- fitSSM(model = model, inits = c(mean(campy), 0, 0, 0),
%                 updatefn = updatefn)
% exp(campyfit_KFAS$optim.out$par)
% @
\begin{Schunk}
\begin{Sinput}
R> library("KFAS")
R> model <- SSModel(campy ~ SSMcustom(Z = 1, T = 1, R = 1, Q = 0,
+                                     a1 = NA, P1 = NA) - 1, 
+                   distribution = "negative binomial", u = NA)
R> updatefn <- function(pars, model, ...){
+    model$a1[1, 1] <- pars[1]
+    model$u[, 1] <- exp(pars[2])
+    model$P1[1, 1] <- exp(pars[3])
+    model$Q[1,1,1] <- exp(pars[4])
+    return(model)
+  }
R> campyfit_KFAS <- fitSSM(model = model, inits = c(mean(campy), 0, 0, 0),
+                  updatefn = updatefn)
R> exp(campyfit_KFAS$optim.out$par)
\end{Sinput}
\begin{Soutput}
[1] 3.427e+00 9.148e+01 2.775e-16 4.334e-02
\end{Soutput}
\end{Schunk}
%
% <<comparison-acf, fig=TRUE, include=FALSE, echo=FALSE, results=hide, height=3.7, width=6>>=
% par(mar=c(3, 2.3, 0.5, 0.5), mgp=c(1.4, 0.5, 0))
% layout(matrix(1:4, ncol=2))
% acf(campy - fitted(campyfit_tsglm), main="", ylim=c(-0.26,1))
% legend("top", bty="n", legend="", title="tsglm", cex=1.3) 
% acf(campy - fitted(campyfit_glarma), main="", ylim=c(-0.26,1))
% legend("top", bty="n", legend="", title="glarma", cex=1.3) 
% acf(campy - campyfitted_INLA, main="", ylim=c(-0.26,1))
% legend("top", bty="n", legend="", title="INLA", cex=1.3)
% acf(campy - predict(campyfit_KFAS$model), main="", ylim=c(-0.26,1))
% legend("top", bty="n", legend="", title="KFAS", cex=1.3)
% @
% <<comparison-fit, fig=TRUE, include=FALSE, echo=FALSE, results=hide, height=2.8, width=7>>=
% par(mar=c(3, 3, 0.5, 0.5), mgp=c(1.8, 0.6, 0))
% plot(campy, type="p", xlim=c(1996, 2000.6), ylab="Number of cases", main="")
% lines(fitted(campyfit_tsglm), lwd=2, lty="solid")
% lines(as.numeric(time(campy)), fitted(campyfit_glarma), lwd=2, lty="dashed", col="darkorange")
% lines(as.numeric(time(campy)), campyfitted_INLA, lwd=2, lty="longdash", col="blue")
% lines(as.numeric(time(campy)), predict(campyfit_KFAS$model), lwd=2, lty="dotdash", col="red")
% legend("topright", legend=c("tsglm", "glarma", "INLA", "KFAS"), lwd=2, lty=c("solid", "dashed", "longdash", "dotdash"), col=c("black", "darkorange", "blue", "red"), seg.len=5)
% @
%
\begin{figure}[tbp]
\centering
\includegraphics[width=\figurewidth]{tscount-comparison-acf}
\caption{Empirical autocorrelation function of the response residuals for a model fit of the campylobacterosis data by our function \code{tsglm} (see Section~\ref{sec:campy}, but without the intervention effects) and the packages \pkg{glarma}, \pkg{INLA} and \pkg{KFAS} (see Section~\ref{sec:comparison}).}
\label{fig:comparison-acf}
\end{figure}
%
\begin{figure}[bt]
\centering
\includegraphics[width=\figurewidth]{tscount-comparison-fit}
\caption{Comparison of a model fit of the campylobacterosis data by our function \code{tsglm} (see Section~\ref{sec:campy}, but without the intervention effects) and the packages \pkg{glarma}, \pkg{INLA} and \pkg{KFAS} (see Section~\ref{sec:comparison}).}
\label{fig:comparison-fit}
\end{figure}
%
The output above corresponds to the estimated parameters $\lambda$, $\phi$, $\sigma_{\nu_1}^2$ and $\sigma_{\varepsilon}^2$, respectively. We observe that the estimation procedure is quite sensitive to given starting values; this fact has been pointed out by \citet{helske_kfas:_2016}. As shown by the empirical autocorrelation function of the residuals in Figure~\ref{fig:comparison-acf} (bottom right), the fitted model explains the temporal dependence of the data quite adequately.
The values fitted by this model (see Figure~\ref{fig:comparison-fit}) do not show any delay when compared to the fit obtained by \code{tsglm}. The algorithm used by \code{tsglm} yields fitted values by 1-step-ahead forecasts based on previous observations; note that only the model parameters are fitted using all available observations. The algorithm of the \pkg{KFAS} package for obtaining fitted values includes future observations which naturally lead to a more accurate fit. However, this methodology does not guarantee better out-of-sample forecasting performance since future observations will not be available in general. Further empirical comparison between \pkg{KFAS} and \pkg{tscount} is required to compare the accuracy of predictions obtained by both models.

%  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %

Another state space model which could be used to describe count time series is a partially observed Markov process (POMP). The package \pkg{pomp} \citep{king_statistical_2016} provides a general and abstract representation of such models. One example by \citet[][Sections 4.5 and 4.6]{king_statistical_2016} is the so-called Ricker model for describing the size $N_t$ of a population which is assumed to fulfill
\begin{align*}
N_{t+1} = r N_t \exp(-N_t + \varepsilon_t)
\end{align*}
with innovations $\varepsilon_t \sim \text{N}(0,\sigma_{\varepsilon}^2)$. The actual observations $Y_t$ are noisy measurements of the population size $N_t$ and it is assumed to hold $Y_t \sim \text{Poisson}(\phi N_t)$, where $\phi$ is an unknown dispersion parameter. Specification and fitting of this rather simple model with the package \pkg{pomp} requires more than thirty lines of code. This complexity is an obstacle for using the package in standard situations but might prove beneficial in very special scenarios.

%  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %

The package \pkg{gcmr} provides methodology for Gaussian copula marginal regression \citep{masarotto_gaussian_2012}, a framework which is also capable to model count time series. The marginal distribution of a time series $Y_t$ given a covariate vector $\boldsymbol{X}_t$ can be modeled by a Poisson or Negative Binomial distribution with mean $\lambda_t$ using that $g(\lambda_t) = \beta_0  + \boldsymbol{\eta}^\top \boldsymbol{X}_t$. This is similar to model \eqref{eq:linpred} but it does not include the terms for regression on past values of $Y_t$ and $\lambda_t$.
Furthermore, randomness is introduced through an unobserved error process $\{\varepsilon_t:t\in\mathbb{N}\}$ by assuming that $Y_t=F_t^{-1}\left(\Phi(\varepsilon_t)\right)$, where $F_t$ is the cumulative distribution function of the Poisson or Negative Binomial distribution with mean $\lambda_t$ and $\Phi$ is the cumulative distribution function of the standard normal distribution. Hence the actual value of $Y_t$ is the $\Phi(\varepsilon_t)$-quantile of the Poisson or Negative Binomial distribution with mean $\lambda_t$.
As pointed out by \citet{masarotto_gaussian_2012}, copulas with discrete marginals might not be unique \citep[see also][]{genest_primer_2007}.
Temporal dependence of $\{Y_t\}$ is modeled through the error process $\{\varepsilon_t\}$ by assuming an autoregressive moving average (ARMA) model of order $p$ and $q$. Note that although this model accounts for serial dependence it does not model the conditional distribution of $Y_t$ given the past but only its time-varying marginal distribution. The mean $\lambda_t$ of this marginal distribution is not influenced by the actual (unobserved) value of the error $\varepsilon_t$. Hence this model is not suitable for accurate 1-step-ahead predictions but rather for quantifying the additional uncertainty induced by the serial dependence.
The following example presents a fit of a Negative Binomial model with an ARMA error process of order $p=1$ and $q=1$:
% <<gcmr-package-pre, include=FALSE, echo=FALSE, results=hide>>=
% width <- getOption("width")
% options(width=50)
% @
% <<gcmr-package, warning=TRUE>>=
% library("gcmr")
% gcmr(ts ~ 1, marginal = negbin.marg(link = "identity"),
%      cormat = arma.cormat(p=1, q=1), data = data.frame(ts = campy))
% @
% <<gcmr-package-post, include=FALSE, echo=FALSE, results=hide>>=
% options(width=width)
% @
\begin{Schunk}
\begin{Sinput}
R> library("gcmr")
R> gcmr(ts ~ 1, marginal = negbin.marg(link = "identity"),
+       cormat = arma.cormat(p=1, q=1), data = data.frame(ts = campy))
\end{Sinput}
\begin{Soutput}
Call:
gcmr(formula = ts ~ 1, data = data.frame(ts = campy), 
    marginal = negbin.marg(link = "identity"), 
    cormat = arma.cormat(p = 1, q = 1))

Marginal model parameters:
(Intercept)   dispersion  
     11.315        0.255  

Gaussian copula parameters:
   ar1     ma1  
 0.824  -0.270  
\end{Soutput}
\end{Schunk}
The estimated ARMA parameters of the error process indicate that there is a considerable amount of serial correlation in the data. Some of the observed variability is explained by this serial dependence. As discussed above, the fitted values are based solely on the estimated marginal distribution and are therefore constant over time (equal to the estimated intercept in the above output).

%  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %

An extension of the zero-inflated models of the package \pkg{ZIM}, which was presented in the previous section, are the so-called dynamic zero-inflated models (DZIM) as proposed by \citet{yang_state-space_2015}. These fall within the framework of state space models and introduce serial dependence by an unobserved autoregressive process of order $p$. Fitting is based on the EM algorithm and MCMC simulation.
% <<ZIM-package2, warning=TRUE>>=
% dzim(campy ~ 1, data = campydata, control = dzim.control(dist = "poisson", order=1)) #data has no zeros!
% @

%  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %  %

The package \pkg{tsintermittent} \citep{kourentzes_tsintermittent:_2016} provides methods for so-called intermittent demand time series \citep[see for example][]{kourentzes_intermittent_2014}. These are time series giving the number of requested items of a particular product (e.g., rarely needed spare parts) which is demanded in a sporadic fashion with periods of zero demand. Reliable forecasts of such time series are important for companies to efficiently plan stocking the respective items. In principle, these are count time series which could be analyzed with the methods in our package. However, it is known that classical forecasting methods perform unsatisfactorily for this kind of data \citep{kourentzes_intermittent_2014}. More successful approaches, which are included in the \pkg{tsintermittent} package, are based on the idea of separately modeling the non-zero demand size and the inter-demand intervals. These methods do not take into account that the observations are integers and do not include covariate effects, as the methods in our package do. Temporal dependence is considered implicitly, by assuming that there are periods where subsequent observations are zero. Our functions consider explicit time dependence. The methods in the \pkg{tsintermittent} package are possibly more appropriate for the specific context they are tailored for (which would need further examination), but not suitable for count time series in general. They compete with other types of models for zero excess time series data, for example with DZIM.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Outlook}
\label{sec:outlook}

In its current version, the \proglang{R} package \pkg{tscount} allows for the analysis of count time series with a quite broad class of models. It will hopefully prove to be useful for a wide range of applications. There is a number of desirable extensions of the package which could be included in future releases. We invite other researchers and developers to contribute to this package.

As an alternative to the Negative Binomial distribution, one could consider the so-called Quasi-Poisson model. It allows for a conditional variance of $\phi\lambda_t$ (instead of $\lambda_t+\phi\lambda_t^2$, as for the Negative Binomial distribution), which is linearly and not quadratically increasing in the conditional mean $\lambda_t$ \citep[for the case of independent data see][]{ver_hoef_quasi-poisson_2007}. A scatterplot of the squared residuals against the fitted values could reveal whether a linear relation between conditional mean and variance is more adequate for a given time series. A generalization of the test for overdispersion in INGARCH(1,0) processes proposed by \citet{weis_detecting_2015} could provide guidance for choosing an appropriate conditional distribution.

The common regression models for count data are often not capable to describe an exceptionally large number of observations with the value zero. In the literature so-called zero-inflated and hurdle regression models have become popular for zero excess count data \citep[for an introduction and comparison see][]{loeys_analysis_2012}. A first attempt to utilize zero-inflation for INGARCH time series models is made by \citet{zhu_zero-inflated_2012}.

In some applications the variable of interest is not the number of events but the rate, which expresses the number of events per unit. For example the number of infected people per 10\,000 inhabitants, where the population size is a so-called exposure variable which varies over time. For models with a logarithmic link function such a rate could be described by a model where the number of events is the response variable and the logarithm of the exposure variable is a so-called offset. An offset is supported by many standard functions for GLMs and could be part of a future release of our package.

Alternative nonlinear models are for example the threshold model suggested by \citet{woodard_stationarity_2011} or the models studied by \citet{fokianos_nonlinear_2012}. \citet{fokianos_goodness--fit_2013} propose a class of goodness-of-fit tests for the specification of the linear predictor, which are based on the smoothed empirical process of Pearson residuals. \citet{christou_estimation_2015} develop suitably adjusted score tests for parameters which are identifiable as well as non-identifiable under the null hypothesis. These tests can be employed to test for linearity of an assumed model.

In practical applications one is often faced with outliers. \citet{elsaied_robust_2014} and \citet{kitromilidou_robust_2016} develop M-estimators for the linear and the log-linear model, respectively.
\citet{fried_outliers_2014} compare robust estimators of the (partial) autocorrelation \citep[see also][]{durre_robust_2015} for time series of counts, which can be useful for identifying the correct model order.
%cite R package robts once it is published on CRAN

In the long term, related models for binary or categorical time series \citep{moysiadis_binary_2014} or potential multivariate extensions of count time series following GLMs could be included as well.

The models which are so far included in the package or mentioned above fall into the class of time series following GLMs.
There is also quite a lot of literature on thinning-based time series models but we are not aware of any publicly available software implementations. To name just a few of many publications, \citet{weis_thinning_2008} reviews univariate time series models based on the thinning operation, \citet{pedeli_composite_2013} study a multivariate extension and \citet{scotto_bivariate_2014} consider models for time series with a finite range of counts.
For the wide class of state space models there are the \proglang{R} packages \pkg{INLA}, \pkg{KFAS} and \pkg{pomp} available, although it is quite complex to apply these to count time series. A future version of our package could provide simple interfaces to such packages specifically for fitting certain count time series models.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{Acknowledgements}

Part of this work was done while K.~Fokianos was a Gambrinus Fellow at TU Dortmund University.
The research of R.~Fried and T.~Liboschik was supported by the German Research Foundation (DFG, SFB~823 ``Statistical modelling of nonlinear dynamic processes'').
The authors thank 
Philipp Probst for his considerable contribution to the development of the package and Jonathan Rathjens for carefully checking the package.
The fruitful discussions with Ma\"{e}lle Salmon (formerly at Robert Koch Institute, Berlin, Germany; currently at Center for Research in Environmental Epidemiology, Barcelona, Spain) improved both the package and this manuscript.
The authors also thank the handling editor and two referees for their helpful and constructive comments which have improved the final version of this manuscript.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\bibliography{bibliography}
%\bibliography{../../../../../Literatur/bibliography}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newpage
\appendix

\section{Implementation details}
\label{app:implementation}

\subsection{Parameter space for the log-linear model}
\label{app:paramspace}

The parameter space $\Theta$ for the log-linear model~\eqref{eq:loglin} which guarantees a stationary and ergodic solution of the process is subject of current research. 
In the current implementation of \code{tsglm} the parameters need to fulfill the condition
\begin{align}
\max\left\{
|\beta_1|,\dots,|\beta_p|,
|\alpha_1|,\dots,|\alpha_q|,
\left|\sum\limits_{k=1}^p \beta_k + \sum\limits_{\ell=1}^q \alpha_{\ell} \right|
\right\} < 1.
\label{eq:paramspace_tsglm}
\end{align}
At the time we started developing \pkg{tscount}, \eqref{eq:paramspace_tsglm} appeared as a reasonable extension of the condition
\begin{align*}
\max\left\{ |\beta_1|, |\alpha_1|, |\beta_1+\alpha_1| \right\} < 1,
\end{align*}
which \citet[][Lemma 14]{douc_ergodicity_2013} derive for $p=q=1$.
However, in a recent work, \citet[][Proposition~5.4.7]{sim_maximum_2016} derives sufficient conditions for a model of order $p=q$.
For the first order model he obtains the weaker condition
\begin{align*}
\max\left\{ |\alpha_1|, |\beta_1+\alpha_1| \right\} < 1.
\end{align*}
For $p=q=2$ the required condition is
\begin{align}
\begin{split}
\max \big\{ \ &
\left| \alpha_1 \right| + \left| \alpha_2 \right| + \left| \beta_2 \right|,
\left| \alpha_1  \alpha_2 \right| + \left| \alpha_2 \alpha_1^2 \right| + \left| \alpha_1 + \beta_2 \right|, %
\left| \alpha_2 \right| + \left| \alpha_1 + \beta_1 \right| + \left| \beta_2 \right|, \\
& \left| \alpha_2  (\alpha_1 + \beta_1) \right| + \left| \alpha_2 + \alpha_1  (\alpha_1 + \beta_1) \right| + \left| (\alpha_1 + \beta_1)  \beta_2 \right|, \\
& \left| \alpha_1 \alpha_2 \right| + \left| \alpha_2 + \alpha_1 (\alpha_1 + \beta_1) + \beta_2 \right| + \left| \alpha_1 \beta_2 \right|, \\
& \left| (\alpha_1 + \beta_1)  \alpha_2 \right| + \left| \alpha_2 + (\alpha_1 + \beta_1)^2 + \beta_2 \right| + \left| (\alpha_1 + \beta_1)  \beta_2 \right|
\big\} \ < 1. \label{eq:paramspace_sim22}
\end{split}
\end{align}
There are parameters which fulfill \eqref{eq:paramspace_tsglm} but not \eqref{eq:paramspace_sim22} (e.g., $\beta_1=-0.9$, $\beta_2=0.9$, $\alpha_1=0$, $\alpha_2=0$) and vice versa (e.g., $\beta_1=-1.8$, $\beta_2=0$, $\alpha_1=0.9$, $\alpha_2=0$). However, there exists a large intersection between values which fulfill \eqref{eq:paramspace_tsglm} and \eqref{eq:paramspace_sim22}.
For the general case $p=q$ the condition can be obtained by considering the maximum among $p$ elements of the norms of matrix products with $p$ factors, where each factor corresponds to a $(2p-1)\times(2p-1)$ matrix. The implementation of this condition is a challenging problem and therefore we have decided in favor of \eqref{eq:paramspace_tsglm}. Alternatively, we can obtain unconstrained estimates (argument \code{final.control = list(constrained = NULL)}), which should be examined carefully.



\subsection{Recursions for inference and their initialization}
\label{app:recursions}

Let $h$ be the inverse of the link function $g$ and let $h'(x)=\partial h(x) / \partial x$ be its derivative.
In the case of the identity link $g(x)=x$ it holds $h(x)=x$ and $h'(x)=1$ and in the case of the logarithmic link $g(x)=\log(x)$ it holds $h(x)=h'(x)=\exp(x)$.
The partial derivative of the conditional mean $\lambda_t(\boldsymbol{\theta})$ is given by
\begin{align*}
\frac{\partial \lambda_t(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}
= h'\left(\nu_t(\boldsymbol{\theta})\right) \frac{\partial \nu_t(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}},
\end{align*}
where the vector of partial derivatives of the linear predictor $\nu_t(\boldsymbol{\theta})$,
\begin{align*}
\frac{\partial \nu_t(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} = \left(
\frac{\partial \nu_t(\boldsymbol{\theta})}{\partial \beta_0},
\frac{\partial \nu_t(\boldsymbol{\theta})}{\partial \beta_1},
\dots,
\frac{\partial \nu_t(\boldsymbol{\theta})}{\partial \beta_p},
\frac{\partial \nu_t(\boldsymbol{\theta})}{\partial \alpha_1},
\dots,
\frac{\partial \nu_t(\boldsymbol{\theta})}{\partial \alpha_q},
\frac{\partial \nu_t(\boldsymbol{\theta})}{\partial \eta_1},
\dots,
\frac{\partial \nu_t(\boldsymbol{\theta})}{\partial \eta_r}
\right)^\top,
\end{align*}
can be computed recursively.
The recursions are given by
\begin{align*}
\frac{\partial \nu_t(\boldsymbol{\theta})}{\partial \beta_0} &= 1 + \sum_{\ell=1}^q \alpha_{\ell} \frac{\partial \nu_{t-j_{\ell}}(\boldsymbol{\theta})}{\partial \beta_0}, \\
\frac{\partial \nu_t(\boldsymbol{\theta})}{\partial \beta_s} &= \widetilde{g}(Y_{t-i_s}) + \sum_{\ell=1}^q \alpha_{\ell} \frac{\partial \nu_{t-j_{\ell}}(\boldsymbol{\theta})}{\partial \beta_s}, \quad s=1,\dots,p, \\
\frac{\partial \nu_t(\boldsymbol{\theta})}{\partial \alpha_s} &= \sum_{\ell=1}^q \alpha_{\ell} \frac{\partial \nu_{t-j_{\ell}}(\boldsymbol{\theta})}{\partial \alpha_s} + \nu_{t-j_s}(\boldsymbol{\theta}), \quad s=1,\dots,q, \\
\frac{\partial \nu_t(\boldsymbol{\theta})}{\partial \eta_s} &= \sum_{\ell=1}^q \alpha_{\ell} \frac{\partial \nu_{t-j_{\ell}}(\boldsymbol{\theta})}{\partial \eta_s} + X_{t,s}, \quad s=1,\dots,r.
\end{align*}

The recursions for the linear predictor $\nu_t=g(\lambda_t)$ and its partial derivatives depend on past values of the linear predictor and its derivatives, which are generally not observable. We implemented three possibilities for initialization of these values.
The default and preferable choice is to initialize by the respective marginal expectations, assuming a model without covariate effects, such that the process is stationary (argument \code{init.method = "marginal"}). For the linear model~\eqref{eq:linear} it holds \citep{ferland_integer-valued_2006}
\begin{align}
\E(Y_t) = \E(\nu_t) = \frac{\beta_0}{1-\sum_{k=1}^p\beta_k-\sum_{\ell=1}^q\alpha_{\ell}} =: \mu(\boldsymbol{\theta}).
\end{align}
For the log-linear model~\eqref{eq:loglin} we instead consider the transformed time series $Z_t:=\log(Y_t+1)$, which has approximately the same second order properties as a time series from the linear model~\eqref{eq:linear}. It approximately holds $\E(Z_t)\approx\E(\nu_t)\approx\mu(\boldsymbol{\theta})$.
Specifically, we initialize past values of $\nu_t$ by $\mu(\boldsymbol{\theta})$ and past values of $\partial \nu_t(\boldsymbol{\theta}) / \partial \boldsymbol{\theta}$ by
\begin{align*}
\frac{\partial \mu(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} = \left(
\frac{\partial \mu(\boldsymbol{\theta})}{\partial \beta_0},
\frac{\partial \mu(\boldsymbol{\theta})}{\partial \beta_1},
\dots,
\frac{\partial \mu(\boldsymbol{\theta})}{\partial \beta_p},
\frac{\partial \mu(\boldsymbol{\theta})}{\partial \alpha_1},
\dots,
\frac{\partial \mu(\boldsymbol{\theta})}{\partial \alpha_q},
\frac{\partial \mu(\boldsymbol{\theta})}{\partial \eta_1},
\dots,
\frac{\partial \mu(\boldsymbol{\theta})}{\partial \eta_r}
\right)^\top,
\end{align*}
which is explicitly given by
\begin{align*}
& \frac{\partial \mu(\boldsymbol{\theta})}{\partial \beta_0} = \frac{1}{1-\sum_{k=1}^p\beta_k-\sum_{\ell=1}^q\alpha_{\ell}}, \\
& \frac{\partial \mu(\boldsymbol{\theta})}{\partial \beta_k} = \frac{\partial \mu(\boldsymbol{\theta})}{\partial \alpha_{\ell}} = \frac{\beta_0}{\bigl(1-\sum_{k=1}^p\beta_k-\sum_{\ell=1}^q\alpha_{\ell}\bigr)^2}, \quad k=1,\dots,p,\ \ell=1,\dots,q, \quad \text{and}\\
& \frac{\partial \mu(\boldsymbol{\theta})}{\partial \eta_m} = 0, \quad m=1,\dots,r.
\end{align*}
Another possibility is to initialize $\nu_t$ by $\beta_0$ and $\partial \nu_t(\boldsymbol{\theta}) / \partial \boldsymbol{\theta}$ by zero. In this case the model corresponds to standard i.i.d. Poisson random variables (argument \code{init.method = "iid"}).
A third possibility would be a data-dependent initialization of $\nu_t$, for example by $\widetilde{g}(y_1)$. In this case, the partial derivatives of $\nu_t$ are initialized by zero (argument \code{init.method = "firstobs"}).

The recursions also depend on unavailable past observations of the time series, prior to the sample which is used for the likelihood computation. The package allows to choose between two strategies to cope with that.
The default choice is to replace these pre-sample observations by the same initializations as used for the linear predictor $\nu_t$ (see above), transformed by the inverse link function $h$ (argument \code{init.drop = FALSE}).
An alternative is to use the first $i_p$ observations for initialization and to compute the log-likelihood on the remaining observations $y_{i_p+1},\dots,y_n$ (argument \code{init.drop = TRUE}). Recall that $i_p$ is the highest order for regression on past observations.

Particularly in the presence of strong serial dependence, the different methods for initialization can affect the estimation substantially even for quite long time series with 1000 observations. We illustrate this by the simulated example presented in Table~\ref{tab:recursioninit}.

<<recursioninit, echo=FALSE, results=tex>>=
set.seed(1246)
timser <- tsglm.sim(n=1000, param=list(intercept=0.5, past_obs=0.77, past_mean=0.22), model=list(past_obs=1, past_mean=1), link="identity")$ts
fit_iid <- tsglm(ts=timser, model=list(past_obs=1, past_mean=1), link="identity", distr="poisson", init.method="iid", init.drop=FALSE)
fit_marginal <- tsglm(ts=timser, model=list(past_obs=1, past_mean=1), link="identity", distr="poisson", init.method="marginal", init.drop=FALSE)
fit_firstobs <- tsglm(ts=timser, model=list(past_obs=1, past_mean=1), link="identity", distr="poisson", init.method="firstobs", init.drop=FALSE)
fit_iid.drop <- tsglm(ts=timser, model=list(past_obs=1, past_mean=1), link="identity", distr="poisson", init.method="iid", init.drop=TRUE)
fit_marginal.drop <- tsglm(ts=timser, model=list(past_obs=1, past_mean=1), link="identity", distr="poisson", init.method="marginal", init.drop=TRUE)
fit_firstobs.drop <- tsglm(ts=timser, model=list(past_obs=1, past_mean=1), link="identity", distr="poisson", init.method="firstobs", init.drop=TRUE)
comparison <- rbind(
  c(fit_marginal$coefficients, fit_marginal$logLik),
  c(fit_marginal.drop$coefficients, fit_marginal.drop$logLik),
  c(fit_iid$coefficients, fit_iid$logLik),
  c(fit_iid.drop$coefficients, fit_iid.drop$logLik),
  c(fit_firstobs$coefficients, fit_firstobs$logLik),
  c(fit_firstobs.drop$coefficients, fit_firstobs.drop$logLik)
)
colnames(comparison) <- c("$\\widehat{\\beta}_0$", "$\\widehat{\\beta}_1$", "$\\widehat{\\alpha}_1$", "$\\ell(\\widehat{\\boldsymbol{\\theta}})$")
rownames(comparison) <- c("\\code{init.method = \"marginal\", init.drop = FALSE}", "\\code{init.method = \"marginal\", init.drop = TRUE}", "\\code{init.method = \"iid\", \\hspace{2em} init.drop = FALSE}", "\\code{init.method = \"iid\", \\hspace{2em} init.drop = TRUE}", "\\code{init.method = \"firstobs\", init.drop = FALSE}", "\\code{init.method = \"firstobs\", init.drop = TRUE}")

library("xtable")
print(xtable(comparison, caption="Estimated parameters and log-likelihood of a time series of length 1000 simulated from model \\eqref{eq:linear} for different initialization strategies. The true parameters are $\\beta_0=0.5$, $\\beta_1=0.77$ and $\\alpha_1=0.22$. Likelihood values are included for completeness of the presentation. There are not comparable as they are based on a different number of observations.", label="tab:recursioninit", align="lcccc", digits=c(0,3,3,3,1)), table.placement="tbp", caption.placement="bottom", booktabs=TRUE, comment=FALSE, sanitize.text.function=function(x){x})
@


% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
\subsection{Starting value for optimization}
\label{app:startestimation}

The numerical optimization of the log-likelihood function requires a starting value for the parameter vector $\boldsymbol{\theta}$, which can be obtained by initial estimation based on a simpler model. Different strategies for this (controlled by the argument \code{start.control}) are discussed in this section. We call this start estimation (and not initial estimation) to avoid confusion with the initialization of the recursions described in the previous section.

The start estimation by the \proglang{R} function \code{glm} utilizes the fact that a time series following a GLM without feedback \citep[as in][]{kedem_regression_2002} can be fitted by employing standard software. Neglecting the feedback mechanism, the parameters of the GLM
\begin{align*}
& Y_{t}|{\cal F}^*_{t-1} \sim \mathrm{Poi}(\lambda_t^*),
\ \text{with} \ \nu_t^*=g(\lambda_t^*) \ \text{and}\\
& \nu_t^* = \beta_0^* + \beta_1^* \,\widetilde{g}(Y_{t-i_1}) + \ldots + \beta_p^* \,\widetilde{g}(Y_{t-i_p})
    + \eta_1^* X_{t,1} + \ldots + \eta_r^* X_{t,r}, \ t=i_p+1,\dots,n,
\end{align*}
with ${\cal F}^*_t$ the history of the joint process $\{Y_t, \boldsymbol{X}_t\}$,
are estimated using the \proglang{R} function \code{glm}.
Denote the estimated parameters by $\widehat{\beta}_0^*$, $\widehat{\beta}_1^*, \dots, \widehat{\beta}_p^*$, $\widehat{\eta}_1^*, \dots, \widehat{\eta}_r^*$ and set $\widehat{\alpha}_1^*, \dots, \widehat{\alpha}_q^*$ to zero (argument \code{start.control\$method = "GLM"}).

\citet{fokianos_poisson_2009} suggest start estimation of $\boldsymbol{\theta}$, for the first order linear model~\eqref{eq:linear} without covariates, by employing its representation as an ARMA(1,1) process with identical second-order properties, see \cite{ferland_integer-valued_2006}. For arbitrary orders $P$ and $Q$ with $s:=\max(P, Q)$ and the general model from Section~\ref{sec:models} this representation, after straightforward calculations, is given by
\begin{align}
(\widetilde{g}(Y_t) - \underbrace{\mu(\boldsymbol{\theta})}_{=:\zeta}) - \sum_{i=1}^{s} \underbrace{(\beta_i+\alpha_i)}_{=:\varphi_i} \left(\widetilde{g}(Y_{t-i})-\mu(\boldsymbol{\theta})\right) = \varepsilon_t + \sum_{i=1}^q \underbrace{(-\alpha_i)}_{=:\psi_i} \varepsilon_{t-i},
\label{eq:armarepr}
\end{align}
where $\beta_i:=0$ for $i \notin P$, $\alpha_i:=0$ for $i \notin Q$ and $\{\varepsilon_t\}$ is a white noise process. Recall that $\widetilde{g}$ is defined by $\widetilde{g}(x)=x$ for the linear model and $\widetilde{g}(x)=\log(x+1)$ for the log-linear model. Given the autoregressive parameters $\varphi_i$ and the moving average parameters $\psi_i$ of the ARMA representation of $\{Y_t\}$, the parameters of the original process are obtained by $\alpha_i=-\psi_i$ and $\beta_i=\varphi_i+\psi_i$. We get $\beta_0$ from $\beta_0=\zeta\bigl(1-\sum_{k=1}^p\beta_k-\sum_{\ell=1}^q\alpha_{\ell}\bigr)$ using the formula for the marginal mean of $\{Y_t\}$.
With these formulas estimates $\widehat{\beta}_0^*$, $\widehat{\beta}_i^*$ and $\widehat{\alpha}_i^*$ are obtained from ARMA estimates $\widehat{\zeta}$, $\widehat{\varphi}_i$ and $\widehat{\psi}_i$. Estimation of the ARMA parameters is implemented by conditional least squares (argument \code{start.control\$method = "CSS"}), maximum likelihood assuming normally distributed errors (argument \code{start.control\$method = "ML"}), or, for models up to first order, the method of moments (argument \code{start.control\$method = "MM"}).
If covariates are included, a linear regression is fitted to $\widetilde{g}(Y_t)$, whose errors follow an ARMA model like~\eqref{eq:armarepr}. Consequently, the covariate effects do not enter the dynamics of the process, as it is the case in the actual model~\eqref{eq:linpred}. It would be preferable to fit an ARMAX model, in which covariate effects are included on the right hand side of \eqref{eq:armarepr}, but this is currently not readily available in \proglang{R}. 

We compare both approaches to obtain start estimates. The GLM approach apparently disregards the feedback mechanism, i.e., the dependence on past values of the conditional mean. As opposed to this, the ARMA approach does not treat covariate effects in an appropriate way. From extensive simulations we note that the final estimation results are almost equally good for both approaches.

However, we also found out that in some situations (especially in the presence of certain types of covariates) both approaches occasionally provoke the likelihood optimization algorithms to run into a local optimum. This happens more often for increasing sample size. To overcome this problem we recommend a naive start estimation assuming an i.i.d. model without covariates, which only estimates the intercept and sets all other parameters to zero (argument \code{start.control\$method = "iid"}). This starting value is usually not close to any local optimum of the likelihood function. Hence we expect, possibly, a larger number of steps for the optimization algorithm to converge. This is the default method of start estimation as we do not guarantee a global optimum with the other two methods, in some special cases.

Particularly for the linear model, some of the aforementioned approaches do not yield a starting value $\widehat{\boldsymbol{\theta}}^*=(\widehat{\beta}_0^*, \widehat{\beta}_1^*,\dots,\widehat{\beta}_p^*, \widehat{\alpha}_1^*,\dots,\widehat{\alpha}_q^*, \widehat{\eta}_1^*,\dots,\widehat{\eta}_r^*)^\top$ for $\boldsymbol{\theta}$ which lays in the interior of the parameter space $\Theta$. To overcome this problem, $\widehat{\boldsymbol{\theta}}^*$ is suitably transformed to be used as a starting value.
For the linear model~\eqref{eq:linear} this transformation is done according to the following procedure \citep{liboschik_modelling_2016}:
\begin{enumerate}
\item[1a.] Set $\widehat{\beta}_k^* := \min{\bigl\{ \widehat{\beta}_k^*, \varepsilon \bigr\}}$ and $\widehat{\alpha}_{\ell}^* := \min{\bigl\{ \widehat{\alpha}_{\ell}^*, \varepsilon \bigr\}}$.
\item[1b.] If $c := \sum_{k=1}^p \widehat{\beta}_k^* + \sum_{\ell=1}^q \widehat{\alpha}_{\ell}^* > 1-\xi-\varepsilon$, then shrink each $\widehat{\beta}_k^*$ and $\widehat{\alpha}_{\ell}^*$ by multiplication with the factor $(1-\xi-\varepsilon)/c$.
\item[2a.] Set $\widehat{\beta}_0^* := \widehat{\beta}_0^* \cdot \left( 1-\sum_{k=1}^p \widehat{\beta}_k^* + \sum_{\ell=1}^q \widehat{\alpha}_{\ell}^* \right) / c$.
\item[2b.] Set $\widehat{\beta}_0^* := \max{\bigl\{ \widehat{\beta}_0^*, \xi+\varepsilon \bigr\}}$. 
\item[3.] Set $\widehat{\eta}_m^* := \max{\bigl\{ \widehat{\eta}_m^*, \varepsilon \bigr\}}$.
\end{enumerate}
A small constant $\varepsilon>0$ ensures that the initial value lies inside the parameter space $\Theta$ and not on its boundaries. It is chosen to be $\varepsilon=10^{-6}$ by default (argument \code{epsilon}).
Another small constant $\xi>0$ enforces the inequalities to be strict (i.e. $<$ instead of $\leq$). This constant is set to $\xi=10^{-6}$ by default (argument \code{slackvar}); recall Section~\ref{sec:inference}.
The shrinkage factor in step~1b is chosen such that the sum of the parameters equals $1-\xi-\varepsilon$ after possible shrinkage in this step.
The choice of $\widehat{\beta}_0^*$ in step~2a ensures that the marginal mean remains unchanged after possible shrinkage in step~1b.
For the log-linear model~\eqref{eq:loglin} it is not necessary to ensure positivity of the parameters. A valid starting value $\widehat{\boldsymbol{\theta}}^*$ is transformed with the following procedure:
\begin{enumerate}
\item[1a.] Set $\widehat{\beta}_k^* := \text{sign}\bigl(\widehat{\beta}_k^*\bigr) \cdot \min{\bigl\{ \bigl|\widehat{\beta}_k^*\bigr|, \varepsilon \bigr\}}$ and $\widehat{\alpha}_{\ell}^* := \text{sign}\bigl(\widehat{\alpha}_{\ell}^*\bigr) \cdot \min{\bigl\{ \bigl|\widehat{\alpha}_{\ell}^*\bigr|, \varepsilon \bigr\}}$.
\item[1b.] If $c := \left| \sum_{k=1}^p \widehat{\beta}_k^* + \sum_{\ell=1}^q \widehat{\alpha}_{\ell}^* \right| > 1-\xi-\varepsilon$, then shrink each $\widehat{\beta}_k^*$ and $\widehat{\alpha}_{\ell}^*$ by multiplication with the factor $(1-\xi-\varepsilon)/c$.
\end{enumerate}


% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
\subsection{Stable inversion of the information matrix}
\label{app:invertinfo}

In order to obtain standard errors from the normal approximation \eqref{eq:asymptoticnormality} one needs to invert the information matrix $G_n(\widehat{\boldsymbol{\theta}};\widehat{\sigma}^2)$. To avoid numerical instabilities we make use of the fact that an information matrix is a real symmetric and positive definite matrix. We first compute a Choleski factorization of the information matrix. Then we apply an efficient algorithm to invert the matrix employing the upper triangular factor of the Choleski decomposition (see \proglang{R} functions \code{chol} and \code{chol2inv}). This procedure is implemented in the function \code{invertinfo} in our package.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Simulations}
\label{app:simulations}

In this section we present simulations supporting that the methods that have not yet been treated thoroughly in the literature work reliably.


% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
\subsection{Covariates}
\label{app:covariates}

We present some limited simulation results for the problem of including covariates, in both linear and log-linear models. For simplicity we employ first order models with one covariate and a conditional Poisson distribution, that is, we consider the linear model with the identity link function
\begin{align*}
Y_t|{\cal F}_{t-1} \sim \text{Poisson}(\lambda_t), \quad
\lambda_t = \beta_0
    + \beta_1 \,Y_{t-1}
    + \alpha_1 \lambda_{t-1}
    + \eta_1 X_t,
\quad t=1,\dots,n,
\end{align*}
and the log-linear model with the logarithmic link function
\begin{align*}
Y_t|{\cal F}_{t-1} \sim \text{Poisson}(\lambda_t), \quad
\log(\lambda_t) = \beta_0
    + \beta_1 \,\log(Y_{t-1}+1)
    + \alpha_1 \log(\lambda_{t-1})
    + \eta_1 X_t,
\quad t=1,\dots,n.
\end{align*}
The parameters are chosen to be $\beta_1=0.3$ and $\alpha_1=0.2$. The intercept parameter is $\beta_0=4 \cdot 0.5$ for the linear and $\beta_0=\log(4) \cdot 0.5$ for the log-linear model in order to obtain a marginal mean (without the covariate effect) of about 4 in both cases.
%
\begin{table}
\begin{tabular}{lp{0.7\textwidth}}
\toprule
\textbf{Abbreviation} & \textbf{Definition} \\
\midrule
Linear & $t/n$ \\
Sine & $(\sin(2\pi \cdot 5 \cdot t/n) + 1)/2$ \\
Spiky outlier & $\mathbbm{1}(t = \tau)$ \\
Transient shift & $0.8^{t-\tau} \mathbbm{1}(t \geq \tau)$ \\
Level shift & $\mathbbm{1}(t \geq \tau)$ \\
GARCH(1,1) & $\sqrt{h_t}\varepsilon_t$ with $\varepsilon_t \sim \text{N}(0.5,1)$ and $h_t=0.002 + 0.1 X_{t-1}^2 + 0.8 h_{t-1}$\\
Exponential & i.i.d. Exponential with mean 0.5 \\
Normal & i.i.d. Normal with mean 0.5 and variance 0.04 \\
\bottomrule
\end{tabular}
\caption{Covariates $\{X_t:t=1,\dots,n\}$ considered in the simulation study. The interventions occur at time $\tau=n/2$. The GARCH model is defined recursively \citep[see][]{bollerslev_generalized_1986}.}
\label{tab:covariates}
\end{table}
%
We consider the covariates listed in Table~\ref{tab:covariates}, covering a simple linear trend, seasonality, intervention effects, i.i.d. observations from different distributions and a stochastic process.
The covariates are chosen to be nonnegative, which is necessary for the linear model but not for the log-linear model. All covariates have a mean of about 0.5, such that their effect sizes are somewhat comparable. The regression coefficient is chosen to be $\eta_1=2 \cdot \beta_0$ for the linear and $\eta_1=1.5 \cdot \beta_0$ for the log-linear model.

<<covariates_load, echo=FALSE>>=
load("covariates.RData")
estimates_list_id <- list(covariate_n100_id, covariate_n500_id, covariate_n1000_id, covariate_n2000_id)
estimates_list_log <- list(covariate_n100_log, covariate_n500_log, covariate_n1000_log, covariate_n2000_log)
@
%
<<covariates_scatterplots, echo=FALSE>>=
covariate_scatterplots <- function(x, main="", truevalue, show=1:12){
  #will only show the first eight types of covariates in vector 'show'
  par(mfrow=c(4,2), mar=c(0.25,0.25,0,0), mgp=c(1.8,0.6,0), oma=c(2.5,2.5,2.5,1))
  estimates_cov <- sapply(x$estimates[show], function(x) x[4, ])
  estimates_dep <- sapply(x$estimates[show], function(x) x[2, ]) + sapply(x$estimates[show], function(x) x[3, ])
  minmax_cov <- c(min(apply(estimates_cov, 2, quantile, probs=0.0055, na.rm=TRUE)), max(apply(estimates_cov, 2, quantile, probs=0.9994, na.rm=TRUE)))
  minmax_cov[2] <- minmax_cov[2]+0.15*(diff(minmax_cov)) #enlarge range to have space for the plot title placed within the plot region
  minmax_dep <- c(min(apply(estimates_dep, 2, quantile, probs=0.0055, na.rm=TRUE)), max(apply(estimates_dep, 2, quantile, probs=0.9994, na.rm=TRUE)))
covariate_labels <- c("Linear", "Quadratic", "Sine", "Sine (fixed width)", "Spiky outlier", "Transient shift", "Level shift", "GARCH(1,1)", "Poisson", "Exponential", "Normal", "Chi^2")
  for(j in seq(along=show)){
  i <- show[j]
  plot(estimates_dep[, j], estimates_cov[, j], main="", pch=20, xaxt="n", yaxt="n", cex=0.5, las=0, cex.axis=0.8, xlim=minmax_dep, ylim=minmax_cov)
  abline(v=0.5, col="darkgrey")
  abline(h=truevalue, col="darkgrey")
  if(j %in% c(7,8)){
    axis(side=1, cex.axis=0.8, line=0)
    mtext(text=expression(hat(alpha)[1]+hat(beta)[1]), side=1, line=1.9, cex=0.7)
  }
  if(j %in% c(1,3,5,7)){
    axis(side=2, cex.axis=0.8, line=0)
    mtext(text=expression(hat(eta)[1]), side=2, line=1.3, cex=0.7, las=0)
  }
  legend("top", bty="n", legend="", title=covariate_labels[i], cex=1.3)
  }
  title(main=main, outer=TRUE, cex.main=1.6)
}

pdf("tscount-covariates_scatterplots.pdf", width=3.5, height=5)
covariate_scatterplots(x=covariate_n100_id, main="Linear model", truevalue=2*2, show=c(1,3,5,6,7,8,10,11))
covariate_scatterplots(x=covariate_n100_log, main="Log-linear model", truevalue=0.65*1.5, show=c(1,3,5,6,7,8,10,11))
invisible(dev.off())
@
%
<<covariates_boxplots, echo=FALSE>>=
covariate_boxplots <- function(estimates_list, index, truevalue, main="", label="", show=1:12){
  number_covariates <- length(show)
  estimates <- cbind(
    sapply(estimates_list[[1]]$estimates[show], function(x) x[index, ]),
    sapply(estimates_list[[2]]$estimates[show], function(x) x[index, ]),
    sapply(estimates_list[[3]]$estimates[show], function(x) x[index, ]),
    sapply(estimates_list[[4]]$estimates[show], function(x) x[index, ])
)[, rev(seq(from=1, by=number_covariates, length.out=4)+rep(1:number_covariates-1, each=4))]
  minmax <- c(min(apply(estimates, 2, quantile, probs=0.0055, na.rm=TRUE)), max(apply(estimates, 2, quantile, probs=0.9994, na.rm=TRUE)))
  distance <- 1
  boxplot(estimates, horizontal=TRUE, main=main, at=c((1:4)+rep(seq(from=0, by=4+distance, length.out=number_covariates), each=4)), yaxt="n", xlab=label, ylim=minmax, cex=0.5)
  abline(h=(1:(number_covariates-1))*4+(1:(number_covariates-1))*distance)
  abline(v=truevalue, col="darkgrey")
  covariate_labels <- rev(c("Linear", "Quadratic", "Sine", "Sine (fixed width)", "Spiky outlier", "Transient shift", "Level shift", "GARCH(1,1)", "Poisson", "Exponential", "Normal", "Chi^2")[show])
  text(x=minmax[2]+diff(minmax)*0.03, y=1.5+seq(from=0, by=5, length.out=number_covariates), labels=covariate_labels, pos=2, font=1, cex=1)
}

###Look at all four parameters:
#Linear model:
pdf("tscount-covariates_boxplotslinear.pdf", width=7, height=7)
par(mfrow=c(2,2), mar=c(2,0.5,2,0.5), mgp=c(2,0.5,0), cex.main=1.5)
covariate_boxplots(estimates_list=estimates_list_id, index=1, truevalue=2, main=expression(hat(beta)[0]), show=c(1,3,5,6,7,8,10,11))
covariate_boxplots(estimates_list=estimates_list_id, index=2, truevalue=0.3, main=expression(hat(beta)[1]), show=c(1,3,5,6,7,8,10,11))
covariate_boxplots(estimates_list=estimates_list_id, index=3, truevalue=0.2, main=expression(hat(alpha)[1]), show=c(1,3,5,6,7,8,10,11))
covariate_boxplots(estimates_list=estimates_list_id, index=4, truevalue=2*2, main=expression(hat(eta)[1]), show=c(1,3,5,6,7,8,10,11))
invisible(dev.off())
#Log-Linear model:
pdf("tscount-covariates_boxplotsloglin.pdf", width=7, height=7)
par(mfrow=c(2,2), mar=c(2,0.5,2,0.5), mgp=c(2,0.5,0), cex.main=1.5)
covariate_boxplots(estimates_list=estimates_list_log, index=1, truevalue=0.65, main=expression(hat(beta)[0]), show=c(1,3,5,6,7,8,10,11))
covariate_boxplots(estimates_list=estimates_list_log, index=2, truevalue=0.3, main=expression(hat(beta)[1]), show=c(1,3,5,6,7,8,10,11))
covariate_boxplots(estimates_list=estimates_list_log, index=3, truevalue=0.2, main=expression(hat(alpha)[1]), show=c(1,3,5,6,7,8,10,11))
covariate_boxplots(estimates_list=estimates_list_log, index=4, truevalue=0.65*1.5, main=expression(hat(eta)[1]), show=c(1,3,5,6,7,8,10,11))
invisible(dev.off())

###Look only at the parameter of the covariate:
# pdf("tscount-covariates_boxplots.pdf", width=7, height=5)
# par(mfrow=c(1,2), mar=c(3,0.5,1.8,0.5), mgp=c(2,0.5,0))
# covariate_boxplots(estimates_list=estimates_list_id, index=4, truevalue2*2, label=expression(hat(eta)[1]), main="Linear model", show=c(1,3,5,6,7,8,10,11))
# covariate_boxplots(estimates_list=estimates_list_log, index=4, truevalue=0.65*1.5, label=expression(hat(eta)[1]), main="Log-linear model", show=c(1,3,5,6,7,8,10,11))
# invisible(dev.off())
@
%
<<covariates_qqplots, echo=FALSE>>=
covariate_qqplots <- function(x, main="", truevalue, show=1:12){
  #will only show the first eight types of covariates in vector 'show'
  par(mfrow=c(4,2), mar=c(0.25,0.25,0,0), mgp=c(1.8,0.6,0), oma=c(2.5,2.5,2.5,1))
  estimates <- sapply(x$estimates[show], function(x) x[4, ])
  minmax <- c(min(apply(estimates, 2, quantile, probs=0.0055, na.rm=TRUE)), max(apply(estimates, 2, quantile, probs=0.9994, na.rm=TRUE)))
covariate_labels <- c("Linear", "Quadratic", "Sine", "Sine (fixed width)", "Spiky outlier", "Transient shift", "Level shift", "GARCH(1,1)", "Poisson", "Exponential", "Normal", "Chi^2")
  for(j in seq(along=show)){
  i <- show[j]
  qqnorm(estimates[, j], main="", xlab="Theoretical quantiles", ylab="Sample quantiles", pch=20, xaxt="n", yaxt="n", cex=0.5, las=0, cex.axis=0.8, ylim=minmax, xlim=c(-3.3,3.3))
  abline(h=truevalue, col="darkgrey")
  if(j %in% c(7,8)){
    axis(side=1, cex.axis=0.8, line=0)
    mtext(text="Theoretical quantiles", side=1, line=1.5, cex=0.7)
  }
  if(j %in% c(1,3,5,7)){
    axis(side=2, cex.axis=0.8, line=0)
    mtext(text="Sample quantiles", side=2, line=1.5, cex=0.7, las=0)
  }
  legend("top", bty="n", legend="", title=covariate_labels[i], cex=1.3)
  qqline(estimates[, j])
  }
  title(main=main, outer=TRUE, cex.main=1.6)
}

pdf("tscount-covariates_qqplots.pdf", width=3.5, height=5)
covariate_qqplots(x=covariate_n100_id, main="Linear model", truevalue=2*2, show=c(1,3,5,6,7,8,10,11))
covariate_qqplots(x=covariate_n100_log, main="Log-linear model", truevalue=0.65*1.5, show=c(1,3,5,6,7,8,10,11))
invisible(dev.off())
@
%
\begin{figure}[bt]
\centering
\includegraphics[width=0.45\textwidth, page=1]{tscount-covariates_scatterplots}
\includegraphics[width=0.45\textwidth, page=2]{tscount-covariates_scatterplots}
\caption{Scatterplots of the estimated covariate parameter $\widehat{\eta}_1$ against the sum $\widehat{\beta}_1+\widehat{\alpha}_1$ of the estimated dependence parameters in a linear (left) respectively log-linear (right) model with an additional covariate of the given type. The time series of length $n=100$ are simulated from the respective model with the true values marked by grey lines. Each dot represents one of 200 replications.
} 
\label{fig:covariates_scatterplots}
\end{figure}
%
\begin{figure}[bt]
\centering
\includegraphics[width=\figurewidth]{tscount-covariates_boxplotslinear}
\caption{Estimated coefficients for a linear model of order $p=q=1$ with an additional covariate of the given type. The time series of length $n=100,500,1000,2000$ (from top to bottom in each panel) are simulated from the respective model with the true coefficients marked by a grey vertical line. Each boxplot is based on 200 replications.
}
\label{fig:covariates_boxplotslinear}
\end{figure}
%
\begin{figure}[tbp]
\centering
\includegraphics[width=\figurewidth]{tscount-covariates_boxplotsloglin}
\caption{Identical simulation results as those shown in Figure~\ref{fig:covariates_boxplotslinear} but for the log-linear model.}
\label{fig:covariates_boxplotsloglin}
\end{figure}
%
% \begin{figure}[tbp]
% \centering
% \includegraphics[width=\figurewidth]{tscount-covariates_boxplots}
% \caption{Estimated coefficient $\widehat{\eta}_1$ of the covariate for a linear (left) respectively log-linear (right) model of order $p=q=1$ with an additional covariate of the given type. The time series of length $100,500,1000,2000$ (from top to bottom in each panel) are simulated from the respective model with the true coefficient marked by a grey vertical line. Each boxplot is based on 200 replications.
% }
% \label{fig:covariates_boxplots}
% \end{figure}
%
\begin{figure}[tbp]
\centering
\includegraphics[width=0.45\textwidth, page=1]{tscount-covariates_qqplots}
\includegraphics[width=0.45\textwidth, page=2]{tscount-covariates_qqplots}
\caption{Normal QQ-plots for the estimated covariate coefficient $\widehat{\eta}_1$ in a linear (left) respectively log-linear (right) model of order $p=q=1$ with an additional covariate of the given type. The time series of length $n=100$ are simulated from the respective model with the true coefficient marked by a grey horizontal line. Each plot is based on 200 replications. 
} 
\label{fig:covariates_qqplots}
\end{figure}
%
Apparently, certain types of covariates can to some extent be confused with serial dependence. This is the case for the linear trend and the level shift, but also for the sinusoidal term, since these lead to data patterns which resemble positive serial correlation; see Figure~\ref{fig:covariates_scatterplots}.

A second finding is that the effect of covariates, like a transient shift or a spiky outlier, is hard to be estimated precisely.
Note that both covariates have most of their values values different from zero only at very few time points (especially the spiky outlier) which explains this behavior of the estimation procedure.
%Such covariates contain information on the covariate effect almost completely indirectly via the serial dependence, and this indirect gain of information is rather limited.
The estimators for the coefficients of such covariates have a large variance which decreases only very slowly with growing sample size; see the bottom right plots in Figures~\ref{fig:covariates_boxplotslinear} and \ref{fig:covariates_boxplotsloglin} for the linear and the log-linear model, respectively. This does not affect the estimation of the other parameters, see the other three plots in the same figures. For all other types of covariates the variance of the estimator for the regression parameter decreases with growing sample size, which indicates consistency of the estimator.

The conjectured approximative normality of the model parameters stated in \eqref{eq:asymptoticnormality} seems to hold for most of the covariates considered here even in case of a rather moderate sample size of 100, as indicated by the QQ~plots shown in Figure~\ref{fig:covariates_qqplots}. The only serious deviation from normality happens for the spiky outlier in the linear model, a case where many estimates of the covariate coefficient $\eta_1$ lie close to zero. This value is the lower boundary of the parameter space for this model. Due to the consistency problem for this covariate (discussed in the previous paragraph) the observed deviation from normality is still present even for a much larger sample size of 2000 (not shown here). Note that for the spiky outlier the conditions for asymptotic normality in linear regression models stated in Section~\ref{sec:inference} are not fulfilled.
QQ plots for the other model parameters $\beta_0$, $\beta_1$ and $\alpha_1$ look satisfactory for all types of covariates and are not shown here.


% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
\subsection{Negative Binomial distribution}
\label{app:distrcoef}

As mentioned before, the model with the logarithmic link function is not covered by the theory derived by \citet{christou_quasi-likelihood_2014}. Consequently, we confirm by simulations that estimating the additional dispersion parameter $\phi$ of the Negative Binomial distribution by equation \eqref{eq:dispestim} yields good results.
We consider both the linear model with the identity link
\begin{align*}
Y_t|{\cal F}_{t-1} \sim \text{NegBin}(\lambda_t, \phi), \quad
\lambda_t = \beta_0
    + \beta_1 \,Y_{t-1}
    + \alpha_1 \lambda_{t-1},
\quad t=1,\dots,n,
\end{align*}
and the log-linear model with the logarithmic link
\begin{align*}
Y_t|{\cal F}_{t-1} \sim \text{NegBin}(\lambda_t, \phi), \quad
\log(\lambda_t) = \beta_0
    + \beta_1 \,\log(Y_{t-1}+1)
    + \alpha_1 \log(\lambda_{t-1}),
\quad t=1,\dots,n.
\end{align*}
The parameters $\beta_0$, $\beta_1$ and $\alpha_1$ are chosen like in Section~\ref{app:covariates}. For the dispersion parameter $\phi$ we employ the values 1, 5, 10, 20 and $\infty$, which are corresponding to overdispersion coefficients $\sigma^2$ of 1, 0.2, 0.1, 0.05 and 0, respectively.

<<distrcoef_load, echo=FALSE>>=
load("distrcoef_size1.RData")
estimates_distrcoef_size1_id <- sapply(list(distrcoef_n100_size1_id, distrcoef_n500_size1_id, distrcoef_n1000_size1_id, distrcoef_n2000_size1_id), function(x) x$estimates[4, ])
estimates_distrcoef_size1_log <- sapply(list(distrcoef_n100_size1_log, distrcoef_n500_size1_log, distrcoef_n1000_size1_log, distrcoef_n2000_size1_log), function(x) x$estimates[4, ])

load("distrcoef_n200.RData")
@
%
<<distrcoef_summary, echo=FALSE, results=tex>>=
distrcoef_nu <- function(estimates) c(mean=mean(estimates, na.rm=TRUE), median=median(estimates, na.rm=TRUE), sd=sd(estimates, na.rm=TRUE), mad=mad(estimates, na.rm=TRUE), propNA=mean(is.na(estimates))*100)
# distrcoef_id_summary <- rbind(
#   size1=distrcoef_nu(1/distrcoef_n200_size1_id$estimates["size", ]),
#   size5=distrcoef_nu(1/distrcoef_n200_size5_id$estimates["size", ]),
#   size10=distrcoef_nu(1/distrcoef_n200_size10_id$estimates["size", ]),
#   size20=distrcoef_nu(1/distrcoef_n200_size20_id$estimates["size", ]),
#   sizeInf=distrcoef_nu(1/distrcoef_n200_sizeInf_id$estimates["size", ])
# )
distrcoef_log_summary <- rbind(
  size1=distrcoef_nu(1/distrcoef_n200_size1_log$estimates["size", ]),
  size5=distrcoef_nu(1/distrcoef_n200_size5_log$estimates["size", ]),
  size10=distrcoef_nu(1/distrcoef_n200_size10_log$estimates["size", ]),
  size20=distrcoef_nu(1/distrcoef_n200_size20_log$estimates["size", ]),
  sizeInf=distrcoef_nu(1/distrcoef_n200_sizeInf_log$estimates["size", ])
)
colnames(distrcoef_log_summary) <- c("\\textbf{Mean}", "\\textbf{Median}", "\\textbf{Std.dev.}", "\\textbf{MAD}", "\\textbf{Failures (in \\%)}")
rownames(distrcoef_log_summary) <- c("$\\sigma^2=\ $ 1.00", "0.20", "0.10", "0.05", "0.00")

library("xtable")
print(xtable(distrcoef_log_summary, caption="Summary statistics for the estimated overdispersion coefficient $\\widehat{\\sigma}^2$ of the Negative Binomial distribution. The time series are simulated from a log-linear model with the true overdispersion coefficient given in the rows. Each statistic is based on 200 replications.", label="tab:distrcoef_summary", align="rrrrrr", digits=c(0,2,2,2,2,2)), table.placement="tbp", caption.placement="bottom", booktabs=TRUE, comment=FALSE, sanitize.text.function=function(x){x})
@
%
<<distrcoef_boxplots, echo=FALSE>>=
##RMSE:
#apply(estimates_distrcoef_size1_id, 2, function(x) sqrt(mean((x-1)^2)))
#apply(estimates_distrcoef_size1_id, 2, function(x) sqrt(mean((x-1)^2)))

pdf("tscount-distrcoef_boxplots.pdf", width=7, height=2.5)
par(mfrow=c(1,2), mar=c(3,0.25,1.8,0.25), mgp=c(1.7,0.5,0), oma=c(0,3.5,0,0.5))
boxplot(estimates_distrcoef_size1_id[, 4:1], horizontal=TRUE, main="Linear model", names=rev(c(100, 500, 1000, 2000)), ylab="", xlab=expression(hat(sigma)^2), las=1, ylim=c(0.5,2.5))
abline(v=1, col="darkgrey")
mtext(text="Length of time series", side=2, line=2.5)
boxplot(estimates_distrcoef_size1_log[, 4:1], horizontal=TRUE, main="Log-linear model", names=rev(c(100, 500, 1000, 2000)), ylab="Sample size", xlab=expression(hat(sigma)^2), cex=0.8, yaxt="n", ylim=c(0.5,2.5))
abline(v=1, col="darkgrey")
invisible(dev.off())
@
%
\begin{figure}[tbp]
\centering
\includegraphics[width=0.9\textwidth]{tscount-distrcoef_boxplots}
\caption{Estimated overdispersion coefficient $\widehat{\sigma}^2$ of the Negative Binomial distribution for a linear (left) respectively log-linear (right) model of order $p=q=1$. The time series are simulated from the respective model with the true overdispersion coefficient marked by a grey vertical line. Each boxplot is based on 200 replications.
} 
\label{fig:distrcoef_boxplots}
\end{figure}

The estimator of the dispersion parameter $\phi$ has a positively skewed distribution. It is thus preferable to consider the distribution of its inverse $\widehat{\sigma}^2=1/\widehat{\phi}$, which is only slightly negatively skewed; see Table~\ref{tab:distrcoef_summary}.
%We see from Table~\ref{tab:distrcoef_summary} that the median and not the mean of the estimations is close to the true value. Accordingly, it is about equally likely to under- and to overestimate the true parameter but on average we overestimate it. %This refers to the dispersion coefficient phi and not to its inverse sigma^2.
In certain cases it is numerically not possible to solve \eqref{eq:dispestim} and the estimation fails. This happens when the true value of $\phi$ is large and we are close to the limiting case of a Poisson distribution (see the proportion of failures in the last column of the table). In such a case our fitting function gives an error and recommends fitting a model with a Poisson distribution instead.
These results are very similar for the linear model and thus not shown here.

We check the consistency of the estimator by a simulation for a true value of $\sigma^2=1/\phi=1$. Our results shown in Figure~\ref{fig:distrcoef_boxplots} indicate that on average the deviation of the estimation from the true value decreases with increasing sample size for both, the linear and the log-linear model. The boxplots also confirm our above finding that the estimator has a clearly asymmetric distribution for sample sizes up to several hundred.


% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
\subsection{Quasi information criterion}
\label{app:qic}

We confirm by simulation that the quasi information criterion (QIC) approximates Akaike's information criterion (AIC) in case of a Poisson distribution.
Like in Section~\ref{app:distrcoef}, we consider both the linear model with the identity link
\begin{align*}
Y_t|{\cal F}_{t-1} \sim \text{Poisson}(\lambda_t), \quad
\lambda_t = \beta_0
    + \beta_1 \,Y_{t-1}
    + \alpha_1 \lambda_{t-1},
\quad t=1,\dots,n,
\end{align*}
and the log-linear model with the logarithmic link
\begin{align*}
Y_t|{\cal F}_{t-1} \sim \text{Poisson}(\lambda_t), \quad
\log(\lambda_t) = \beta_0
    + \beta_1 \,\log(Y_{t-1}+1)
    + \alpha_1 \log(\lambda_{t-1}),
\quad t=1,\dots,n,
\end{align*}
but now with a Poisson distribution. Again, the parameters $\beta_0$, $\beta_1$ and $\alpha_1$ are chosen like in Section~\ref{app:covariates}.

<<qic, fig=TRUE, include=FALSE, echo=FALSE, results=hide, width=7, height=3.5>>=
load("qic.RData")
par(mar=c(3,3,2,1), mgp=c(1.8,0.5,0), mfrow=c(1,2))
xylim <- c(375,465)
plot(QIC ~ AIC, data=aicqic_id_n100, xlim=xylim, ylim=xylim, main="Linear model", cex=0.8)
abline(a=0, b=1)
plot(QIC ~ AIC, data=aicqic_log_n100, xlim=xylim, ylim=xylim, main="Log-linear model", cex=0.8)
abline(a=0, b=1)
@
%
\begin{figure}[tbp]
\centering
\includegraphics[width=\textwidth]{tscount-qic}
\caption{Relationship of QIC and AIC for a linear (left) respectively log-linear (right) model of order $p=q=1$. Each of the 200 points represents the QIC and AIC of a fit to a time series of length $n=100$ simulated from the respective model. The diagonal line is the identity, i.e. it represents values for which the QIC equals the AIC.
} 
\label{fig:qic}
\end{figure}

From each of the two models we simulate 200 time series of length $n=100$ and compute the QIC and AIC of the fitted model. Figure~\ref{fig:qic} shows that the relationship between QIC and AIC is very close to the identity, i.e. the QIC is approximately equal to the AIC. There is only one out of 200 cases (for the linear model) where the QIC deviates largely from the AIC.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\end{document}