\documentclass[12pt,a4paper]{article} \usepackage[greek, british, english]{babel} \usepackage{natbib} %control font: \usepackage[T1]{fontenc} %\usepackage{palatino} \usepackage{mathpazo} \usepackage{amsmath} \usepackage{amssymb} %enable \mathbb, \subsetneq commands \usepackage{graphicx} %needed for [scale..] argument in figures %\usepackage[formats]{listings} %enable centred tilde ~ in verbatim environments \usepackage{hyperref} %see http://en.wikibooks.org/wiki/LaTeX/Hyperlinks and http://www.tug.org/applications/hyperref/manual.html#x1-120003.8 \hypersetup{ % bookmarks=true, % show bookmarks bar? % unicode=false, % non-Latin characters in Acrobat's bookmarks % pdftoolbar=true, % show Acrobat's toolbar? % pdfmenubar=true, % show Acrobat's menu? % pdffitwindow=false, % window fit to page when opened pdfstartview={FitH}, % fits the width of the page to the window % pdftitle={My title}, % title % pdfauthor={Author}, % author % pdfsubject={Subject}, % subject of the document % pdfcreator={Creator}, % creator of the document % pdfproducer={Producer}, % producer of the document % pdfkeywords={keyword1} {key2} {key3}, % list of keywords % pdfnewwindow=true, % links in new window colorlinks, % false: boxed links; true: colored links linkcolor=blue, %red, % color of internal links (change box color with linkbordercolor) citecolor=blue, %green, % color of links to bibliography % filecolor=magenta, % color of file links urlcolor=blue %cyan % color of external links } %%%%%% BOLD MATH-SYMBOLS: \newcommand{\bx}{\boldsymbol{x}} \newcommand{\bbeta}{\boldsymbol{\beta}} %\newcommand{\code}{\tt} \newcommand{\code}[1]{{\tt #1}} %%%%%% ENABLE CENTRED TILDE ~ IN VERBATIM ENVIRONMENTS: %%%%%%%%%%%%% %\lstdefineformat{R}{~=\( \sim \)} %\lstset{basicstyle=\tt,format=R} %%%%%%% MARGIN RELATED COMMANDS: \setlength\textheight{23cm} %height of the body without header and footer \setlength\textwidth{17cm} %width of the body \addtolength{\voffset}{-2.5cm} \addtolength{\hoffset}{-1.5cm} %move text horisontally 0.5cm %%%%%% ORDINARY FONT-SIZES: %\Huge \huge \LARGE \Large \large \normalsize \small \footnotesize \scriptsize \tiny \begin{document} %\SweaveOpts{engine=R,eps=FALSE} %\VignetteIndexEntry{User-Specified General-to-Specific (GETS) and Indicator Saturation (ISAT) Methods} %\VignetteDepends{gets} %\VignetteKeywords{general-to-specific, indicator saturation, user-defined methods} %\VignettePackage{gets} \begin{center} {\bf\Large User-Specified General-to-Specific (GETS) and Indicator Saturation (ISAT) Methods}% \footnote{ This vignette is a revised and updated version of the article ``User-Specified General-to-Specific and Indicator Saturation Methods'', which was published in \href{https://journal.r-project.org}{The R Journal} (2020) 12:2, pages 388-401: \url{https://journal.r-project.org/archive/2021/RJ-2021-024/}. I am grateful to the Editor Catherine Hurley, an anonymous reviewer, Damian Clarke, {\'E}ric Dubois, Jonas Kurle, Felix Pretis, James Reade, Moritz Schwarz, Gareth Thomas, participants at the UseR! 2019 conference (Toulouse, July), the Statistics Norway seminar (May, 2018) and the Norges Bank seminar (April, 2018) for helpful comments, suggestions and questions. } \medskip Genaro Sucarrat\footnote{Department of Economics, BI Norwegian Business School, Nydalsveien 37, 0484 Oslo, Norway. Webpage: \url{http://www.sucarrat.net/}.} \\ \smallskip {\small This version: \selectlanguage{british}\today}\\ %table of contents: \parbox[t]{0.9\linewidth}{ % \renewcommand{\contentsname}{Contents:} \renewcommand{\contentsname}{} %{} means no name \tableofcontents } \end{center} \clearpage \section*{Summary} \addcontentsline{toc}{section}{Summary} General-to-Specific (GETS) modelling provides a comprehensive, systematic and cumulative approach to modelling that is ideally suited for conditional forecasting and counterfactual analysis, whereas Indicator Saturation (ISAT) is a powerful and flexible approach to the detection and estimation of structural breaks (e.g.\ changes in parameters), and to the detection of outliers. In both methods multi-path backwards elimination, single and multiple hypothesis tests on the coefficients, diagnostics tests and goodness-of-fit measures are combined to produce a parsimonious final model. In many situations a specific model or estimator is needed, a specific set of diagnostics tests may be required, or a specific fit criterion is preferred. In these situations, if the combination of estimator/model, diagnostics tests and fit criterion is not offered in a pre-programmed way by publicly available software, then the implementation of user-specified GETS and ISAT methods puts a large programming-burden on the user. %Generic functions and procedures that facilitate the implementation of user-specified GETS and ISAT methods for specific problems can therefore be of great benefit. To reduce this burden, the R package \textbf{gets} provides a complete set of facilities and generic functions for user-specified GETS and ISAT methods: User-specified model/estimator, user-specified diagnostics and user-specified goodness-of-fit criteria. This vignette explains and illustrates how user-specified GETS and ISAT methods can be created with the R package \textbf{gets}. \clearpage \section{Introduction } General-to-Specific (GETS) modelling provides a comprehensive, systematic and cumulative approach to modelling that is ideally suited for scenario analysis, e.g.\ conditional forecasting and counterfactual analysis. To this end, well-known ingredients (tests of coefficients, multi-path backwards elimination, diagnostics tests and fit criteria) are combined to produce a parsimonious final model that passes the chosen diagnostics. GETS modelling originated at the London School of Economics (LSE) during the 1960s, and gained widespread acceptance and usage in economics during the 1980s and 1990s. The two-volume article collection by \citet{CamposEricssonHendry2005} provides a comprehensive historical overview of key-developments in GETS modelling. Software-wise, a milestone was reached in 1999, when the data-mining experiment of \citet{Lovell83} was re-visited by \citet{Hooveretal99}. They showed that automated GETS modelling could improve substantially upon the then prevalent modelling approaches. The study spurred numerous new studies and developments, including Indicator Saturation (ISAT) methods, see \citet{HendryJohansenSantos2008} and \citet{CastleDoornikHendryPretis2015}. ISAT methods provide a powerful and flexible approach to the detection and estimation of structural breaks (e.g.\ changes in parameters), and to the detection of outliers. On CRAN, there are two packages that provide GETS methods. The second, named \textbf{gets}, is simply the successor of the first, which is named \textbf{AutoSEARCH}.\footnote{Both packages were created by the me. Originally, I simply wanted to rename the first to the name of the second. This, however, is inconvenient in practice I was told, so I was instead asked by CRAN to publish a ``new" package with the new name.} Since October 2014 the development of \textbf{AutoSEARCH} is frozen, and all development efforts have been directed towards the \textbf{gets} package.\footnote{See the Gitub page for the current development version of the package: \url{https://github.com/gsucarrat/gets/}.} An introduction to the \textbf{gets} package is provided by \cite{PretisReadeSucarrat2018}. However, it does does not cover the user-specification capabilities of the package, some of which were not available at the time. At the time of writing (August 2021), the publicly available softwares that provide GETS and ISAT methods are contained in Table \ref{table:software:comparison}. Although they offer GETS and ISAT methods for some of the most popular models in applications, in many situations a specific model or estimator will be needed, a specific set of diagnostics tests may be required, or a specific fit criterion is preferred. In these situations, if the combination of estimator/model, diagnostics tests and fit criterion is not offered in a pre-programmed way by the publicly available softwares, then the implementation of user-specified GETS and ISAT methods puts a large programming-burden on the user. To reduce this burden, generic functions and procedures that facilitate the implementation of user-specified GETS and ISAT methods for specific problems can therefore be of great benefit. The R package \textbf{gets}, since version 0.20 (September 2019), is the first software -- both inside and outside the R universe -- to provide a complete set of facilities and generic functions for user-specified GETS and ISAT methods: User-specified model/estimator, user-specified diagnostics and user-specified goodness-of-fit criteria. The aim of this vignette is to illustrate how user-specified GETS and ISAT methods can be implemented. \begin{table} \centering \setlength{\tabcolsep}{3pt} %needed to ensure table fits on page \begin{tabular}{lccccccc} \hline & & $\underset{(\text{MATLAB})}{\text{HP1999}}$ & $\underset{(\text{OxMetrics})}{\text{Autometrics}}$ & $\underset{(\text{Scilab})}{\text{Grocer}}$ & $\underset{(\text{STATA})}{ \text{genspec} }$ & EViews & $\underset{(\text{R})}{ \text{ \textbf{gets} } }$ \\ \hline More than 10 paths & & & Yes & Yes & Yes & Yes & Yes \\[1mm] GETS of linear regression & & Yes & Yes & Yes & Yes & Yes & Yes \\[1mm] GETS of variance models & & & & & & & Yes \\[1mm] GETS of logit models & & & Yes & & & & Yes \\[1mm] GETS of count models & & & Yes & & & & \\[1mm] GETS of probit models & & & Yes & Yes & & & \\[1mm] GETS of panel models & & & & Yes & Yes & & \\[1mm] GETS of MIDAS models & & & & & & Yes & \\[1mm] ISAT of linear regression & & & Yes & Yes & & Yes & Yes \\[1mm] User-specified GETS & & & & Yes & & & Yes \\[1mm] User-specified ISAT & & & & & & & Yes \\[1mm] User-specified diagnostics & & & & Yes & & & Yes \\[1mm] User-specified goodness-of-fit & & & & & & & Yes \\[1mm] Menu-based GUI & & & Yes & & & Yes & \\[1mm] Free and open source & & Yes$^{*}$ & & Yes & Yes$^{*}$ & & Yes \\[1mm] \hline \end{tabular}\\ \caption{A comparison of publicly available GETS and ISAT softwares with emphasis on user-specification capabilities. HP1999, the MATLAB code of \citet{Hooveretal99}. Autometrics, OxMetrics version 15, see \citet{DoornikHendry2018}. Grocer, version 1.8, see \citet{DuboisMichaux2019}. genspec, version 1.2.2, see \citet{Clarke2014}. EViews, version 12, see \citet{EViews2020}. \textbf{gets}, version 0.29, see \cite{SucarratKurlePretisReadeSchwarz2021getsV029}, and \citet{PretisReadeSucarrat2018}.\\ $^{*}$The modules in themselves are free and open source, but they run in non-free and closed source software environments (MATLAB and STATA, respectively).} \label{table:software:comparison} \end{table} The rest of this vignette contains three sections. In the next the model selection properties of GETS and ISAT methods are summarised. This is followed by a section that outlines the general principles of how user-specified estimation, user-specified diagnostics and user-specified goodness-of-fit measures are implemented. Next, a section with four illustrations follows. \section{Model selection properties of GETS and ISAT methods} It is useful to denote a generic model for observation $t$ as % \begin{equation} m\left(y_t, \bx_t, \bbeta \right), \qquad t=1,2,\ldots,n, \end{equation} % where $y_t$ is the dependent variable, $\bx_t = (x_{1t}, x_{2t}, \ldots)'$ is a vector of covariates, $\bbeta = (\beta_1, \beta_2, \ldots)'$ is a vector of parameters to be estimated and $n$ is the sample size. Two examples are the linear regression model and the logit-model: % \begin{eqnarray} y_t &=& \beta_1 x_{1t} + \cdots + \beta_k x_{kt} + \epsilon_t, \label{eq:linear:regression:model} \\ % Pr\left(y_t=1|\bx_t\right) &=& \frac{1}{1+\exp\left(-h_t\right)} \qquad\text{with}\qquad h_t = \beta_1 x_{1t} + \cdots + \beta_k x_{kt}. \label{eq:linear:logit:model} \end{eqnarray} % Note that, in a generic model $m(y_t, \bx_t, \bbeta)$, the dimension $\bbeta$ is usually -- but not necessarily -- equal to the dimension of $\bx_t$. Here, unless otherwise stated, they will both have dimension $k$. In \eqref{eq:linear:regression:model}--\eqref{eq:linear:logit:model}, a variable $x_{jt} \in \bx_t$ is said to be relevant if $\beta_j \neq 0$ and irrelevant if $\beta_j = 0$. Let $k_{\text{rel}} \geq 0$ and $k_{\text{irr}} \geq 0$ denote the number of relevant and irrelevant variables, respectively, such that $k_{\text{rel}} + k_{\text{irr}} = k$. GETS modelling aims at finding a specification that contains as many relevant variables as possible, and a proportion of irrelevant variables that on average equals the significance level $\alpha$ chosen by the investigator. Put differently, if $\widehat{k}_{\text{rel}}$ and $\widehat{k}_{\text{irr}}$ are the retained number of relevant and irrelevant variables in an empirical application, respectively, then GETS modelling aims at satisfying % \begin{equation}\label{eq:asymptotic:gauge:and:potency} E\left(\widehat{k}_{\text{rel}}/k_{\text{rel}}\right) \rightarrow 1 \quad \textnormal{and} \quad E\left(\widehat{k}_{\text{irr}}/k_{\text{irr}}\right) \rightarrow \alpha \quad \textnormal{as} \quad n\rightarrow\infty, \end{equation} % when $k_{\text{rel}},k_{\text{irr}}>0$. If either $k_{\text{rel}}=0$ or $k_{\text{irr}}=0$, then the targets are modified in natural ways: If $k_{\text{rel}}=0$, then the first target is $E(\widehat{k}_{\text{rel}}) = 0$, and if $k_{\text{irr}}=0$, then the second target is $E(\widehat{k}_{\text{irr}}) = 0$. Sometimes, the irrelevance proportion $\widehat{k}_{\text{irr}}/k_{\text{irr}}$ is also referred to as \textit{gauge}, whereas the relevance proportion $\widehat{k}_{\text{irr}}/k_{\text{irr}}$ is also referred to as \textit{potency}. In targeting a relevance proportion equal to 1 and an irrelevance proportion equal to $\alpha$, GETS modelling combines well-known ingredients: Multi-path backwards elimination, tests on the $\beta_j$'s (both single and multiple hypothesis tests), diagnostics tests and fit-measures (e.g. information criteria). Let $V(\widehat{\bbeta})$ denote the estimated coefficient-covariance matrix. GETS modelling in the package \textbf{gets} can be described as proceeding in three steps:\footnote{The exact way GETS modelling is implemented across softwares varies.} \begin{enumerate} \item Formulate a General Unrestricted Model (GUM), i.e.\ a starting model, that passes a set of chosen diagnostic tests. A regressor $x_j$ in the GUM is non-significant if the $p$-value of a two-sided $t$-test is lower than the chosen significance level $\alpha$, and each non-significant regressor constitutes the starting point of a backwards elimination path. The test-statistics of the $t$-tests are computed as $\widehat{\beta}_j/se(\widehat{\beta}_j)$, where $se(\widehat{\beta}_j)$ is the square root of the $j$th.\ element of the diagonal of $V(\widehat{\bbeta})$. \item Undertake backwards elimination along multiple paths by removing, one-by-one, non-significant regressors as determined by the chosen significance level $\alpha$. Each removal is checked for validity against the chosen set of diagnostic tests, and for parsimonious encompassing (i.e.\ a multiple hypothesis test) against the GUM. These multiple hypothesis tests on subsets of $\bbeta$ are implemented as Wald-tests. \item Multi-path backwards elimination can result in multiple terminal models. The last step of GETS modelling consists of selecting, among the terminal models, the specification with the best fit according to a fit-criterion, e.g.\ the \cite{Schwarz1978} information criterion. \end{enumerate} In ISAT methods, the vector $\bx_t$ contains at least $n-1$ indicators in addition to other covariates that are considered. Accordingly, standard estimation methods are infeasible, since the number of variables in $\bx_t$ is usually larger than the number of observations $n$. The solution to this problem provided by ISAT methods is to first organise $\bx_t$ into $B$ blocks: $\bx_t^{\left(1\right)}, \ldots, \bx_t^{\left(B\right)}$. These blocks need not be mutually exclusive, so a variable or subset of variables can appear in more than one block. Next, GETS modelling is applied to each block, which leads to $B$ final models. Finally, a new round of GETS modelling is undertaken with the union of the retained variables from the $B$ blocks as covariates in a new starting model (i.e.\ a new GUM). The model selection properties targeted by ISAT methods are the same as those of GETS methods. Note, however, that since the starting model (the GUM) contains at least $n-1$ regressors, a tiny significance level -- e.g.\ $\alpha=0.001$ or smaller -- is usually recommended in ISAT methods. \section{General principles \label{sec:general:principles} } In the current version of the package \textbf{gets}, version 0.28, the functions that admit user-specified estimation are \code{arx()}, \code{getsFun()}, \code{blocksFun()} and \code{isat()}. The user-specification principles are the same in all four. However, if the result (i.e.\ a \code{list}) returned from the user-specified estimator does not have the same structure as that returned from the default estimator \code{ols()} (part of the \textbf{gets} package), then \code{arx()} and \code{isat()} may not always work as expected. This is particularly the case with respect to their extraction functions, e.g.\ \code{print()}, \code{coef()}, \code{residuals()} and \code{predict()}. User-specified diagnostics and goodness-of-fit functions are optional. By default, \code{getsFun()}, \code{blocksFun()} and \code{isat()} do not perform any diagnostics tests, whereas the default in \code{arx()}, \code{getsm()} and \code{getsv()} is to test the standardised residuals for autocorrelation and Autoregressive Heteroscedasticity (ARCH). This is implemented via the \code{diagnostics()} function (part of the \textbf{gets} package). Also by default, all four functions use the \cite{Schwarz1978} information criterion as goodness-of-fit measure, which favours parsimony, via the \code{infocrit()} function (part of the \textbf{gets} package). \subsection{The \code{getsFun()} function \label{subsec:the:getsFun:function} } The recommended, most flexible and computationally most efficient approach to user-specified GETS modelling is via the \code{getsFun()} function. Currently, it accepts up to twenty-five arguments. For the details of all these arguments, the reader is referred to the discussion of the \code{getsm()} function (Section 5) in \cite{PretisReadeSucarrat2018}, and the help pages of \code{getsFun()} (type \code{?getsFun}). For the purpose of user-specified estimation, user-specified diagnostics and user-specified goodness-of-fit measures, the most important arguments are: % \begin{verbatim} getsFun(y, x, user.estimator = list(name = "ols"), user.diagnostics = NULL, gof.function = list(name = "infocrit", method = "sc"), gof.method = c("min", "max"), ...) \end{verbatim} % The \code{y} is the left-hand side variable (the regressand), \code{x} is the regressor or design matrix, \code{user.estimator} controls which estimator or model to use and further arguments -- if any -- to be passed on to the estimator, \code{user.diagnostics} controls the user-specified diagnostics if any, and \code{gof.function} and \code{gof.method} control the goodness-of-fit measure used. Note that \code{y} and \code{x} should satisfy \code{is.vector(y) == TRUE} and \code{is.matrix(x) == TRUE}, respectively, and enter in ``clean" ways: If either \code{y} or \code{x} are objects of class, say, \code{"ts"} or \code{"zoo"}, then \code{getsFun()} may not behave as expected. By default, the estimator \code{ols()} is used with its default arguments, which implements OLS estimation via the \code{qr()} function. The value \code{NULL} on \code{user.diagnostics} means no diagnostics checks are undertaken by default. The following code illustrates \code{getsFun()} in linear regression (the default), and reproduces the information printed while searching: % \begin{verbatim} n <- 40 #number of observations k <- 20 #number of Xs set.seed(123) #for reproducibility y <- rnorm(n) #generate Y x <- matrix(rnorm(n*k), n, k) #create matrix of Xs #do gets w/default estimator ols(), store output in 'result': result <- getsFun(y, x) #the information printed during searching: 18 path(s) to search Searching: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 \end{verbatim} % The object named \code{result} is a list, and the code \code{summary(results)} returns a summary of its contents. The most important items are: % \begin{itemize} \item \code{paths}: A list of vectors containing the searched paths. Each vector (i.e.\ path) indicates the sequence of deletion of the regressors. In the example above the first path is % \begin{verbatim} $paths[[1]] [1] 1 15 6 7 3 14 11 16 4 2 8 12 5 9 20 19 13 \end{verbatim} % That is, regressor no.\ 1 was the first to be deleted, regressor no.\ 15 was the second, regressor no.\ 6 was the third, and so on. If the regressors in \code{x} were named, then a name-representation of the first deletion path can be obtained with \code{ colnames(x)[ paths[[1]] ]}. \item \code{terminals}: A list of vectors with the distinct terminal models of the specification search. In the example above it is equal to % \begin{verbatim} $terminals $terminals[[1]] [1] 10 17 18 $terminals[[2]] [1] 10 18 \end{verbatim} % That is, two terminal models. The first contains regressors 10, 17 and 18, whereas the second contains regressors 10 and 18. \item \code{terminals.results}: A data frame with the goodness-of-fit information of the terminal models. In the above example the entry is equal to: % \begin{verbatim} $terminals.results info(sc) logl n k spec 1: 2.514707 -44.76081 40 3 spec 2: 2.529923 -46.90958 40 2 \end{verbatim} % \code{spec 1} is short for specification 1, i.e.\ terminal model 1, and \code{spec 2} is short for specification 2, i.e.\ terminal model 2. \code{info(sc)} indicates that the \cite{Schwarz1978} criterion (the default) is used as goodness-of-fit measure, whereas \code{n} and \code{k} denote the number of observations and parameters, respectively. \item \code{best.terminal}: An integer that indicates which terminal model is the best according to the goodness-of-fit criterion used. In the example above the value is 1. \item \code{specific.spec}: A vector of integers that indicates which regressors that are contained in the best terminal model. In the above example it is % \begin{verbatim} $specific.spec [1] 10 17 18 \end{verbatim} % That is, the best terminal model contains regressors no.\ 10, 17 and 18. \end{itemize} \subsection{User-specified estimation \label{subsec:user-specified:estimation} } User-specified estimation is carried out via the \code{user.estimator} argument. It must be a list containing at least one entry -- a character -- named \code{name} with the name of the estimator to be called. Optionally, the list can also contain an item named \code{envir}, a character, which indicates the environment in which the user-specified estimator resides. If unspecified, then the function is looked for in the usual way by R. Additional entries in the list, if any, are passed on to the estimator as arguments. The user-specified estimator must also satisfy the following: % \begin{enumerate} \item It should be of the form \code{myEstimator(y, x)} or \code{myEstimator(y, x, ...)}, where \code{y} is a vector, \code{x} is a matrix and \code{...} means the user-estimator can accept further arguments (optional) that are passed on to the estimator. In other words, while the name of the function is arbitrary, the first argument should be the regressand and the second a matrix (e.g.\ of covariates). \item The user-defined estimator should return a \code{list} with a minimum of six items: \begin{itemize} \item \code{n} (the number of observations) \item \code{k} (the number of coefficients) \item \code{df} (degrees of freedom, used in the $t$-tests) \item \code{coefficients} (a vector with the coefficient estimates) \item \code{vcov} (the coefficient covariance matrix) \item \code{logl} (a goodness-of-fit value, e.g.\ the log-likelihood) \end{itemize} % The items need not appear in this order. However, the naming should be exactly as indicated. If also the diagnostics and/or the goodness-of-fit criterion is user-specified, then additional objects may be required, see the subsections below on user-specified diagnostics and goodness-of-fit criteria. Note also that, if the goodness-of-fit criterion is user-specified, then \code{logl} can in certain situations be replaced by another item that needs not be named \code{logl}. \item The user-defined estimator must be able to handle \code{NULL} regressor-matrices, i.e.\ situations where either \code{NCOL(x)} is 0 or \code{is.null(x)} is \code{TRUE}. This is needed in situations where a terminal model is empty (i.e.\ no regressors are retained). \end{enumerate} % To illustrate how the requirements above can be met in practice, suppose -- as an example -- that we would like to use the function \code{lm()} for estimation rather than \code{ols()}. The first step is then to make an ``interface" function that calls \code{lm()} while satisfying requirements 1 to 3: % \begin{verbatim} lmFun <- function(y, x, ...){ ##create list: result <- list() ##n, k and df: result$n <- length(y) if( is.null(x) || NCOL(x) == 0 ){ result$k <- 0 }else{ result$k <- NCOL(x) } result$df <- result$n - result$k ##call lm if k > 0: if( result$k > 0){ tmp <- lm(y ~ x - 1) result$coefficients <- coef(tmp) result$vcov <- vcov(tmp) result$logl <- as.numeric(logLik(tmp)) }else{ result$coefficients <- NULL result$vcov <- NULL result$logl <- sum(dnorm(y, sd = sqrt(var(y)), log = TRUE)) } ##return result: return(result) } \end{verbatim} % Next, the code % \begin{verbatim} getsFun(y, x, user.estimator = list(name = "lmFun")) \end{verbatim} % undertakes the same specification search as earlier, but uses \code{lmFun()} rather than \code{ols()}. \subsection{User-specified diagnostics \label{subsec:user-specified:diagnostics} } User-specified diagnostics is carried out via the \code{user.diagnostics} argument. By default, the argument is \code{NULL}, so no diagnostic tests are undertaken. To carry out user-specified diagnostics tests, the argument must be a list containing at least two entries: A character named \code{name} containing the name of the diagnostics function to be called, and an entry named \code{pval} that contains a vector with values between 0 and 1, i.e.\ the chosen significance level(s) for the diagnostics test(s).\footnote{A word of caution is required. Let $R^{\left(i\right)}$ denote the event of rejecting the null, under the null, for a significance level $\alpha^{\left(i\right)}$ in diagnostic test $i$. For example, if only a single test is undertaken so that $i=1$, then $Pr(R^{\left(1\right)}) = \alpha^{\left(1\right)}$. If two tests are undertaken, however, then the probability of rejecting in one or both tests is $Pr\left(R^{\left(1\right)} \cup R^{\left(2\right)}\right)$. As rule of thumb, therefore, to control the overall error, it is recommended that the significance level of each diagnostic test is set equal to $\alpha/m$, where $\alpha$ is the overall or total significance level targeted by the user, and $m$ is the number of diagnostic tests. This is sometimes referred to as a Bonferroni correction.} If only a single test is undertaken by the diagnostics function, then \code{pval} should be of length one. If two tests are undertaken, then \code{pval} should be of length two. And so on. An example of the argument when only a single test is undertaken is: % \begin{verbatim} user.diagnostics = list(name = "myDiagnostics", pval = 0.05)) \end{verbatim} % That is, the name of the function is \code{myDiagnostics}, and the chosen significance level for the single test that is carried out is 5\%. Optionally, just as when the estimator is user-specified, the list can contain an item named \code{envir}, a character, which indicates the environment in which the user-specified diagnostics function resides. Additional items in the list, if any, are passed on to the user-specified function as arguments. The user-specified diagnostics function must satisfy the following: % \begin{enumerate} \item It should be of the form \code{myDiagnostics(result)} or \code{myDiagnostics(result, ...)}, where \code{result} is the list returned from the estimator in question, e.g.\ that of the user-specified estimator (recall requirement 2 in the previous section above), and \code{...} means the function can accept further arguments (optional) that are passed on. \item It should return an $m \times 3$ matrix that contains the $p$-value(s) of the test(s) in the third column, where $m\geq 1$ is the number of tests carried out. So if only a single test is carried out, then $m=1$ and the $p$-value should be contained in the third column. An example could look like: % \begin{verbatim} statistic df pval normality NA NA 0.0734 \end{verbatim} % Note that the row-names and column-names in the example are not required. However, they do indicate what kind of information you may wish to put there for reporting purposes, e.g.\ by using the function \code{diagnostics()} (part of the \textbf{gets} package). \end{enumerate} % To illustrate how the requirements can be met in practice, suppose we would like to ensure the residuals are normal by testing for non-normality with the Shapiro-Wilks test function \code{shapiro.test()}. In this context, its main argument is the residuals of the estimated model. The list returned by the user-defined estimator named \code{lmFun()} above, however, does not contain an item with the residuals. The first step, therefore, is to modify the estimator \code{lmFun()} so that the returned list also contains the residuals: % \begin{verbatim} lmFun <- function(y, x, ...){ ##info needed for estimation: result <- list() result$n <- length(y) if( is.null(x) || NCOL(x)==0 ){ result$k <- 0 }else{ result$k <- NCOL(x) } result$df <- result$n - result$k if( result$k > 0){ tmp <- lm(y ~ x - 1) result$coefficients <- coef(tmp) result$vcov <- vcov(tmp) result$logl <- as.numeric(logLik(tmp)) }else{ result$coefficients <- NULL result$vcov <- NULL result$logl <- sum(dnorm(y, sd=sqrt(var(y)), log=TRUE)) } ##residuals: if( result$k > 0){ result$residuals <- residuals(tmp) }else{ result$residuals <- y } ##return result: return(result) } \end{verbatim} % Computationally, the only modification appears under \verb|##residuals|. We can now make the user-specified diagnostics function: % \begin{verbatim} myDiagnostics <- function(x, ...){ tmp <- shapiro.test(x$residuals) #do the test result <- rbind( c(tmp$statistic, NA, tmp$p.value) ) return(result) } \end{verbatim} % The following code undertakes GETS modelling with the user-specified estimator defined above, and the user-specified diagnostics function using a 5\% significance level for the latter: % \begin{verbatim} getsFun(y, x, user.estimator = list(name = "lmFun"), user.diagnostics = list(name = "myDiagnostics", pval = 0.05)) \end{verbatim} % Note that if the chosen significance level for the diagnostics is sufficiently high, then no specification search is undertaken because the starting model does not pass the non-normality test. With the current data, for example, a little bit of trial and error reveals this is the case for a level of about \code{pval = 0.35}. \subsection{User-specified goodness-of-fit \label{subsec:user-specified:gof} } User-specified goodness-of-fit is carried out with the \code{gof.function} and \code{gof.method} arguments. The former indicates which Goodness-of-Fit (GOF) function to use, and the second is a character that indicates whether the best model maximises (\code{"max"}) or minimises (\code{"min"}) the GOF criterion in question. The argument \code{gof.function} is a list with a structure similar to earlier: It must contain at least one entry, a character named \code{name}, with the name of the GOF function to call. An example is: % \begin{verbatim} gof.function = list(name = "myGof")) \end{verbatim} % Optionally, also here the list can contain an item named \code{envir}, a character, which indicates the environment in which the user-specified GOF function resides. Also as earlier, additional items in the list are passed on to the user-specified GOF function as arguments. The default value, for example, \code{gof.function = list(name = "infocrit", method = "sc")}, means the argument \code{method = "sc"} is passed on to the function \code{infocrit()}, see the help pages of \code{infocrit()} (type \code{?infocrit}) for information on the methods available via this function. The user-specified GOF function must satisfy the following: % \begin{enumerate} \item It should be of the form \code{myGof(result)} or \code{myGof(result, ...)}, where \code{result} is the list returned from the estimator in question, e.g.\ that of the user-specified estimator, and \code{...} means the function can accept further arguments (optional) that are passed on. \item It should return a single numeric value, i.e.\ the value of the GOF measure in question. \end{enumerate} % To illustrate how the requirements can be met in practice, suppose we would like to use the adjusted $R^2$ as our GOF measure in combination with our user-defined estimator. For the moment, the user-defined estimator \code{lmFun()} does not contain the information necessary to compute the adjusted $R^2$. In particular, it lacks the regressand \code{y}. However, this is readily added: % \begin{verbatim} lmFun <- function(y, x, ...){ ##info needed for estimation: result <- list() result$n <- length(y) if( is.null(x) || NCOL(x)==0 ){ result$k <- 0 }else{ result$k <- NCOL(x) } result$df <- result$n - result$k if( result$k > 0){ tmp <- lm(y ~ x - 1) result$coefficients <- coef(tmp) result$vcov <- vcov(tmp) result$logl <- as.numeric(logLik(tmp)) }else{ result$coefficients <- NULL result$vcov <- NULL result$logl <- sum(dnorm(y, sd=sqrt(var(y)), log=TRUE)) } ##residuals: if( result$k > 0){ result$residuals <- residuals(tmp) }else{ result$residuals <- y } ##info needed for r-squared: result$y <- y ##return result: return(result) } \end{verbatim} % The added part appears under \verb|##info needed for r-squared|. A GOF function that returns the adjusted $R^2$ is: % \begin{verbatim} myGof <- function(object, ...){ TSS <- sum((object$y - mean(object$y))^2) RSS <- sum(object$residuals^2) Rsquared <- 1 - RSS/TSS result <- 1 - (1 - Rsquared) * (object$n - 1)/(object$n - object$k) return(result) } \end{verbatim} % The following code undertakes GETS modelling with all the three user-specified functions defined so far: % \begin{verbatim} getsFun(y, x, user.estimator = list(name = "lmFun"), user.diagnostics = list(name = "myDiagnostics", pval = 0.05), gof.function = list(name = "myGof"), gof.method = "max") \end{verbatim} % Incidentally, it leads to the same final model as when the default GOF function is used. \subsection{The \code{blocksFun()} function \label{subsec:the:blocksFun:function} } The \code{blocksFun()} function was added to the package \textbf{gets} in version 0.24 (July 2020). It enables block-based GETS modelling with user-specified estimator, diagnostics and goodness-of-fit criteria, and one of its main attractions is that it can handle more variables than observations $n$. This is not possible in \code{getsFun()}. Currently, \code{blocksFun()} accepts up to 31 arguments, most of which are the same as those of \code{getsFun()} (type \code{?blocksFun} for the details). In particular, the main principles outlined above in Sections \ref{subsec:the:getsFun:function} to \ref{subsec:user-specified:gof} also apply to \code{blocksFun()}. The following code illustrates the basic usage of \code{blocksFun()} in a situation where the number of regressors is larger than the number of observations: % \begin{verbatim} n <- 40 #number of observations k <- 60 #number of Xs set.seed(123) #for replicability y <- rnorm(n) #generate Y x <- matrix(rnorm(n*k), n, k) #create matrix of Xs #do block-based gets w/default estimator ols(), store output in 'result': result <- blocksFun(y, x) #the information printed during searching: x block 1 of 2: 30 path(s) to search Searching: 1 2 3 ... 29 30 x block 2 of 2: 30 path(s) to search Searching: 1 2 3 ... 29 30 \end{verbatim} % The printed information during search is from the first round of the block-based GETS modelling. The information states that the original regressor matrix $\bx_t$ was divided into 2 blocks, and then GETS-modelling was applied to each of the blocks. An internal algorithm is used to split $\bx_t$, and the algorithm can be tweaked via the arguments \code{blocks}, \code{no.of.blocks}, \code{max.block.size} and \code{ratio.threshold}, type \code{?blocksFun} for the details. For complete control of the block-composition, use the argument \code{blocks}. In the list returned by \code{blocksFun()}, i.e.\ the object named \code{result} in the example above, the exact composition of the blocks produced by the internal algorithm is contained in the entry named \code{blocks}. Just as for \code{getsFun()}, the final specification is contained in the entry named \code{specific.spec}. However, note that here the entry is now a list rather than a vector. The argument \code{x} argument can also be specified as a list of matrices, for example: % \begin{verbatim} xlist <- list(x1=x[,5:30], x2=x[,26:50]) blocksFun(y, xlist) \end{verbatim} % This is useful when it makes sense to group subsets of variables together. Note that, as the example indicates, one or more regressors can enter more than one matrix or ``group". \subsection{More speed: \code{turbo}, \code{max.paths} and parallel computing} In multi-path backwards elimination search, one may frequently arrive at a specification that has already been estimated and tested. As an example, consider the following two paths: % \begin{verbatim} $paths[[1]] [1] 2 4 3 1 5 $paths[[2]] [1] 4 2 3 1 5 \end{verbatim} % In path 1, i.e.\ \code{paths[[1]]}, regressor no.\ 2 is the first to be deleted, regressor no.\ 4 is the second, and so on. In path 2 regressor no.\ 4 is the first to be deleted, regressor no.\ 2 is the second, and so on. In other words, after the deletion of the first two variables, the set of remaining variables (i.e.\ 3, 1 and 5) in the two paths is identical. Accordingly, knowing the result from the first path, in path 2 it is unnecessary to proceed further after having deleted the first two regressors. Setting the argument \code{turbo} equal to \code{TRUE} turns such a check on, and thus skips searches, estimations and tests that are unnecessary. The turbo comes at a small computational cost (often less than 1 second), since the check is undertaken at each deletion. This is why the default is \code{turbo = FALSE} in \code{getsFun()}. However, if the estimation time is noticeable, then turning the turbo on can reduce the search time substantially. As a rule of thumb, if each estimation takes 1 second or more, then turning the turbo on will (almost) always reduce the total search time. Searching more paths may increase the relevance proportion or potency. Whether and to what extent this happens depends on the sample size $n$, and on the degree of multicolinearity among the regressors $\bx_t$. If $n$ is sufficiently large, or if the regressors are sufficiently uncorrelated, then searching fewer paths will not reduce the relevance proportion. In many situations, therefore, one may consider reducing the number of paths to increase the speed. This is achieved via the \code{max.paths} argument. Setting \code{max.paths = 10}, for example, means a maximum of 10 paths is searched. The paths that are searched are those of the 10 most insignificant variables (i.e.\ those with the highest $p$-values) in the starting model. The functions \code{blocksFun()} and \code{isat()} implement a two-round version of block-based GETS modelling. In the first round the regressors $\bx_t$ are split into $B$ blocks, and then GETS modelling is undertaken on each block. This is a socalled ``embarassingly parallel" problem. To make \code{isat} search in parallel during the first round, simply set the argument \code{parallel.options} equal to an integer greater than 1. The integer determines how many cores/threads to use, and the command \code{detectCores()} can be used to find out how many cores/threads that are available on the current machine. Remember, it is not recommended to use all the cores/threads available. Within \code{blocksFun()} and \code{isat()}, parallel-computing is implemented with the \code{makeCluster()} and \code{parApply()} functions from the package \textbf{parallel}. If the time required by \code{makeCluster()} to set up parallel computing is negligible relative to the total computing time (on an average computer the setup-time is about 1 second), then the total computing time may -- in optimal situations -- be reduced by a factor of about $m-0.8$, where $m>1$ is the number of cores/threads used for parallel computing. \section{Illustrations} \subsection{GETS modelling of Generalised Linear Models (GLMs)} The function \code{glm()} enables the estimation of a large number of specifications within the class of Generalised Linear Models (GLMs). Here, it is illustrated how GETS modelling can be implemented with GLMs. To fix ideas, the illustration is in terms of the logit-model. Let $y_t \in \left\{0,1\right\}$ denote the regressand of the logit-model given by % \begin{equation} Pr\left(y_t=1|\bx_t\right) = \frac{1}{1+\exp\left(-h_t\right)}, \qquad h_t = \bbeta'\bx_t. \end{equation} % Consider the following set of data: % \begin{verbatim} n <- 40 #number of observations k <- 20 #number of Xs set.seed(123) #for reproducibility y <- round(runif(40)) #generate Y x <- matrix(rnorm(n*k), n, k) #create matrix of Xs \end{verbatim} % In other words, one regressand $y_t \in \left\{0,1\right\}$ which is entirely independent of the 20 regressors in $\bx_t$. The following interface function enables GETS modelling of logit-models: \begin{verbatim} logitFun <- function(y, x, ...){ ##create list: result <- list() ##n, k and df: result$n <- length(y) if( is.null(x) || NCOL(x)==0 ){ result$k <- 0 }else{ result$k <- NCOL(x) } result$df <- result$n - result$k ##call glm if k > 0: if( result$k > 0){ tmp <- glm(y ~ x - 1, family = binomial(link="logit")) result$coefficients <- coef(tmp) result$vcov <- vcov(tmp) result$logl <- as.numeric(logLik(tmp)) }else{ result$coefficients <- NULL result$vcov <- NULL result$logl <- result$n*log(0.5) } ##return result: return(result) } \end{verbatim} % To undertake GETS modelling: % \begin{verbatim} getsFun(y, x, user.estimator=list(name="logitFun")) \end{verbatim} % Two variables are retained, namely $x_{5t}$ and $x_{11t}$, at the default significance level of 5\% (i.e.\ \code{t.pval = 0.05}). To reduce the chance of retaining irrelevant variables, the significance level can be lowered to, say, 1\% by setting \code{t.pval = 0.01}. To implement GETS modelling for a different GLM model, only two lines of code must be modified in the user-defined function above. The first is the line that specifies the \code{family}, and the second is the one that contains the log-likelihood associated with the empty model (i.e.\ the line \code{result\$logl <- result\$n*log(0.5)}). \subsection{Creating a \code{gets()} method (S3) for a model of class \code{"lm"}} The package \textbf{gets} provides the generic function \code{gets()}. This enables the creation of GETS methods (S3) for models of arbitrary classes (type \code{?S3Methods} for more info on S3 methods). Here, this is illustrated for models of class \code{"lm"}. Since version 0.28 (August 2021), the \textbf{gets} package provides a method \code{gets()} for models of class \code{"lm"} (type \code{?gets.lm} for the details). So the example outlined here is purely for illustration purposes. Suppose, for illustratory purposes, that a method \code{gets()} for models of class \code{"lm"} does not exist, and that our aim is to be able to do the following: % \begin{verbatim} mymodel <- lm(y ~ x) gets(mymodel) \end{verbatim} % That is, to first estimate a model of class \code{"lm"} where \code{x} is a matrix of regressors, and then to conveniently undertake GETS modelling by simply applying the code \code{gets(.)} to the named object \code{mymodel}. To this end, a function named \code{gets.lm()} that relies on \code{getsFun()} must be created. In doing so, a practical aspect is how to appropriately deal with the intercept codewise. Indeed, as we will see, a notable part of the code in the user-defined function will be devoted to the intercept. The reason for this is that \code{lm()} includes the intercept by default. Another practical aspect is whether to use \code{lm()} or \code{ols()} whenever a model is estimated by OLS (both employ the QR decomposition). The latter is simpler codewise, so here we opt for the latter.\footnote{The function \code{gets.lm()} that ships with \textbf{gets} since version 0.28 employs \code{lm()} only, not \code{ols()}.} The function is: % \begin{verbatim} gets.lm <- function(object, ...){ ##make y: y <- as.vector(object$model[, 1]) yName <- names(object$model)[1] ##make x: x <- as.matrix(object$model[, -1]) xNames <- colnames(x) if(NCOL(x) == 0){ x <- NULL xNames <- NULL }else{ if(is.null(xNames)){ xNames <- paste0("X", 1:NCOL(x)) colnames(x) <- xNames } } ##is there an intercept?: if(length(coef(object)) > 0){ cTRUE <- names(coef(object))[1] == "(Intercept)" if(cTRUE){ x <- cbind(rep(1, NROW(y)), x) xNames <- c("(Intercept)", xNames) colnames(x) <- xNames } } ##do gets: myspecific <- getsFun(y, x, ...) ##which are the retained regressors?: retainedXs <- xNames[myspecific$specific.spec] cat("Retained regressors:\n ", retainedXs, "\n") ##return result return(myspecific) } \end{verbatim} % Next, recall the Data Generation Process (DGP) of the first experiment: % \begin{verbatim} n <- 40 #number of observations k <- 20 #number of Xs set.seed(123) #for reproducibility y <- rnorm(n) #generate Y x <- matrix(rnorm(n*k), n, k) #create matrix of Xs \end{verbatim} % We can now do GETS modelling on models of class \code{"lm"} by simply applying the code \code{gets()} on the object in question. As an example, the following code first stores an estimated model of class \code{"lm"} in an object named \code{startmodel}, and then applies the newly defined function \code{gets.lm()} to it: % \begin{verbatim} startmodel <- lm(y ~ x) finallm <- gets(startmodel) \end{verbatim} % The information from the specification search is stored in the object called \code{finallm}, and during the search the following is printed: % \begin{verbatim} 18 path(s) to search Searching: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Retained regressors: X10 X17 X18 \end{verbatim} % In other words, the retained regressors are no.\ 10, 17 and 18. Note that, due to the way the user-defined function has been put together, the intercept is excluded from deletion. For a comparison with the function \code{gets.lm()} available in the \textbf{gets} package since version 0.28, type \code{gets::gets.lm(startmodel)}. \subsection{Regression with ARMA error} The function \code{arima()} can be used to estimate a linear regression with deterministic regressors and an error-term that follows an ARMA. An example is % \begin{equation*} y_t = \bbeta'\bx_{t} + \epsilon_t, \qquad \epsilon_t = \phi_1\epsilon_{t-1} + \theta_1 u_{t-1} + u_t, \qquad u_t\sim WN\left(0,\sigma_u^2\right), \end{equation*} % where $\bx_t$ is a vector of deterministic regressors and $WN$ is short for White Noise. The error $\epsilon_t$ is thus governed by an ARMA(1,1). Let $x_t$ denote a (deterministic) step-shift variable in which the step-shift occurs at observation 30, i.e.\ $x_t = 1\left(t\geq30\right)$. Next, consider the DGP given by % \begin{equation}\label{eq:dgp:of:location-shift} y_t = 4x_t + \epsilon_t, \qquad \epsilon_t = 0.4\epsilon_{t-1} + 0.1 u_{t-1} + u_t, \qquad u_t \sim N\left(0,1\right), \qquad t=1,\ldots,n \end{equation} % with $n=60$. In other words, the series $y_t$ is non-stationary and characterised by a large location shift at $t=30$. Figure \ref{figure:step-shift} illustrates the evolution of $y_t$, which is generated with the following code: % \begin{verbatim} set.seed(123) #for reproducibility eps <- arima.sim(list(ar = 0.4, ma = 0.1), 60) #epsilon x <- coredata(sim(eps, which.ones = 30)) #step-dummy at t = 30 y <- 4*x + eps #the dgp plot(y, ylab="y", xlab="t", lwd = 2) \end{verbatim} % \begin{figure}[htbp] \centering \includegraphics[scale=0.5]{step-shift} \caption{The graph of $y_t$ as given in \eqref{eq:dgp:of:location-shift}.} \label{figure:step-shift} \end{figure} % By just looking at the graph, it seems clear that there is a location shift, but it is not so clear that it in fact occurs at $t=30$. I now illustrate how the \code{arima()} function can be used in combination with \code{getsFun()} to automatically search for where the break occurs. The idea is to do GETS modelling over a set or block of step-indicators that cover the period in which the break visually appears to be in. Specifically, the aim is to apply GETS modelling to the following starting model with 11 regressors: % \begin{equation*} y_t = \sum_{i=1}^{11} \beta_i \cdot 1_{ \left\{ t \geq 24+i \right\} } + \epsilon_t, \qquad \epsilon_t = \phi_1 \epsilon_{t-1} + \theta_1 u_{t-1} + u_t. \end{equation*} % To this end, we first need to make the user-specified estimator: % \begin{verbatim} myEstimator <- function(y, x){ ##create list: result <- list() ##estimate model: if( is.null(x) || NCOL(x)==0 ){ result$k <- 0 tmp <- arima(y, order = c(1,0,1)) #empty model }else{ result$k <- NCOL(x) tmp <- arima(y, order = c(1,0,1), xreg = x) result$coefficients <- tmp$coef[-c(1:3)] result$vcov <- tmp$var.coef result$vcov <- result$vcov[-c(1:3),-c(1:3)] } ##rename and re-organise things: result$n <- tmp$nobs result$df <- result$n - result$k result$logl <- tmp$loglik #return result: return(result) } \end{verbatim} % Note that the estimator has been put together such that the ARMA(1,1) specification of the error $\epsilon_t$ is fixed. As a consequence, the specification search is only over the regressors. The following code first creates the 11 step dummies, and then undertakes the GETS modelling: % \begin{verbatim} xregs <- coredata(sim(eps, which.ones = 25:35)) #11 step-dummies getsFun(y, xregs, user.estimator = list(name = "myEstimator")) \end{verbatim} % Two step-dummies are retained, namely those of $t=30$ and $t=35$. \subsection{Faster ISAT with large datasets} Block-based GETS modelling and ISAT methods in particular are computationally intensive. This is particularly the case when the number of observations $n$ is large in ISAT methods, since at least $n-1$ indicators are included as regressors. Accordingly, as $n$ grows large, purpose-specific estimators can greatly reduce the computing time. One way of building such an estimator is by using tools from the package \textbf{Matrix}, see \citet{BatesMaechler2018}. The code below illustrates this. First it loads the library, and then it creates a function named \code{olsFaster()} that re-produces the structure of the estimation result returned by the function \code{ols()} with \code{method = 3} (i.e.\ OLS with the ordinary coefficient-covariance), but with functions from \textbf{Matrix}. The code is: % \begin{verbatim} library(Matrix) olsFaster <- function(y, x){ out <- list() out$n <- length(y) if (is.null(x)){ out$k <- 0 }else{ out$k <- NCOL(x) } out$df <- out$n - out$k if (out$k > 0) { x <- as(x, "dgeMatrix") out$xpy <- crossprod(x, y) out$xtx <- crossprod(x) out$coefficients <- as.numeric(solve(out$xtx,out$xpy)) out$xtxinv <- solve(out$xtx) out$fit <- out$fit <- as.vector(x %*% out$coefficients) }else{ out$fit <- rep(0, out$n) } out$residuals <- y - out$fit out$residuals2 <- out$residuals^2 out$rss <- sum(out$residuals2) out$sigma2 <- out$rss/out$df if (out$k > 0) { out$vcov <- as.matrix(out$sigma2 * out$xtxinv) } out$logl <- -out$n * log(2 * out$sigma2 * pi)/2 - out$rss/(2 * out$sigma2) return(out) } \end{verbatim} % Depending on the data and hardware/software configuration, the estimator may lead to considerably speed-improvement. In the following example, the function \code{system.time()} suggests a speed improvement of about 20\% on the current hardware/software configuration: % \begin{verbatim} set.seed(123) #for reproducibility y <- rnorm(1000) x <- matrix(rnorm(length(y)*20), length(y), 20) #with ols(): system.time( finalmodel <- isat(y, mxreg = x, max.paths = 5) ) #with olsFaster(): system.time( finalmodel <- isat(y, mxreg = x, max.paths = 5, user.estimator = list(name = "olsFaster")) ) \end{verbatim} \clearpage %\addcontentsline{toc}{section}{References} %%\renewcommand\refname{References} %\bibliographystyle{chicago} %\bibliography{C:/tex/gs041004} \begin{thebibliography}{} \addcontentsline{toc}{section}{References} \bibitem[\protect\citeauthoryear{Bates and Maechler}{Bates and Maechler}{2018}]{BatesMaechler2018} Bates, D. and M.~Maechler (2018). \newblock {\em {M}atrix: {S}parse and {D}ense {M}atrix {C}lasses and {M}ethods}. \newblock R package version 1.2-15. \bibitem[\protect\citeauthoryear{Campos, Hendry, and Ericsson}{Campos et~al.}{2005}]{CamposEricssonHendry2005} Campos, J., D.~F. Hendry, and N.~R. Ericsson (Eds.) (2005). \newblock {\em General-to-{S}pecific {M}odeling. {V}olumes 1 and 2}. \newblock Cheltenham: Edward Elgar Publishing. \bibitem[\protect\citeauthoryear{Castle, Doornik, Hendry, and Pretis}{Castle et~al.}{2015}]{CastleDoornikHendryPretis2015} Castle, J., J.~Doornik, D.~F. Hendry, and F.~Pretis (2015). \newblock {D}etecting {L}ocation {S}hifts {D}uring {M}odel {S}election by {S}tep-{I}ndicator {S}aturation. \newblock {\em Econometrics\/}~{\em 3}, 240--264. \newblock DOI: https://doi.org/10.3390/econometrics3020240. \bibitem[\protect\citeauthoryear{Clarke}{Clarke}{2014}]{Clarke2014} Clarke, D. (2014). \newblock {G}eneral-to-specific modeling in {S}tata. \newblock {\em The Stata Journal\/}~{\em 14}, 895--908. \newblock \url{https://www.stata-journal.com/article.html?article=st0365}. \bibitem[\protect\citeauthoryear{Doornik and Hendry}{Doornik and Hendry}{2018}]{DoornikHendry2018} Doornik, J.~A. and D.~F. Hendry (2018). \newblock {\em {E}mpirical {E}conometric {M}odelling - {P}c{G}ive 15}. \newblock London: Timberlake Consultants Ltd. \bibitem[\protect\citeauthoryear{Dubois and Michaux}{Dubois and Michaux}{2019}]{DuboisMichaux2019} Dubois, {\'E}. and E.~Michaux (2019). \newblock Grocer 1.8: an econometric toolbox for {S}cilab. \newblock \url{http://dubois.ensae.net/grocer.html}. \bibitem[\protect\citeauthoryear{Hendry, Johansen, and Santos}{Hendry et~al.}{2008}]{HendryJohansenSantos2008} Hendry, D.~F., S.~Johansen, and C.~Santos (2008). \newblock Automatic selection of indicators in a fully saturated regression. \newblock {\em Computational Statistics\/}~{\em 23}, 317--335. \newblock DOI 10.1007/s00180-007-0054-z. \bibitem[\protect\citeauthoryear{Hoover and Perez}{Hoover and Perez}{1999}]{Hooveretal99} Hoover, K.~D. and S.~J. Perez (1999). \newblock Data {M}ining {R}econsidered: {E}ncompassing and the {G}eneral-to-{S}pecific {A}pproach to {S}pecification {S}earch. \newblock {\em Econometrics Journal\/}~{\em 2}, 167--191. \newblock Dataset and code: \url{http://www.csus.edu/indiv/p/perezs/Data/data.htm}. \bibitem[\protect\citeauthoryear{{IHS Markit}}{{IHS Markit}}{2020}]{EViews2020} {IHS Markit} (2020). \newblock {\em EViews Version 11}. \newblock Irvine: IHS Markit. \bibitem[\protect\citeauthoryear{Lovell}{Lovell}{1983}]{Lovell83} Lovell, M.~C. (1983). \newblock Data {M}ining. \newblock {\em The {R}eview of {E}conomics and {S}tatistics\/}~{\em 65}, 1--12. \bibitem[\protect\citeauthoryear{Pretis, Reade, and Sucarrat}{Pretis et~al.}{2018}]{PretisReadeSucarrat2018} Pretis, F., J.~Reade, and G.~Sucarrat (2018). \newblock {A}utomated {G}eneral-to-{S}pecific ({GETS}) {R}egression {M}odeling and {I}ndicator {S}aturation for {O}utliers and {S}tructural {B}reaks. \newblock {\em Journal of Statistical Software\/}~{\em 86}, 1--44. \bibitem[\protect\citeauthoryear{Schwarz}{Schwarz}{1978}]{Schwarz1978} Schwarz, G. (1978). \newblock {E}stimating the {D}imension of a {M}odel. \newblock {\em The Annals of Statistics\/}~{\em 6}, 461--464. \bibitem[\protect\citeauthoryear{Sucarrat, Kurle, Pretis, Reade, and Schwarz}{Sucarrat et~al.}{2021}]{SucarratKurlePretisReadeSchwarz2021getsV029} Sucarrat, G., J.~Kurle, F.~Pretis, J.~Reade, and M.~Schwarz (2021). \newblock {\em {gets}: {G}eneral-to-{S}pecific ({GETS}) {M}odelling and {I}ndicator {S}aturation ({ISAT}) {M}ethods}. \newblock R package version 0.29. \end{thebibliography} \end{document}