\documentclass[article,nojss]{jss} \usepackage[utf8]{inputenc} \usepackage{amsmath,amssymb,bbm} \usepackage{Sweave} \newcommand{\dnorm}{\phi}% normal density \newcommand{\loglik}{\ell}% log likelihood \newcommand*{\mat}[1]{\mathbf{#1}}% Matrix \newcommand{\pnorm}{\Phi}% normal distribution function \renewcommand*{\vec}[1]{\boldsymbol{#1}}% vector \author{Ott Toomet\\Tartu University \And Arne Henningsen\\University of Copenhagen} \Plainauthor{Ott Toomet, Arne Henningsen} \title{Sample Selection Models in \proglang{R}: Package \pkg{sampleSelection}} \Plaintitle{Sample Selection Models in R: Package sampleSelection} \Abstract{ This introduction to the \proglang{R} package \pkg{sampleSelection} is a slightly modified version of \cite{toomet08}, published in the \emph{Journal of Statistical Software}. This paper describes the implementation of Heckman-type sample selection models in \proglang{R}. We discuss the sample selection problem as well as the Heckman solution to it, and argue that although modern econometrics has non- and semiparametric estimation methods in its toolbox, Heckman models are an integral part of the modern applied analysis and econometrics syllabus. We describe the implementation of these models in the package \pkg{sampleSelection} and illustrate the usage of the package on several simulation and real data examples. Our examples demonstrate the effect of exclusion restrictions, identification at infinity and misspecification. We argue that the package can be used both in applied research and teaching. } \Keywords{sample selection models, Heckman selection models, econometrics, \proglang{R}} \Plainkeywords{sample selection models, Heckman selection models, econometrics, R} \Address{ Ott Toomet\\ Department of Economics\\ Tartu University\\ Narva 4-A123\\ Tartu 51009, Estonia\\ Telephone: +372/737.6348\\ E-mail: \email{otoomet@ut.ee}\\ URL: \url{http://www.obs.ee/~siim/}\\ Arne Henningsen\\ Institute of Food and Resource Economics\\ University of Copenhagen\\ Rolighedsvej 25\\ 1958 Frederiksberg C, Denmark\\ E-mail: \email{arne.henningsen@gmail.com}\\ URL: \url{http://www.arne-henningsen.name/} } \begin{document} \SweaveOpts{concordance=TRUE} % initialisation stuff \SweaveOpts{engine=R} %\VignetteIndexEntry{Sample Selection Models} %\VignetteDepends{sampleSelection,mvtnorm} %\VignetteKeywords{{sample selection models, Heckman selection models, econometrics, R} %\VignettePackage{sampleSelection} <>= options( prompt = "R> ", ctinue = "+ " ) @ \section{Introduction} \label{sec:intro} Social scientists are often interested in causal effects---what is the impact of a new drug, a certain type of school or being born as a twin. Many of these cases are not under the researcher's control. Often, the subjects can decide themselves, whether they take a drug or which school they attend. They cannot control whether they are twins, but neither can the researcher---the twins may tend to be born in different types of families than singles. All these cases are similar from the statistical point of view. Whatever is the sampling mechanism, from an initial ``random'' sample we extract a sample of interest, which may not be representative of the population as a whole \citep[see][p.~1937, for a discussion]{heckman+macurdy1986}. This problem---people who are ``treated'' may be different than the rest of the population---is usually referred to as a \emph{sample selection} or \emph{self-selection} problem. We cannot estimate the causal effect, unless we solve the selection problem\footnote{Correcting for selectivity is necessary but not always sufficient for estimating the causal effect. Another common problem is the lack of common support between the treated and untreated population. We are grateful to a referee for pointing this out.}. Otherwise, we will never know which part of the observable outcome is related to the causal relationship and which part is due to the fact that different people were selected for the treatment and control groups. Solving sample selection problems requires additional information. This information may be in different forms, each of which may or may not be feasible or useful for any particular case. Here we list a few popular choices: \begin{itemize} \item Random experiment, the situation where the participants do not have control over their status but the researcher does. Randomisation is often the best possible method as it is easy to analyse and understand. However, this method is seldom feasible for practical and ethical reasons. Even more, the experimental environment may add additional interference which complicates the analysis. \item Instruments (exclusion restrictions) are in many ways similar to randomisation. These are variables, observable to the researcher, and which determine the treatment status but not the outcome. Unfortunately, these two requirements tend to contradict each other, and only rarely do we have instruments of reasonable quality. \item Information about the functional form of the selection and outcome processes, such as the distribution of the disturbance terms. The original Heckman's solution belongs to this group. However, the functional form assumptions are usually hard to justify. \end{itemize} During recent decades, either randomisation or pseudo-randomisation (natural experiments) have become state of the art for estimating causal effects. However, methods relying on distributional assumptions are still widely used. The reason is obvious---these methods are simple, widely available in software packages, and they are part of the common econometrics syllabus. This is true even though reasonable instruments and parametric assumptions can only seldom be justified, and therefore, it may be hard to disentangle real causal effects from (artificial) effects of parametric assumptions. Heckman-type selection models also serve as excellent teaching tools. They are extensively explained in many recent econometric text books \citep[e.g.][]{johnston97, verbeek00, greene2002, wooldridge03, cameron05} and they are standard procedures in popular software packages like \pkg{Limdep} \citep{Limdep} and \proglang{Stata} \citep{Stata}. These models easily allow us to experiment with selection bias, misspecification, exclusion restrictions etc. They are easy to implement, to visualize, and to understand. The aim of this paper is to describe the \proglang{R} \citep{r2007} package \pkg{sampleSelection} (version~0.6-0), which is available on the Comprehensive \proglang{R} Archive Network at \url{http://CRAN.R-project.org/package=sampleSelection}. The package implements two types of more popular Heckman selection models which, as far as we know, were not available for \proglang{R} before. Our presentation is geared toward teaching because we believe that one of the advantages of these types of models lies in econometrics training. The paper is organized as follows: In the next section we introduce the \citet{heckman1976} solution to the sample selection problem. Section~\ref{sec:implementation} briefly describes the current implementation of the model in \pkg{sampleSelection} and its possible generalisations. In Section~\ref{sec:using} we illustrate the usage of the package on various simulated data sets. Section~\ref{sec:reliability} is devoted to replication exercises where we compare our results to examples in the literature. Section~\ref{sec:problems} describes robustness issues of the method and our implementation of it; and the last section concludes. % Justin's recommendation: concludes the discussion \section{Heckman's solution} \label{sec:heckman} The most popular solutions for sample selection problems are based on \citet{heckman1976}. A variety of generalisations of Heckman's standard sample selection model can be found in the literature. These models are also called ``generalized Tobit models'' \citep{amemiya84,amemiya1985}. A comprehensive classification of these models has been proposed by \citet{amemiya84,amemiya1985}. \subsection{Tobit-2 models} \label{sec:theory-tobit2} Heckman's standard sample selection model is also called ``Tobit-2'' model \citep{amemiya84,amemiya1985}. It consists of the following (unobserved) structural process: \begin{align} y_i^{S*} &= {\vec{\beta}^S}' \vec{x}_i^S + \varepsilon_i^S \label{eq:probit*} \\ y_i^{O*} &= {\vec{\beta}^O}' \vec{x}_i^O + \varepsilon_i^O, \label{eq:outcome*} \end{align} where $y_i^{S*}$ is the realisation of the the latent value of the selection ``tendency'' for the individual $i$, and $y_i^{O*}$ is the latent outcome. $\vec{x}_i^S$ and $\vec{x}_i^O$ are explanatory variables for the selection and outcome equation, respectively. $\vec{x}^S$ and $\vec{x}^O$ may or may not be equal. We observe \begin{align} y_i^S &= \begin{cases} 0 & \quad \text{if } y_i^{S*} < 0 \label{eq:probit} \\ 1 & \quad \text{otherwise} \end{cases} \\ y_i^O &= \begin{cases} 0 & \quad \text{if } y_i^{S} = 0\\ y_i^{O*} & \quad \text{otherwise}, \end{cases} \end{align} i.e.\ we observe the outcome only if the latent selection variable $y^{S*}$ is positive. The observed dependence between $y^O$ and $x^O$ can now be written as \begin{equation} \E [y^O | \vec{x}^O = \vec{x}_i^O , \vec{x}^S = \vec{x}_i^S , y^S = 1] = {\vec{\beta}^O}' \vec{x}_i^O + \E [ \varepsilon^O|\varepsilon^S \ge -{\vec{\beta}^S}' \vec{x}_i^S ]. \end{equation} Estimating the model above by OLS gives in general biased results, as $\E [ \varepsilon^O|\varepsilon^S \ge -{\vec{\beta}^S}' \vec{x}_i^S ] \not = 0$, unless $\varepsilon^O$ and $\varepsilon^S$ are mean independent (in this case $\varrho = 0$ in equation~\eqref{eq:correlation} below). Assuming the error terms follow a bivariate normal distribution: \begin{equation} \begin{pmatrix} \varepsilon^S\\ \varepsilon^O \end{pmatrix} \sim N\left( \begin{pmatrix} 0\\ 0 \end{pmatrix}, \begin{pmatrix} 1 & \varrho\\ \varrho & \sigma^2 \end{pmatrix} \right), \label{eq:correlation} \end{equation} we may employ the following simple strategy: find the expectations $\E [ \varepsilon^O|\varepsilon^S \ge -{\vec{\beta}^S}' \vec{x}_i^S ]$, also called the \emph{control function}, by estimating the selection equations \eqref{eq:probit*} and \eqref{eq:probit} by probit, and thereafter insert these expectations into equation \eqref{eq:outcome*} as additional covariates (see \citealp{greene2002} for details). Accordingly, we may write: \begin{equation} y_i^O = {\vec{\beta}^O}' \vec{x}_i^O + \E [ \varepsilon^O|\varepsilon^S \ge -{\vec{\beta}^S}' \vec{x}_i^S ] + \eta_i \equiv {\vec{\beta}^O}' \vec{x}_i^O + \varrho \sigma \lambda({\vec{\beta}^S}' \vec{x}_i^S) + \eta_i \label{eq:tobit2_2step} \end{equation} where $\lambda(\cdot) = \phi(\cdot)/\Phi(\cdot)$ is commonly referred to as inverse Mill's ratio, $\phi(\cdot)$ and $\Phi(\cdot)$ are standard normal density and cumulative distribution functions and $\eta$ is a new disturbance term, independent of $\vec{x}^{O}$ and $\vec{x}^{S}$. The unknown multiplicator $\varrho \sigma$ can be estimated by OLS ($\hat{\beta}^\lambda$). Essentially, we describe the selection problem as an omitted variable problem, with $\lambda(\cdot)$ as the omitted variable. Since the true $\lambda(\cdot)$s in equation \eqref{eq:tobit2_2step} are generally unknown, they are replaced by estimated values based on the probit estimation in the first step. The relations \eqref{eq:correlation} and \eqref{eq:tobit2_2step} also reveal the interpretation of $\varrho$. If $\varrho > 0$, the third term in the right hand side of \eqref{eq:tobit2_2step} is positive as the observable observations tend to have above average realizations of $\varepsilon^{O}$. This is usually referred to as ``positive selection'' in a sense that the observed outcomes are ``better'' than the average. In this case, the OLS estimates are upward biased. An estimator of the variance of $\varepsilon^O$ can be obtained by \begin{equation} \hat{\sigma}^2 = \frac{\hat{\vec{\eta}}' \hat{\vec{\eta}}}{n^O} + \frac{\sum_i \hat{\delta}_i}{n^O} \left. \hat{\beta}^\lambda \right.^2 \end{equation} where $\hat{\vec{\eta}}$ is the vector of residuals from the OLS estimation of~\eqref{eq:tobit2_2step}, $n^O$ is the number of observations in this estimation, and $\hat{\delta}_i = \hat{\lambda}_i ( \hat{\lambda}_i + \left. \hat{\vec{\beta}}^S \right. ' \vec{x}_i^S )$. Finally, an estimator of the correlation between $\varepsilon^S$ and $\varepsilon^O$ can be obtained by $\hat{\varrho} = \hat{\beta}^\lambda / \hat{\sigma}$. Note that $\hat{\varrho}$ can be outside of the $[-1, 1]$ interval. Since the estimation of \eqref{eq:tobit2_2step} is not based on the true but on estimated values of $\lambda(\cdot)$, the standard OLS formula for the coefficient variance-covariance matrix is not appropriate \citep[p.~157]{heckman79}. A consistent estimate of the variance-covariance matrix can be obtained by \begin{equation} \widehat{\VAR} \left[ \hat{\vec{\beta}}^O, \hat{\vec{\beta}}^\lambda \right] = \hat{\sigma}^2 \left[ {\mat{X}_\lambda^O}' \mat{X}_\lambda^O \right]^{-1} \left[ {\mat{X}_\lambda^O}' \left( \mat{I} - \hat{\varrho}^2 \hat{\mat{\Delta}} \right) \mat{X}_\lambda^{O} + \mat{Q} \right] \left[ {\mat{X}_\lambda^O}' \mat{X}_\lambda^O \right]^{-1} \end{equation} where \begin{equation} \mat{Q} = \hat{\varrho}^2 \left( {\mat{X}_\lambda^O}' \hat{\mat{\Delta}} \mat{X}^S \right) \widehat{\VAR} \left[ \hat{\vec{\beta}}^S \right] \left( {\mat{X}^S}' \hat{\mat{\Delta}} \mat{X}_\lambda^O \right), \end{equation} $\mat{X}^S$ is the matrix of all observations of $\vec{x}^S$, $\mat{X}_\lambda^O$ is the matrix of all observations of $\vec{x}^O$ and $\hat{\lambda}$, $\mat{I}$ is an identity matrix, $\hat{\mat{\Delta}}$ is a diagonal matrix with all $\hat{\delta}_i$ on its diagonal, and $\widehat{\VAR} \left[ \hat{\vec{\beta}}^S \right]$ is the estimated variance covariance matrix of the probit estimate \citep{greene81,greene2002}. This is the original idea by \citet{heckman1976}. As the model is fully parametric, it is straightforward to construct a more efficient maximum likelihood (ML) estimator. Using the properties of a bivariate normal distribution, it is easy to show that the log-likelihood can be written as \begin{align} \loglik & = \sum_{\{i:y_i^S = 0\}} \log \pnorm(-{\vec{\beta}^S}' \vec{x}_i^S ) + \\ & + \sum_{\{i:y_i^S = 1\}} \left[ \log \pnorm \left(\frac{ {\vec{\beta}^S}' \vec{x}_i^S + \displaystyle\frac{\varrho}{\sigma}(y_i^O - {\vec{\beta}^O}' \vec{x}_i^O)} {\sqrt{1 - \varrho^2}} \right) -\frac{1}{2} \log 2\pi - \log \sigma - \frac{1}{2} \frac{(y_i^O - {\vec{\beta}^O}' \vec{x}_i^O)^2}{\sigma^2} \right]. \end{align} The original article suggests using the two-step solution for exploratory work and as initial values for ML estimation, since in those days the cost of the two-step solution was \$15 while that of the maximum-likelihood solution was \$700 \citep[p.~490]{heckman1976}. %footnote 18, page 490 Nowadays, costs are no longer an issue, however, the two-step solution allows certain generalisations more easily than ML, and is more robust in certain circumstances (see Section~\ref{sec:problems} below). This model and its derivations were introduced in the 1970s and 1980s. The model is well identified if the exclusion restriction is fulfilled, i.e.\ if $\vec{x}^S$ includes a component with a substantial explanatory power but which is not present in $\vec{x}^O$. This means essentially that we have a valid instrument. If this is not the case, the identification is related to the non-linearity of the inverse Mill's ratio $\lambda(\cdot)$. The exact form of it stems from the distributional assumptions. During the recent decades, various semiparametric estimation techniques have been increasingly used in addition to the Heckman model (see \citealp{powell1994}, \citealp{pagan+ullah1999}, and \citealp{li07} for a review). \subsection{Tobit-5 models} \label{sec:theory-tobit5} A straightforward generalisation of the standard sample selection model (Tobit-2) is the switching regression (Tobit-5) model. In this case, we have two outcome variables, where only one of them is observable, depending on the selection process. Switching regression problems arise in a wide variety of contexts, e.g.\ in treatment effect, migration or schooling choice analysis. This type of model consists of a system of three simultaneous latent equations: \begin{align} y_i^{S*} &= {\vec{\beta}^S}' \vec{x}_i^S + \varepsilon_i^S \\ y_i^{O1*} &= {\vec{\beta}^{O1}}' \vec{x}_i^{O1} + \varepsilon_i^{O1} \\ y_i^{O2*} &= {\vec{\beta}^{O2}}' \vec{x}_i^{O2} + \varepsilon_i^{O2}, \end{align} where $y^{S*}$ is the selection ``tendency'' as in the case of Tobit-2 models, and $y^{O1*}$ and $y^{O2*}$ are the latent outcomes, only one of which is observable, depending on the sign of $y^{S*}$. Hence we observe \begin{align} y_i^S &= \begin{cases} 0 & \quad \text{if } y_i^{S*} < 0 \\ 1 & \quad \text{otherwise} \end{cases} \label{eq:t5_selection_obs} \\ y_i^O &= \begin{cases} y_i^{O1*} & \quad \text{if } y_i^{S} = 0\\ y_i^{O2*} & \quad \text{otherwise}. \end{cases} \end{align} Assuming that the disturbance terms have a 3-dimensional normal distribution, \begin{equation} \begin{pmatrix} \varepsilon^S\\ \varepsilon^{O1}\\ \varepsilon^{O2} \end{pmatrix} \sim N\left( \begin{pmatrix} 0\\ 0\\ 0 \end{pmatrix}, \begin{pmatrix} 1 & \varrho_1\sigma_1 & \varrho_2\sigma_2 \\ \varrho_1\sigma_1 & \sigma_1^2 & \varrho_{12}\sigma_1\sigma_2\\ \varrho_2\sigma_2 & \varrho_{12}\sigma_1\sigma_2 & \sigma_2^2 \end{pmatrix} \right), \end{equation} it is straightforward to construct analogous two-step estimators as in~\eqref{eq:tobit2_2step}. We may write \begin{align} \E [y^O | \vec{x}^{O1} = \vec{x}_i^{O1} , \vec{x}^S = \vec{x}_i^S , y^S = 0] &= {\vec{\beta}^{O1}}' \vec{x}_i^{O1} + \E [ \varepsilon^{O1}|\varepsilon^S < -{\vec{\beta}^S}' \vec{x}_i^S ] \\ \E [y^O | \vec{x}^{O2} = \vec{x}_i^{O2} , \vec{x}^S = \vec{x}_i^S , y^S = 1] &= {\vec{\beta}^{O2}}' \vec{x}_i^{O2} + \E [ \varepsilon^{O2}|\varepsilon^S \ge -{\vec{\beta}^S}' \vec{x}_i^S ] \end{align} and hence \begin{equation} y_i^O = \begin{cases} {\vec{\beta}^{O1}}' \vec{x}_i^{O1} - \varrho_1 \sigma_1 \lambda(-{\vec{\beta}^S}' \vec{x}_i^S) + \eta_i^1 & \text{if } y_i^{S} = 0 \\ {\vec{\beta}^{O2}}' \vec{x}_i^{O2} + \varrho_2 \sigma_2 \lambda({\vec{\beta}^S}' \vec{x}_i^S) + \eta_i^2 & \text{otherwise}, \end{cases} \label{eq:tobit5_2step} \end{equation} where $\E [ \eta^1 ] = \E [ \eta^2 ] = 0$ and $\lambda(\cdot) = \phi(\cdot)/\Phi(\cdot)$ is the inverse Mill's ratio which can be calculated using the probit estimate of~\eqref{eq:t5_selection_obs}. This system can be estimated as two independent linear models, one for the case $y^S = 0$ and another for $y^S = 1$. Note that the inverse Mill's ratio enters \eqref{eq:tobit5_2step} with opposite signs. If $\varrho_{2} > 0$, we find that those, for whom we observe the outcome 2, have more positive realizations of $\varepsilon^{O2}$ in average. As the outcome 1 being observable is in the opposite end of the latent $y^{S*}$ scale, upward bias for $y^{O1}$ occurs when $\varrho_{1} < 0$. This is what we expect to observe in case of endogenous selection, a situation where the individuals try to select the ``best'' one between two possible options, based on some private information about $y^{O1*}$ and $y^{O2*}$. The log-likelihood for this problem can be written as \begin{align} \loglik &= -\frac{N}{2}\log 2\pi + \notag\\ &+ \sum_{\{i:y_i^S = 0\}} \left\{ -\log\sigma_1 -\frac{1}{2}\left( \frac{y_i^O - {\vec{\beta}^{O1}}' \vec{x}_i^{O1}} {\sigma_1} \right)^2 +\log \pnorm \left[ - \frac {{\vec{\beta}^S}' \vec{x}_i^S + \displaystyle \frac{\varrho_1}{\sigma_1} \left({y_i^O - {\vec{\beta}^{O1}}' \vec{x}_i^{O1}} \right)} {\sqrt{ 1 - \varrho_1^2}} \right] \right\} \notag\\ &+ \sum_{\{i: y_i^S = 1\}} \left\{ -\log\sigma_2 -\frac{1}{2}\left( \frac{y_i^O - {\vec{\beta}^{O2}}' \vec{x}_i^{O2}} {\sigma_2} \right)^2 +\log \pnorm \left[ \phantom{-} \frac {{\vec{\beta}^S}' \vec{x}_i^S + \displaystyle \frac{\varrho_2}{\sigma_2} \left({y_i^O - {\vec{\beta}^{O2}}' \vec{x}_i^{O2}} \right)} {\sqrt{ 1 - \varrho_2^2}} \right] \right\} \notag\\ \end{align} where $N$ is the total number of observations. Note that $\varrho_{12}$ plays no role in this model; the observable distributions are determined by the correlations $\varrho_1$ and $\varrho_2$ between the disturbances of the selection equation ($\varepsilon^S$) and the corresponding outcome equation ($\varepsilon^{O1}$ and $\varepsilon^{O2}$). \section[Implementation]{Implementation in \pkg{sampleSelection}} \label{sec:implementation} \subsection{Current implementation} \label{sec:current_implementation} The main frontend for the estimation of selection models in \pkg{sampleSelection} is the command \code{selection}. It requires a formula for the selection equation (argument \code{selection}), and a formula (or a list of two for switching regression models) for the outcome equation (\code{outcome}). One can choose the method (\code{method}) to be either \code{"ml"} for the ML estimation, or \code{"2step"} for the two-step method. If the user does not provide initial values (\code{start}) for the ML estimation, \code{selection} calculates consistent initial values by the corresponding two-step method. While the actual two-step estimation is done by the function \code{heckit2fit} or \code{heckit5fit} (depending on the model), the ML estimation is done by \code{tobit2fit} or \code{tobit5fit}. Note that log-likelihood functions of selection models are in general not globally concave, and hence one should use a good choice of initial values (see the example in Section~\ref{sec:convergence}). The likelihood maximisation is performed by the \pkg{maxLik} package \citep{r-maxLik}, where the Newton-Raphson algorithm (implemented as the function \code{maxNR}) using an analytic score vector and an analytic Hessian matrix is used by default. This results in a reasonably fast computation even in cases of tens of thousands observations. A well-defined model should converge in less than 15 iterations; in the case of weak identification this number may be considerably larger. Convergence issues may appear at the boundary of the parameter space (if $|\varrho| \to 1$, see Section~\ref{sec:convergence}). Other maximisation algorithms can be chosen by argument \code{maxMethod}, e.g.\ \code{maxMethod="SANN"} uses a variant of the robust ``simulated annealing'' stochastic global optimization algorithm proposed by \citet{belisle92} and \code{maxMethod="BHHH"} uses the Berndt-Hall-Hall-Hausman algorithm \citep{berndt74}. The command \code{selection} returns an object of class \code{selection}. The \pkg{sampleSelection} package provides several methods for objects of this class: a \code{print} method prints the estimation results, \code{summary} method (and associated \code{print} method) calculate and print summary results, \code{coef} methods extract the estimated coefficients, a \code{vcov} method extracts their covariance matrix, a \code{fitted} method extracts the fitted values, a \code{residuals} method extracts the residuals, a \code{model.frame} method extracts the model frame, and a \code{model.matrix} method extracts the model matrix. The \code{coef} and \code{vcov} methods for \code{selection} objects, as well as the \code{print} method for\linebreak \code{summary.selection} objects include an extra argument \code{part}, which specifies which part of the model is returned or printed. One may choose either the full model (\code{part="full"}, default), or the outcome equation only (\code{part="outcome"}). The \code{fitted}, \code{residuals}, and \code{model.matrix} methods also include a \code{part} argument. However, for these functions the \code{part} argument specifies whether the objects of the outcome part (\code{part="outcome"}, default) or of the selection part (\code{part="selection"}) should be returned. Currently, the variance-covariance matrix of the two-step estimators is only partially implemented with \code{NA}-s in place of the unimplemented components. The package is written completely in \proglang{R} which should minimize the portability issues. It depends on packages \pkg{maxLik} \citep{r-maxLik}, \pkg{systemfit} \citep{henningsen07a,r-systemfit}, and \pkg{mvtnorm} \citep{r-mvtnorm}, where \pkg{mvtnorm} is used for examples only. All these packages are available from the Comprehensive \proglang{R} Archive Network at \url{http://CRAN.R-project.org/}. \subsection{Current API and a wider range of selection models} We now briefly discuss possible ways of introducing more general selection models using a slightly generalized version of the current API. First, the current argument \code{selection} can be used for specifying more general selection conditions. \code{selection} might contain an interval for interval censoring (for instance \code{selection = c(0, Inf)} in case of the standard Tobit model), more than one formula (for multiple treatment models), or an indicator for the selection mechanism (like \code{"max"} or \code{"min"} for switching regression with unobserved separator). In this way, all generalized Tobit models listed by \cite{amemiya84,amemiya1985} can be specified. Another possible generalisation is allowing for discrete dependent variable models. While the case of binary-choice can be easily distinguished from continuous response, we need an additional parameter in the multiple-choice case. This parameter (possibly a vector where components correspond to the individual equations) will allow the user to specify the exact type of the response (like multinominal, ordered or Poissonian). Third, different distributions of the disturbance terms can be specified in a similar way using an additional parameter. It may be a vector if different marginal distributions for different outcome equations are necessary. Finally, as the \code{selection} currently supports only two easily distinguishable models, we have not provided a way to specify the model explicitly. However, explicit specification would allow users to locate the programming problems more easily, and lessen the complications related to automatic guess of the correct model type. \section[Usage]{Using the \code{selection} function} \label{sec:using} This section provides selected illustrative simulation experiments which illustrate both the strong and weak sides of the method, and the typical usage of \code{selection}. \subsection{Tobit-2 models} \label{sec:using-tobit2} First, we estimate a correctly specified Tobit-2 model with exclusion restriction: <>= set.seed(0) library("sampleSelection") library("mvtnorm") eps <- rmvnorm(500, c(0,0), matrix(c(1,-0.7,-0.7,1), 2, 2)) xs <- runif(500) ys <- xs + eps[,1] > 0 xo <- runif(500) yoX <- xo + eps[,2] yo <- yoX*(ys > 0) @ % -- please keep these comments. These are necessary for auctex We use \pkg{mvtnorm} in order to create bivariate normal disturbances with correlation $-0.7$. Next, we generate a uniformly distributed explanatory variable for the selection equation, \code{xs}, the selection outcome \code{ys} by probit data generating process, and the explanatory variable for the outcome equation \code{xo}. All our true intercepts are equal to 0 and our true slopes are equal to 1, both in this and the following examples. We retain the latent outcome variable (\code{yoX}) for the figure below, and calculate the observable outcome \code{yo}. Note that the vectors of explanatory variables for the selection (\code{xs}) and outcome equation (\code{xo}) are independent and hence the exclusion restriction is fulfilled. This can also be seen from the fact that the estimates are reasonably precise: <>= summary( selection(ys~xs, yo ~xo)) @ <>= m <- selection(ys~xs, yo ~xo) @ % One can see that all the true values are within the 95\% confidence intervals of the corresponding estimates. Now look at the graphical representation of the data (Figure~\ref{fig:t2ex}). We can see that the unobserved values (empty circles) tend to have higher $y^{O*}$ realisations than the observed ones (filled circles). This is because $\varrho < 0$. The OLS estimate (dotted line) is substantially downward biased -- it does not take into account the fact, that we tend to observe the observations with low realisations of $\varepsilon^O$. The slope, however, remains unbiased because $\E [ \varepsilon^O|\varepsilon^S \ge -{\vec{\beta}^S}' \vec{x}_i^S ]$ does not depend on $\vec{x}^O$. \begin{figure}[tp] \begin{center} <>= par(mar=c(3,3,0,0) + 0.1, mgp=c(2,1,0)) pch <- c(1, 16) plot(xo, yoX, pch=pch[1 + ys], cex=0.5, lwd=0.3) abline(a=0, b=1, lty=1) # True dependence abline(a=coef(m)[3], b=coef(m)[4], lty=2) # Heckman's model cf <- coef(lm(yo ~ xo, subset=ys==1)) abline(a=cf[1], b=cf[2], lty=3) # OLS @ % \caption{\label{fig:t2ex} Sample selection example with exclusion restriction (filled circles = observable outcomes, empty circles = unobservable outcomes, solid line = true dependence, dashed line (partly overlapping the solid) = ML estimate above, dotted line = OLS estimate based on observable outcomes only).} \end{center} \end{figure} Now we repeat the same exercise, but without the exclusion restriction, generating \code{yo} using \code{xs} instead of \code{xo}. <<>>= yoX <- xs + eps[,2] yo <- yoX*(ys > 0) summary(selection(ys ~ xs, yo ~ xs)) @ % The estimates are still unbiased but standard errors are substantially larger in this case. The exclusion restriction---independent information about the selection process---has a certain identifying power that we now have lost. We are solely relying on the functional form identification. We illustrate this case with an analogous figure (Figure~\ref{fig:t2}). The selection model uncovers the true dependence very well. The OLS estimate is downward biased because of $\varrho < 0$, as in the previous case. However, now the slope is upward biased because $\E [ \varepsilon^O|\varepsilon^S \ge -{\vec{\beta}^S}' \vec{x}_i^S ]$ is increasing in the single explanatory variable in the outcome model, $\vec{x}^S$. \begin{figure}[tp] \begin{center} <>= par(mar=c(3,3,0,0) + 0.1, mgp=c(2,1,0)) pch <- c(1, 16) plot(xs, yoX, pch=pch[1 + ys], cex=0.5, lwd=0.3) abline(a=0, b=1, lty=1) # True dependence abline(a=coef(m)[3], b=coef(m)[4], lty=2) # Heckman's model cf <- coef(lm(yo ~ xs, subset=ys==1)) abline(a=cf[1], b=cf[2], lty=3) # OLS @ % \caption{\label{fig:t2} Sample selection example without exclusion restriction (for symbols see Figure~\ref{fig:t2ex}).} \end{center} \end{figure} In order to identify $\beta^\lambda$ and ${\vec{\beta}^O}'$ without the exclusion restriction, $\lambda(\cdot)$ must differ from a linear combination of $\vec{x}^O$ components \citep[see][]{leung+yu1996}. The degree of non-linearity in $\lambda(\cdot)$ depends on the variability in ${\vec{\beta}^S}' \vec{x}^S$ as $\lambda(\cdot)$ is a smooth convex function (see Figure~\ref{fig:cfunctions}). Hence the standard errors of the estimates depend on the variation in the latent selection equation \eqref{eq:probit*}, even without the exclusion restriction fulfilled. More variation gives smaller standard errors\footnote{The exact shape of the function $\lambda(\cdot)$ is dependent on the distribution of the disturbances. However, $\E[\varepsilon^{O}|\varepsilon^{S} \ge -{\vec{\beta}^{S}}' \vec{x}^{S}] \to 0 \text{ as } {\vec{\beta}^{S}}' \vec{x}^{S} \to \infty$ under a wide set of distribution functions. Hence the estimator is less dependent on functional form assumptions if the variability in the latent selection equation is larger. This is related to identification at infinity \citep{chamberlain1986}.}. We demonstrate this below: Change the support of $\vec{x}^S$ from $[0,1]$ to $[-5,5]$: <<>>= xs <- runif(500, -5, 5) ys <- xs + eps[,1] > 0 yoX <- xs + eps[,2] yo <- yoX*(ys > 0) summary( selection(ys ~ xs, yo ~ xs)) @ <>= m <- selection(ys ~ xs, yo ~ xs) @ % Now all the parameters are precisely estimated, with even higher precision than in the first example where the exclusion restriction is fulfilled. The reason is simple: As one can see from Figure~\ref{fig:lambda()}, selection is not an issue if $x^S > 2$ while virtually nothing is observed if $x^{S} < -2$. Here $\lambda(\beta^S x^S)$ differs enough from a linear function. \begin{figure}[tp] \begin{center} <>= par(mar=c(3,3,0,0) + 0.1, mgp=c(2,1,0)) pch <- c(1, 16) plot(xs, yoX, pch=pch[1 + ys], cex=0.5, lwd=0.3) abline(a=0, b=1, lty=1) # True dependence abline(a=coef(m)[3], b=coef(m)[4], lty=2) # Heckman's model cf <- coef(lm(yo ~ xs, subset=ys==1)) abline(a=cf[1], b=cf[2], lty=3) # OLS @ % \caption{\label{fig:lambda()} Sample selection example with more variation in $x^S$ (for symbols see Figure~\ref{fig:t2ex}).} \end{center} \end{figure} \subsection{Switching regression models} \label{sec:using-tobit5} Now let us focus on the Tobit-5 examples. We create the following simple switching regression problem: <<>>= set.seed(0) vc <- diag(3) vc[lower.tri(vc)] <- c(0.9, 0.5, 0.1) vc[upper.tri(vc)] <- vc[lower.tri(vc)] eps <- rmvnorm(500, c(0,0,0), vc) xs <- runif(500) ys <- xs + eps[,1] > 0 xo1 <- runif(500) yo1 <- xo1 + eps[,2] xo2 <- runif(500) yo2 <- xo2 + eps[,3] @ % We generate 3 disturbance vectors by a 3-dimensional normal distribution using \code{rmvnorm}. We set the correlation between $\varepsilon^S$ and $\varepsilon^{O1}$ equal to 0.9 and between $\varepsilon^S$ and $\varepsilon^{O2}$ to 0.5. The third correlation, 0.1, takes care of the positive definiteness of the covariance matrix and does not affect the results. Further, we create three independent explanatory variables (\code{xs}, \code{xo1} and \code{xo2}, uniformly distributed on $[0,1]$), and hence the exclusion restriction is fulfilled. \code{selection} now expects three formulas, one for the selection equation, as before, and a list of two for the outcome equation. Note that we do not have to equalize the unobserved values to zero, those are simply ignored. The results look as follows: <<>>= summary(selection(ys~xs, list(yo1 ~ xo1, yo2 ~ xo2))) @ We can see that the parameters are fairly well estimated. All the estimates are close to the true values. Next, take an example of functional form misspecification. We create the disturbances as 3-variate $\chi_1^2$ random variables (we subtract 1 in order to get the mean zero disturbances), and generate \code{xs} to be in the interval $[-1,0]$ in order to get an asymmetric distribution over observed choices: <<>>= set.seed(5) eps <- rmvnorm(1000, rep(0, 3), vc) eps <- eps^2 - 1 xs <- runif(1000, -1, 0) ys <- xs + eps[,1] > 0 xo1 <- runif(1000) yo1 <- xo1 + eps[,2] xo2 <- runif(1000) yo2 <- xo2 + eps[,3] summary(selection(ys~xs, list(yo1 ~ xo1, yo2 ~ xo2), iterlim=20)) @ % Although we still have an exclusion restriction, now serious problems appear---most intercepts are statistically significantly different from the true values zero. This model has serious convergence problems and often it does not converge at all (this is why we increased the \code{iterlim} and used \code{set.seed(5)}). As the last Tobit example, we repeat the previous exercise without the exclusion restriction, and a slightly larger variance of \code{xs}: <>= set.seed(6) xs <- runif(1000, -1, 1) ys <- xs + eps[,1] > 0 yo1 <- xs + eps[,2] yo2 <- xs + eps[,3] summary(tmp <- selection(ys~xs, list(yo1 ~ xs, yo2 ~ xs), iterlim=20)) @ % In most cases, this model does not converge. However, if it does (like in this case, where we use \code{set.seed(6)}), the results may be seriously biased. Note that the first outcome parameters have low standard errors, but a substantial bias. We present the graph of the correct control function, based on the $\chi^2$ distribution, and one where we assume the normal distribution of the disturbance terms in Figure~\ref{fig:cfunctions}. We use the estimated parameters for constructing the latter, however, we scale the normal control functions (inverse Mill's ratios) to a roughly similar scale as the correct ones. \begin{figure}[tp] \begin{center} <>= EUlower <- function(alpha) { alpha[alpha >= 1] <- NA alpha <- sqrt(-alpha + 1) EUalpha <- ss^2 - 2*ss*alpha*dnorm(alpha/ss)/(1 - 2*pnorm(-alpha/ss)) s1s^2/ss^2*EUalpha + s1^2 - s1s^2/ss^2 - 1 } EUupper <- function(alpha) { alpha[alpha >= 1] <- 1 alpha <- sqrt(-alpha + 1) EUalpha <- (ss*alpha*dnorm(alpha/ss) + ss^2*(1 - pnorm(alpha/ss)))/(1 - pnorm(alpha/ss)) s2s^2/ss^2*EUalpha + s2^2 - s2s^2/ss^2 - 1 } Nlower <- function(alpha) { alpha <- -alpha -Ns1s*dnorm(alpha)/pnorm(alpha) } Nupper <- function(alpha) { alpha <- -alpha Ns2s*dnorm(-alpha)/pnorm(-alpha) } ss <- sqrt(vc[1,1]) s1 <- sqrt(vc[2,2]) s2 <- sqrt(vc[3,3]) s1s <- vc[1,2] s2s <- vc[1,3] Ns1s <- coef(tmp)["sigma1"]*coef(tmp)["rho1"] Ns2s <- coef(tmp)["sigma2"]*coef(tmp)["rho2"] hatb1O <- coef(tmp)[c("XO1(Intercept)", "XO1xs")] hatb2O <- coef(tmp)[c("XO2(Intercept)", "XO2xs")] es <- eps[,1] e1 <- eps[,2] e2 <- eps[,3] ex <- seq(-5, 5, length=200) ey <- cbind(EUlower(ex), EUupper(ex), ## -31*Nupper(cbind(1, ex)%*%hatb1O), 5.5*Nlower(cbind(1,ex)%*%hatb2O), -s1s*dnorm(-ex)/pnorm(-ex), s2s*dnorm(ex)/pnorm(ex) ) mcy <- matrix(0, length(ex), 2) for(i in seq(length=nrow(mcy))) { mcy[i,1] <- mean(e1[es < -ex[i]]) mcy[i,2] <- mean(e2[es > -ex[i]]) } par(cex=0.8, mar=c(3,3,0,0) + 0.1, mgp=c(2,1,0)) matplot(ex, ey, type="l", lty=c(1,1,2,2,3,3), col=1, xlab=expression(x^S), ylab="", ylim=c(-1.5,2.5)) abline(v=-1,lty=3) abline(v=1, lty=3) # matpoints(ex, mcy, pch=c(1,2), cex=0.5, col=1) axis(4) text(-3.5, 0.8, expression(paste("correct ")*E*group("[", epsilon^{O2}*group("|", epsilon^S > -bold(beta)^S*minute*bold(x)^S, ""), "]"))) text(-3.6, -0.4, expression(paste("correct ")*E*group("[", epsilon^{O1}*group("|", epsilon^S < -bold(beta)^S*minute*bold(x)^S, ""), "]"))) text(-2.5, 1.5, expression(paste("assumed ")*E*group("[", epsilon^{O2}*group("|", epsilon^S > -bold(beta)^S*minute*bold(x)^S, ""), "]")), pos = 4 ) text(1, -1.4, expression(paste("assumed ")*E*group("[", epsilon^{O1}*group("|", epsilon^S < -bold(beta)^S*minute*bold(x)^S, ""), "]")), pos = 2 ) @ % \caption{\label{fig:cfunctions} Correct and assumed control functions. Dotted vertical lines denote the support of $x^S$ in the simulation experiment; correct control functions are based on the $\chi^2(1)$ distribution; assumed control functions are based on the normal distribution. } \end{center} \end{figure} One can see that the functions differ substantially in the relevant range of $x^S \in [-1, 1]$. In particular, the true $\E [ \varepsilon^{O2}|\varepsilon^S \ge -{\vec{\beta}^S}' \vec{x}^S ]$ decreases substantially faster close to $x^S = 1$ than the normal approximation while the correct $\E [ \varepsilon^{O1}|\varepsilon^S < -{\vec{\beta}^S}' \vec{x}^S ]$ is decreasing slower compared to the approximation. It is instructive to estimate the same model as two independent OLS equations: <<>>= coef(summary(lm(yo1~xs, subset=ys==0))) coef(summary(lm(yo2~xs, subset=ys==1))) @ % One can see that the OLS estimates are very close to the ML ones. This is related to the fact that none of the $\varrho$s is statistically significantly different from zero. \section{Two replication exercises} \label{sec:reliability} In this section we test the reliability of the results from \code{selection} by applying the two-step and the ML estimation method re-estimating selected models already published in the literature. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Greene (2002): Example 22.8, page 786} \label{sec:Greene22.8} The first test is example 22.8 from \citet[p.~786]{greene2002}. The data set used in this example is included in \pkg{sampleSelection}; it is called \code{Mroz87}. This data set was used by \citet{mroz1987} for analysing female labour supply. In this example, labour force participation (described by dummy \code{lfp}) is modelled by a quadratic polynomial in age (\code{age}), family income (\code{faminc}, in 1975 dollars), presence of children (\code{kids}), and education in years (\code{educ}). The wage equation includes a quadratic polynomial in experience (\code{exper}), education in years (\code{educ}), and residence in a big city (\code{city}). First, we have to create a dummy variable for presence of children. <>= data( "Mroz87" ) Mroz87$kids <- ( Mroz87$kids5 + Mroz87$kids618 > 0 ) @ % Now, we estimate the model by the two-step method. <>= greeneTS <- selection( lfp ~ age + I( age^2 ) + faminc + kids + educ, wage ~ exper + I( exper^2 ) + educ + city, data = Mroz87, method = "2step" ) @ % Most results are identical to the values reported by \citet[p.~786]{greene2002}. Only the coefficient of the inverse Mill's ratio ($\beta^\lambda = \varrho \sigma$), its standard error, and $\varrho$ deviate from the published results, but all differences are less than one percent.% \footnote{ Note that the standard error of the coefficient of the inverse Mill's ratio ($\beta^\lambda = \varrho \sigma$) is wrong in \citet[p.~786]{greene2002} \citep[see][]{greene06}. } Finally, we repeat the analysis with the ML estimation method: <>= greeneML <- selection( lfp ~ age + I( age^2 ) + faminc + kids + educ, wage ~ exper + I( exper^2 ) + educ + city, data = Mroz87, maxMethod = "BHHH", iterlim = 500 ) @ % Again, the estimated coefficients and standard errors are almost identical to the values published in \citet{greene06}. While we can obtain the same coefficients with the Newton-Raphson (NR) maximisation method, we have to use the Berndt-Hall-Hall-Hausman (BHHH) method to obtain the published standard errors\footnote{We are grateful to William Greene for pointing this out.}. This is because different ways of calculating the Hessian matrix may result in substantially different standard errors \citep{calzolari+fiorentini1993}. The NR algorithm uses exact analytic Hessian, BHHH uses outer product approximation. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Cameron and Trivedi (2005): Section 16.6, page 553} \label{sec:Cameron16.6} The data set used in this example is based on the ``RAND Health Insurance Experiment'' \citep{newhouse99}. It is included in \pkg{sampleSelection}, where it is called \code{RandHIE}. \citet[p.~553]{cameron05} use these data to analyse health expenditures. The endogenous variable of the outcome equation measures the log of the medical expenses of the individual (\code{lnmeddol}) and the endogenous variable of the selection equation indicates whether the medical expenses are positive (\code{binexp}). The regressors are the log of the coinsurance rate plus 1 (\code{logc = log(coins+1)}), a dummy for individual deductible plans (\code{idp}), the log of the participation incentive payment (\code{lpi}), an artificial variable (\code{fmde} that is 0 if \code{idp}\,=\,1 and \code{ln(max(1,mde/(0.01*coins)))} otherwise (where \code{mde} is the maximum expenditure offer), physical limitations (\code{physlm}), the number of chronic diseases (\code{disea}), dummy variables for good (\code{hlthg}), fair (\code{hlthf}), and poor (\code{hlthp}) self-rated health (where the baseline is excellent self-rated health), the log of family income (\code{linc}), the log of family size (\code{lfam}), education of household head in years (\code{educdec}), age of the individual in years (\code{xage}), a dummy variable for female individuals (\code{female}), a dummy variable for individuals younger than 18 years (\code{child}), a dummy variable for female individuals younger than 18 years (\code{fchild}), and a dummy variable for black household heads (\code{black}). First, we select the subsample (study year \code{year} equal to 2 and education information present) that is used by \citet{cameron05} and specify the selection as well as the outcome equation. % <>= data( "RandHIE" ) subsample <- RandHIE$year == 2 & !is.na( RandHIE$educdec ) selectEq <- binexp ~ logc + idp + lpi + fmde + physlm + disea + hlthg + hlthf + hlthp + linc + lfam + educdec + xage + female + child + fchild + black outcomeEq <- lnmeddol ~ logc + idp + lpi + fmde + physlm + disea + hlthg + hlthf + hlthp + linc + lfam + educdec + xage + female + child + fchild + black @ % Now, we estimate the model by the two-step method (reporting only the coefficients): <>= rhieTS <- selection( selectEq, outcomeEq, data = RandHIE[ subsample, ], method = "2step" ) @ % All coefficients and standard errors are fully identical to the results reported by \citet{cameron05} --- even if they are compared with the seven-digit values in their \proglang{Stata} output that is available on \url{http://cameron.econ.ucdavis.edu/mmabook/mma16p3selection.txt}.% \footnote{ The coefficient and t-value of \code{idp} in column \code{lnmed} of \citeauthor{cameron05}'s Table~16.1 seem to be wrong, because they differ from their \proglang{Stata} output as well as from our results. } Again, we repeat the analysis with the ML estimation method: <>= rhieML <- selection( selectEq, outcomeEq, data = RandHIE[ subsample, ] ) @ % All coefficients and standard errors of the ML estimation are nearly identical to the values reported in Table~16.1 of \citet{cameron05} as well as to the seven-digit values in their \proglang{Stata} output. Only a few coefficients deviate at the seventh decimal place. % outcome equation: hlthp, female, child, black %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Robustness issues} \label{sec:problems} \subsection{Convergence} \label{sec:convergence} The log-likelihood function of the models above is not globally concave. The model may not converge, or it may converge to a local maximum, if the initial values are not chosen well enough. This may easily happen as we illustrate below. Recall example 22.8 from \citet[p.~786]{greene2002} (section:~\ref{sec:Greene22.8}). This model gives reasonable results, but these are sensitive to the start values. Now we re-estimate the model specifying start values ``by hand'' (note that you have to supply a positive initial value for the variance): <>= greeneStart <- selection( lfp ~ age + I( age^2 ) + faminc + kids + educ, wage ~ exper + I( exper^2 ) + educ + city, data = Mroz87, start = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.5, 0.9)) cat( greeneStart$message ) coef( summary( greeneStart ) )[ "rho", ] @ % The process did not converge. In the current case the problem lies with the numerical problems at the boundary of the parameter space (note that $\varrho$ is close to 1). A workaround is to use a more robust maximisation method. For instance, one may choose to run the SANN maximizer for 10\,000 iterations, and then use the returned coefficients as start values for the Newton-Raphson algorithm.% \footnote{One has to choose a suitable value for \code{parscale}; \code{parscale=0.001} worked well for this example.} <>= set.seed(0) greeneSANN <- selection( lfp ~ age + I( age^2 ) + faminc + kids + educ, wage ~ exper + I( exper^2 ) + educ + city, data = Mroz87, start = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.5, 0.9), maxMethod="SANN", parscale = 0.001 ) greeneStartSANN <- selection( lfp ~ age + I( age^2 ) + faminc + kids + educ, wage ~ exper + I( exper^2 ) + educ + city, data = Mroz87, start = coef( greeneSANN ) ) cat( greeneStartSANN$message ) @ % The new Newton-Raphson estimate converged to another maximum with a log-likehood value even higher than the one of the original estimate published in \citet[p.~786]{greene2002} (see Section~\ref{sec:Greene22.8}): <<>>= logLik( greeneML ) logLik( greeneStartSANN ) @ % However, in most cases the 2-step method does a good job in calculating initial values. \subsection{Boundary of the parameter space} \label{sec:boundary} In general, one should prefer \code{method="ml"} instead of \code{"2step"}. However, ML estimators may have problems at the boundary of the parameter space. Take the textbook Tobit example: \begin{equation} \label{eq:tobit} y_i^* = \vec{\beta}' \vec{x}_i + \varepsilon_i; \quad y_i = \begin{cases} y_i^* \quad & \text{if } y_i^* > 0\\ 0 & \text{otherwise.} \end{cases} \end{equation} This model can be written as a Tobit-2 model where the error term of the selection and outcome equation are perfectly correlated. In this case the ML estimator may not converge: <>= set.seed(0) x <- runif(1000) y <- x + rnorm(1000) ys <- y > 0 tobitML <- selection(ys~x, y~x) cat( tobitML$message ) coef( summary( tobitML ) )[ "rho", ] @ % The reason, as in the previous example, is that $\varrho = 1$ lies at the boundary of the parameter space. However, the 2-step method still works, although standard errors are large and $\rho \notin [-1,1]$: <>= tobitTS <- selection(ys~x, y~x, method="2step") coef( summary( tobitTS ) )[ "rho", ] @ % \section{Conclusions} \label{sec:conclusions} This paper describes Heckman-type selection models and their implementation in the package \pkg{sampleSelection} for the programming language \proglang{R}. These models are popular in estimating impacts of various factors in economics and other social sciences. We argue that they also serve as useful teaching tools because they are easy to implement and understand. We describe the implementation and usage of standard sample selection (Tobit-2) and switching regression (Tobit-5) models in \pkg{sampleSelection}, and possible generalisations of our \code{selection} function. We demonstrate the usage of the package using a number of simulated and real data examples. The examples illustrate several important issues related to exclusion restrictions, identification at infinity, and functional form specification. Our implementation works well for correctly specified cases with the exclusion restriction fulfilled. The problems appearing in the case of misspecification or weak identification are related to the model itself. In these problematic cases, the user may gain from a more robust maximisation algorithm. In some cases, the two-step estimator is preferable. \section*{Acknowledgments} The authors thank Roger Koenker, Achim Zeileis, and two anonymous referees for helpful comments and suggestions. Ott Toomet is grateful to the project TMJRI 0525 2003-2007 of the Estonian Ministry of Education and Science. \bibliography{selection} \end{document}