---
title: "Using summclust"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{summclust}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(summclust)
```

## Introduction 

The `summclust` package allows to compute leverage statistics for clustered errors and fast `CRV3(J)` variance-covariance matrices as described in [MacKinnon, J.G., Nielsen, M.Ø., Webb, M.D., 2022. Leverage, influence, and the jackknife in clustered regression models: Reliable inference using summclust](https://arxiv.org/abs/2205.03288). 

It is a post-estimation command and currently supports methods for objects of type `lm` (from `stats`) and `fixest` (from the `fixest` package). 

## The `summclust` function

```{r example, warning = FALSE, message = FALSE, out.width="50%", out.height="50%"}
library(summclust)
library(lmtest)
library(haven)

url <- "http://www.stata-press.com/data/r9/nlswork.dta"
if (httr::http_error(url)) {
  stop("No internet connection or data source broken. Sorry about that!")
  return(NULL)
} else {
  message("downloading the 'nlswork' dataset.")
  nlswork <- read_dta(url)
}


# drop NAs at the moment
nlswork <- nlswork[, c("ln_wage", "grade", "age", "birth_yr", "union", "race", "msp", "ind_code")]
nlswork <- na.omit(nlswork)

lm_fit <- lm(
  ln_wage ~ as.factor(grade) + as.factor(age) + as.factor(birth_yr) + union +  race + msp,
  data = nlswork)

summclust_res <- summclust(
  obj = lm_fit,
  cluster = ~ind_code,
  params = c("msp", "union")
)

# CRV3-based inference - exactly matches output of summclust-stata
tidy(summclust_res)

summary(summclust_res)
```

To visually inspect the leverage statistics, use the `plot` method

```{r, warning = FALSE, message = FALSE, out.width="50%"}
plot(summclust_res)
```

## Using `summclust` with `coefplot` and `fixest`

Note that you can also use CVR3 and CRV3J covariance matrices computed via `summclust` or its `vcov_CR3J` method with the `lmtest()` and `fixest` packages.

```{r, warning = FALSE, message=FALSE}
library(lmtest)

vcov3J <- 
  vcov_CR3J(
    lm_fit, 
    cluster = ~ ind_code
  )

all.equal(
  vcov3J, 
  summclust_res$vcov
)

df <- length(summclust_res$cluster) - 1

# with lmtest
CRV1 <- lmtest::coeftest(lm_fit, sandwich::vcovCL(lm_fit, ~ind_code), df = df)
CRV3 <- lmtest::coeftest(lm_fit, vcov3J, df = df)

CRV1[c("union", "race", "msp"),]
CRV3[c("union", "race", "msp"),]

stats::confint(CRV1)[c("union", "race", "msp"),]
stats::confint(CRV3)[c("union", "race", "msp"),]
```

```{r}
library(fixest)

feols_fit <- feols(
  ln_wage ~ i(grade) + i(age) + i(birth_yr) + union +  race + msp,
  data = nlswork
)

# Store vcov into the fixest object
feols_fit <- summary(
  feols_fit, 
  vcov = vcov_CR3J(feols_fit, cluster = ~ ind_code)
)

# Now it just works with fixest functions
fixest::coeftable(feols_fit, keep = c("msp", "union", "race"))
```

The p-value and confidence intervals for `fixest::coeftable()` differ from `lmtest::coeftest()` and `summclust::coeftable()`. This is due to the fact that `fixest::coeftable()` uses a different degree of freedom for the t-distribution used in these calculation (I believe it uses t(N-1)).