---
title: "Get Started"
description: "A brief introduction to kcmeans."
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Get Started}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

This article is a brief introduction to ``kcmeans``.


```r
library(kcmeans)
set.seed(51944)
```

To illustrate ``kcmeans``, consider simulating a small dataset with a continuous outcome variable ``y``, two observed predictors -- a categorical variable ``Z`` and a continuous variable ``X`` -- and an (unobserved) Gaussian error. As in Wiemann (2023), the reduced form has an unobserved lower-dimensional representation dependent on the latent categorical variable ``Z0``.

```r
# Sample parameters
nobs = 800 # sample size
# Sample data
X <- rnorm(nobs)
Z <- sample(1:20, nobs, replace = T)
Z0 <- Z %% 4 # lower-dimensional latent categorical variable
y <- Z0 + X + rnorm(nobs)
```


``kcmeans`` is then computed by combining the categorical feature with the continuous feature. By default, the categorical feature is the first column. Alternatively, the column corresponding to the categorical feature can be set via the ``which_is_cat`` argument. Computation is _very_ quick -- indeed the dynamic programming algorithm of the leveraged ``Ckmeans.1d.dp`` package is polynomial in the number of values taken by the categorical feature ``Z``. See also ``?kcmeans`` for details.


```r
system.time({
kcmeans_fit <- kcmeans(y = y, X = cbind(Z, X), K = 4)
})
```

```
##    user  system elapsed 
##   0.784   0.027   0.668
```

We may now use the ``predict.kcmeans`` method to construct fitted values and/or compute predictions of the lower-dimensional latent categorical feature ``Z0``. See also ``?predict.kcmeans`` for details.

```r
# Predicted values for the outcome + R^2
y_hat <- predict(kcmeans_fit, cbind(Z, X))
round(1 - mean((y - y_hat)^2) / mean((y - mean(y))^2), 3)
```

```
## [1] 0.695
```

```r
# Predicted values for the latent categorical feature + missclassification rate
Z0_hat <- predict(kcmeans_fit, cbind(Z, X), clusters = T) - 1
mean((Z0 - Z0_hat)!=0)
```

```
## [1] 0
```

Finally, it is also straightforward to compute standard errors for the final coefficients, e.g., using ``summary.lm``:


```r
# Compute the linear regression object and call summary.lm
lm_fit <- lm(y ~ as.factor(Z0_hat) + X)
summary(lm_fit)
```

```
## 
## Call:
## lm(formula = y ~ as.factor(Z0_hat) + X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1205 -0.6916  0.0544  0.6700  3.4201 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.03897    0.07434   0.524      0.6    
## as.factor(Z0_hat)1  0.88393    0.10265   8.611   <2e-16 ***
## as.factor(Z0_hat)2  1.88314    0.10271  18.334   <2e-16 ***
## as.factor(Z0_hat)3  3.01094    0.10636  28.310   <2e-16 ***
## X                   1.04636    0.03541  29.549   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.03 on 795 degrees of freedom
## Multiple R-squared:  0.6954,	Adjusted R-squared:  0.6939 
## F-statistic: 453.7 on 4 and 795 DF,  p-value: < 2.2e-16
```

# References
Wiemann T (2023). "Optimal Categorical Instruments." https://arxiv.org/abs/2311.17021