---
title: "SMMAL_vignette"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{SMMAL_vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


## Introduction

This vignette demonstrates how to use the SMMAL package to estimate the Average Treatment Effect (ATE) using semi-supervised machine learning. We provide an example dataset and walk through the required input format and function usage.


## Import Sample data. 

Sample data contain 1000 observations with 60% of Y and A missing at random. 
Y is the outcome. 
A is the treatment indicator. 
X are the covariates. 
S are the surrogates.

For the sample data, missingness occurs at random and is encoded as NA.
This package can handle datasets with a high proportion of missing values, but it requires a sufficiently large sample size to ensure that each fold in cross-validation contains at least 20 labeled observations.

```{r}
library(SMMAL)

file_path <- system.file("extdata", "sample_data_withmissing.rds", package = "SMMAL")
dat <- readRDS(file_path)


file_path2 <- system.file("extdata", "semi_supervised_data.rds", package = "SMMAL")
data_loaded <- readRDS(file_path2)
```

## Prepare Inputs

Input file S and X needs to be data frame, even if they are vectors.


```{r}
  # Y and A are numeric vector 
  Y <- dat$Y
  A <- dat$A
  
  # S and X needs to be data frame
  S <- data.frame(dat$S)
  X <- data.frame(dat$X)
``` 


## Estimate ATE with SMMAL & Output

Users can choose which model to use for the nuisance functions by setting the cf_model parameter.
If no cf_model is indicated, the default value is "bspline".

After cross-validation and prediction, the best-performing model is selected based on the lowest cross-entropy (log loss).
Users can control how many folds are used in cross-validation by setting the nfold parameter.
If no nfold is indicated, the default value is 5.
  
```{r}
 SMMAL_output1 <- SMMAL(Y=Y,A=A,S=S,X=X)
 print(SMMAL_output1)
```

Other options for cf_model are "xgboost" 
  
```{r}
SMMAL_output2 <- SMMAL(Y=Y,A=A,S=S,X=X,cf_model= "xgboost")
print(SMMAL_output2)
```

or "random forest"

```{r}
SMMAL_output3 <- SMMAL(Y=Y,A=A,S=S,X=X,cf_model= "randomforest")
print(SMMAL_output3)
```

or "glm"

```{r}
 SMMAL_output4 <- SMMAL(Y=Y,A=A,S=S,X=X, cf_model= "glm")
 print(SMMAL_output4)
```

## Using Your Own custom_model_fun

Users may customize the feature‐selection or penalization strategy by supplying their own function through the custom_model_fun argument. To do so, pass a function that meets these requirements:

1. Function Signature
It must accept exactly these arguments (in this order):
X, Y, foldid_labelled, sub_set, labeled_indices, nfold, log_loss

(X, Y, foldid_labelled, sub_set, labeled_indices, and nfold are used internally by SMMAL to partition and fit the data.)

(log_loss is a function for computing cross‐entropy (log‐loss). Your function should call log_loss(true_labels, predicted_probs) to evaluate each tuning parameter.)

2. Return Value
It must return a list of length equal to the number of “ridge” penalty values defined in param_fun(). Each element of that list should be a numeric vector of length n containing out‐of‐fold predicted probabilities for all observations—i.e., it should stack together predictions from every held‐out fold (no NA values, except where Y is genuinely missing).

Below is an example showing how to plug in the packaged SMMAL_ada_lasso() as custom_model_fun. In practice, you could substitute any function with the same signature and return type:

```{r}
 SMMAL_output5 <- SMMAL(Y=Y,A=A,S=S,X=X, custom_model_fun = SMMAL_ada_lasso)
 print(SMMAL_output5)
```
### Understanding SMMAL_ada_lasso

```{r}
SMMAL_ada_lasso
```


### Input of SMMAL_ada_lasso

```{r}
str(data_loaded)
```

Input: X, Y, foldid_labelled, sub_set, labeled_indices, nfold, log_loss

X: The full matrix of predictors for labelled observations

Y: Outcome vector of length n, binary, may contain NA for unlabeled rows.

X_full: The full matrix of predictors for all observations.

foldid: A vector assigning each observation (labelled or unlabelled) to a fold.

foldid_labelled: Integer vector assigning labeled rows to CV folds (1 to nfold); NA for unlabeled.

sub_set: Logical or integer vector indicating rows included in supervised CV.

labeled_indices: Indices of labeled observations (where Y is not missing).

nfold: Number of cross-validation folds (e.g., 5 or 10).

log_loss: Function that computes log-loss: log_loss(true_labels, pred_probs) returns a single numeric.


### Demonstration of how to run SMMAL_ada_lasso & Output of SMMAL_ada_lasso

Output:fold_predictions

When you use SMMAL_ada_lasso() as a custom_model_fun, it returns a list of numeric vectors where each element is a numeric vector of length equal to the total number of observations, containing the cross-validated predicted probabilities for the corresponding ridge value.

Below is a sample run & output of SMMAL_ada_lasso


```{r}
SMMAL_fold_predictions <-SMMAL_ada_lasso(
  X = data_loaded$X,
  Y = data_loaded$Y,
  X_full = data_loaded$X_full,
  foldid = data_loaded$foldid,
  foldid_labelled = data_loaded$foldid_labelled,
  sub_set = data_loaded$sub_set,
  labeled_indices = data_loaded$labeled_indices,
  nfold = data_loaded$nfold,
  log_loss = data_loaded$log_loss
)

str(SMMAL_fold_predictions)
```