--- title: "SMMAL_vignette" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{SMMAL_vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction This vignette demonstrates how to use the SMMAL package to estimate the Average Treatment Effect (ATE) using semi-supervised machine learning. We provide an example dataset and walk through the required input format and function usage. ## Import Sample data. Sample data contain 1000 observations with 60% of Y and A missing at random. Y is the outcome. A is the treatment indicator. X are the covariates. S are the surrogates. For the sample data, missingness occurs at random and is encoded as NA. This package can handle datasets with a high proportion of missing values, but it requires a sufficiently large sample size to ensure that each fold in cross-validation contains at least 20 labeled observations. ```{r} library(SMMAL) file_path <- system.file("extdata", "sample_data_withmissing.rds", package = "SMMAL") dat <- readRDS(file_path) file_path2 <- system.file("extdata", "semi_supervised_data.rds", package = "SMMAL") data_loaded <- readRDS(file_path2) ``` ## Prepare Inputs Input file S and X needs to be data frame, even if they are vectors. ```{r} # Y and A are numeric vector Y <- dat$Y A <- dat$A # S and X needs to be data frame S <- data.frame(dat$S) X <- data.frame(dat$X) ``` ## Estimate ATE with SMMAL & Output Users can choose which model to use for the nuisance functions by setting the cf_model parameter. If no cf_model is indicated, the default value is "bspline". After cross-validation and prediction, the best-performing model is selected based on the lowest cross-entropy (log loss). Users can control how many folds are used in cross-validation by setting the nfold parameter. If no nfold is indicated, the default value is 5. ```{r} SMMAL_output1 <- SMMAL(Y=Y,A=A,S=S,X=X) print(SMMAL_output1) ``` Other options for cf_model are "xgboost" ```{r} SMMAL_output2 <- SMMAL(Y=Y,A=A,S=S,X=X,cf_model= "xgboost") print(SMMAL_output2) ``` or "random forest" ```{r} SMMAL_output3 <- SMMAL(Y=Y,A=A,S=S,X=X,cf_model= "randomforest") print(SMMAL_output3) ``` or "glm" ```{r} SMMAL_output4 <- SMMAL(Y=Y,A=A,S=S,X=X, cf_model= "glm") print(SMMAL_output4) ``` ## Using Your Own custom_model_fun Users may customize the feature‐selection or penalization strategy by supplying their own function through the custom_model_fun argument. To do so, pass a function that meets these requirements: 1. Function Signature It must accept exactly these arguments (in this order): X, Y, foldid_labelled, sub_set, labeled_indices, nfold, log_loss (X, Y, foldid_labelled, sub_set, labeled_indices, and nfold are used internally by SMMAL to partition and fit the data.) (log_loss is a function for computing cross‐entropy (log‐loss). Your function should call log_loss(true_labels, predicted_probs) to evaluate each tuning parameter.) 2. Return Value It must return a list of length equal to the number of “ridge” penalty values defined in param_fun(). Each element of that list should be a numeric vector of length n containing out‐of‐fold predicted probabilities for all observations—i.e., it should stack together predictions from every held‐out fold (no NA values, except where Y is genuinely missing). Below is an example showing how to plug in the packaged SMMAL_ada_lasso() as custom_model_fun. In practice, you could substitute any function with the same signature and return type: ```{r} SMMAL_output5 <- SMMAL(Y=Y,A=A,S=S,X=X, custom_model_fun = SMMAL_ada_lasso) print(SMMAL_output5) ``` ### Understanding SMMAL_ada_lasso ```{r} SMMAL_ada_lasso ``` ### Input of SMMAL_ada_lasso ```{r} str(data_loaded) ``` Input: X, Y, foldid_labelled, sub_set, labeled_indices, nfold, log_loss X: The full matrix of predictors for labelled observations Y: Outcome vector of length n, binary, may contain NA for unlabeled rows. X_full: The full matrix of predictors for all observations. foldid: A vector assigning each observation (labelled or unlabelled) to a fold. foldid_labelled: Integer vector assigning labeled rows to CV folds (1 to nfold); NA for unlabeled. sub_set: Logical or integer vector indicating rows included in supervised CV. labeled_indices: Indices of labeled observations (where Y is not missing). nfold: Number of cross-validation folds (e.g., 5 or 10). log_loss: Function that computes log-loss: log_loss(true_labels, pred_probs) returns a single numeric. ### Demonstration of how to run SMMAL_ada_lasso & Output of SMMAL_ada_lasso Output:fold_predictions When you use SMMAL_ada_lasso() as a custom_model_fun, it returns a list of numeric vectors where each element is a numeric vector of length equal to the total number of observations, containing the cross-validated predicted probabilities for the corresponding ridge value. Below is a sample run & output of SMMAL_ada_lasso ```{r} SMMAL_fold_predictions <-SMMAL_ada_lasso( X = data_loaded$X, Y = data_loaded$Y, X_full = data_loaded$X_full, foldid = data_loaded$foldid, foldid_labelled = data_loaded$foldid_labelled, sub_set = data_loaded$sub_set, labeled_indices = data_loaded$labeled_indices, nfold = data_loaded$nfold, log_loss = data_loaded$log_loss ) str(SMMAL_fold_predictions) ```