--- title: "CAEN Tutorial" author: "Yan Zhou" package: CAEN date: "`r Sys.Date()`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{CAEN Tutorial} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction This package provides a feature selection method for single-cell RNA-seq data. It encodes the category number and calculate the spearman correlation coefficient. The encoding the category number (CAEN) method is developed for selecting feature genes, which is differentially expressed between classes. We have implemented the CAEN method via a set of R functions. We make a R package named CAEN, and give a tutorial for the package. The method consist three steps. Step 1: Data Pre-processing; Step 2: Encoding the category number; Calculating the spearman correlation coefficient between the gene and category number; Step 3: Calculate the classification error using the genes selected. We employ a simulation dataset to illustrate the usage of the CAEN package. The programs can run under the Linux system and windows 10 system. The R versions should larger than 4.0. # Preparations To install the CAEN package into your R environment, start R and enter: ```{r, eval=FALSE} install.packages("BiocManager") BiocManager::install("CAEN") ``` Then, the CAEN package is ready to load. ```{r,message=FALSE,warning=FALSE} library(CAEN) library(SummarizedExperiment) ``` # Data format In order to show the CAEN workflow, the package includes the example data sets, which is a real dataset from *Woo, Y., Kim, D., Koo, N., Kim, Y., Lee, S., Ko, J., et al. (2017). Profiling of mirnas and target genes related to cystogenesis in adpkd mouse models. scientific reports 7, 14151.* The next we will show the dataset in the package. The dataset is in the data subdirectory of CAEN package, to consistent with standard Bioconductor representations, we transform the format of dataset as *SummarizedExperiment*, please refer R package * SummarizedExperiment* for details. In the dataset, gene expression is compared between two mouse models, Pkd1f/f: HoxB7-cre mice and Pkd2f/f: HoxB7-cre mice. Each group includes 18 samples, and there are total 29996 transcripts in this dataset. The next we will load the data and use CAEN function to find the most differentially expressed genes. ```{r} data("realData") realData ``` **readData** includes 2 groups, each group includes 18 samples, and there are total 29996 transcripts. * the columns are samples of dataset; * the rows are transcripts of dataset. ```{r, message=FALSE, warning=FALSE} x <- realData y <- c(rep(1,18),rep(2,18)) xte <- realData prob <- estimatep(x = x, y = y, xte = x, beta = 1, type = "mle", prior = NULL) prob0 <- estimatep(x = x, y = y, xte = xte, beta = 1, type = "mle", prior = NULL) myscore <- CAEN(dataTable = x, y = y, K = 2, gene_no_list = 100) ``` The output of the function *CAEN* is: A list of computed correlation coefficient and the first some differentially expressed genes , where "r" represents correlation coefficient between gene and category number, and "np" represents the top differential feature label. # Calculate classification error rate using genes selected with CAEN method Getting the important gene, we Calculate classification error rate using genes selected. The step is as follows: ```{r,message=FALSE,warning=FALSE} ddd <- myscore$np datx <- t(assay(x)[ddd,]) datxte <- t(assay(xte)[ddd,]) probb <- prob[ddd,] probb0 <- prob0[ddd,] zipldacv.out <- ZIPLDA.cv(x = datx, y = y, prob0 = t(probb)) ZIPLDA.out <- ZIPLDA(x = datx, y = y, xte = datxte, transform = FALSE, prob0 = t(probb0),rho = zipldacv.out$bestrho) classResult <- ZIPLDA.out$ytehat ``` In order to reproduce the presented CAEN workflow, the package also includes the simulated data sets, which is generated by function *newCountDataSet()*. The next we will give an example for how to generate simulation dataset: ```{r} dat <- newCountDataSet(n = 100, p = 500, K = 4, param = 10, sdsignal = 2, drate = 0.2) ``` The output of the function *newCountDataSet()* includes: * "sim_train_data" represents training data of q*n data matrix. * "sim_test_data" represents test data of q*n data matrix. The colnames of this two matrix are class labels for the n observations. May have q

0 total counts in dataset. So q <= p. * "truesf" denotes size factors for training observations. * "isDE" represnts the differential gene label. *Calculate the spearman correlation coefficient for data* For the category number, we need to consider not only the difference between class but also the Intra-category difference. Therefore, we propose CAEN method, by encoding the category number within class, it get the optimal category number and select the most important genes used for classification. ```{r, message=FALSE, warning=FALSE} x <- t(assay(dat$sim_train_data)) y <- as.numeric(colnames(dat$sim_train_data)) xte <- t(assay(dat$sim_test_data)) prob <- estimatep(x = x, y = y, xte = x, beta = 1, type = c("mle","deseq","quantile"), prior = NULL) prob0 <- estimatep(x = x, y = y, xte = xte, beta = 1, type = c("mle","deseq","quantile"), prior = NULL) myscore <- CAEN(dataTable = assay(dat$sim_train_data), y = as.numeric(colnames(dat$sim_train_data)), K = 4, gene_no_list = 100) ``` The output of the function *CAEN* is: A list of computed correlation coefficient and the first some differentially expressed genes , where "r" represents correlation coefficient between gene and category number, and "np" represents the top differential feature label. *Calculate classification error rate using genes selected with CAEN method* Getting the important gene, we Calculate classification error rate using genes selected. The step is as follows: ```{r,message=FALSE,warning=FALSE} ddd <- myscore$np datx <- x[,ddd] datxte <- xte[,ddd] probb <- prob[ddd,] probb0 <- prob0[ddd,] zipldacv.out <- ZIPLDA.cv(x = datx, y = y, prob0 = t(probb)) ZIPLDA.out <- ZIPLDA(x = datx, y = y, xte = datxte, transform = FALSE, prob0 = t(probb0), rho = zipldacv.out$bestrho) classResult <- ZIPLDA.out$ytehat ``` ```{r} sessionInfo() ```