---
title: "CAEN Tutorial"
author: "Yan Zhou"
package: CAEN
date: "`r Sys.Date()`"
output: BiocStyle::html_document
vignette: >
    %\VignetteIndexEntry{CAEN Tutorial}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---


# Introduction
This package provides a feature selection method for single-cell RNA-seq 
data. It encodes the category number and calculate the spearman 
correlation coefficient. 

The encoding the category number (CAEN) method is developed for selecting 
feature genes, which is differentially expressed between classes. 
We have implemented the CAEN method via a set of R functions. 
We make a R package named CAEN, and give a tutorial for the package. 
The method consist three steps.

Step 1: Data Pre-processing;  

Step 2: Encoding the category number; Calculating the spearman 
correlation coefficient between the gene and category number;

Step 3: Calculate the classification error using the genes selected.

We employ a simulation dataset to illustrate the usage of the CAEN package.
The programs can run under the Linux system and windows 10 system. The
R versions should larger than 4.0.

# Preparations
To install the CAEN package into your R environment, start R and
enter:
```{r, eval=FALSE}
install.packages("BiocManager")
BiocManager::install("CAEN")
```


Then, the CAEN package is ready to load.
```{r,message=FALSE,warning=FALSE}
library(CAEN)
library(SummarizedExperiment)
```



# Data format
In order to show the CAEN workflow, the package includes the example data sets, 
which is a real dataset from *Woo, Y., Kim, D., Koo, N., Kim, Y., Lee, S., 
Ko, J., et al. (2017). Profiling of mirnas and target genes related to 
cystogenesis in adpkd mouse models. scientific reports 7, 14151.*  
The next we will show the dataset in the package.

The dataset is in the data subdirectory of CAEN package, to consistent 
with standard Bioconductor representations, we transform the format of dataset
as *SummarizedExperiment*, please refer R package * SummarizedExperiment* for 
details. In the dataset, gene expression is compared between two mouse models, 
Pkd1f/f: HoxB7-cre mice and Pkd2f/f: HoxB7-cre mice. Each group includes 18 
samples, and there are total 29996 transcripts in this dataset. The next we 
will load the data and use CAEN function to find the most differentially 
expressed genes. 



```{r}
data("realData")
realData
```

**readData** includes 2 groups, each group includes 18 samples, and there are 
total 29996 transcripts. 
* the columns are samples of dataset;
* the rows are transcripts of dataset.


```{r, message=FALSE, warning=FALSE}
x <- realData                  
y <- c(rep(1,18),rep(2,18))      
xte <- realData                

prob <- estimatep(x = x, y = y, xte = x, beta = 1, type = "mle", prior = NULL)     
prob0 <- estimatep(x = x, y = y, xte = xte, beta = 1, type = "mle", 
                   prior = NULL)   
myscore <- CAEN(dataTable = x, y = y, K = 2, gene_no_list = 100)
```

The output of the function *CAEN* is:
A list of computed correlation coefficient and the first some differentially 
expressed genes , where "r" represents correlation coefficient between gene 
and category number, and "np" represents the top differential feature label.


# Calculate classification error rate using genes selected with CAEN method
Getting the important gene, we Calculate classification error rate 
using genes selected. The step is as follows:
```{r,message=FALSE,warning=FALSE}
ddd <- myscore$np
datx <- t(assay(x)[ddd,])
datxte <- t(assay(xte)[ddd,])
probb <- prob[ddd,]
probb0 <- prob0[ddd,]

zipldacv.out <- ZIPLDA.cv(x = datx, y = y, prob0 = t(probb))
ZIPLDA.out <- ZIPLDA(x = datx, y = y, xte = datxte, transform = FALSE, 
                     prob0 = t(probb0),rho = zipldacv.out$bestrho)
classResult <- ZIPLDA.out$ytehat
```






In order to reproduce the presented CAEN workflow, the package also includes 
the simulated data sets, which is generated by function *newCountDataSet()*. 
The next we will give an example for how to generate simulation dataset:

```{r}
dat <- newCountDataSet(n = 100, p = 500, K = 4, param = 10, sdsignal = 2, 
                       drate = 0.2)
```

The output of the function *newCountDataSet()* includes:
* "sim_train_data" represents training data of q*n data matrix.   
* "sim_test_data" represents test data of q*n data matrix.   
The colnames of this two matrix are class labels for the n observations. May 
have q<p because features with 0 total counts are removed. The q features are 
those with >0 total counts in dataset. So q <= p.   
* "truesf" denotes size factors for training observations.   
* "isDE" represnts the differential gene label.   




 *Calculate the spearman correlation coefficient for data*
For the category number, we need to consider not only the difference 
between class but also the Intra-category difference. Therefore, we 
propose CAEN method, by encoding the category number within class, 
it get the optimal category number and select the most important genes 
used for classification.

```{r, message=FALSE, warning=FALSE}
x <- t(assay(dat$sim_train_data))                  
y <- as.numeric(colnames(dat$sim_train_data))      
xte <- t(assay(dat$sim_test_data))                 

prob <- estimatep(x = x, y = y, xte = x, beta = 1, 
                  type = c("mle","deseq","quantile"),
                  prior = NULL)      
prob0 <- estimatep(x = x, y = y, xte = xte, beta = 1, 
                   type = c("mle","deseq","quantile"),
                   prior = NULL)   
myscore <- CAEN(dataTable = assay(dat$sim_train_data), 
                y = as.numeric(colnames(dat$sim_train_data)), K = 4,
                gene_no_list = 100)
```

The output of the function *CAEN* is:
A list of computed correlation coefficient and the first some differentially 
expressed genes , where "r" represents correlation coefficient between gene 
and category number, and "np" represents the top differential feature label.



 *Calculate classification error rate using genes selected with CAEN method*
Getting the important gene, we Calculate classification error rate 
using genes selected. The step is as follows:
```{r,message=FALSE,warning=FALSE}
ddd <- myscore$np
datx <- x[,ddd]
datxte <- xte[,ddd]
probb <- prob[ddd,]
probb0 <- prob0[ddd,]

zipldacv.out <- ZIPLDA.cv(x = datx, y = y, prob0 = t(probb))
ZIPLDA.out <- ZIPLDA(x = datx, y = y,
                     xte = datxte, transform = FALSE, prob0 = t(probb0),
                     rho = zipldacv.out$bestrho)
classResult <- ZIPLDA.out$ytehat
```


```{r}
sessionInfo()
```