---
title: "Apparent Membrane Permeability (Caco-2)"
author: "US EPA's Center for Computational Toxicology and Exposure ccte@epa.gov"
output:
  rmarkdown::html_vignette
bibliography: '`r system.file("REFERENCES.bib", package="invitroTKstats")`'
vignette: >
  %\VignetteIndexEntry{Caco-2}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = 'center'
)
```

## Introduction 

This vignette guides users on how to estimate apparent membrane permeability (P~app~) from mass spectrometry data. Apparent membrane permeability is a chemical specific parameter that describes the general transport of the chemical through a membrane (@honda2025impact, @hubatsch2007determination). 

The mass spectrometry data should be collected from a Caco-2 assay in which the test compound was added to either the apical or basolateral side of a confluent and differentiated Caco-2 cell monolayer as seen in Figure 1 (@honda2025impact, @hubatsch2007determination). 

```{r, echo = FALSE, out.width = "60%", fig.cap = "Fig 1: Caco-2 experimental set up", fig.topcaption = TRUE, fig.align = "center"}
knitr::include_graphics("img/Caco2_diagram.png")
```

### Suggested packages for use with this vignette 

```{r setup, message = FALSE, warning = FALSE}
# Primary package 
library(invitroTKstats)
# Data manipulation package 
library(dplyr)
# Table formatting package
library(flextable)
```

## Load Data 

First, we load in the example dataset from `invitroTKstats`. 

```{r Load example data}
# Load example caco2 data 
data("Caco2-example")
```

Five datasets are loaded in: `caco2_cheminfo`, `caco2_L0`, `caco2_L1`, `caco2_L2`, and `caco2_L3`. The first, `caco2_cheminfo`, contains chemical information necessary for identification mapping; it is used to create Level 0 data. The latter datasets contain Caco-2 data at Level 0, 1, 2, and 3 respectively. For the purpose of this vignette, we'll start with `caco2_L0`, the Level 0 data, to demonstrate the complete pipelining process. 

`caco2_L0` is the output from the `merge_level0` function which compiles raw lab data from specified Excel files into a singular data frame. The data frame contains exactly one row per sample with information obtained from the mass spectrometer. For more details on curating raw lab data to a singular Level 0 data frame, see the "Data Guide Creation and Level-0 Data Compilation" vignette. 

The following table displays the first three rows of `caco2_L0`, our Level 0 data.

```{r, echo = FALSE, tab = 1}
head(caco2_L0, n = 3) %>% 
  flextable() %>% 
  bg(bg = "#DDDDDD", part = "header") %>% 
  autofit() %>% 
  set_table_properties(
    opts_html = list(
      scroll = list(
      )
      )
  ) %>% 
  set_caption(caption = "Table 1: Level 0 data",
              align_with_table = FALSE) %>% 
  fontsize(size = 10, part = "body") %>% 
  theme_vanilla()
```

## Level 1 processing

`format_caco2` is the Level 1 function used to create a standardized data frame. This level of processing is necessary because naming conventions or formatting can differ across data sets. 

If the Level 0 data already contains the required column, then the existing column name can be specified. For example, `caco2_L0` already contains a column specifying the sample name called "Sample". However, the default column name for sample name is "Lab.Sample.Name". Therefore, we specify the correct column with `sample.col = "Sample"`. In general, to specify an already existing column that differs from the default, the user must use the parameter with the `.col` suffix. 

If the Level 0 data does not already contain the required column, then the entire column can be populated with a single value. For example, `caco2_L0` does not contain a column specifying biological replicates. Therefore, we populate the required column with `biological.replicates = 1`. In general, to specify a single value for an entire column, the user must use the parameter without the `.col` suffix.

Users should be mindful if they choose to specify a single value for all of their samples; they should verify this action is one they wish to take. 

Some columns must be present in the Level 0 data while others can be filled with a single value. At minimum, the following columns must be present in the Level 0 data and specification with a single entry is not permitted: `sample.col`, `lab.compound.col`, `dtxsid.col`, `compound.col`, `area.col`, `istd.col`, `type.col`, `direction.col`, `receiver.vol.col`, and `donor.vol.col`.

If there is no additional `note.col` in the Level 0 data, users should use `note.col = NULL` to fill the column with "Note". 

The rest of the following columns may either be specified from the Level 0 data or filled with a single value: `date.col` or `date`, `membrane.area.col` or `membrane.area`, `test.conc.col` or `test.conc`, `cal.col` or `cal`, `dilution.col` or `dilution`, `time.col` or `time`, `istd.name.col` or `istd.name`, `istd.conc.col` or `istd.conc`, `test.nominal.conc.col` or `test.nominal.conc`, `biological.replicates.col` or `biological.replicates`, `technical.replicates.col` or `technical.replicates`, `analysis.method.col` or `analysis.method`, `analysis.instrument.col` or `analysis.instrument`, `analysis.parameters.col` or `analysis.parameters`, `level0.file.col` or `level0.file`, and `level0.sheet.col` or `level0.sheet`.

```{r required cols, echo = FALSE}
# Create a table of required arguments for Level 1

req_cols <- data.frame(matrix(nrow = 34, ncol = 5))
vars <- c("Argument", "Default", "Required in L0?", "Corresp. single-entry Argument", "Descr.")
colnames(req_cols) <- vars

# Argument names 
arguments <- c("FILENAME", "data.in", "sample.col", "lab.compound.col", "dtxsid.col", "date.col", 
                "compound.col", "area.col", "istd.col", "type.col", "direction.col", "membrane.area.col",
                "receiver.vol.col", "donor.vol.col", "test.conc.col", "cal.col", "dilution.col",
                "time.col", "istd.name.col", "istd.conc.col", "test.nominal.conc.col", "biological.replicates.col",
                "technical.replicates.col", "analysis.method.col", "analysis.instrument.col", "analysis.parameters.col",
                "note.col", "level0.file.col", "level0.sheet.col", "output.res", "save.bad.types", "sig.figs", "INPUT.DIR", "OUTPUT.DIR")
req_cols[, "Argument"] <- arguments

# Default arguments 
defaults <- c("MYDATA", NA, "Lab.Sample.Name", "Lab.Compound.Name",
              "DTXSID", "Date", "Compound.Name", "Area", 
              "ISTD.Area", "Type", "Direction", "Membrane.Area",
              "Vol.Receiver", "Vol.Donor", "Test.Compound.Conc", 
              "Cal", "Dilution.Factor", "Time", "ISTD.Name", 
              "ISTD.Conc", "Test.Target.Conc",
              "Biological.Replicates", "Technical.Replicates",
              "Analysis.Method", "Analysis.Instrument", 
              "Analysis.Parameters", "Note", "Level0.File",
              "Level0.Sheet", "FALSE", "FALSE", "5", "NULL", "NULL")
req_cols[, "Default"] <- defaults 

# Argument required in L0?
req_cols <- req_cols %>% 
  mutate("Required in L0?" = case_when(
    Argument %in% c("sample.col", "lab.compound.col", "dtxsid.col", "compound.col", "area.col", "istd.col", 
  "type.col", "direction.col", "receiver.vol.col", "donor.vol.col") ~ "Y",
  Argument %in% c("FILENAME", "data.in", "output.res", "save.bad.types", "sig.figs", "INPUT.DIR", "OUTPUT.DIR") ~ "N/A",
  .default = "N"
  ))

# Corresponding single-entry Argument 
req_cols <- req_cols %>% 
  mutate("Corresp. single-entry Argument" = ifelse(.data[[vars[[3]]]] == "N" & .data[[vars[[1]]]] != "note.col", 
                                                    gsub(".col" ,"", Argument), NA)) 

# Brief description 
description <- c("Output and input filename",
                 "Level 0 data frame", 
                 "Lab sample name", 
                 "Lab test compound name (abbr.)",
                 "EPA's DSSTox Structure ID",
                 "Lab measurement date",
                 "Formal test compound name",
                 "Target analyte peak area",
                 "Internal standard peak area",
                 "Sample type (Blank/D0/D2/R2)",
                 "Experiment direction",
                 "Membrane area",
                 "Receiver compartment volume",
                 "Donor compartment volume",
                 "Standard test chemical concentration",
                 "MS calibration",
                 "Number of times sample was diluted",
                 "Time before compartments were measured",
                 "Internal standard name",
                 "Internal standard concentration",
                 "Expected initial chemical concentration added to donor",
                 "Replicates with the same analyte",
                 "Repeated measurements from one sample",
                 "Analytical chemistry analysis method",
                 "Analytical chemistry analysis instrument",
                 "Analytical chemistry analysis parameters",
                 "Additional notes",
                 "Raw data filename",
                 "Raw data sheet name",
                 "Export results (TSV)?",
                 "Export bad data (TSV)?",
                 "Number of significant figures",
                 "Input directory of Level 0 file",
                 "Export directory to save Level 1 files")
req_cols[, "Descr."] <- description
```

```{r, echo = FALSE}
req_cols %>% 
  flextable() %>% 
  bg(bg = "#DDDDDD", part = "header") %>% 
  autofit() %>% 
  set_table_properties(
    opts_html = list(
      scroll = list(height = 200)
      )
  ) %>% 
  set_caption("Table 2: Level 1 'format_caco2' function arguments", align_with_table = FALSE) %>% 
  fontsize(size = 10, part = "body") %>% 
  theme_vanilla()
```

A TSV file containing the level-1 data can be exported to the user's per-session temporary directory. This temporary directory is a per-session directory whose path can be found with the following code: `tempdir()`. For more details, see 
[https://www.collinberke.com/til/posts/2023-10-24-temp-directories/].

To avoid exporting to this temporary directory, an `OUTPUT.DIR` must be specified. We have omitted this export entirely with `output.res = FALSE` (the default). The option to omit exporting a TSV file is also available at levels 2 and 3 and will be used from this point forward. 

```{r L1 processing}
caco2_L1_curated <- format_caco2(FILENAME = "Caco2_vignette",
                                 data.in = caco2_L0,
                                 # columns present in L0 data 
                                 sample.col = "Sample",
                                 lab.compound.col = "Lab.Compound.ID",
                                 compound.col = "Compound",
                                 area.col = "Peak.Area",
                                 istd.col = "ISTD.Peak.Area",
                                 test.conc.col = "Compound.Conc",
                                 analysis.method.col = "Analysis.Params",
                                 # columns not present in L0 data
                                 biological.replicates = 1,
                                 technical.replicates = 1,
                                 membrane.area = 0.11,
                                 cal = 1,
                                 time = 2, 
                                 istd.conc = 1,
                                 test.nominal.conc = 10,
                                 analysis.instrument = "Agilent.GCMS",
                                 analysis.parameters = "Unknown",
                                 note.col = NULL,
                                 # don't export output TSV file
                                 output.res = FALSE
                                 )
```

We receive a message that "Responses of samples with a "Blank" sample type and a NA response have been reassigned to 0." Currently, when the `ISTD.Area` measurement for a sample is 0, the calculated `Response` is NaN, or not a number, because of the division by 0. In this particular case, all of our samples that have a NaN `Response` are "Blank" sample types. But, these "Blank" sample types are needed for point estimation. (Note: NA responses of other sample types are appropriately removed.) Users can verify this with the following check. 

```{r, eval = TRUE}
all(caco2_L1_curated[caco2_L1_curated$ISTD.Area == 0, "Sample.Type"] == "Blank")
```

Therefore, the NaNs are replaced with 0 so that the "Blank" sample types are kept in the dataset and our estimates are accurate. Users can verify that all "Blank" samples with `ISTD.Area` = 0 have their `Response` values reassigned to 0 with the following check.  

```{r}
# Verify Blank samples with ISTD.Area = 0 also have Response = 0
resp <- caco2_L1_curated %>% 
  dplyr::filter(Sample.Type == "Blank" & ISTD.Area == 0) %>% 
  dplyr::select(Response) %>% 
  unlist()

all(resp == 0)
```

Now, all of our samples are successfully formatted and returned in `caco2_L1_curated`, our Level 1 data produced from `format_caco2`. Each sample has one of the following sample types indicated in **bold**. 

1. Blank with no chemical added **(Blank)**
2. Target concentration added to donor compartment at time 0 (can be referred to as C0) **(D0)**
3. Donor compartment at the end of the experiment **(D2)**
4. Receiver compartment at the end of the experiment **(R2)**

If any samples had a different sample type, they would have been removed and reported to the user. If the user wants to export the removed samples as a TSV, the user should set the parameter `save.bad.types = TRUE`.  

The following table displays the first three samples of `caco2_L1_curated` with a non-Blank sample type. In addition to the columns specified by the user, there is an additional column called `Response`. This column is the test compound concentration and is calculated as $\textrm{Response} = \frac{\textrm{Analyte Area}}{\textrm{ISTD Area}}*\textrm{ISTD Conc}$ where $\textrm{Analyte Area}$ is defined by the `Area` column, $\textrm{ISTD Area}$ is defined by the `ISTD.Area` column, and $\textrm{ISTD Conc}$ is defined by the `ISTD.Conc` column.

```{r, echo = FALSE}
# Select non-Blank sample type to display Response col 
caco2_L1_curated %>% 
  filter(!Sample.Type %in% c("Blank", "CC")) %>% 
head(n = 3) %>% 
  flextable() %>% 
  bg(bg = "#DDDDDD", part = "header") %>% 
  autofit() %>% 
  set_table_properties(
    opts_html = list(
      scroll = list()
      )
  ) %>% 
  set_caption("Table 3: Level 1 data", align_with_table = FALSE) %>% 
  fontsize(size = 10, part = "body") %>% 
  theme_vanilla()
```


## Level 2 processing 

`sample_verification` is the Level 2 function used to add a verification column. The verification column indicates whether a sample should be included in the point estimation (Level 3) processing. This column allows users to keep all samples in their data but only utilize the reliable samples for P~app~ estimation. All of the data in Level 2 is identical to the data in Level 1 with the exception of the additional `Verified` column. 

To determine whether a sample should be included, the user should consult the wet-lab scientists from where their data originates or a chemist who may be able to provide reliable rationale for samples that should not be verified. This level of processing allows the user to receive feedback, exclude erroneous or unreliable samples, and produce new P~app~ estimates. Thus, there is always an open channel of communication between the user and the wet-lab scientists or chemists. 

We will use the already processed Level 2 data frame, `caco2_L2`, to regenerate our exclusion data. Note, all of our samples are verified but we are explaining how to create an exclusion list for learning purposes. In general, the user would not have access to the exclusion information *a priori*. 

The exclusion data frame must include the following columns: `Variables`, `Values`, and `Message`. The `Variables` column contains the variable names used to filter the excluded rows. Here, we are using `Lab.Sample.Name` and `DTXSID` to identify the excluded rows separated by a "|". The `Values` column contains the values of the variables, as a character, also separated by a "|". The `Message` column contains the reason for exclusion. Here, we are using the reasons listed in the `Verified` column in `caco2_L2`. The user should refrain from using "|" in any of their descriptions to avoid conflicts with the `sample_verification` function.

```{r L2 processing exclusion}
# Use verification data from loaded in `caco2_L2` data frame 
exclusion <- caco2_L2 %>% 
  filter(Verified != "Y") %>% 
  mutate("Variables" = "Lab.Sample.Name|DTXSID") %>% 
  mutate("Values" = paste(Lab.Sample.Name, DTXSID, sep = "|")) %>% 
  mutate("Message" = Verified) %>% 
  select(Variables, Values, Message)
```

```{r, echo = FALSE }
exclusion %>% 
  flextable() %>% 
  bg(bg = "#DDDDDD", part = "header") %>% 
  autofit() %>% 
  set_table_properties(
    opts_html = list(
      scroll = list()
      )
  ) %>% 
  set_caption("Table 4: Exclusion data frame", align_with_table = FALSE) %>% 
  fontsize(size = 10, part = "body") %>% 
  theme_vanilla()
```

As expected, our exclusion data frame is empty because all of our samples are verified. If all of the user's samples are verified, they simply do not provide an `exclusion.info` data frame in `sample_verification`. 

```{r}
caco2_L2_curated <- sample_verification(FILENAME = "Caco2_vignette",
                                        data.in = caco2_L1_curated,
                                        assay = "Caco-2",
                                        # don't export output TSV file
                                        output.res = FALSE)
```

Our Level 2 data now contains a `Verified` column. If the sample should be included, the column contains a "Y" for yes. If the sample should be excluded, the column contains the reason for exclusion. 

The following table displays some rows of the Level 2 data. 

```{r, echo = FALSE}
caco2_L2_curated %>% 
  head(n = 3) %>% 
  flextable() %>% 
  bg(bg = "#DDDDDD", part = "header") %>% 
  autofit() %>% 
  set_table_properties(
    opts_html = list(
      scroll = list()
      )
  ) %>% 
  set_caption("Table 5: Level 2 data", align_with_table = FALSE) %>% 
  fontsize(size = 10, part = "body") %>% 
  theme_vanilla()
```

## Level 3 processing 

`calc_caco2_point` is the Level 3 function used to calculate the P~app~ point estimate for each test compound using a Frequentist framework.

Mathematically, P~app~ is the amount of compound transported per unit time and is defined as $$P_{app} = \frac{dQ/dt}{c_0*A}$$ where $dQ/dt$ is the rate of permeation, $c_0$ is the initial concentration in the donor compartment, and $A$ is the surface area of the cell monolayer. 

First, we define the rate of permeation as the amount of compound passing through the monolayer per unit time. It is expressed as 
$$\frac{dQ}{dt} = \frac{c_{\textrm{receiver}}V_{\textrm{receiver}}}{\Delta t}$$ where $c_{\textrm{receiver}}$ is the concentration of the test compound in the receiver compartment, $V_{\textrm{receiver}}$ is the volume of the receiver compartment, and $\Delta t$ is the total elapsed time. This rate is a type of flux and therefore has units of $\mu \textrm{mol}/s$. The dimensional analysis is described below. 
$$\left[\frac{dQ}{dt}\right] = \frac{[c_{\textrm{receiver}}][V_{\textrm{receiver}}]}{[\Delta t]} = \frac{[\mu \textrm{mol/L}][\textrm{L]}}{[s]} = \frac{[\mu \textrm{mol}]}{[s]} $$

Next, to estimate $c_\textrm{receiver}$, we first multiply each R2 response with the corresponding R2 dilution factor and estimate the mean. To account for background noise, we follow a similar procedure with the blank response and dilution factor. We then subtract the two such that $$c_\textrm{receiver} = \textrm{mean}(\textrm{dilution factor}_{R2} * \textrm{Response}_{R2}) - \textrm{mean}(\textrm{dilution factor}_{Blank} * \textrm{Response}_{Blank})$$ 

The initial concentration in the donor compartment, $c_0$, is calculated similarly but with the D0 response instead of the R2 response. 

The last remaining arguments are estimated with columns in our data: $V_\textrm{receiver}$ is defined by our `Vol.Receiver` column, $\Delta t$ is defined by our `Time` column, and $A$ is defined by our `Membrane.Area` column. 

With this information, we calculate P~app~ in the donor to receiver direction. The estimate's direction is annotated according to Figure 1. For example, results annotated with the "A2B" suffix denote the apical side as the donor and the basolateral side as the receiver. Contrastingly, results annotated with the "B2A" suffix denote the basolateral side as the donor and the apical side as the receiver.

```{r}
caco2_L3_curated <- calc_caco2_point(FILENAME = "Caco2_vignette",
                                     data.in = caco2_L2_curated, 
                                     # don't export output TSV file 
                                     output.res = FALSE
                                     )
```

```{r, echo = FALSE}
caco2_L3_curated %>% 
  flextable() %>% 
  bg(bg = "#DDDDDD", part = "header") %>% 
  autofit() %>% 
  set_table_properties(
    opts_html = list(
      scroll = list()
      )
  ) %>% 
  set_caption("Table 6: Level 3 data", align_with_table = FALSE) %>% 
  fontsize(size = 10, part = "body") %>% 
  theme_vanilla()
```

In addition to returning an estimate for `Papp` and its components (`C0`, `dQdt`) in both directions, our Level 3 data contains estimates for the efflux ratio (`Refflux`), the fraction recovered (`Frec`) in both directions, and the qualitative category of each fraction recovered value (`Recovery_Class`).  

The efflux ratio is the ratio between the apparent membrane permeabilities and is expressed as $$R_{\textrm{efflux}} = \frac{P_{app}^{B2A}}{P_{app}^{A2B}}$$

The fraction recovered is the fraction of the initial donor amount recovered in the receiver compartment. It assesses the accuracy of the analytical chemistry method used to determine concentration values. It is expressed as $$ \frac{V_{D2}*\textrm{DF}_{D2}*(Responses_{D2} - \overline{Response}_{Blank}) + V_{R2}*\textrm{DF}_{R2}*(Responses_{R2} - \overline{Response}_{Blank})}{V_{D0}*\textrm{DF}_{D0}*(Responses_{D0} - \overline{Response}_{Blank})} $$ where $V$ is the corresponding volume, $\textrm{DF}$ is the corresponding dilution factor, $Responses$ are the corresponding responses, and $\overline{Response}_{Blank}$ is the mean blank response. 

Fraction recovered values are then classified and provided as the `Recovery_Class` value. Values $<0.4$ are classified as "Low Recovery" and values $>2$ are classified as "High Recovery". Values within the range, $0.4 \leq \textrm{Frec} \leq 2$, are considered normal and are not given an explicit classification. 

If there are multiple samples per sample type per chemical, (i.e. biological replicates), a vector of fraction recovered values will be returned with a "|" separating each value. Similarly, a vector of recovery classifications will be returned for each value in the vector of fraction recovered values with a "|" separating each value. 

The following columns are always returned involving the fraction recovered in the A to B direction: 

1. `Frec_A2B.vec` - vector of fraction recovered values 
2. `Recovery_Class_A2B.vec` - vector of recovery classifications 
3. `Frec_A2B.mean` - mean of fraction recovered values 
4. `Recovery_Class_A2B.mean` - recovery classification of mean fraction recovered value

Note that even if there are no biological replicates, the above four columns are still returned; all of the `Frec` estimates will just be equivalent. Estimates are also returned in the B to A direction. 

Looking at our results, some of our `Frec_A2B.vec` and `Frec_B2A.vec` values are less than $0.4$ and are therefore classified as "Low Recovery". Others are within the $[0.4, 2]$ range and are therefore considered normal.  

## Best Practices: Food for thought 

Generally, data processing pipelines should include minimal to no manual coding. It is best to keep clean code that is easily reproducible and transferable. The user should aim to have all the required data and meta-data files properly formatted to avoid further modifications throughout the pipeline. 

## References 

\references{
\insertRef{honda2025impact}{invitroTKstats}
\insertRef{hubatsch2007determination}{invitroTKstats}
}