---
title: "CrcBiomeScreen: Colorectal Cancer Microbiome Screening and Analysis"
author: "Chengxin Li"
date: "`r Sys.Date()`"
output: 
    BiocStyle::html_document:
    toc: true
    toc_float: true
vignette: >
  %\VignetteIndexEntry{CrcBiomeScreen}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE,
  warning = FALSE,
  fig.align = "center",
  crop = NULL
)
```

# Introduction

Microbiome research has become an increasingly important field for understanding the complex interactions between host health, disease, and environmental factors. Numerous studies have demonstrated that microbial community composition and function can provide predictive insights into a wide range of conditions, including cancer, metabolic, and inflammatory diseases. Translating these findings into robust and reproducible computational workflows remains a key challenge for the microbiome research community.

`CrcBiomeScreen` is a Bioconductor package that provides a reproducible and modular workflow for microbiome-based disease screening and classification. Although originally designed for colorectal cancer (CRC) applications, the framework is disease-agnostic and can be applied to any case–control microbiome dataset. The package integrates data preprocessing, normalization, quality control, and machine learning-based classification into a streamlined and extensible pipeline.

The Bioconductor ecosystem provides a standardized environment for reproducible omics analysis and emphasizes interoperability between data structures and statistical workflows. `CrcBiomeScreen` extends this principle to microbiome classification by supporting the use of Bioconductor core data containers such as `TreeSummarizedExperiment` and by providing accessor methods that facilitate integration with other packages (e.g., `phyloseq` and `curatedMetagenomicData`). This design ensures compatibility with existing Bioconductor workflows and promotes transparency, reproducibility, and reusability of microbiome-based models.

Existing Bioconductor packages such as `microbiome`, `microbiomeMarker`, and `curatedMetagenomicData` provide powerful tools for microbiome data access, visualization, and differential abundance testing. However, there remains a gap in standardized machine learning workflows tailored for disease screening and cross-cohort validation. `CrcBiomeScreen` addresses this need by providing:

-   **Integrated preprocessing**: automated taxonomic splitting, filtering, and normalization (TSS and GMPR);
-   **Modeling interface**: built-in machine learning modules for Random Forest and XGBoost classification with class weighting for imbalanced data;
-   **Evaluation framework**: functions for cross-validation, ROC/F1-based performance metrics, and external validation;
-   **Quality control utilities**: MDS-based sample outlier detection and contamination-aware filtering.

By bridging microbiome features with reproducible machine learning pipelines, `CrcBiomeScreen` complements existing Bioconductor tools and provides a flexible platform for developing, benchmarking, and validating microbiome-based predictive models.

> Note: This vignette is organized in two parts:

> (1) a fully executable toy workflow for reproducibility during package checking, and
> (2) a full-scale workflow using real cohort data for reference.

> The toy workflow mirrors the complete analysis pipeline, while the full workflow illustrates how to apply the same steps to a real data set.

## Installation

Install from the Bioconductor repository:

```{r, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("CrcBiomeScreen")
```

Install the development version of the package from GitHub(development version):

```{r, eval = FALSE}
install.packages(c("remotes"))
remotes::install_github("iChronostasis/CrcBiomeScreen", force = TRUE)
```

Load `CrcBiomeScreen` into the R session:

```{r, eval = TRUE}
library(CrcBiomeScreen)
```

# Getting started

Full model training is disabled during vignette build because it exceeds SPB time limits. All other steps are fully executable to illustrate real package behavior.

The core workflow of `CrcBiomeScreen` involves several key steps: data loading and pre-processing, normalization, model training, evaluation, and validation. All functions are designed to work with the `CrcBiomeScreenObject`, which is the main data structure used throughout the package.

## Exploring the data structure

### Data preparation

First, we need to load the example data and create a `CrcBiomeScreenObject` This object is the primary input for all downstream functions in the package.

We offer two methods for preparing your data. You only need to use one of the following examples, depending on the format of your input data.

The first example below, which uses the Thomas_2018_RelativeAbundance dataset, is the one we will be using for the subsequent steps in this vignette. It represents a typical use case with a more complex dataset. The second example with NHSBCSP_screeningData is included to show an alternative input format for simpler datasets, demonstrating that the function can handle various data structures.

```{r, eval = TRUE}
# Set working directory
# setwd("/home/CRCscreening/CRCscreening-Workflow/")
# List available datasets in the package
# data(package = "CrcBiomeScreen")

# Load the toy dataset from curatedMetagenomicData
data(Thomas_2018_RelativeAbundance, package = "CrcBiomeScreen")

# Create the CrcBiomeScreenObject
## Direct creation from TreeSummarizedExperiment

# library(curatedMetagenomicData)
# tse <- curatedMetagenomicData("ThomasAM_2018a.relative_abundance",
#                               dryrun = FALSE, rownames = "short")[[1]]

tse <- Thomas_2018_RelativeAbundance$`2021-03-31.ThomasAM_2018a.relative_abundance`
CrcBiomeScreenObject <- CreateCrcBiomeScreenObjectFromTSE(tse)

# Load the simple toy dataset from curatedMetagenomicData
data(NHSBCSP_screeningData, package = "CrcBiomeScreen")

# Create the CrcBiomeScreenObject for simple toy data
CrcBiomeScreenObject_NHSBCSP_screeningData <- CreateCrcBiomeScreenObject(
  AbsoluteAbundance = NHSBCSP_screeningData[,-c(1:2,646:647)],
  TaxaData = colnames(NHSBCSP_screeningData[,-c(1:2,646:647)]),
  SampleData = NHSBCSP_screeningData[,c(1:2,646:647)]
)

## Data exploration
head(getTaxaData(CrcBiomeScreenObject))
dim(getRelativeAbundance(CrcBiomeScreenObject))
summary(getSampleData(CrcBiomeScreenObject)$age)
```

### Taxonomic data processing

The package offers robust functions to process raw taxonomic strings, splitting them into multiple hierarchical levels and allowing you to filter to a specific taxonomic rank. This ensures your data is clean and ready for downstream analysis.

#### Supported formats

Currently, the function supports multiple formats of taxonomic strings, including **MetaPhlAn**, **QIIME**, **SILVA**, etc.\
Users do not need to manually specify the delimiter (`;`, `.`, `|`, `_`), since the function automatically detects the correct separator.

#### Original taxa column

For reproducibility and traceability, the output contains an additional column `OriginalTaxa` that preserves the original input string.

#### Special handling of *uncultured* and *unclassified*

If a taxonomic entry at a given rank is `"uncultured"` or `"unclassified"`, the function appends this label to the higher-level taxa.

For example:

| Input | Processed output (Genus) |
|--------------------------------------------------|----------------------|
| `D_0__Bacteria.D_1__Firmicutes.D_2__Clostridia.D_3__Ruminococcaceae.D_4__uncultured` | `Ruminococcaceae_uncultured` |
| `k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales; f__Enterobacteriaceae; g__unclassified` | `Enterobacteriaceae_unclassified` |

#### Choosing a Taxonomic Level

You don't have to be limited to just the genus level. Use the `KeepTaxonomicLevel()` function to select the rank you need, whether it's **family**, **genus**, or **species**. The function will automatically group taxa at the specified level. For the situation where there are two levels of unclassified, it will retain one level with unclassified.

#### Handling Data Without Taxonomic Strings

For data like **ASVs** or **OTUs** that do not contain taxonomic information in their names, you can use the `LoadTaxaTable()` function to import a separate table that maps each sequence ID to its full taxonomic lineage. This provides maximum flexibility for any type of input data. The use of this function can be seen in the documentation of the function.

#### Example

```{r, eval = TRUE}
# Split the taxa into multiple levels[If not done already]
# CrcBiomeScreenObject <- SplitTaxas(CrcBiomeScreenObject)

head(getTaxaData(CrcBiomeScreenObject))

taxa <- getTaxaData(CrcBiomeScreenObject)
colnames(taxa) <- c("Kingdom","Phylum","Class","Order","Family","Genus","Species")

setTaxaData(CrcBiomeScreenObject) <- taxa

colnames(getTaxaData(CrcBiomeScreenObject))

# Keep only the genus level data
CrcBiomeScreenObject <- KeepTaxonomicLevel(CrcBiomeScreenObject,level = "Genus")

# Inspect processed data
head(getTaxaData(CrcBiomeScreenObject))
dim(getTaxaData(CrcBiomeScreenObject))
```

### Data normalization

`CrcBiomeScreen` supports multiple normalization methods. Here we demonstrate both GMPR and TSS normalization:

```{r, eval = TRUE}
# Normalize using GMPR (Geometric Mean of Pairwise Ratios)
CrcBiomeScreenObject <- NormalizeData(CrcBiomeScreenObject, method = "GMPR", level = "Genus")

# Normalize using TSS (Total Sum Scaling)
CrcBiomeScreenObject <- NormalizeData(CrcBiomeScreenObject, method = "TSS", level = "Genus")
```

**Note**: GMPR normalization is generally recommended for microbiome data as it is more robust to compositional effects and outliers compared to TSS normalization.


# Runnable toy workflow setup

- More explaination and details could check the Full-scale modeling workflow using real cohort data for reference section below.

## Setup the toy dataset

```{r, eval = TRUE}
set.seed(123)
# -------------------------
# Toy taxonomy
# -------------------------
toy_taxa_strings <- c(
  "D_0__Bacteria|D_1__Firmicutes|D_2__Clostridia|D_3__OrderA|D_4__FamilyA|D_5__GenusA",
  "D_0__Bacteria|D_1__Firmicutes|D_2__Clostridia|D_3__OrderB|D_4__FamilyB|D_5__GenusB",
  "D_0__Bacteria|D_1__Firmicutes|D_2__Clostridia|D_3__OrderC|D_4__FamilyC|D_5__GenusC",
  "D_0__Bacteria|D_1__Bacteroidetes|D_2__Bacteroidia|D_3__OrderD|D_4__FamilyD|D_5__GenusD",
  "D_0__Bacteria|D_1__Bacteroidetes|D_2__Bacteroidia|D_3__OrderE|D_4__FamilyE|D_5__GenusE",
  "D_0__Bacteria|D_1__Proteobacteria|D_2__Gammaproteobacteria|D_3__OrderF|D_4__FamilyF|D_5__GenusF"
)

toy_taxa <- data.frame(
  Taxa = toy_taxa_strings,
  stringsAsFactors = FALSE
)

# -------------------------
# Toy training abundance
# 6 taxa x 12 samples
# -------------------------
train_samples <- paste0("S", 1:12)

toy_abs <- matrix(
  c(
    rpois(6 * 6, lambda = 54.8887777),   # controls
    rpois(6 * 6, lambda = 55)    # CRC
  ),
  nrow = 6,
  ncol = 12
)

rownames(toy_abs) <- toy_taxa_strings
colnames(toy_abs) <- train_samples
toy_abs <- as.data.frame(toy_abs)

toy_sample <- data.frame(
  number_reads = rep(10000, 12),
  study_condition = c(rep("control", 6), rep("CRC", 6)),
  row.names = train_samples,
  stringsAsFactors = FALSE
)

# -------------------------
# 1. Create toy training data
# -------------------------

toy_obj <- CreateCrcBiomeScreenObject(
  AbsoluteAbundance = toy_abs,
  TaxaData = toy_taxa,
  SampleData = toy_sample
)

# -------------------------
# 2. Keep one taxonomic level and normalize
# -------------------------

toy_obj <- SplitTaxas(toy_obj)
toy_obj <- KeepTaxonomicLevel(toy_obj, level = "Genus")
toy_obj <- NormalizeData(toy_obj, method = "TSS", level = "Genus")

# -------------------------
# 3. Split dataset
# -------------------------
toy_obj <- SplitDataSet(
  toy_obj,
  label = c("control", "CRC"),
  partition = 0.7
)

# Perform quality control using cmdscale
toy_obj <- qcByCmdscale(
  toy_obj,
  TaskName = "toy_obj",
  normalize_method = "TSS",
  plot = FALSE
)

# Check class balance in the training data
checkClassBalance(getModelData(toy_obj)$TrainLabel)
```

## Model training and evaluation

```{r, eval = TRUE}
# -------------------------------
# 4. Train and Evaluate RF model
# -------------------------------
toy_obj <- TrainModels(
  toy_obj,
  model_type = "RF",
  TaskName = "toy_rf",
  ClassWeights = FALSE,
  TrueLabel = "CRC",
  num_cores = 1,
  n_cv = 2
)

toy_obj <- EvaluateModel(
  toy_obj,
  model_type = "RF",
  TaskName = "ToyData_RF_Test",
  TrueLabel = "CRC",
  PlotAUC = FALSE
)
```

## External validation

```{r, eval = TRUE}
# -------------------------
# 5. Create toy validation data
# -------------------------
val_taxa_strings <- c(
  "D_0__Bacteria|D_1__Firmicutes|D_2__Clostridia|D_3__OrderA|D_4__FamilyA|D_5__GenusA",
  "D_0__Bacteria|D_1__Firmicutes|D_2__Clostridia|D_3__OrderB|D_4__FamilyB|D_5__GenusB",
  "D_0__Bacteria|D_1__Firmicutes|D_2__Clostridia|D_3__OrderC|D_4__FamilyC|D_5__GenusC",
  "D_0__Bacteria|D_1__Bacteroidetes|D_2__Bacteroidia|D_3__OrderD|D_4__FamilyD|D_5__GenusD",
  "D_0__Bacteria|D_1__Bacteroidetes|D_2__Bacteroidia|D_3__OrderE|D_4__FamilyE|D_5__GenusE",
  "D_0__Bacteria|D_1__Proteobacteria|D_2__Gammaproteobacteria|D_3__OrderF|D_4__FamilyF|D_5__GenusF"
)

val_taxa <- data.frame(
  Taxa = val_taxa_strings,
  stringsAsFactors = FALSE
)

val_samples <- paste0("V", 1:8)

val_abund <- matrix(
  c(
    rpois(6 * 4, lambda = 38),
    rpois(6 * 4, lambda = 48)
  ),
  nrow = 6,
  ncol = 8
)

rownames(val_abund) <- val_taxa_strings
colnames(val_abund) <- val_samples

val_abund <- as.data.frame(val_abund)

val_sample <- data.frame(
  number_reads = rep(10000, 8),
  study_condition = c(rep("control", 4), rep("CRC", 4)),
  condition = c(rep("control", 4), rep("CRC", 4)),
  row.names = val_samples,
  stringsAsFactors = FALSE
)

val_obj <- CreateCrcBiomeScreenObject(
  AbsoluteAbundance = val_abund,
  TaxaData = val_taxa,
  SampleData = val_sample
)

val_obj <- SplitTaxas(val_obj)
val_obj <- KeepTaxonomicLevel(val_obj, level = "Genus")
val_obj <- NormalizeData(val_obj, method = "TSS", level = "Genus")

# -------------------------
# 6. Align features
# -------------------------
train_norm <- getNormalizedData(toy_obj)
val_norm <- getNormalizedData(val_obj)

common_features <- intersect(colnames(train_norm), colnames(val_norm))

setNormalizedData(toy_obj) <- train_norm[, common_features, drop = FALSE]
setNormalizedData(val_obj) <- val_norm[, common_features, drop = FALSE]

# -------------------------
# 7. Validate model
# -------------------------
validated_obj <- ValidateModelOnData(
  toy_obj,
  ValidationData = val_obj,
  model_type = "RF",
  TaskName = "toy_validation",
  TrueLabel = "CRC",
  PlotAUC = FALSE
)
```

## Full screening workflow in one step

```{r, eval = TRUE}
# ----------------------------------
# 8. Minimal end-to-end RunScreening
# ----------------------------------
toy_obj2 <- CreateCrcBiomeScreenObject(
  AbsoluteAbundance = toy_abs,
  TaxaData = toy_taxa,
  SampleData = toy_sample
)

toy_obj2 <- SplitTaxas(toy_obj2)
toy_obj2 <- KeepTaxonomicLevel(toy_obj2, level = "Genus")
toy_obj2 <- NormalizeData(toy_obj2, method = "TSS", level = "Genus")

toy_obj2 <- RunScreening(
  obj = toy_obj2,
  model_type = "RF",
  partition = 0.7,
  split.requirement = list(
    label = c("control", "CRC"),
    condition_col = "study_condition"
  ),
  ClassWeights = FALSE,
  n_cv = 2,
  num_cores = 1,
  TaskName = "RF_TSS_toydata",
  ValidationData = val_obj,
  TrueLabel = "CRC"
)
```

## Get the the raw prediction scores for each sample

```{r, eval = TRUE}
# -------------------------
# 9. Prediction on new data
# -------------------------
toy_obj <- PredictCrcBiomeScreen(
  toy_obj,
  newdata = getNormalizedData(val_obj),
  model_type = "RF"
)

# Inspect the predictions
head(getPredictResult(toy_obj)$RF)
#>      control       CRC
#> V1 0.5122222 0.4877778
#> V2 0.6066667 0.3933333
#> V3 0.4205556 0.5794444
#> V4 0.5255556 0.4744444
#> V5 0.4772222 0.5227778
#> V6 0.4677778 0.5322222
predictions <- getPredictResult(toy_obj)$RF

# -------------------------
# 10. Evaluate predictions
# -------------------------

evaluation_results <- EvaluateCrcBiomeScreen(
  predictions = predictions,
  true_labels = getSampleData(val_obj)$study_condition,
  TrueLabel = "CRC",
  TaskName = "Toy_Prediction_Eval"
)

```

# Full-scale modeling workflow using real cohort data for reference

### Data filtering and splitting

Before model training, we need to filter the data to include only CRC and control samples, then split into training and test sets:

```{r, eval = TRUE}
# Check the available labels in the data
table(getSampleData(CrcBiomeScreenObject)$study_condition)

# Filter data to keep only CRC and control samples
CrcBiomeScreenObject <- FilterDataSet(
  CrcBiomeScreenObject,
  label = c("CRC", "control"),
  condition_col = "study_condition"
)

# Verify the filtering results
table(getSampleData(CrcBiomeScreenObject)$study_condition)

# Split data into training (70%) and test (30%) sets
CrcBiomeScreenObject <- SplitDataSet(
  CrcBiomeScreenObject, 
  label = c("control", "CRC"), 
  partition = 0.7
)
```

### Quality control

Quality control using classical multidimensional scaling can help identify potential batch effects or outliers:

```{r, eval = TRUE}
# Perform quality control using cmdscale
CrcBiomeScreenObject <- qcByCmdscale(
  CrcBiomeScreenObject,
  TaskName = "Normalize_ToyData_filtered_qc",
  normalize_method = "GMPR",
  plot = FALSE
)

# Check class balance in the training data
checkClassBalance(getModelData(CrcBiomeScreenObject)$TrainLabel)
```

### Model training and Evaluation

Training machine-learning models for microbiome classification often requires repeated cross-validation, hyperparameter search, and ensemble methods. These steps are computationally demanding and cannot be executed within the 15-minute limit of the Bioconductor build system.

For this reason, the full training code is provided for users, but is not executed when building the vignette. Users may run the pipeline locally to reproduce the complete analysis.

`CrcBiomeScreen` supports two machine learning algorithms: **Random Forest** and **XGBoost**. Both models can be trained with class balancing to handle imbalanced datasets:

## External validation

To ensure the robustness and generalizability of the trained models, external validation on independent datasets is crucial

## Prediction & Evaluation

Once you've trained a model, you'll want to apply it to new data and assess its performance. Our package offers a streamlined, two-step process for this: first, generate predictions on your new dataset, and then, if true labels are available, evaluate the model's accuracy. This modular approach allows for flexibility, whether you're predicting outcomes for unknown samples or measuring model quality on a validation set.

### Step 1: Predict on New Data

Use the PredictCrcBiomeScreen() function to generate prediction probabilities for your new dataset. This function takes your trained CrcBiomeScreenObject and the new data, returning the raw prediction scores for each sample.

```{r, eval = FALSE}
# Assuming 'CrcBiomeScreenObject' contains a trained XGBoost model
# and 'MyNewData' is your new dataset (e.g., another CrcBiomeScreenObject or a data frame).

# Generate predictions for the new dataset
ValidationData_filtered_qc <- PredictCrcBiomeScreen(
    CrcBiomeScreenObject, 
    newdata = getNormalizedData(ValidationData_filtered_qc), # Use the appropriate data slot from your new data
    model_type = "RF"
)
# Inspect the predictions
head(getPredictResult(ValidationData_filtered_qc)$RF)
predictions <- getPredictResult(ValidationData_filtered_qc)$RF
```

### Step 2: Evaluate Model Performance

If your new dataset includes known true labels, you can evaluate the model's performance using the EvaluateCrcBiomeScreen() function. This function takes the predictions generated in the previous step and the true_labels to calculate key metrics like the AUC and plot the ROC curve.

```{r, eval = FALSE}
# Assuming 'MyNewData' has a 'study_condition' column in its SampleData
# and the true positive label is "CRC".

# Evaluate the predictions
evaluation_results <- EvaluateCrcBiomeScreen(
    predictions = predictions,
    true_labels = getSampleData(ValidationData_filtered_qc)$study_condition, # Provide the actual true labels
    TrueLabel = "CRC", # Specify the positive class label
    TaskName = "MyNewDatasetEvaluation" # A name for the output plot/files
)

# View the evaluation results
print(evaluation_results$AUC)
```

## Streamlined Screening Workflow

For users who want to run the entire screening workflow in one step, `CrcBiomeScreen` provides the `RunScreening()` function. This function encapsulates the complete process from data preparation to model validation, offering a comprehensive and easy-to-use solution.

```{r full_pipeline, eval=FALSE}
# Load the toy dataset from curatedMetagenomicData
data(Thomas_2018_RelativeAbundance, package = "CrcBiomeScreen")

# Create the CrcBiomeScreenObject
tse <- Thomas_2018_RelativeAbundance$`2021-03-31.ThomasAM_2018a.relative_abundance`
CrcBiomeScreenObject <- CreateCrcBiomeScreenObjectFromTSE(tse)

taxa <- getTaxaData(CrcBiomeScreenObject)
colnames(taxa) <- c("Kingdom","Phylum","Class","Order","Family","Genus","Species")

setTaxaData(CrcBiomeScreenObject) <- taxa

# Keep only the genus level data
CrcBiomeScreenObject <- KeepTaxonomicLevel(CrcBiomeScreenObject,level = "Genus")

# Inspect processed data
head(getTaxaData(CrcBiomeScreenObject))

# Normalize using GMPR (Geometric Mean of Pairwise Ratios)
CrcBiomeScreenObject <- NormalizeData(CrcBiomeScreenObject, method = "GMPR", level = "Genus")

# Filter data to keep only CRC and control samples
CrcBiomeScreenObject <- FilterDataSet(
  CrcBiomeScreenObject,
  label = c("CRC", "control"),
  condition_col = "study_condition"
)

# Perform quality control using cmdscale
CrcBiomeScreenObject <- qcByCmdscale(
  CrcBiomeScreenObject,
  TaskName = "Normalize_ToyData_filtered_qc",
  normalize_method = "GMPR",
  plot = FALSE
)

# Verify the filtering results
table(getSampleData(CrcBiomeScreenObject)$study_condition)

# Split data into training (70%) and test (30%) sets
CrcBiomeScreenObject <- SplitDataSet(
  CrcBiomeScreenObject, 
  label = c("control", "CRC"), 
  partition = 0.7
)

# Check class balance in the training data
checkClassBalance(getModelData(CrcBiomeScreenObject)$TrainLabel)

# Load validation data
data("ZellerG_2014_RelativeAbundance")

ValidationData_tse <- ZellerG_2014_RelativeAbundance$`2021-03-31.ZellerG_2014.relative_abundance`
# Create CrcBiomeScreenObject for validation data
ValidationData <- CreateCrcBiomeScreenObjectFromTSE(ValidationData_tse)

# Process validation data similarly to training data
# ValidationData <- SplitTaxas(ValidationData)
taxa <- getTaxaData(ValidationData)
colnames(taxa) <- c("Kingdom","Phylum","Class","Order","Family","Genus","Species")

setTaxaData(ValidationData) <- taxa

ValidationData <- KeepTaxonomicLevel(ValidationData,level = "Genus")
ValidationData <- NormalizeData(ValidationData, method = "GMPR", level = "Genus")

# Align features between training and validation data
val_norm <- getNormalizedData(ValidationData)
crc_norm <- getNormalizedData(CrcBiomeScreenObject)
# Ensure consistent features between training and validation data
val_norm <- val_norm[, colnames(val_norm) %in% colnames(crc_norm)]
crc_norm <- crc_norm[, colnames(crc_norm) %in% colnames(val_norm)]
# Set back the normalized data
setNormalizedData(ValidationData) <- val_norm
setNormalizedData(CrcBiomeScreenObject) <- crc_norm

# Filter validation data
ValidationData_filtered <- FilterDataSet(
  ValidationData,
  label = c("CRC", "control"),
  condition_col = "study_condition"
)

# Quality control for validation data
ValidationData_filtered_qc <- qcByCmdscale(
  ValidationData_filtered,
  TaskName = "Normalize_ValidationData_filtered_qc",
  normalize_method = "GMPR",
  plot = FALSE
)

# Run the complete screening workflow
ValidationData_filtered_qc <- RunScreening(
  # Input data
  obj = CrcBiomeScreenObject,
  # Model and data splitting
  model_type = "RF",
  partition = 0.7,
  split.requirement = list(
    label = c("control", "CRC"),
    condition_col = "study_condition"
  ),
  ClassWeights = FALSE,
  
  # Cross-validation and parallelization
  n_cv = 10,
  num_cores = 10,
  
  # Task and output naming
  TaskName = "RF_GMPR_toydata",
  
  # External validation
  ValidationData = ValidationData_filtered_qc,
  TrueLabel = "CRC"
)

# Example with XGBoost model
ValidationData_filtered_qc <- RunScreening(
  CrcBiomeScreenObject,
  model = "RF",
  partition = 0.7,
  split.requirement = list(
    label = c("control", "CRC"),
    condition_col = "study_condition"
  ),
  ClassWeights = TRUE,
  n_cv = 10,
  num_cores = 10,
  TaskName = "RF_GMPR_toydata",
  ValidationData = ValidationData_filtered_qc,
  TrueLabel = "CRC"
)
```

This function performs the complete workflow, including:

-   **Data normalization**: Standardizes data to a common scale.
-   **Data filtering and splitting**: Prepares the data for modeling by selecting and partitioning it into training and testing sets.
-   **Model training with cross-validation**: Trains the specified model (`RF` or `XGBoost`) using k-fold cross-validation to ensure robustness.
-   **Model evaluation on the test set**: Assesses the model's performance on a held-out test set to get an unbiased estimate of its predictive power.
-   **External validation** (if `ValidationData` is provided): Applies the trained model to an entirely independent dataset to confirm its generalizability and real-world applicability. This step also stores the individual sample predictions and evaluation metrics in the `CrcBiomeScreenObject`.

# Best practices and recommendations

## Normalization method selection

-   **GMPR**: Recommended for most microbiome datasets, especially when dealing with compositional data and potential outliers
-   **TSS**: Simpler approach, suitable for well-balanced datasets without extreme outliers

## Model selection

-   **Random Forest**: Generally robust and interpretable, good baseline choice
-   **XGBoost**: Often provides better performance but may require more careful hyperparameter tuning

## Validation strategies

-   Always use external validation datasets when available
-   Consider cross-validation for model selection and hyperparameter tuning
-   Ensure consistent preprocessing between training and validation data

## Class imbalance

-   Enable `ClassWeights = TRUE` when dealing with imbalanced datasets
-   Monitor both sensitivity and specificity, not just overall accuracy
-   Consider stratified sampling when splitting datasets

# Troubleshooting

## Common issues and solutions

### Feature mismatch between datasets

Ensure that both training and validation datasets have been processed identically and contain overlapping features.

### Memory issues with large datasets

-   Reduce the number of features by applying more stringent filtering
-   Use fewer CPU cores (`num_cores` parameter) to reduce memory usage

# Session information

This vignette was created under the following conditions:

```{r sessioninfo}
sessionInfo()
```

# References

If you use `CrcBiomeScreen` in your research, please cite:

[Package citation information]

For questions, bug reports, or feature requests, please visit the GitHub repository: <https://github.com/iChronostasis/CrcBiomeScreen>

**Note**: This vignette demonstrates the basic usage of `CrcBiomeScreen`. For more advanced usage and customization options, please refer to the function documentation and additional tutorials.