---
title: "Getting started"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{getting_started}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

```{r setup, message = FALSE}
library("tidyverse")
library("rstatix")
library("magrittr")
library("GimmeMyStats")
```

## Clinical Dataset

```{r}
set.seed(123)
n <- 150 # Number of patients
clinical_data <- tibble(
    Country = sample(c("France", "Germany", "UK", "Italy", "Spain"), n, replace = TRUE),
    Age = rnorm(n, mean = 60, sd = 10),
    Sex = sample(c("Male", "Female"), n, replace = TRUE),
    Cancer_Type = sample(c("Lung", "Breast", "Colorectal", "Healthy"), n, replace = TRUE),
    Cancer_Stage = sample(1:4, n, replace = TRUE),
    Weight = rnorm(n, mean = 75, sd = 15),
    Height = rnorm(n, mean = 170, sd = 10),
    Fatigue_Score = sample(0:10, n, replace = TRUE),
    Physician_Score = sample(0:10, n, replace = TRUE),
    CRP = rnorm(n, mean = 5, sd = 2),
    IL6 = rnorm(n, mean = 10, sd = 5),
    Leukocytes = rnorm(n, mean = 6.5, sd = 2),
    Neutrophils = rnorm(n, mean = 55, sd = 10),
    Lymphocytes = rnorm(n, mean = 35, sd = 8),
    KRAS_Mutation = sample(c("Mutated", "Wild-type"), n, replace = TRUE),
    Treatment_Response = sample(c("Complete", "Partial", "None"), n, replace = TRUE)
)
head(clinical_data)
```

## Descriptive Statistics

We summarize categorical and multinomial variables using `print_multinomial`.

```{r}
print_multinomial(select(clinical_data, "Cancer_Type"))
```
*Here, we see if the distribution of cancer types and treatment responses is balanced across the dataset. If a category is underrepresented, statistical comparisons may lack power.*

Binary variables can be summarized using `summary_binomial`.

```{r}
summary_binomial(select(clinical_data, c("KRAS_Mutation", "Sex")))
```
*Checking for imbalances in binary variables is crucial. If `KRAS_Mutation` is highly imbalanced, conclusions regarding its association with outcomes should be interpreted cautiously.*

For continuous variables, `summary_numeric` provides a robust summary.

```{r}
print_numeric(select(clinical_data, c("Age", "Weight", "CRP")))
summary_numeric(clinical_data$Age)
```
*We verify if the distributions are symmetric or skewed. A strong skew might indicate outliers or a non-normal distribution requiring transformation before parametric testing.*

### **Identifying Outliers**

Outliers in continuous variables can affect statistical analyses. We use different methods to detect them:

```{r}
identify_outliers(clinical_data$CRP, method = "iqr")
```
```{r}
identify_outliers(clinical_data$CRP, method = "percentiles")
```
```{r}
identify_outliers(clinical_data$CRP, method = "hampel")
```
```{r}
identify_outliers(clinical_data$CRP, method = "mad")
```
```{r}
identify_outliers(select(clinical_data, CRP), method = "sd")
```
*Different methods have different sensitivity levels. The `iqr` method identifies extreme values based on quartiles, while `mad` and `hampel` are robust to skewed distributions. `sd` assumes normality and may not be ideal for skewed data.*

## Correlation Analysis

```{r}
mcor_test(clinical_data[, c("CRP", "IL6", "Leukocytes")], method = "pearson")
```
```{r}
mcor_test(
    clinical_data[, c("CRP", "IL6", "Leukocytes")],
    clinical_data[, c("Physician_Score", "Fatigue_Score")],
    method = "spearman",
    p.value = TRUE,
    method_adjust = "bonferroni"
)
```
*Pearson's correlation assumes linear relationships, whereas Spearman's is rank-based and better suited for skewed or non-linear associations. If Pearson's r differs significantly from Spearman's rho, the relationship might not be linear.*

## Group Comparisons

### **ANOVA (Parametric)**

```{r}
anova_res <- anova_test(data = clinical_data, Age ~ Country)
print_test(anova_res)
```
*A significant ANOVA result suggests at least one group mean differs. If non-significant, we fail to reject the null hypothesis that all means are equal.*

### **Kruskal-Wallis (Non-Parametric)**

```{r}
kruskal_res <- kruskal_test(data = clinical_data, CRP ~ Cancer_Type)
print_test(kruskal_res)
```
*This test is an alternative to ANOVA when normality is violated. A significant result means at least one group median differs.*

### **Wilcoxon Test (Two Groups)**

```{r}
wilcox_res <- wilcox_test(data = clinical_data, IL6 ~ KRAS_Mutation)
print_test(wilcox_res)
```
*A significant result suggests IL6 levels differ significantly between `KRAS_Mutation` groups.*

## Chi-Square and Fisher’s Exact Test

```{r}
chi2_res <- chisq_test(table(clinical_data$Cancer_Type, clinical_data$Treatment_Response))
print_chi2_test(chi2_res)
```
*Chi-square tests independence between categorical variables. A significant result suggests an association between cancer type and treatment response.*

```{r}
post_hoc_chi2(clinical_data$Cancer_Type, method = "chisq")
```
*Post-hoc tests determine which specific categories differ, useful when the chi-square test is significant.*

## Session Information
```{r end, echo = FALSE}
sessionInfo()
```