---
title: "riemstats"
author: "Nicolas Escobar, Jaroslaw Harezlak"
date: "2025-02-12"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{riemstats}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction

The `riemstats` package provides specialized statistical methods for analyzing brain connectivity matrices (connectomes) from multi-site neuroimaging studies. When working with functional or structural connectivity data represented as symmetric positive definite (SPD) matrices, this package offers both harmonization techniques to address site-specific effects and ANOVA methods that respect the Riemannian geometry of SPD matrices.

## The Multi-Site Connectome Challenge

Modern neuroimaging studies often collect brain connectivity data across multiple sites to increase sample size and generalizability. However, this introduces significant challenges associated to the fact that different scanners, protocols, and populations introduce systematic biases.

Traditional statistical methods fail to address these challenges simultaneously, either ignoring the geometric structure of SPD matrices or lacking appropriate harmonization capabilities.

## Core Capabilities

`riemstats` addresses these challenges through two complementary sets of tools:

1. **Harmonization Methods**: Remove unwanted site/scanner effects while preserving biological signals
  - ComBat harmonization adapted for SPD matrices (see Honorro et al.)
  - Rigid geometric alignment methods (see Simeon et al.)

2. **ANOVA Methods**: Test for group differences while respecting SPD geometry
  - Fréchet ANOVA using Riemannian distances (see Muller et al.)
  - Riemannian ANOVA (closer to classical MANOVA, see Escobar, Harezlak.)
    - Wilks' lambda and Pillai's trace statistics on the tangent space  
    - Bootstrap inference for robust significance testing

By integrating harmonization and ANOVA within a Riemannian framework, `riemstats` enables rigorous statistical analysis of multi-site connectome data while preserving the geometric properties that make these matrices meaningful representations of brain connectivity.

# Why Standard Methods Fail for Connectome Data

Brain connectivity matrices (connectomes) are symmetric positive definite (SPD) matrices - they have special mathematical properties that make standard statistical methods inappropriate. This section explains why ordinary ComBat and ANOVA fail for SPD data and why specialized Riemannian methods are necessary.

## The Problem with Treating SPD Matrices as Vectors

Standard methods like ordinary ComBat and ANOVA treat matrices as if they were simple vectors in Euclidean space. This approach has critical flaws:

### 1. Distance Distortions

Standard Euclidean distance between matrices ignores their geometric structure:
- Two very similar connectivity patterns might appear far apart
- Two different patterns might appear close
- Statistical tests based on these distances give misleading results

### 2. Site Effect Removal Violations

Ordinary ComBat removes site effects by:
1. Treating each matrix element independently
2. Applying location/scale adjustments element-wise
3. Ignoring the constraint that results must be SPD

This can produce "harmonized" data that:
- No longer represents valid connectivity matrices
- Has lost important geometric relationships
- Contains negative eigenvalues (mathematically invalid for correlations)

## Why Ordinary ANOVA Fails

Standard ANOVA assumes data lies in Euclidean space where:
- The mean is computed by simple averaging
- Distances are measured with straight lines
- Variance has the same meaning everywhere

For SPD matrices, these assumptions break down.

## The Riemannian Solution

The key insight is that SPD matrices form a Riemannian manifold - a curved geometric space. By respecting this geometry, `riemstats` ensures:
- All operations produce valid SPD matrices
- Statistical tests have correct Type I error rates
- Biological signals are preserved during harmonization
- Results are interpretable as actual brain connectivity patterns

Indeed, Riemannian methods in `riemstats` solve these problems by:

1. **Proper Averaging**: Using the Fréchet mean, which always produces valid SPD matrices
2. **Correct Distances**: Measuring distances along the manifold (geodesics) rather than through Euclidean space
3. **Geometry-Preserving Harmonization**: Removing site effects while maintaining SPD structure
4. **Valid Statistical Tests**: ANOVA methods that account for manifold curvature

# Typical Workflow

This section demonstrates a complete analysis pipeline for multi-site connectome data using `riemstats`. We'll walk through loading data, applying harmonization, and performing statistical tests.

## Loading the Package and Data

```{r load-package}
library(riemstats)
library(riemtan) # Required for CSuperSample and CSample classes
data("airm")  # Load the AIRM metric

# Create example SPD matrices for demonstration
test_pd_mats <- list(
  Matrix::Matrix(c(2.0, 0.5, 0.5, 3.0), nrow = 2) |>
    Matrix::nearPD() |> _$mat |> Matrix::pack(),
  Matrix::Matrix(c(1.5, 0.3, 0.3, 2.5), nrow = 2) |>
    Matrix::nearPD() |> _$mat |> Matrix::pack()
)

# Create samples for two sites/groups
sample1 <- test_pd_mats |>
  purrr::map(\(x) (2 * x) |> Matrix::unpack() |> as("dpoMatrix") |> Matrix::pack()) |>
  CSample$new(metric_obj = airm)
sample2 <- test_pd_mats |> CSample$new(metric_obj = airm)

# Combine into a supersample for analysis
super_sample <- list(sample1, sample2) |> CSuperSample$new()
```

## Data Harmonization

When working with multi-site data, harmonization is crucial to remove site-specific effects while preserving biological signals:

```{r harmonization}
# Apply ComBat harmonization for SPD matrices
# Input must be a CSuperSample object
harmonized_super_sample <- combat_harmonization(super_sample)

# Alternative: Rigid geometric alignment
# Also works with CSuperSample objects
aligned_super_sample <- rigid_harmonization(super_sample)
```

## Statistical Analysis with Riemannian ANOVA

After harmonization, test for group differences using methods that respect the SPD geometry:

```{r anova-analysis, eval=FALSE}
# Fréchet ANOVA for overall group differences
# Returns p-value directly
frechet_p_value <- frechet_anova(super_sample)

# Riemannian ANOVA with permutation test inference
# Can use log_wilks_lambda (default) or pillais_trace as stat_fun
riemann_p_value_wilks <- riem_anova(
  ss = super_sample,
  stat_fun = log_wilks_lambda,
  nperm = 100 # Number of permutations (use 1000+ in practice)
)

# Using Pillai's trace
riemann_p_value_pillai <- riem_anova(
  ss = super_sample,
  stat_fun = pillais_trace,
  nperm = 100 # Number of permutations (use 1000+ in practice)
)

# Display results
cat("Fréchet ANOVA p-value:", frechet_p_value, "\n")
cat("Riemannian ANOVA (Wilks) p-value:", riemann_p_value_wilks, "\n")
cat("Riemannian ANOVA (Pillai) p-value:", riemann_p_value_pillai, "\n")
```

## Interpreting Results

The analysis provides several key outputs:

- **Test statistics**: Quantify the magnitude of group differences
- **P-values**: Assess statistical significance (adjusted for multiple comparisons when applicable)
- **Effect sizes**: Measure the practical importance of observed differences
- **Bootstrap confidence intervals**: Provide robust inference without parametric assumptions