---
title: "Multistage Phenotypic Selection Indices"
Author: "Zankrut Goyani"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Multistage Phenotypic Selection Indices}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction

In crop and animal breeding programs, selection is rarely performed in a single stage. Breeders usually evaluate multiple traits across different stages of testing, successively discarding inferior genotypes and advancing superior ones. This multistage approach requires selection indices to account for the changes in variances and covariances induced by selection at prior stages.

Chapter 9 of the `selection.index` package focuses on the mathematical formulation and practical application of Multistage Linear Selection Indices. We introduce indices that properly adjust for prior selection using the Cochran (1951) and Cunningham (1975) method, calculating corrected covariance matrices to appropriately predict subsequent selection responses. This vignette will demonstrate six multistage indices utilizing phenotypic and genomic estimated breeding values (GEBVs).

We will use the synthetic maize phenotypic and genotypic datasets (`maize_pheno` and `maize_geno`) to illustrate these complex functions. Let us first prepare the covariance matrices.

## Data Preparation

For our examples, we will evaluate 3 quantitative traits from the dataset: Yield, PlantHeight, and DaysToMaturity. We assume that Stage 1 selection evaluates the first two traits (Yield, PlantHeight), and Stage 2 evaluates all 3 traits (adding DaysToMaturity).

```{r setup_data}
library(selection.index)

# Estimate phenotypic and genotypic covariance matrices for the 3 traits
# The traits are Yield, PlantHeight, DaysToMaturity
traits <- c("Yield", "PlantHeight", "DaysToMaturity")
pmat <- phen_varcov(maize_pheno[, traits], maize_pheno$Environment, maize_pheno$Genotype)
gmat <- gen_varcov(maize_pheno[, traits], maize_pheno$Environment, maize_pheno$Genotype)

# Matrix limits for Stage 1 (Traits 1 to 2)
P1 <- pmat[1:2, 1:2]
G1 <- gmat[1:2, 1:2]

# Complete Matrices for Stage 2
P <- pmat
C <- gmat

# Economic weights for the 3 traits
weights <- c(10, -5, -2)
```

---

# 1. Multistage Linear Phenotypic Selection Index (MLPSI)

The Multistage Linear Phenotypic Selection Index accounts for changes in phenotypic ($\mathbf{P}$) and genotypic ($\mathbf{C}$) covariance matrices due to previous selection cycles.

At Stage 1, the index coefficients are computed just as in the standard Smith-Hazel index:
$$ \mathbf{b}_1 = \mathbf{P}_1^{-1}\mathbf{G}_1\mathbf{w}_1 $$
where $\mathbf{w}_1$ contains the economic weights for Stage 1 traits.

At Stage 2, the coefficients for the entire set of traits are:
$$ \mathbf{b}_2 = \mathbf{P}^{-1}\mathbf{C}\mathbf{w} $$

However, due to selection at stage 1, the covariances $\mathbf{P}$ and $\mathbf{C}$ are adjusted for stage 2 evaluation:
$$ \mathbf{P}^* = \mathbf{P} - u \frac{Cov(\mathbf{y},\mathbf{x}_1)\mathbf{b}_1\mathbf{b}_1'Cov(\mathbf{x}_1,\mathbf{y})}{\mathbf{b}_1'\mathbf{P}_1\mathbf{b}_1} $$
$$ \mathbf{C}^* = \mathbf{C} - u \frac{\mathbf{G}_1'\mathbf{b}_1\mathbf{b}_1'\mathbf{G}_1}{\mathbf{b}_1'\mathbf{P}_1\mathbf{b}_1} $$
where $u = k_1(k_1 - \tau)$ calculates the effect of selection based on the standardized truncation point $\tau$ and selection intensity $k_1$.

The function `mlpsi` simultaneously performs adjustments and metric estimations for both stages.

```{r mlpsi_example}
# We apply a selection proportion of 10% (0.10) per stage.
mlpsi_res <- mlpsi(
  P1 = P1, P = P, G1 = G1, C = C,
  wmat = weights,
  selection_proportion = 0.1
)

# Stage 1 metrics
mlpsi_res$summary_stage1

# Stage 2 metrics
mlpsi_res$summary_stage2
```

---

# 2. Multistage Restricted Linear Phenotypic Selection Index (MRLPSI)

The MRLPSI method applies when the breeder aims to maintain one or more quantitative traits without change over the multistage evaluation (e.g., maintaining constant `PlantHeight` while optimizing other variables).

The restricted coefficient vectors for Stage 1 and Stage 2 are defined as:
$$ \mathbf{b}_{R_1} = \mathbf{K}_1 \mathbf{b}_1 $$
$$ \mathbf{b}_{R_2} = \mathbf{K}_2 \mathbf{b}_2 $$

Where $\mathbf{K}_1 = \mathbf{I}_1 - \mathbf{Q}_1$ and $\mathbf{K}_2 = \mathbf{I}_2 - \mathbf{Q}_2$ are matrices that impose zero genetic gain vectors, derived from the constraint matrices $\mathbf{C}_1$ and $\mathbf{C}_2$.

```{r mrlpsi_example}
# We constrain PlantHeight (Trait 2) at Stage 1
C1 <- matrix(0, nrow = 2, ncol = 1)
C1[2, 1] <- 1

# We constrain PlantHeight (Trait 2) at Stage 2
C2 <- matrix(0, nrow = 3, ncol = 1)
C2[2, 1] <- 1

mrlpsi_res <- mrlpsi(
  P1 = P1, P = P, G1 = G1, C = C,
  wmat = weights,
  C1 = C1, C2 = C2,
  selection_proportion = 0.1
)

# Observe that Expected Gain (E) for PlantHeight is approximately 0
mrlpsi_res$summary_stage1
```

---

# 3. Multistage Predetermined Proportional Gain LPSI (MPPG-LPSI)

Unlike MRLPSI which imposes zero genetic gain bounds, MPPG-LPSI forces proportional changes mapped by the $\mathbf{d}_1$ and $\mathbf{d}_2$ restricted-difference vectors. At Stage $i$, the difference vector creates an updated target trajectory.

```{r mppg_lpsi_example}
# Target specific proportional gains
d1 <- c(2, 1) # Yield gains twice as much as PlantHeight at stage 1
d2 <- c(3, 1, 0.5) # Desired proportions at stage 2

mppg_res <- mppg_lpsi(
  P1 = P1, P = P, G1 = G1, C = C,
  wmat = weights,
  d1 = d1, d2 = d2,
  selection_proportion = 0.1
)

# Observe the Expected Gain (E) in the resulting summary stats aligns with d1 proportions
mppg_res$summary_stage1
```

---

# 4. Multistage Linear Genomic Selection Index (MLGSI)

In modern breeding, Genomic Estimated Breeding Values (GEBVs) computed from whole-genome markers accelerate cyclical selection.
For the Multi-Stage Genomic framework, we replace phenotypic estimators with GEBV matrix derivations:
$\mathbf{\Gamma}$ (GEBV variance-covariance) replaces $\mathbf{P}$.
$\mathbf{A}$ (Covariance between GEBVs and true BVs) provides genomic mappings.

For illustrative purposes, we mock simulate the arrays via a pseudo-reliability scaling of the genetic matrices:
```{r setup_genomic}
set.seed(42)
reliability <- 0.7 # Simulated genomic prediction reliability

Gamma1 <- reliability * G1
Gamma <- reliability * C
A1 <- reliability * G1
A <- C[, 1:2] # n x n1 covariance mapping
```

```{r mlgsi_example}
mlgsi_res <- mlgsi(
  Gamma1 = Gamma1, Gamma = Gamma, A1 = A1, A = A,
  C = C, G1 = G1, P1 = P1,
  wmat = weights,
  selection_proportion = 0.1
)

mlgsi_res$summary_stage1
```

---

# 5. Multistage Restricted Genomic Selection Index (MRLGSI)

Similarly, traits can be biologically constrained in multiple genome-assisted breeding cycles.

```{r mrlgsi_example}
mrlgsi_res <- mrlgsi(
  Gamma1 = Gamma1, Gamma = Gamma, A1 = A1, A = A,
  C = C, G1 = G1, P1 = P1,
  wmat = weights,
  C1 = C1, C2 = C2,
  selection_proportion = 0.1
)

mrlgsi_res$summary_stage2
```

---

# 6. Multistage PPG Genomic Selection Index (MPPG-LGSI)

The procedure calculates predetermined gains over multiple cycles exclusively utilizing whole-genome predictions.

```{r mppg_lgsi_example}
mppg_lgsi_res <- mppg_lgsi(
  Gamma1 = Gamma1, Gamma = Gamma, A1 = A1, A = A,
  C = C, G1 = G1, P1 = P1,
  wmat = weights,
  d1 = d1, d2 = d2,
  selection_proportion = 0.1
)

mppg_lgsi_res$summary_stage1
```

# Statistical Properties

For all the multistage indices above (Phenotypic and Genomic), we evaluate statistical properties to compare efficiency.

### Accuracy
The accuracy (or correlation between the index and true breeding value) indicates the efficiency of the index:
$$ \rho_{H} = \frac{\sigma_{H, I}}{\sigma_H \sigma_I} = \sqrt{ \frac{\mathbf{b}'\mathbf{P}\mathbf{b}}{\mathbf{w}'\mathbf{C}\mathbf{w}} } $$
where $\mathbf{P}$ and $\mathbf{C}$ are substituted by their adjusted equivalents at Stage 2 ($\mathbf{P}^*$ and $\mathbf{C}^*$ or $\mathbf{\Gamma}^*$).

### Selection Response
The overall selection response generated by the index evaluates the genetic superiority:
$$ R = k \sigma_{I} = k \sqrt{\mathbf{b}'\mathbf{P}\mathbf{b}} $$
where $k$ is the selection intensity.

### Expected Genetic Gain
The expected genetic gain per individual trait is given by the vector:
$$ \mathbf{E} = k \frac{\mathbf{G}'\mathbf{b}}{\sigma_I} = k \frac{\mathbf{G}'\mathbf{b}}{\sqrt{\mathbf{b}'\mathbf{P}\mathbf{b}}} $$
For the Multistage Genomic Indices, $\mathbf{G}'$ is replaced by the marker association mapping matrix $\mathbf{A}'$.
 
---

# Summary

Multistage Selection Indices effectively manage breeding resources by filtering inferior variants sequentially while accurately recalibrating covariance parameters using genomic and phenotypic variables across progressive evaluations. These analytical tools prevent severe variance distortion over breeding cycles.