---
title: "The Linear Phenotypic Selection Index Theory"
Author: "Zankrut Goyani"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{The Linear Phenotypic Selection Index Theory}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Introduction

In plant and animal breeding, quantitative traits (QTs) are expressions of genes distributed across the genome interacting with the environment. The phenotypic value of QTs ($y$) can be systematically partitioned into a genotypic component ($g$) and an environmental component ($e$):

$$ y = g + e $$

The primary goal in breeding is to maximize an individual's **net genetic merit**. The net genetic merit ($H$) is a linear combination of the unobservable true breeding values ($\mathbf{g}$) weighted by their respective economic values ($\mathbf{w}$):

$$ H = {\mathbf{w}}^{\prime}\mathbf{g} $$

Because the net genetic merit is unobservable in field trials, breeders construct a **Linear Phenotypic Selection Index (LPSI)** to predict it. The LPSI ($I$) is a linear combination of the observable and optimally weighted phenotypic trait values ($\mathbf{y}$) adjusted by index coefficients ($\mathbf{b}$):

$$ I = {\mathbf{b}}^{\prime}\mathbf{y} $$

The objective of the LPSI is to predict the net genetic merit and maximize the multi-trait selection response.

## Optimizing the LPSI

To identify the optimal parents for the next selection cycle, the correlation between the net genetic merit ($H$) and the LPSI ($I$) must be maximized. The vector $\mathbf{b}$ that simultaneously minimizes the mean squared difference between $I$ and $H$ and perfectly maximizes this correlation is mathematically derived as:

$$ \mathbf{b} = {\mathbf{P}}^{-1}\mathbf{Gw} $$

where:
* $\mathbf{P}$ is the phenotypic variance-covariance matrix.
* $\mathbf{G}$ is the genotypic variance-covariance matrix.
* $\mathbf{w}$ is the vector of economic weights defining relative trait importance.

Once these optimal coefficients are derived, we can evaluate two fundamental parameters:

1. **The Maximized Selection Response ($R_I$)**: The expected mean improvement in the net genetic merit due to indirect selection on the index.
   $$ {R}_I = {k}_I\sqrt{{\mathbf{b}}^{\prime}\mathbf{Pb}} $$

2. **The Expected Genetic Gain Per Trait ($\mathbf{E}$)**: The multi-trait selection response broken down per individual trait.
   $$ \mathbf{E} = {k}_I\frac{\mathbf{Gb}}{\sigma_I} $$
   
where $k_I$ is the standardized selection intensity and $\sigma_I$ is the standard deviation of the index score variance.

## Practical Implementation in R

We can seamlessly translate this text theory into rigorous statistical practice using the `selection.index` package. We will utilize the built-in synthetic datasets: `maize_pheno` (containing multi-environment phenotypic records for 100 genotypes) and `maize_geno` (500 SNP markers).

### 1. Estimating Covariance Matrices

First, we estimate the genotypic ($\mathbf{G}$) and phenotypic ($\mathbf{P}$) variance-covariance matrices from our raw phenotypic dataset.

```{r matrices}
library(selection.index)

# Load the synthetic phenotypic multi-environment dataset
data("maize_pheno")

# In maize_pheno: Traits are columns 4:6.
# Genotypes are in column 1, and Block/Replication is in column 3.
gmat <- gen_varcov(data = maize_pheno[, 4:6], genotypes = maize_pheno[, 1], replication = maize_pheno[, 3])
pmat <- phen_varcov(data = maize_pheno[, 4:6], genotypes = maize_pheno[, 1], replication = maize_pheno[, 3])
```

### 2. Defining Economic Weights

Next, we establish the relative economic priority of each trait. Economic weights ($\mathbf{w}$) explicitly define our strategic breeding objectives.

```{r weights}
# Define the economic weights for the 3 continuous traits
# (e.g., Yield, PlantHeight, DaysToMaturity)
weights <- c(10, -5, -5)
```

### 3. Calculating the LPSI

With the covariance matrices and economic weights specified, we integrate them into the primary `lpsi()` function, which evaluates the combinatorial multi-trait selection indices efficiently.

```{r lpsi}
# Calculate the Optimal Combinatorial Linear Phenotypic Selection Index (LPSI)
index_results <- lpsi(
  ncomb = 3,
  pmat = pmat,
  gmat = gmat,
  wmat = as.matrix(weights),
  wcol = 1
)
```

### 4. Evaluating Outcomes and Selecting Genotypes

Finally, we evaluate the theoretical gains. The `lpsi()` function returns a structured data frame containing the theoretical selection response ($R_I$) and other parameter estimates for all requested trait combinations.

```{r gains}
# View the top combinatorial indices, including their selection response (R_A)
head(index_results)

# Extract the phenotypic selection scores to strategically rank the parental candidates
# using the top evaluated combinatorial index
scores <- predict_selection_score(
  index_results,
  data = maize_pheno[, 4:6],
  genotypes = maize_pheno[, 1]
)

# View the top performing candidates designated for the next breeding cycle
head(scores)
```

### 5. Extension: Linear Marker Selection Index

The classical linear selection index theories seamlessly extend to marker-assisted genomic selection. If you have genome-wide marker profiles for your genotypes, you can incorporate them to estimate the Linear Marker Selection Index (LMSI). 

```{r marker_data, eval=FALSE}
# Load the associated synthetic genomic dataset (500 SNPs for the 100 genotypes)
data("maize_geno")

# Calculate the marker-assisted index combining our matrices and raw SNP profiles
marker_index_results <- lmsi(
  pmat = pmat,
  gmat = gmat,
  marker_scores = maize_geno,
  wmat = weights
)

summary(marker_index_results)
```

### 6. The Base Index and Index Efficiency

In scenarios where the phenotypic ($\mathbf{P}$) and genotypic ($\mathbf{G}$) matrices are poorly estimated (e.g., due to limited data), the true optimal coefficients ($\mathbf{b}$) can be systematically biased. The **Base Index** provides a robust, non-optimized alternative where coefficients are set strictly equal to the fixed economic weights ($I_B = \mathbf{w}'\mathbf{y}$).

```{r base_index}
# Calculate the Base Index and automatically compare its efficiency to the LPSI
base_results <- base_index(
  pmat = pmat,
  gmat = gmat,
  wmat = weights,
  compare_to_lpsi = TRUE
)

# Observe the expected genetic gains and efficiency comparison
base_results$summary
```

### 7. Heritability of the LPSI

The theory demonstrates that the correlation between the net genetic merit ($H$) and the expected index ($I$) differs from the traditional index heritability mathematically ($h^2_I \neq \rho^2_{HI}$). The `lpsi()` function intrinsically estimates both of these fundamental statistics:

```{r heritability}
# Extract the top combinatorial index results
top_index <- index_results[1, ]

# h^2_I: Heritability of the optimal index
top_index$hI2

# \rho_HI: Correlation between the LPSI and the true underlying Net Genetic Merit
top_index$rHI
```