\documentclass{article} \usepackage[dvipsnames,usenames]{color} \definecolor{darkgreen}{rgb}{0,.6,.2} \usepackage{amsmath} % \VignetteIndexEntry{An R Package for moving Grid Adjustment} \SweaveOpts{keep.source=TRUE} \usepackage{graphicx} \usepackage{setspace} \usepackage{varioref} \onehalfspacing \usepackage{url} \usepackage{Sweave} \usepackage[font=small,labelfont=bf]{caption} \usepackage{hyperref} \usepackage{ifthen} \usepackage{keyval} \hypersetup{pdfborder={0 0 0 0}, linkcolor ={blue}, colorlinks = {true}, citecolor = {darkgreen}, urlcolor = {VioletRed}} \begin{document} \title{R Package mvngGrAd: Moving Grid Adjustment In Plant Breeding Field Trials} \author{Frank Technow} \date{package version 0.1.6} \maketitle \section{Moving grid adjustment} Moving grid adjustment is a spatial method to adjust for environmental variation in field trials. It is most common in unreplicated plant breeding field trials. The most extreme form of unreplicated trials is single plant evaluation, for example for recurrent mass selection in populations. The trial is arranged in a spatial row-column like design. Each entry is then associated with a cell, respectively with a row and column number. A grid is constructed around each cell (= entry) and the mean of the cells included in the grid is calculated. The moving mean of the ith entry, denoted as $x_{i}$, is calculated as: \begin{equation} \label{eq:MvG} x_{i} = \frac{\sum_j p_{j,obs} \cdot I(p_{j,obs} \in G_{i})}{\sum_j I(p_{j, obs} \in G_{i})} \end{equation} The grid of entry $i$ is denoted by $G_{i}$ and $I(\cdot)$ is a indicator function that takes the value 1 if the condition is satisfied and 0 if not. The observed phenotypic values of all entries which could potentially be included in $G_{i}$ are denoted by $p_{j,obs}$. The value $x_{i}$ is taken as a measure of the growing conditions for the entry $i$ and is used as a covariate to calculate an adjusted phenotypic value ($p_{i, adj}$) according to the following formula: \begin{equation} \label{eq:adj} p_{i, adj} = p_{i, obs} - b (x_{i} - \bar{x}) \end{equation} Where $p_{i, obs}$ denotes the observed phenotypic value of the ith entry, $\bar{x}$ is the mean of all $x_{i}$ and $b$ the regression coefficient in the general linear model: \begin{equation} \label{eq:model} p_{i, obs} = a + b x_{i} \end{equation} with intercept $a$. The phenotypic value is a function of the genotypic value $g_i$ and the environmental error $e_i$. Before the adjustment this is \begin{equation} \label{eq:obsP} p_{i, obs} = g_{i} + e_{i} , \end{equation} and after \begin{equation} p_{i, adj} = g_{i} + e'_{i} \end{equation} with $e'_{i}$ being the $e_{i}$ after adjustment. When the adjustment was successful, \begin{equation} \label{eq:vare} var(e'_{i}) < var(e_{i}) \end{equation} and $p_{i, adj}$ is a better estimator of $g_{i}$ than $p_{i,obs}$. A direct consequence of this is that the heritability ($h^{2}$) will improve. The $h^{2}$ gives the proportion of variance observed that was attributable to genetic effects. The closer to one the value is, the better, because only genetic variance can be exploited by the breeder. The heritability before adjustment is calculated as \begin{equation} \label{eq:h2} h^2_{obs} = \frac{var(g)}{var(p_{obs})} \end{equation} where $var(g)$ is the \emph{exploitable} genetic variance and $var(p_{obs})$ is the variance of $p_{obs}$ (i.e. the phenotypic variance). The heritability after adjustment can then be calculated as \begin{equation} \label{eq:h2adj} h^2_{adj} = \frac{var(g)}{var(p_{adj})} \end{equation} and because of (\ref{eq:vare}) \begin{equation} \label{eq:hvsh} h^{2}_{adj} > h^{2}_{obs} \end{equation} The adjustment procedure can only be successful, and in fact valid, if two important conditions are met:\enlargethispage{2ex} \begin{enumerate} \item \label{firstCond} the entries must not influence the values of their covariates ($x_{i}$) and \item the entries must have been randomly assigned to positions in the field. \end{enumerate} \renewcommand{\thefootnote}{\fnsymbol{footnote}} If either one or both are violated, the adjustment can lead to erroneous results. If for example $p_{i,obs}$ is included in the calculation of $x_{i}$, the first condition is violated. The result is that the $p_{adj}$ of superior genotypes are downwardly biased and the $p_{adj}$ of inferior ones upwardly. When, as is the case many times, entries are not randomized but grouped according to family, the $p_{adj}$ of superior families are downwardly biased and the $p_{adj}$ of inferior ones upwardly. In the best case this leads to a reduction in $var(g)$\footnote[2]{The adjustment is of course something that is done to the data, and as such can not alter the population variance. To distinguish $var(g)$ from the true genetic variance in the population, I termed it \emph{exploitable genetic variance}.} which might reverse (\ref{eq:hvsh}) or at least reduces the superiority of $h^{2}_{adj}$. In the worst case, selection might be directed to the opposite of the intended direction! Please refer to \cite{Falconer} for details on quantitative genetics, to \cite{cox} for details on experimental design and covariate adjustment and to \cite{selMeth} for plant breeding designs and moving grids in particular. \begin{thebibliography}{8} \bibitem[a]{Falconer}D. S. Falconer: {\sl Introduction to Quantitative Genetics}. Oliver and Boyd., London, 1967. \bibitem[b]{cox}W. G. Cochran and G. M. Cox: {\sl Experimental Designs}. John Wiley \& Sons, Inc., New York, 1964. \bibitem[c]{selMeth} I. Bos and P. Caligari: {\sl Selection Methods in Plant Breeding}. Springer New York, 2008. \end{thebibliography} \section{The package} \subsection{Overview} The function performing the adjustment is called \texttt{movingGrid( )}. The function \texttt{sketchGrid( )} helps with designing the grid by plotting its shape. The functions \texttt{fitted( )}, \texttt{movingMean( )} and \texttt{entryData( )} are convenience functions to extract the most relevant information from the object created by \mbox{\texttt{movingGrid( )}}. The package defines a new class, \texttt{movG}, and provides methods for the functions \texttt{entryData( )}, \texttt{fitted( )}, \texttt{movingMean( )}, \texttt{summary( )}, \mbox{\texttt{show( )}} and \texttt{residuals( )}. What follows is a detailed tutorial on the usage of \mbox{\texttt{movingGrid( )}}. \pagebreak As a first step, the package needs to be loaded. <>= op <- options(); utils::str(op) options(width = 60) options(prompt = "R> ") @ <>= library(mvngGrAd) @ \subsection{Setting up the grid} The shape of the grid must be designed by the user. This is done via three arguments to \texttt{movingGrid( )}, \texttt{shapeCross}, \texttt{layers} and \texttt{excludeCenter}. With the help of \texttt{shapeCross} one specifies the cells that are to be included in a grid that extends from the center in 0, 90, 180 and 270 degree direction, with \texttt{layers} the cells that extend the grid in all other directions. The two can be used together or alone. With the argument \texttt{excludeCenter} one can decide whether or not to include $p_{i,obs}$ in the calculation of $x_{i}$. The usage is best described with examples. For designing the grid, function \texttt{sketchGrid( )} takes the same arguments as \texttt{movingGrid( )} and plots a visualization of the grid. This function should always be used to verify that the actual arguments to \texttt{shapeCross} and \texttt{layers} do create the intended grid. The argument to \texttt{shapeCross} is given in form of a list with four components, each representing one of the directions. \begin{enumerate} \item DOWN (180\textdegree) \item UP (0\textdegree) \item LEFT (270\textdegree) \item RIGHT (90\textdegree). \end{enumerate} Other arguments to \texttt{sketchGrid( )} are \texttt{i}, the row of the central cell; \texttt{j} the column of the central cell; \texttt{rowLimit}, the number of rows in the field layout; \mbox{\texttt{colLimit}}, the number of columns; and \texttt{excludeCenter}. If only one of \texttt{shapeCross} or \texttt{layers} is to be used, the other must be \texttt{NULL}. A grid that includes one cell above and below the center and four to its right and left and excludes the central cell can be created as follows\footnote[3]{The calls to \texttt{sketchGrid( )} produce always one plot. To save space, these plots are not included in this document, but instead figures are shown that combine several of them. These figures are produced from code that is included in the \texttt{sweave} file but is not echoed. However, the calls to \texttt{sketchGrid( )} are identical.} \footnote[4]{This is the grid used by a stand alone software called ``plabstat'' (\url{https://www.uni-hohenheim.de/plantbreeding/software/}) that is around since the 70's. With these settings, the results of ``plabstat'' can be reproduced.}(Figure \ref{fig:atod} [a]): <>= sketchGrid(i = 10, j = 10, rowLimit = 20, colLimit = 20, shapeCross = list(1, ## down 1, ## up 1:4, ## left 1:4),## right layers = NULL, excludeCenter = TRUE) @ \begin{figure}[hbt] \centering \includegraphics[width = 12cm]{mvngGrAd-aToD}\\ \caption{Grids for different settings of \texttt{shapeCross} ,\texttt{layers}, \texttt{excludeCenter} and \mbox{\texttt{i, j}}} \label{fig:atod} \end{figure} Using \texttt{shapeCross} as follows, creates a grid that also includes one cell above and below the center and four to its right and left but excludes the cells right next to the center (Figure \ref{fig:atod} [b]). \SweaveOpts{keep.source = FALSE} <>= sketchGrid(i = 10, j = 10, rowLimit = 20, colLimit = 20, shapeCross = list(2, 2, ## up 2:5, ## left 2:5),## right layers = NULL, excludeCenter = TRUE) @ Using \texttt{shapeCross} on its own, without \texttt{layers}, can often make sense, using \texttt{layers} on its own usually does not. However, it is introduced independently from \texttt{shapeCross} to make its usage clear. The actual argument to \texttt{layers} is an integer vector. Each integer represents a ``layer'' of cells included in the grid. Integer \texttt{1} would include the cells that are right next to the center, apart from the ones in 0, 90, 180 and 270 degree direction (Figure \ref{fig:atod} [c]): <>= sketchGrid(i = 10, j = 10, rowLimit = 20, colLimit = 20, shapeCross = list(NULL, ## down NULL, ## up NULL, ## left NULL),## right layers = 1, excludeCenter = TRUE) @ and integer \texttt{2} the ones on top of that (Figure \ref{fig:atod} [d]): <>= sketchGrid(i = 10, j = 10, rowLimit = 20, colLimit = 20, shapeCross = list(NULL, ## down NULL, ## up NULL, ## left NULL),## right layers = 1:2, excludeCenter = TRUE) @ and so on. By using \texttt{shapeCross} and \texttt{layers} jointly, more complex grids can be created. For example a honeycomb shape, which might be most suited for single plant evaluation (Figure \ref{fig:etoh} [e]): <>= sketchGrid(i = 10, j = 10, rowLimit = 20, colLimit = 20, shapeCross = list(1:4, 1:4, 1:4, 1:4), layers = c(1:4), excludeCenter = TRUE) @ However, one should keep in mind that \texttt{sketchGrid( )} can not be made to display any scale differences between row and column widths. So for unequal row and column width, the grid will look different on the field than the output of \texttt{sketchGrid( )}. In this case \texttt{sketchGrid( )} will only help to see which cells are included. Setting the argument \texttt{excludeCenter} to \texttt{FALSE} will include the central cell, the one with the entry whose $p_{obs}$ is to be adjusted, in the calculation\enlargethispage{2ex} of $x_{i}$ (Figure \ref{fig:etoh} [f]): <>= sketchGrid(i = 10, j = 10, rowLimit = 20, colLimit = 20, shapeCross = list(1:4, 1:4, 1:4, 1:4), layers = c(1:4), excludeCenter = FALSE) @ \begin{figure}[hbt] \centering \includegraphics[width = 12cm]{mvngGrAd-eToH} \caption{Grids for different settings of \texttt{shapeCross} ,\texttt{layers}, \texttt{excludeCenter} and \mbox{\texttt{i, j}}} \label{fig:etoh} \end{figure} This is of course in clear violation of the first condition for a valid adjustment (see Section \vref{firstCond}). In fact one may consider to exclude the cells right next to the center as well. This way the effects of possible interactions between the entry in the central cell and its neighbors, do not influence $x_i$ either (Figure \ref{fig:etoh} [g]): <>= sketchGrid(i = 10, j = 10, rowLimit = 20, colLimit = 20, shapeCross = list(2:4, 2:4, 2:4, 2:4), layers = c(2:4), excludeCenter = TRUE) @ For cells close to the border of the experimental area, only incomplete grids can be constructed of course. This is correctly displayed by \texttt{sketchGrid( )} (Figure \ref{fig:etoh} [h]): <>= sketchGrid(i = 2, j = 2, rowLimit = 20, colLimit = 20, shapeCross = list(2:4, 2:4, 2:4, 2:4), layers = c(2:4), excludeCenter = TRUE) @ \subsection{Using \texttt{movingGrid( )}} To demonstrate the function \texttt{movingGrid ( )} itself, data needs to be generated first. The experimental area of the simulated field trial consists of 50 rows and 50 columns. The row and column subscripts of each cell are stored because they are part of the input to \texttt{movingGrid( )}. The two vectors must of course correspond to each other. <>= ## row vector rows <- rep(1 : 50, each = 50) ## column vector cols <- rep(1 : 50, 50) @ The population parameters are as follows: \begin{itemize} \item population mean $\mu$ = 30 \item genotypic variation $var(g)$ = 25 \item environmental (error) variation $var(e)$ = 45 \end{itemize} The environmental errors $e$ are simulated in such a way that there is a strong systematic horizontal gradient in growing conditions present in the experimental area, together with some unsystematic small scale noise. <>= set.seed(13) envError <- rep(c(seq(from = -12.5 , to = -0.5, by = 0.5), seq(from = 0.5 , to = 12.5, by = 0.5)), each = 50) + rnorm(2500) @ \pagebreak The following code will scale this vector to have a variance of 45. <>= scaleFactE <- (sd(envError) / sqrt(45)) envError <- scale(envError, center = FALSE, scale = scaleFactE) envError <- as.vector(envError) @ The genotypic effects of the 2500 entries are simulated with zero mean and variance of 25 (because of some sampling variation the vector is additionally scaled to assure a variance of 25). <>= gEffects <- rnorm(2500, mean = 0, sd = 5) scaleFactG <- (sd(gEffects)/sqrt(25)) gEffects <- scale(gEffects, center = FALSE, scale = scaleFactG) gEffects <- as.vector(gEffects) @ To arrive at the genotypic values $g$, $\mu$ is added to \texttt{gEffects} <>= ## population mean mu <- 30 gValues <- mu + gEffects @ The observed phenotypic values ($p_{obs}$) are then obtained by adding \texttt{gValues} to \texttt{envError} according to (\ref{eq:obsP}). <>= obsP <- gValues + envError @ To observe the handling of \texttt{NA} values, the third element of \texttt{obsP} is set to \texttt{NA}. <>= obsP[3] <- NA @ Finally the phenotypic and genetic variance are stored for later use. Note that $var(p_{obs})$ will not be exactly 70 as it should be, but slightly different due to minimal, ``random'' covariance between \texttt{gEffects} and \texttt{envError}. <>= ## phenotypic variance varP <- var(obsP,na.rm = TRUE) ## genotypic variance varG <- var(gEffects) @ The row and column vectors are given to the arguments \texttt{rows} and \texttt{columns} and the $p_{obs}$ to \texttt{obsPhe} of \texttt{movingGrid( )} The grid is designed by \texttt{shapeCross} and \texttt{layers} as demonstrated above. The argument \texttt{excludeCenter} is \texttt{TRUE} by default. \pagebreak \SweaveOpts{keep.source = TRUE} <>= ## creates object of class movG resMG <- movingGrid(rows = rows, columns = cols, obsPhe = obsP, shapeCross = list(1:2, 1:2, 1:2, 1:2), layers = 1) @ A summary of the adjustment process can be obtained by the function \mbox{\texttt{summary( )}}. \SweaveOpts{keep.source = FALSE} \enlargethispage{2ex} <>= summary(resMG) @ The first line names the function used. In the second line the number of values (cells) in a complete grid is given. The mean number of values in the third line is always smaller, because for entries close to the edge of the experimental area only partial grids can be constructed. NA values also have this effect. The fourth line gives the number of observations\footnote[4]{An ``observation'' is everything mentioned in the vectors given to \texttt{rows} and \texttt{columns}, irrespective of whether the corresponding data entry in the vector to \texttt{obsPhe} is \texttt{NA} or not.} and how many of these were NA. The Pearson coefficient of correlation between $x_i$ and $p_{i, obs}$ gives information on the strength of the influence of the covariate $x_i$. If this value is too low, an adjustment will not yield better estimates of $g$ than by using $p_{obs}$ directly. In rare cases, for example under strong inter plant competition, the correlation coefficient can be below zero. In such a case one should not perform an adjustment. The next line gives $b$, the regression coefficient used for the adjustment [see (\ref{eq:adj})]. Finally, the dimensions of the experimental \enlargethispage{2ex} area are given.\footnote[5]{The function \texttt{movingGrid( )} automatically assumes a rectangular experimental area with the given dimensions. Therefore, the number of cells in the ``virtual'' experimental. area is $rows \times columns$. If the number of observations is smaller than this, the remaining cells are filled with \texttt{NA} values. These are not ``observations'' and are not included in the number of NA values given in the summary. They also do not influence the results of \texttt{movingGrid( )} and are ignored by the various extractor functions.} The function \texttt{entryData( )} extracts all information on an entry available, its row, column, $p_{i,adj}$, $p_{i,obs}$, $x_{i}$ and the number of values used for calculating its $x_{i}$. <>= ## only first 10 head(entryData(resMG)) @ For the third element, which is \texttt{NA}, no $p_{adj}$ was calculated of course. The moving mean $x_{i}$ is, however, available. The $p_{adj}$ can be obtained by using the function \texttt{fitted( )}. <>= ## only first 10 fitted(resMG)[1:10] @ The $x$ are extracted by \texttt{movingMean( )}. <>= ## only first 10 movingMean(resMG)[1:10] @ The residuals for the model in (\ref{eq:model}) are returned by the function \texttt{residuals( )}. <>= residuals(resMG)[1:10] @ \vspace{4ex} The $h^2_{obs}$ is \Sexpr{round(varG/varP,3)} <>= varG/varP @ while the $h^2_{adj}$ is with \Sexpr{round(varG/var(fitted(resMG),na.rm=TRUE),3)} considerably higher. <>= ## variance of the adj. phenot. values varPadj <- var(fitted(resMG),na.rm = TRUE) varG/varPadj @ This demonstrates a tremendous improvement due to the adjustment procedure. In reality the improvement will usually not be this much because the contribution of random noise, which can not be picked up by moving grid adjustment, is much larger than in this example. <>= layout(rbind(c(1,2), c(3,4))) sketchGrid(i = 10, j = 10, rowLimit = 20, colLimit = 20, shapeCross = list(1, ## down 1, ## up 1:4, ## left 1:4),## right layers = NULL, excludeCenter = TRUE) title(main = "[a]") sketchGrid(i = 10, j = 10, rowLimit = 20, colLimit = 20, shapeCross = list(2, ## down 2, ## up 2:5, ## left 2:5),## right layers = NULL, excludeCenter = TRUE) title(main = "[b]") sketchGrid(i = 10, j = 10, rowLimit = 20, colLimit = 20, shapeCross = list(NULL, ## down NULL, ## up NULL, ## left NULL),## right layers = 1, excludeCenter = TRUE) title(main = "[c]") sketchGrid(i = 10, j = 10, rowLimit = 20, colLimit = 20, shapeCross = list(NULL, ## down NULL, ## up NULL, ## left NULL),## right layers = 1:2, excludeCenter = TRUE) title(main = "[d]") @ <>= layout(rbind(c(1,2), c(3,4))) sketchGrid(i = 10, j = 10, rowLimit = 20, colLimit = 20, shapeCross = list(1:4, 1:4, 1:4, 1:4), layers = c(1:4), excludeCenter = TRUE) title(main = "[e]") sketchGrid(i = 10, j = 10, rowLimit = 20, colLimit = 20, shapeCross = list(1:4, 1:4, 1:4, 1:4), layers = c(1:4), excludeCenter = FALSE) title(main = "[f]") sketchGrid(i = 10, j = 10, rowLimit = 20, colLimit = 20, shapeCross = list(2:4, 2:4, 2:4, 2:4), layers = c(2:4), excludeCenter = TRUE) title(main = "[g]") sketchGrid(i = 2, j = 2, rowLimit = 20, colLimit = 20, shapeCross = list(2:4, 2:4, 2:4, 2:4), layers = c(2:4), excludeCenter = TRUE) title(main = "[h]") @ \end{document}