% \VignetteIndexEntry{Assigning alleles to isoloci in polysat} % \VignetteDepends{ape, polysat} \documentclass{article} \usepackage{hyperref} \title{Assigning alleles to isoloci in \textsc{polysat}} \date{\today} \author{Lindsay V. Clark \\ University of Illinois, Urbana-Champaign} \hyphenation{processDatasetAllo} \hyphenation{alleleCorrelations} \hyphenation{testAlGroups} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \section{Introduction} This tutorial accompanies the R package \textsc{polysat} versions 1.6 and later, and demonstrates how to use functions in \textsc{polysat} for recoding data from allopolyploid or diploidized autopolyploid organisms such that each microsatellite marker is split into multiple isoloci. The data can then be analyzed under the model of random, Mendelian segregation; for example an allotetraploid organism can be treated as diploid, giving the user access to a much greater number of analyses both with \textsc{polysat} and with other software. If you are not sure whether your organism is fully polysomic or whether it has two or more subgenomes that segregate independently from each other, keep reading. In the ``Quality of clustering'' section I illustrate what the results look like if a locus (Loc7 in the example dataset) segratates in a polysomic (autopolyploid) manner. The methods described in this tutorial are designed for natural populations of one hundred or more individuals. They will also work for certain types of mapping populations, such as an F2 population derived from two inbred grandparents. They will be much less effective for small sample sizes ($<50$ for tetraploids, $<200$ for higher ploidies), and for datasets with multiple species or highly-diverged ecotypes. The methods in this tutorial are also inappropriate for species that reproduce asexually. If you have duplicate genotypes in your dataset (either as a result of asexual reproduction, vegatative spread, or sampling the same individual multiple times) you should remove them before starting. It is assumed that the reader is already familiar with R and with \textsc{polysat}. If you aren't, please spend some time with ``An Introduction to R'' and with the ``polysat version 1.7 Tutorial Manual''. Additionally, there is a manuscript published in \emph{Molecular Ecology Resources} at \url{http://dx.doi.org/10.1111/1755-0998.12639} that describes in detail the rationale and limitations of the allele assignment tools described in this manual. You may also find the documentation for individual functions to be helpful (\emph{e.g.} see \texttt{?alleleCorrelations}, \texttt{?processDatasetAllo}, and \texttt{?recodeAllopoly}). I'm always happy to answer questions or help debug errors over email, although I appreciate it if you first take the time to read the documentation carefully and check your work. \section{Data hygiene} Below I have some recommendations for generating and cleaning up datasets so that they will have the greatest success with the allele assignment algorithm. \subsection{Before you begin genotyping} Choose your markers well. If your research group has previously run microsatellite markers on your species of interest, choose markers that have given clear, consistent, and easy-to-interpret patterns of amplification in the past. If a linkage map has been published for your species, use markers that are on the map; low quality markers would have given segregation distortion and would not have been mappable. Lastly, dinucleotide repeat markers should be used with caution, as their high mutation rate increases the chance of homoplasy occuring, and their high degree of stutter can make amplification patterns difficult to interpret in polyploids. Markers with trinucleotide or larger repeats will give cleaner results, despite not having as many alleles. Avoid multiplexing several markers in one PCR reaction. Multiplexing increases the probability of scoring error due to allelic dropout. Run each marker in a separate reaction, then, if desired for your electrophoresis method, pool the reactions post-PCR. Use the highest resolution electrophoresis method available to you. Agarose gels may lack the resolution to distinguish all alleles from each other. If using acrylamide gels, make an allelic ladder by pooling PCR products from several diverse individuals, then run the same ladder on each gel to ensure consistency of scoring. If using a capillary sequencer, always use the same size standard, and be consistent in terms of which fluorescent dye goes with which marker. ``Oh no! I've already run all of my microsatellite markers, and I don't have time to re-do them!'' It's okay. Everything above was just a suggestion to improve the probability that the method described in this vignette will work on your dataset. The method may still work, and if it doesn't, you can consider whether failing to meet one of the above suggestions could have caused the problem. \subsection{Scoring your microsatellite alleles} If you know someone in your lab or at your institution who has a lot of experience scoring microsatellites (and if you are not very experienced), get them to sit down with you for a couple hours and demonstrate how they would score the markers in your dataset. Understanding the additivity of overlapping stutter and allele peaks is important, as is knowing how to distiguish true alleles from PCR artifacts and dye blobs, and knowing how to check that the software interpreted the size standard correctly. Don't trust any piece of software to score your markers. Most of them were optimized for diploid species, and even then they have a lot of problems. After the software (e.g. GeneMapper or \href{http://www.vgl.ucdavis.edu/informatics/strand.php/}{STRand}) has called the alleles, you need to manually inspect every genotype, and you will probably correct a lot of them. Consistency is crucial. One allele may give a pattern of multiple peaks, so you need to decide which peak to score, and whether to round up or down to get the size in nucleotides. If using software that performs ``binning'', go through and correct all of the allele calls before using them to make bins. Using STRand, I like to take screenshots to indicate how I score each marker, then I can easily look at them months later when I genotype additional individuals. \subsection{Preliminary analysis of the data} In this section I'll use a simulated dataset to demonstrate how to clean up your data and split it into subpopulations if necessary. The same simulated dataset will be used throughout the rest of this manual. <<>>= library(polysat) data(AllopolyTutorialData) summary(AllopolyTutorialData) # make a copy of the dataset to modify mydata <- AllopolyTutorialData @ Note that datasets can be imported using any of the normal import functions for \textsc{polysat} (like \texttt{read.GeneMapper}). The \texttt{data} function here is used only because this is an example dataset installed with the package. First, any questionable genotypes should be removed. If an electropherogram or banding pattern was unclear, it is best to replace that genotype with missing data. Any duplicate or highly similar genotypes that likely represent the same individual (or a group of asexually derived individuals) should be removed, such that each genotype is only represented once in the dataset. (The \texttt{assignClones} function might be useful for identifying duplicates if you haven't already done so.) Since this is an allotetraploid, let's also make sure that no genotypes have more than four alleles, and eliminate any that do. <<>>= # Calculate the length of each genotype vector (= the number of alleles) and # construct a TRUE/FALSE matrix of whether that number is greater than four. tooManyAlleles <- apply(Genotypes(mydata), c(1,2), function(x) length(x[[1]])) > 4 # Find position(s) in the matrix that are TRUE. which(tooManyAlleles, arr.ind=TRUE) # 43rd sample, second locus # Look at the identified genotype, then replace it with missing data. Genotype(mydata, 43, 2) Genotype(mydata, 43, 2) <- Missing(mydata) Genotype(mydata, 43, 2) @ Next, we'll want to look at population structure in the dataset. We'll make a square matrix of genotype dissimilarities using a simple band-sharing metric, then make a neighbor-joining tree. <>= mydist <- meandistance.matrix(mydata, distmetric=Lynch.distance, progress=FALSE) @ <>= load("vignettebuild/AllopolyTutorialDist.RData") @ <>= require(ape) mynj <- nj(mydist) plot(mynj, type="unrooted") @ You can see that individuals 301, 302, and 303 are highly dissimilar from the rest. This is what it looks like when some individuals are a different species. Because of the fast rate at which microsatellites mutate, allele assignments that we make in one species are very unlikely to apply to another species. We will remove these three individuals from the dataset. <<>>= mydata <- deleteSamples(mydata, c("301","302","303")) @ Now let's examine the rest of the dataset for population structure using principal coordinates analysis. <>= par(mfrow=c(2,1)) mypca <- cmdscale(mydist[Samples(mydata), Samples(mydata)]) plot(mypca[,1], mypca[,2]) hist(mypca[,1], breaks=30) @ We can see a slightly bimodal distribution of individuals, indicating moderate population structure. Since population structure can interfere with preliminary allele assignment in the \texttt{alleleCorrelations} step, we will assign individuals to two populations that we can analyze separately. This will also help us to evaluate reproducibility of allele assignments. <<>>= pop1ind <- Samples(mydata)[mypca[,1] <= 0] pop2ind <- Samples(mydata)[mypca[,1] > 0] PopInfo(mydata)[pop1ind] <- 1 PopInfo(mydata)[pop2ind] <- 2 @ Of course, principal coordinates analysis is not the only method of assigning individuals to populations. If you have already analyzed your data with other software (for example, Structure) before importing it into \textsc{polysat}, you could instead use population assignments from that software, and construct the \texttt{PopInfo} vector accordingly to indicate population assignments. \section{The \textsc{polysat} algorithm for allele assignment} \subsection{General considerations and parameters} To assign alleles to isoloci, you will primarily be using the function \texttt{processDatasetAllo}. This function internally calls two other functions from \textsc{polysat}: \texttt{alleleCorrelations} looks for negative correlations between alleles and uses those correlations to make preliminary assignments, then \texttt{testAlGroups} adjusts those assignments if necessary after checking them against individual genotypes. \texttt{testAlGroups} has some parameters that affect its accuracy depending on the rate of meiotic error (pairing between homeologs or paralogs during meiosis), the presence of null alleles, and homoplasy (different alleles with identical amplicon size) between isoloci. Below are some recommendations for adjusting arguments from the defaults depending on particular issues in the dataset. \vspace{5mm} \begin{tabular}{| l | l |} \hline Many alleles ($>10$) & Increase \texttt{R} (the number of allele swaps attempted) \\ Meiotic error & Increase \texttt{tolerance} \\ Null alleles & \texttt{null.weight = 0} \\ Null alleles at high frequency & \texttt{swap = FALSE} \\ Homoplasy & \texttt{swap = FALSE} \\ \hline \end{tabular} \vspace{5mm} Because you may not know whether the last four issues are present in your dataset, I recomment trying several parameter sets. The function \texttt{processDatasetAllo} tests several parameter combinations across all loci in the dataset. \subsection{Running the algorithm} Now we will use \texttt{processDatasetAllo} to make allele assignments. Because this is an allotetraploid we will use \texttt{n.subgen = 2} (two subgenomes) and \texttt{SGploidy = 2} (each subgenome is diploid). Because we divided our dataset into two subpopulations that appear to have different allele frequencies, we will set \texttt{usePops = TRUE} to analyze the two populations separately according to \texttt{PopInfo(mydata)}. We will leave the \texttt{parameters} argument at the default, which includes one parameter set optimized for no null alleles and no homoplasy, one optimized for homoplasy, and two optimized for null alleles. This command will take a few minutes to process (especially since the data set includes deliberately problematic loci, which take longer). Here I have set \texttt{R} to 500, which also increases processing time but makes it more likely that your results will look similar or identical to mine. <>= myassign <- processDatasetAllo(mydata, n.subgen = 2, SGploidy = 2, usePops = TRUE, R = 500) @ <>= load("vignettebuild/AllopolyTutorialAssign.RData") @ \subsection{Inspecting the results} \subsubsection{Warnings about positive correlations} We got a warning about Loc6 for both the populations. If population structure were a serious problem, we would have gotten warnings about most or all loci. Let's take a look at which alleles had positive correlations for Loc6, to see if there may have been a scoring problem. The \texttt{myassign} object is a list, and it has an element called \texttt{AlCorrArray} that contains the results of \texttt{alleleCorrelations} for each locus and population. Each set of results is also a list. The list element called \texttt{significant.pos} indicates any significantly positive associations between alleles. We will look at the Loc6 results for both populations. <<>>= myassign$AlCorrArray[["Loc6", 1]]$significant.pos myassign$AlCorrArray[["Loc6", 2]]$significant.pos @ We see positive correlations between alleles 330 and 327, 339 and 336, and 348 and 345. Since these are trinucleotide repeats, it looks like some of the larger alleles (which would tend to have more stutter) had stutter peaks mis-called as alleles. If this were a real dataset, I would say to go back to the gels or elecropherograms and call the alleles more carefully. Since this is a simulated dataset, we will simply exclude Loc6 from further analysis. <<>>= mydata <- deleteLoci(mydata, loci="Loc6") @ Occasionally, a locus with no population structure or scoring error will produce a warning about positive correlations between alleles simply due to sampling error in the dataset. If you only get the warning for one locus and population in your dataset, and it does not look like there was scoring error (\emph{i.e.} the alleles that are positively correlated are not close in amplicon size), you can safely ignore the warning. \subsubsection{Quality of clustering} In your working directory, you should now find a file called ``alleleAssignmentPlots.pdf'' containing several plots to help you evaluate clustering quality. We will recreate some of those plots here and discuss their interpretation. <>= par(mfrow = c(1,1)) plotSSAllo(myassign$AlCorrArray) @ This plot gives an indication of clustering quality using the K-means method (from \texttt{alleleCorrelations}) alone, and is intended to help identify problematic loci or populations. The x-axis essentially shows how much of the variation in \textit{P}-values for correlations between alleles could be explained by clustering alleles into groups. The y-axis indicates how similar in size the groups of alleles are, \textit{e.g} two groups of five alleles each would give a high value, and one group with one allele and one group with nine alleles would give a low value. So, loci and populations in the upper-right quadrant performed well in allele assignment, but others should be subject to scrutiny. In the last section we identified Loc6 as a locus that we should exclude. In this plot, for both populations, Loc6 has a relatively low value on the x-axis. It is also marked in italics to indicate that there were positive correlations between alleles, in case we hadn't noticed this problem when the algorithm ran. It looks like Loc7 may also have a problem, given that it got a low value on the y-axis in both populations. We can also make heatmaps of the \textit{P}-values for correlations between alleles to get a sense of how well the clustering went. Let's compare a locus that seemed to work very well (Loc5) to one that we expect to cluster poorly (Loc6), then compare them to Loc7. <>= heatmap(myassign$AlCorrArray[["Loc5", "Pop1"]]$heatmap.dist, main = "Loc5, Pop1") @ The big red blocks neatly divide the alleles into two groups. <>= # A plot to show how the colors correspond to p-values in the # heat map; you can repeat this for the other heat maps in this # tutorial if you wish. plot(x=seq(min(myassign$AlCorrArray[["Loc5", "Pop1"]]$heatmap.dist), max(myassign$AlCorrArray[["Loc5", "Pop1"]]$heatmap.dist), length.out=12), y=rep(1,12), xlab="P-values", ylab="", bg=heat.colors(12), pch=22, cex=3) @ <>= heatmap(myassign$AlCorrArray[["Loc6", "Pop1"]]$heatmap.dist, main = "Loc6, Pop1") @ For Loc6, it is not nearly as obvious how to divide the alleles into two groups. <>= heatmap(myassign$AlCorrArray[["Loc7", "Pop1"]]$heatmap.dist, main = "Loc7, Pop1") @ <>= heatmap(myassign$AlCorrArray[["Loc7", "Pop2"]]$heatmap.dist, main = "Loc7, Pop2") @ For Loc7, the alleles also don't seem to go into two neat groups. In fact, the dendrograms on both heatmaps first split off a single allele from the rest, but that allele is different for Pop1 (125) versus Pop2 (150). When I simulated Loc7, I actually simulated it as being a single tetrasomic locus instead of a pair of disomic isoloci. If you have a large sample size and no positive correlations between alleles, but the results for all of your loci look like this, you can probably treat the data as being autopolyploid (polysomic). It is also quite possible for an organism to have some disomic loci and some polysomic loci. We will leave Loc7 in the dataset but won't attempt to do allele assignment or genotype recoding with it. \subsection{Evaluating the parameter sets} Now we will take a look at the results of \texttt{testAlGroups}. As you may recall, there are several parameters that can be adjusted for that function, and we tried out the default four sets of parameters with \texttt{processDatasetAllo}. One way to evaluate the quality of the assignments produced by \texttt{testAlGroups} is by seeing how many alleles it decided to make homoplasious, \emph{i.e.} how many alleles were assigned to multiple isoloci. Let's plot those proportions separately for the two populations. <>= plotParamHeatmap(myassign$propHomoplasious, popname = "Pop1", main = "Proportion homoplasious loci:") @ Lighter colors indicate fewer homoplasious alleles, which in turn indicates better quality allele assignment. We will ignore Loc6 and Loc7 since we already decided to exclude them. In Pop1, Loc4 worked best with the third and fourth parameter sets (allowing null alleles) and Loc1 worked best with the second, third, or fourth parameter sets (allowing homoplasy or null alleles), but the other loci worked equally well regardless of parameter set. <>= plotParamHeatmap(myassign$propHomoplasious, popname = "Pop2", main = "Proportion homoplasious loci:") @ For all loci in Pop2 except for Loc 1, we see the same proportion of homoplasious loci regardless of parameter set. \texttt{processDatasetAllo} also runs the \texttt{mergeAlleleAssignments} function to combine allele assignments across populations, within loci and parameter sets. We take a look at how many more homoplasious alleles we get after merging the assignments (since the same allele may have been assigned to different isoloci in different populations). <>= plotParamHeatmap(myassign$propHomoplMerged, popname = "Merged across populations", main = "Proportion homoplasious loci:") @ Now we see that Loc6 and Loc7 look a lot worse, since their allele assignments were different between populations. For Loc1, the second parameter set (optimized for homoplasy) was clearly the best in terms of minimizing homoplasious alleles. For Loc2, the first or third parameter set (using allele swapping) works best. For Loc3, Loc4, and Loc5, we have the same number of homoplasious alleles regardless of parameter set. We can also examine, for each locus and parameter set, what proportion of genotypes would be replaced with missing data if we used that set of allele assignments to recode the data. Genotypes are recoded as missing if homoplasy prevents them from being unambiguously determined. \texttt{processDatasetAllo} runs \texttt{recodeAllopoly} in order to determine these missing data rates. When it does so, it uses the \texttt{allowAneuploidy=FALSE} option so that genotypes that have too many alleles ($>$ \texttt{SGploidy}) for an isolocus will also be recoded as missing. <>= plotParamHeatmap(myassign$missRate, popname = "All Individuals", main = "Missing data after recoding:") @ This plot mirrors the plot reporting the proportion of loci that are homoplasious, since homoplasy tends to lead to genotypes that cannot be unambiguously recoded. \texttt{processDatasetAllo} suggests a best set of allele assignments based on missing data after recoding and the proportion of homoplasious alleles. For each locus, it picks the parameter set that results in the lowest missing data rate, then in the case of a tie the parameter set with the least amount of homoplasy (after merging across populations), then in the case of a tie the lowest numbered parameter set. So in this case, parameter set 2 would be chosen for Loc1, and parameter set 1 for all other loci. Let's extract that list of optimal assignments from the results. We will also eliminate Loc6 and Loc7 from the list since we discarded Loc6 and don't want to recode Loc7. <<>>= myBestAssign <- myassign$bestAssign myBestAssign myBestAssign <- myBestAssign[1:5] @ We can see a large amount of homoplasy for Loc7 and some for Loc6, since the allele assignment algorithm was not appropriate for those two loci. We can also see that Loc1 and Loc4 each have one homoplasious allele. This accounts for the much higher proportion of missing data when recoding Loc1 and Loc4 as opposed to Loc2, Loc3, and Loc5. We can also extract from the results the proportion of missing data that is introduced by recoding genotypes using the best allele assignments. <<>>= apply(myassign$missRate, 1, min) @ Loc1 would have 52\% missing data after being recoded using the best set of allele assignments, Loc2 would have 0.7\% missing data, \emph{etc}. It is possible to take a deeper look at the allele assignments that were produced before the best set was chosen. Let's look at Loc1, since it has such a high missing data rate after recoding. <<>>= myassign$AlCorrArray[["Loc1", "Pop1"]]$Kmeans.groups myassign$AlCorrArray[["Loc1", "Pop2"]]$Kmeans.groups myassign$AlCorrArray[["Loc1", "Pop1"]]$UPGMA.groups myassign$AlCorrArray[["Loc1", "Pop2"]]$UPGMA.groups @ Both K-means clustering and UPGMA placed allele 236 in a different isolocus depending on population. This is the same allele that was ultimately assigned to be homoplasious in the ``best'' assignment set. \texttt{testAlGroups} results for Loc1: <<>>= myassign$TAGarray[["Loc1", "Pop1", 1]]$assignments myassign$TAGarray[["Loc1", "Pop2", 1]]$assignments myassign$mergedAssignments[["Loc1", 1]]$assignments myassign$TAGarray[["Loc1", "Pop1", 2]]$assignments myassign$TAGarray[["Loc1", "Pop2", 2]]$assignments myassign$mergedAssignments[["Loc1", 2]]$assignments myassign$TAGarray[["Loc1", "Pop1", 3]]$assignments myassign$TAGarray[["Loc1", "Pop2", 3]]$assignments myassign$mergedAssignments[["Loc1", 3]]$assignments myassign$TAGarray[["Loc1", "Pop1", 4]]$assignments myassign$TAGarray[["Loc1", "Pop2", 4]]$assignments myassign$mergedAssignments[["Loc1", 4]]$assignments @ We can see that with parameter set 1, allele 236 was made homoplasious in Pop1 and allele 227 was made homoplasious in both populations, resulting in both being homoplasious in the merged set. With parameter set 2, however, assignments were consistent between the two populations. Parameter set 3 produced allele assignments that were very different from those produced by parameter sets 1 and 2. The parameter set 4 assignments matched the ``best'' assignment set for Pop2, but not Pop1. \subsection{Re-analyzing individual loci} For problematic loci, you may wish to try different population splits or parameter sets. It is not necessary to re-run \texttt{processDatasetAllo} on the entire dataset in order to accomplish this; we can run \texttt{alleleCorrelations} and \texttt{testAlGroups} on individual loci instead. Continuing to examine Loc1, what assignments do we get from K-means clustering and UPGMA if we don't split the datset into two populations? <<>>= corrLoc1AllInd <- alleleCorrelations(mydata, locus = "Loc1", n.subgen = 2) corrLoc1AllInd$Kmeans.groups corrLoc1AllInd$UPGMA.groups @ Analyzing both populations together gives us clustering identical to the previous clustering from Pop1. We'll try \texttt{testAlGroups} with a parameter set identical to parameter set 2 from our previous analysis. We'll also try increasing \texttt{tolerance} a little bit to see if it eliminates homoplasy. Increasing \texttt{tolerance} could be useful if Loc1 does not in fact segregate in a completely disomic manner. <<>>= TaLoc1.param2 <- testAlGroups(mydata, corrLoc1AllInd, SGploidy = 2, null.weight = 0.5, tolerance = 0.05, swap = FALSE) TaLoc1.param5 <- testAlGroups(mydata, corrLoc1AllInd, SGploidy = 2, null.weight = 0.5, tolerance = 0.1, swap = FALSE) TaLoc1.param2$assignments TaLoc1.param5$assignments @ In both cases we get the same allele assignments as before, so we will leave it as is. Hypothetically though, if we liked the new assignments better and wanted to use them to replace the old assignments, here is how it would be done: <<>>= myBestAssign[[1]] <- TaLoc1.param5 @ \subsection{Recoding the data} Now that we've chosen sets of allele assignments to use and thrown away loci that had problems, we can recode the dataset. <<>>= recodedData <- recodeAllopoly(mydata, myBestAssign) summary(recodedData) @ You'll notice that we have more loci now that each marker (except Loc7) has been split into two isoloci. We also have a lot of missing data, since homoplasy can lead to uncertainty about what the true genotype is. We had homoplasy for Loc1 and Loc 4. <<>>= for(L in Loci(recodedData)){ proportionmissing <- mean(isMissing(recodedData, loci=L)) cat(paste(L,":",proportionmissing,"missing"),sep="\n") } @ You'll also notice that not the entire dataset is diploid. That is because there is some meiotic error in the dataset, and we used \texttt{allowAneuoploidy = TRUE} in \texttt{recodeAllopoly}. Most genotypes are diploid though. <<>>= table(Ploidies(recodedData)) @ We can manually add in the ploidy for Loc7, since it was not recoded. <<>>= Ploidies(recodedData)[,"Loc7"] <- 4 @ The recoded data may now be analyzed with any \textsc{polysat} function, or exported to other software using any of the various \texttt{write} functions in polysat. For example, it is now appropriate to estimate allele frequencies from the data and use those to estimate $G_{ST}$. <<>>= myfreq <- simpleFreq(recodedData) myGst <- calcPopDiff(myfreq, metric = "Gst") myGst @ We can also export the data if we want to do analysis using other software. <<>>= write.GeneMapper(recodedData, file = "tutorialRecodedData.txt") @ \section{The Catal\'{a}n method of allele assignment} An alternative method of allele assignment available in \textsc{polysat} is that by Catal\'{a}n \emph{et al.} (2006; \url{http://dx.doi.org/10.1534/genetics.105.042788}). The Catal\'{a}n method does not allow for homoplasy, null alleles, or meiotic error, but may perform equally to the \textsc{polysat} method in cases of strong population structure, with less processing time. Let's try it on our example dataset. <<>>= catResults <- list() length(catResults) <- length(Loci(mydata)) names(catResults) <- Loci(mydata) for(L in Loci(mydata)){ cat(L, sep="\n") catResults[[L]] <- catalanAlleles(mydata, locus=L, verbose=TRUE) } @ Assignments are returned for Loc5 only. For Loc2 and Loc3, the algorithm found the correct allele assignments, but did not return them since some genotypes were inconsistent with those assignments due to meiotic error. The results of \texttt{catalanAlleles} can be passed to \texttt{recodeAllopoly} in the same way as the results of \texttt{testAlGroups}. \section{Testing assignment accuracy using simulated datasets} The simulated data in this tutorial, as well as the simulations for the manuscript, were created using the \texttt{simAllopoly} function. If you have a different ploidy, number of individuals, or number of alleles from the datasets simulated in the manuscript, you can run your own simulations to estimate the accuracy of allele assignment. By default, the alleles are given names that start with A, B, \emph{etc.} to indicate to which isolocus they belong, so that it is easy to see whether the output of \texttt{testAlGroups} is correct. See \texttt{?simAllopoly} for more information. The \texttt{\detokenize{tables_figs.R}} file that is included as supplementary information for the manuscript can serve as a guide for how to run a large number of simulations and test their accuracy. \end{document}