--- title: "PARATI: Parental Allele Transmission Inference" author: "Jinyi Che" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{parati Workflow} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Abstract PARATI is an R package for inferring parental transmitted and non-transmitted alleles in trio genotype data. It fills a gap in current Bioconductor workflows for analyzing genetic nurture and transgenerational effects, complementing packages such as `VariantAnnotation` by providing SNP-specific transmission inference for trio data. ## Introduction PARATI infers maternal and paternal transmitted and non-transmitted alleles from phased trio genotype data. While Bioconductor packages such as `VariantAnnotation` provide robust infrastructure for reading and representing VCF data, they do not directly implement trio-specific transmission inference. PARATI builds on that infrastructure by accepting `VariantAnnotation::VCF` objects or VCF file paths and returning R objects suitable for downstream analysis. ## Installation ```{r eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("parati") ``` ## Load packages and example data ```{r} library(parati) library(VariantAnnotation) vcf_file <- system.file("extdata", "Toy_TrioGenotype.vcf.gz", package = "parati") fam_file <- system.file("extdata", "Toy_FamilyIndexTable.xlsx", package = "parati") ``` ## Run PARATI from a VCF file path ```{r} res <- parati_run( vcf = vcf_file, fam = fam_file, chr = 1, hap_length = 500000 ) names(res) ``` ## Explore returned results ```{r} head(res$vcf_trans, 3) head(res$vcf_nontrans, 3) head(res$sim_perc_summary, 3) ``` ## Export result files The returned result object can be saved as tabular files or standard VCF files. ```{r eval=FALSE} library(vcfR) # 1. Save tabular outputs data.table::fwrite(res$vcf_trans, "transmitted_chr1.csv.gz") data.table::fwrite(res$vcf_nontrans, "nontransmitted_chr1.csv.gz") data.table::fwrite(res$sim_perc_summary, "sim_perc_summary_chr1.csv.gz") # 2. Read original VCF metadata orig_vcf <- vcfR::read.vcfR(vcf_file, verbose = FALSE) meta_lines <- orig_vcf@meta # 3. Convert returned data.tables to standard VCF objects trans_obj <- vcf_dt_to_vcfR(res$vcf_trans, meta = meta_lines) nontrans_obj <- vcf_dt_to_vcfR(res$vcf_nontrans, meta = meta_lines) # 4. Write standard gzipped VCF files write_vcf_obj(trans_obj, "transmitted_chr1.vcf.gz") write_vcf_obj(nontrans_obj, "nontransmitted_chr1.vcf.gz") ``` This produces five output files: - `transmitted_chr1.csv.gz` - `nontransmitted_chr1.csv.gz` - `sim_perc_summary_chr1.csv.gz` - `transmitted_chr1.vcf.gz` - `nontransmitted_chr1.vcf.gz` ## Output files and summary columns The updated PARATI workflow separates output into three logical components: 1. `vcf_trans` Final parental transmitted alleles. 2. `vcf_nontrans` Final parental non-transmitted alleles. 3. `sim_perc_summary` Haplotype-matching evidence and inference diagnostics for sites where mother, father, and child are all heterozygous. Compared with the original implementation, the transmitted VCF output corresponds most closely to the original single-file VCF result. The updated workflow additionally provides non-transmitted alleles and a separate summary table describing haplotype-based inference evidence. ### Meaning of `sim_perc_summary` columns The `sim_perc_summary` table contains one row per haplotype comparison for each target variant and family. Its columns are: - `#CHROM`: Chromosome of the target variant. - `POS`: Genomic position of the target variant. - `ID`: Variant identifier of the target SNP. - `pair`: The specific haplotype pair being compared. Each target variant has up to eight comparisons: `B_hap1_vs_M_hap1`, `B_hap1_vs_M_hap2`, `B_hap1_vs_P_hap1`, `B_hap1_vs_P_hap2`, `B_hap2_vs_M_hap1`, `B_hap2_vs_M_hap2`, `B_hap2_vs_P_hap1`, and `B_hap2_vs_P_hap2`. - `B_hap`: The child haplotype involved in the comparison (`B_hap1` or `B_hap2`). - `PM_hap`: The parental haplotype involved in the comparison (`M_hap1`, `M_hap2`, `P_hap1`, or `P_hap2`). - `bpwindow`: The local haplotype-matching window size in base pairs, corresponding to the `hap_length` argument. - `nSNP_haplotype`: Number of nearby SNPs within the matching window that had complete phased haplotype information and were used in similarity calculation. - `sim_perc`: Similarity proportion between the child haplotype and the corresponding parental haplotype across the local window. - `order`: Output row order for the eight pairwise haplotype comparisons. In the current implementation this is the fixed listing order of the eight comparisons, not a rank by similarity. - `status`: Final inference status for the target variant in that family. Typical values are: `Inferred based on haplotype`, `Ambiguous`, and `Low similarity`. - `FamilyIndex`: Family identifier from the family index table. In practice, `vcf_trans` and `vcf_nontrans` are the primary final outputs, whereas `sim_perc_summary` is intended for interpretation, quality control, and tracing the basis of haplotype-based inference. ## Integration with Bioconductor VCF workflows ```{r eval=FALSE} vcf_obj <- readVcf(vcf_file, genome = "unknown") res_from_vcf <- parati_run( vcf = vcf_obj, fam = fam_file, chr = 1, hap_length = 500000 ) names(res_from_vcf) ``` ## Inputs ### 1. Trio genotype VCF - Standard phased VCF input - Autosomal biallelic SNPs are recommended - Sample IDs must match `IndividualID` in the family table ### 2. Family index table Columns required: - `FamilyIndex` - `IndividualID` - `Role` ## Notes The primary interface returns R objects rather than writing files to disk. This design is intended to align with Bioconductor workflows and facilitate downstream analyses. Users can save returned results manually as needed. ## Session Info ```{r} sessionInfo() ```