--- title: "VCF Preprocessing User Guide" author: - name: "Marta Sevilla Porras" affiliation: - "Universitat Pompeu Fabra (UPF)" - "Centro de Investigación Biomédica en Red (CIBERER)" email: "marta.sevilla@upf.edu" - name: "Carlos Ruiz Arenas" affiliation: - "Universidad de Navarra (UNAV)" email: "cruizarenas@unav.es" output: BiocStyle::html_document: number_sections: false toc: true fig_caption: true toc_float: true vignette: > %\VignetteIndexEntry{VCF Preprocessing User Guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- This vignette shows how to preprocess **individual trio VCFs** (proband, father, mother) into a clean **trio VCF** suitable for UPDhmm. ## Inputs You should have one **VCF** (e.g. oputput from GATK) per family member: - `proband.vcf.gz` - `mother.vcf.gz` - `father.vcf.gz` Each file should be bgzipped (`.vcf.gz`) and indexed with tabix (`.tbi`). ## 0. Normalize and Left-Align Variants (optional) For this step you will need a **reference genome FASTA** indexed with `samtools faidx`. ```{bash, eval=FALSE} bcftools norm -m-any -f reference.fa proband.vcf.gz -Oz -o proband.norm.vcf.gz bcftools norm -m-any -f reference.fa mother.vcf.gz -Oz -o mother.norm.vcf.gz bcftools norm -m-any -f reference.fa father.vcf.gz -Oz -o father.norm.vcf.gz tabix -p vcf proband.norm.vcf.gz tabix -p vcf mother.norm.vcf.gz tabix -p vcf father.norm.vcf.gz ``` ## 1. Remove Extra Annotations Keep only essential fields (drop INFO/FORMAT annotations not required downstream). The resulting VCFs are lighter. ```{bash, eval=FALSE} # Keep only GT, AD, DP, and GQ fields (remove all other INFO) bcftools annotate -x INFO,^FORMAT/GT,FORMAT/AD,FORMAT/DP,FORMAT/GQ \ proband.norm.vcf.gz -Oz -o proband.clean.vcf.gz bcftools annotate -x INFO,^FORMAT/GT,FORMAT/AD,FORMAT/DP,FORMAT/GQ \ mother.norm.vcf.gz -Oz -o mother.clean.vcf.gz bcftools annotate -x INFO,^FORMAT/GT,FORMAT/AD,FORMAT/DP,FORMAT/GQ \ father.norm.vcf.gz -Oz -o father.clean.vcf.gz # Index the cleaned VCFs tabix -p vcf proband.clean.vcf.gz tabix -p vcf mother.clean.vcf.gz tabix -p vcf father.clean.vcf.gz ``` ## 2. Merge Trio into a Single VCF **Goal:** - Combine the proband, mother, and father into a single VCF. - Retain only biallelic and informative variants. - Remove positions where all trio members are homozygous for the reference (0/0). If the input files are **gVCFs**, you can directly merge them, as they already include all genomic positions. If they are **standard VCFs**, merge the files first and then remove missing genotypes (`./.`) to keep only fully called variants. ```{bash, eval=FALSE} #-------------------------------------------------------------- # Option 1: For standard VCFs #-------------------------------------------------------------- # Merge the three individuals bcftools merge \ proband.clean.vcf.gz \ mother.clean.vcf.gz \ father.clean.vcf.gz \ -Oz -o trio_merged_raw.vcf.gz # Retain only biallelic variants bcftools view -m2 -M2 trio_merged_raw.vcf.gz -Oz -o trio_merged_biallelic.vcf.gz # Remove sites where all genotypes are homozygous reference (0/0) bcftools view \ -i 'COUNT(FORMAT/GT="0/0") != 3' \ trio_merged_biallelic.vcf.gz -Oz -o trio_merged_nohom.vcf.gz # Remove missing genotypes (./.) to keep only fully called variants bcftools view \ -e 'GT="./."' \ trio_merged_nohom.vcf.gz -Oz -o trio_merged_clean.vcf.gz #-------------------------------------------------------------- # Option 2: For gVCFs #-------------------------------------------------------------- # gVCFs already include all genomic sites, so intersection is not needed bcftools merge \ proband.clean.vcf.gz \ mother.clean.vcf.gz \ father.clean.vcf.gz \ -Oz -o trio_merged_raw.vcf.gz # Retain only biallelic variants bcftools view -m2 -M2 trio_merged_raw.vcf.gz -Oz -o trio_merged_biallelic.vcf.gz # Remove sites where all are 0/0 (non-informative for UPD analysis) bcftools view \ -i 'COUNT(FORMAT/GT="0/0") != 3' \ trio_merged_biallelic.vcf.gz -Oz -o trio_merged_clean.vcf.gz #-------------------------------------------------------------- # Normalize and index the merged VCF #-------------------------------------------------------------- # Normalize (split multi-allelics if any remain and remove duplicates) bcftools norm -m -both -d all -Oz -o trio_merged_norm.vcf.gz trio_merged_clean.vcf.gz # Index for downstream tools tabix -p vcf trio_merged_norm.vcf.gz ``` ## 3. Mask Structural Variant Regions Before detecting UPD events, it is recommended to **exclude genomic regions** prone to alignment artifacts or abnormal variant density — such as centromeres, segmental duplications, and immune complex regions (e.g., HLA and KIR). These regions can lead to false-positive signals due to mapping ambiguity or high polymorphism. To simplify this step, a curated set of BED masks is available in the following Zenodo repository: 🔗 [Zenodo – UPDhmm Excluded Regions](https://zenodo.org/records/17193905) The repository provides BED files for both reference genome builds: - `hg19_excluded_regions.bed` - `hg38_excluded_regions.bed` These files include merged genomic intervals covering: - Centromeric and telomeric regions - Segmental duplications - HLA and KIR loci - Low-mappability or highly repetitive regions Download the appropriate BED file for your genome build and use it to mask excluded regions from the merged trio VCF: ```{bash, eval=FALSE} # Mask problematic regions bcftools view -T ^merged_mask.bed \ trio_merged_norm.vcf.gz -Oz -o trio_masked.vcf.gz # Index the masked VCF tabix -p vcf trio_masked.vcf.gz ``` # Session Info ```{r} sessionInfo() ```