Background

Uniparental disomy (UPD) is a rare genetic condition where an individual inherits both copies of a chromosome from one parent, as opposed to the typical inheritance of one copy from each parent. The extent of its contribution as a causative factor in rare genetic diseases remains largely unexplored.

UPDs can lead to disease either by inheriting a carrier pathogenic mutation as homozygous from a carrier parent or by causing errors in genetic imprinting. Nevertheless, there are currently no standardized methods available for the detection and characterization of these events.

We have developed UPDhmm, an R/Bioconductor package that provides a tool to detect, classify, and locate uniparental disomy events in trio genetic data.

Method overview

UPDhmm relies on a Hidden Markov Model (HMM) to identify regions with UPD.

In our HMM model, the observations are the genotype combinations of the father, mother, and proband at each genomic variant. The hidden states represent five inheritance patterns:

  • Normal (Mendelian inheritance)

  • Maternal isodisomy

  • Paternal isodisomy

  • Maternal heterodisomy

  • Paternal heterodisomy

Emission probabilities were defined based on the expected inheritance patterns, and the Viterbi algorithm was used to infer the most likely sequence of hidden states along the genome.

The final output consists of genomic segments whose inferred state deviates from Mendelian inheritance — i.e., likely UPD regions.

Set-up the packages

if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("UPDhmm")
BiocManager::install("regioneR")
BiocManager::install("karyoploteR")

Loading libraries

library(UPDhmm)
library(dplyr)

Quick start

The input required by the package is a multisample VCF file containing genotypes for a trio (proband, mother, and father).

Requirements:

Only biallelic variants are supported.

Accepted genotype (GT) formats: 0/0, 0/1, 1/0, 1/1 (or their phased equivalents 0|0, 0|1, 1|0, 1|1).

For details on how to preprocess VCF files and prepare them in the appropriate format, please refer to the “Preprocessing and Input Preparation” vignette included with the package.

Detecting and postprocessing UPD events

In this tutorial, we will perform a complete analysis using two trio VCF files included with the UPDhmm package.

By following the steps below, you will learn how to:

  1. Load and preprocess the VCF files.

  2. Detect UPD events using the HMM-based approach.

  3. Apply quality filters to the detected events.

  4. Collapse and summarize results per individual.

  5. Perform postprocessing analyses to identify potential true UPDs by detecting recurrent or artifact-prone genomic regions across samples.

1. Load and preprocess the VCFs

library(VariantAnnotation)
library(regioneR)
library(karyoploteR)

#Example files
files <- c(
  system.file(package = "UPDhmm", "extdata", "sample1.vcf.gz"),
  system.file(package = "UPDhmm", "extdata", "sample2.vcf.gz")
)


# Define the trio structure
trio_list <- list(
  list(
    proband = "NA19675",
    mother  = "NA19678",
    father  = "NA19679"
  ),
  list(
    proband = "NA19685",
    mother  = "NA19688",
    father  = "NA19689"
  )
)

# Read and preprocess each VCF
vcf_list <- mapply(function(file, trio) {
  vcf <- readVcf(file)
  vcfCheck(vcf,
           proband = trio$proband,
           mother  = trio$mother,
           father  = trio$father)
}, 
files, trio_list, SIMPLIFY = FALSE)

Each processed VCF is now ready for UPD detection.

2. Detect UPD events using the HMM-based approach

The principal function of UPDhmm package, calculateEvents(), is the central function for identifying Uniparental Disomy (UPD) events. It takes the output from the previous vcfCheck() function and systematically analyzes genomic data, splitting the VCF into chromosomes and applying the Viterbi algorithm.

calculateEvents() function encompasses a serie of subprocesses for identifying Uniparental disomies (UPDs):

  1. Split vcf into chromosomes
  2. Apply Viterbi algorithm to every chromosome
  3. Convert Viterbi results into contiguous genomic blocks by collapsing consecutive variants sharing the same inferred state and compute block-level statistics. Optionally, calculate depth-based ratios for each block.
  4. Calculate block-level statistics including Mendelian error counts.

In this example, we run the function in serial mode, but calculateEvents() now includes an optional argument, BPPARAM, which controls parallel execution settings and is passed to bplapply. By default, it uses BiocParallel::SerialParam() (serial execution). To enable parallelization, provide a BiocParallel backend, for example:

events_list <- lapply(vcf_list, calculateEvents, add_ratios = TRUE)

head(events_list[[1]])
       ID chromosome     start       end   group n_snps ratio_father
1 NA19675          3 100374740 197574936 het_mat     47    0.9735006
2 NA19675          6  32489853  32490000 iso_mat      6    0.9802723
3 NA19675          6  32609207  32609341 iso_mat     11    0.8356040
4 NA19675         11  55322539  55587117 iso_mat      7    1.1930587
5 NA19675         14 105408955 105945891 het_fat     30    0.9293491
6 NA19675         15  22382655  23000272 iso_mat     12    0.9463235
  ratio_mother ratio_proband n_mendelian_error
1    1.0247049     1.0102723                 2
2    1.0878888     1.0770260                 5
3    1.0997261     1.0007866                 7
4    1.0440852     1.1676397                 3
5    0.9795199     1.0370100                 2
6    1.0857886     0.9450584                 3

Note: when running under R CMD check or Bioconductor build systems, the number of workers may be automatically limited (usually less than 2).

Results description

The calculateEvents function returns a data.frame containing all
detected events in the provided trio. If no events are found, the function
will return an empty data.frame.

By default, this function only calculates the number of Mendelian errors as a block-level statistic. Sequencing depth ratios can be included by enabling the optional parameter add_ratios = TRUE, which defaults to FALSE.

Column name Description
ID Identifier of the proband (child sample)
chromosome Chromosome name
start Start position of the block (genomic coordinate)
end End position of the block
group Predicted inheritance state (iso_mat, iso_fat, het_mat, het_fat)
n_snps Number of informative SNPs within the event
ratio_proband Optional. Ratio between the average sequencing depth inside the event and the depth outside it for the proband. A value close to 1 indicates balanced coverage; significant deviations may suggest copy-number changes.
ratio_mother Optional. Same ratio computed for the maternal sample
ratio_father Optional. Same ratio computed for the paternal sample
n_mendelian_error Number of Mendelian inconsistencies detected within the region

In this example, we run the function in serial mode, but calculateEvents() now includes an optional argument BPPARAM that controls parallel execution.

Argument Description
BPPARAM Parallelization settings, passed to BiocParallel::bplapply(). By default, the function uses BiocParallel::SerialParam() (serial execution). To enable parallel computation, a BiocParallel backend can be specified — for example:
library(BiocParallel)
BPPARAM <- MulticoreParam(workers = min(2, parallel::detectCores()))
events_list <- lapply(vcf_list, calculateEvents, BPPARAM = BPPARAM)

3. Collapse and summarize results per individual

The output of calculateEvents needs some pre-processing before the interpretation of the results. On one hand, calculateEvents can detect small regions which are likely to be artifacts, instead of real UPD events. On the other, large UPD events, i.e. spanning across a whole chromosome, can be detected as multiple contiguous events.

The function collapseEvents() handles both situations. First,collapseEvents() filters the detected events using the following parameters:

  • n_mendelian_error (default > 2): minimum number of Mendelian errors supporting the UPD event. Requiring multiple Mendelian inconsistencies reduces the likelihood that the signal is driven by sporadic genotyping errors.

  • size (default = 500,000): minimum length (in base pairs) of the event to be considered, aimed at excluding short spurious segments frequently observed as false positives.

These thresholds were defined based on the evaluation conducted in the publication in which the package was introduced and validated, including simulations of UPD events of different sizes and application to a real trio-based cohort, with the aim of removing small events and excluding those likely driven by technical noise.

Second, collapseEvents() merges UPD events with the same predicted type occurring on the same chromosome into a single collapsed event. This step ensures that extended UPD regions initially detected as multiple adjacent segments are represented as a single biologically meaningful event per individual and chromosome.

collapsed_list <- lapply(events_list, collapseEvents)
collapsed_all <- do.call(rbind.data.frame, c(collapsed_list, make.row.names = FALSE))
head(collapsed_all)
       ID chromosome    start      end   group n_events total_mendelian_error
1 NA19675         15 22382655 23000272 iso_mat        1                     3
2 NA19675         19 42315281 43522911 het_mat        1                     3
3 NA19685         15 22368862 42109975 iso_mat        1                     6
4 NA19685          6 32489853 33499925 het_mat        1                     3
  total_size total_snps prop_covered ratio_father ratio_mother ratio_proband
1     617617         12            1    0.9463235    1.0857886     0.9450584
2    1207630          8            1    1.0248302    0.9671289     0.9801788
3   19741113         10            1    1.0077605    1.0158014     1.0105088
4    1010072          5            1    0.9844789    0.9683973     0.9789823
      collapsed_events
1 15:22382655-23000272
2 19:42315281-43522911
3 15:22368862-42109975
4  6:32489853-33499925

The resulting table contains one row per individual, chromosome, and UPD type, with the following columns:

Column name Description
ID Identifier of the proband
chromosome Chromosome
start / end Genomic span of collapsed block
group Predicted state
n_events Number of events collapsed
total_mendelian_error Sum of Mendelian errors
total_size Total genomic span covered
total_snps Total SNPs in the overlapping events
prop_covered Proportion of the region covered by events
collapsed_events Comma-separated list of merged events
ratio_proband Weighted mean ratio of the proband across the collapsed events
ratio_mother Weighted mean ratio of the mother across the collapsed events
ratio_father Weighted mean ratio of the father across the collapsed events

The read-depth ratios are reported as supportive metrics to facilitate downstream interpretation. Values close to one indicate balanced coverage across trio members, whereas strong deviations may suggest copy-number gains or losses rather than true UPD events. Importantly, these ratios are not used as hard filtering criteria at this stage, but are provided to aid quality control and biological interpretation.

4. Perform postprocessing analyses

In this final step, we perform postprocessing analyses to refine the detected UPD events and highlight potentially true or recurrent regions across samples. This helps to distinguish genuine biological UPDs from artifact-prone genomic regions (e.g., repetitive or complex loci that may systematically produce false positives).

To explore whether similar events recur across individuals, UPDhmm provides two complementary functions:

  • identifyRecurrentRegions() — Identifies recurrent genomic regions shared across multiple samples by clustering overlapping genomic intervals.
    The function first filters events based on a Mendelian error threshold, then groups overlapping regions using an agglomerative hierarchical clustering approach applied chromosome-wise.
    Clusters supported by a minimum number of distinct individuals are collapsed into recurrent genomic regions.

    It returns a GRanges object containing the recurrent regions, along with metadata describing sample support and the individual events contributing to each region.

    Arguments:

    • df — The input data.frame containing the genomic regions to evaluate.
      This table must include columns describing chromosome (chromosome), genomic coordinates (start, end), sample identifiers, and a column with Mendelian error counts (n_mendelian_error or total_mendelian_error).

    Filtering arguments:

    • ID_col — The name of the column containing sample identifiers.
      It allows the function to determine how many unique individuals support each overlapping genomic region.
    • error_threshold (default = 100) — Maximum number of Mendelian errors allowed for a region to be considered in the recurrence analysis.
      Regions exceeding this threshold are excluded from recurrence counting. These regions, which often span broad genomic segments, are not considered in the recurrence calculation since they could overlap with many smaller events and bias the results. Instead, they are retained in the output as non-recurrent events, under the assumption that they may represent genuine large-scale alterations rather than noise.
    • min_support (default = 3) — Minimum number of distinct samples required for a cluster of overlapping events to be considered recurrent.
      This parameter can be adjusted depending on the cohort size — smaller values detect more shared regions, while higher thresholds focus on the most consistently recurrent events.
    • max_dist (default = 0.3) — Maximum distance threshold used to cut the hierarchical clustering dendrogram.
      Distance between two events is defined as:

      Smaller values enforce stricter overlap similarity.
    • linkage (default = “complete”) — Linkage method used in agglomerative hierarchical clustering.
      With complete linkage, the distance between two clusters is defined as the maximum distance between all pairs of events across clusters.
  • markRecurrentRegions() — Annotates each event in a results table (from calculateEvents() or collapseEvents()) as recurrent or non-recurrent, depending on whether it sufficiently overlaps any region defined as recurrent.
    Recurrence is determined by evaluating the fraction of the event length that overlaps a recurrent genomic region. Events are classified as recurrent only if this overlap fraction exceeds a user-defined threshold.

    The function returns the same data.frame provided as input, but with two additional columns:

    • Recurrent — Categorical variable indicating whether the event overlaps a recurrent region (“Yes” or “No”).
    • n_samples — Number of unique samples supporting the recurrent region (as determined in identifyRecurrentRegions()).

    These annotations facilitate downstream filtering of events, allowing users to retain only non-recurrent regions, which are more likely to represent true UPD candidates rather than technical artifacts or common genomic polymorphisms.

In summary, these two functions can be combined to systematically identify and flag recurrent genomic regions, facilitating the filtering of potential technical artifacts or recurrent UPDs.

recurrentRegions <- identifyRecurrentRegions(
  df = collapsed_all,
  ID_col = "ID",
  error_threshold = 100,
  min_support = 2,
  max_dist = 0.97
)

head(recurrentRegions)
NULL
annotatedEvents <- markRecurrentRegions(
  df = collapsed_all,
  recurrent_regions = recurrentRegions
)

head(annotatedEvents)
       ID chromosome    start      end   group n_events total_mendelian_error
1 NA19675         15 22382655 23000272 iso_mat        1                     3
2 NA19675         19 42315281 43522911 het_mat        1                     3
3 NA19685         15 22368862 42109975 iso_mat        1                     6
4 NA19685          6 32489853 33499925 het_mat        1                     3
  total_size total_snps prop_covered ratio_father ratio_mother ratio_proband
1     617617         12            1    0.9463235    1.0857886     0.9450584
2    1207630          8            1    1.0248302    0.9671289     0.9801788
3   19741113         10            1    1.0077605    1.0158014     1.0105088
4    1010072          5            1    0.9844789    0.9683973     0.9789823
      collapsed_events Recurrent n_samples
1 15:22382655-23000272        No         1
2 19:42315281-43522911        No         1
3 15:22368862-42109975        No         1
4  6:32489853-33499925        No         1

Alternatively, recurrent regions can be provided directly as an external BED file (e.g., SFARI.bed), converted into a GRanges object, and used in the same way.

The SFARI.bed file was generated from our analysis of the SFARI cohort, where recurrent genomic intervals were identified across trios using UPDhmm. These regions represent loci frequently showing artifactual or low-confidence UPD signals in the SFARI dataset and can be used as a mask to improve specificity in other analyses.

library(regioneR)
bed <- system.file(package = "UPDhmm", "extdata", "SFARI.bed")
bed_df <- read.table(
    bed,
    header = TRUE
)

bed_gr <- toGRanges(bed_df)

annotatedEvents <- markRecurrentRegions(
  df = collapsed_all,
  recurrent_regions = bed_gr
)

head(annotatedEvents)
       ID chromosome    start      end   group n_events total_mendelian_error
1 NA19675         15 22382655 23000272 iso_mat        1                     3
2 NA19675         19 42315281 43522911 het_mat        1                     3
3 NA19685         15 22368862 42109975 iso_mat        1                     6
4 NA19685          6 32489853 33499925 het_mat        1                     3
  total_size total_snps prop_covered ratio_father ratio_mother ratio_proband
1     617617         12            1    0.9463235    1.0857886     0.9450584
2    1207630          8            1    1.0248302    0.9671289     0.9801788
3   19741113         10            1    1.0077605    1.0158014     1.0105088
4    1010072          5            1    0.9844789    0.9683973     0.9789823
      collapsed_events Recurrent n_samples
1 15:22382655-23000272        No         1
2 19:42315281-43522911        No         1
3 15:22368862-42109975        No         1
4  6:32489853-33499925        No         1

Regions annotated as non-recurrent and passing all quality filters (low Mendelian error count and consistent read depth ratios) are the most likely true UPD candidates. Conversely, highly recurrent or error-prone regions should be interpreted with caution, as they may reflect systematic technical artifacts or structural genomic complexity.

5. Visualizing UPD events across the genome

After identifying and postprocessing UPD events, a final and highly informative step is to visualize their genomic distribution. Visualization helps to quickly assess the size, chromosomal location, and type (maternal or paternal; hetero- or isodisomy) of the detected events.

The karyoploteR package provides elegant genome-wide visualization tools that can be integrated directly with the UPDhmm results. The following custom plotting function, plotUPDKp(), enables the visualization of UPD segments detected in a trio across all autosomes. To visualize events across the genome, you can use the karyoploteR package:

library(karyoploteR)
library(regioneR)

# Function to visualize UPD events across the genome
plotUPDKp <- function(updEvents) {
  #--------------------------------------------------------------
  # 1. Detect coordinate columns
  #--------------------------------------------------------------
  if (all(c("start", "end") %in% colnames(updEvents))) {
    start_col <- "start"
    end_col <- "end"
  } else if (all(c("min_start", "max_end") %in% colnames(updEvents))) {
    start_col <- "min_start"
    end_col <- "max_end"
  } else {
    stop("Input must contain either (start, end) or (min_start, max_end) columns.")
  }

  #--------------------------------------------------------------
  # 2. Ensure chromosome naming format (e.g. "chr3")
  #--------------------------------------------------------------
  updEvents$seqnames <- ifelse(grepl("^chr", updEvents$chromosome),
                               updEvents$seqnames,
                               paste0("chr", updEvents$chromosome))

  #--------------------------------------------------------------
  # 3. Helper function to safely create GRanges only if events exist
  #--------------------------------------------------------------
  to_gr_safe <- function(df, grp) {
    subset_df <- subset(df, group == grp)
    if (nrow(subset_df) > 0) {
      toGRanges(subset_df[, c("seqnames", start_col, end_col)])
    } else {
      NULL
    }
  }

  #--------------------------------------------------------------
  # 4. Separate UPD event types
  #--------------------------------------------------------------
  het_fat <- to_gr_safe(updEvents, "het_fat")
  het_mat <- to_gr_safe(updEvents, "het_mat")
  iso_fat <- to_gr_safe(updEvents, "iso_fat")
  iso_mat <- to_gr_safe(updEvents, "iso_mat")

  #--------------------------------------------------------------
  # 5. Create the genome ideogram
  #--------------------------------------------------------------
  kp <- plotKaryotype(genome = "hg19")

  #--------------------------------------------------------------
  # 6. Plot regions by inheritance type (if any exist)
  #--------------------------------------------------------------
  if (!is.null(het_fat)) kpPlotRegions(kp, het_fat, col = "#AAF593")
  if (!is.null(het_mat)) kpPlotRegions(kp, het_mat, col = "#FFB6C1")
  if (!is.null(iso_fat)) kpPlotRegions(kp, iso_fat, col = "#A6E5FC")
  if (!is.null(iso_mat)) kpPlotRegions(kp, iso_mat, col = "#E9B864")

  #--------------------------------------------------------------
  # 7. Add legend
  #--------------------------------------------------------------
  legend("topright",
         legend = c("Het_Fat", "Het_Mat", "Iso_Fat", "Iso_Mat"),
         fill = c("#AAF593", "#FFB6C1", "#A6E5FC", "#E9B864"))
}
# Example: visualize UPD events for the first trio
plotUPDKp(annotatedEvents)

Summary

In summary, the UPDhmm package provides a complete workflow for the detection, characterization, and visualization of uniparental disomy (UPD) events from trio-based sequencing data. Starting from raw VCF files, users can preprocess, model inheritance patterns through a Hidden Markov Model, and identify genomic regions consistent with UPD. Postprocessing steps, including the collapsing of events and the identification of recurrent or artifact-prone regions, enable a refined interpretation of results and the prioritization of true UPD candidates. Finally, genome-wide visualization with karyoploteR offers an intuitive way to inspect and report detected UPDs.

Altogether, `UPDhmm facilitates a standardized and reproducible approach for the study of UPD across large cohorts, integrating statistical detection, quality control, and interpretative visualization into a single coherent framework.

Session Info

sessionInfo()
R Under development (unstable) (2026-01-15 r89304)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.3 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] karyoploteR_1.37.0          regioneR_1.43.0            
 [3] VariantAnnotation_1.57.1    Rsamtools_2.27.0           
 [5] Biostrings_2.79.4           XVector_0.51.0             
 [7] SummarizedExperiment_1.41.1 Biobase_2.71.0             
 [9] GenomicRanges_1.63.1        IRanges_2.45.0             
[11] S4Vectors_0.49.0            Seqinfo_1.1.0              
[13] MatrixGenerics_1.23.0       matrixStats_1.5.0          
[15] BiocGenerics_0.57.0         generics_0.1.4             
[17] dplyr_1.2.0                 UPDhmm_1.7.1               
[19] BiocStyle_2.39.0           

loaded via a namespace (and not attached):
 [1] DBI_1.2.3                bitops_1.0-9             gridExtra_2.3           
 [4] rlang_1.1.7              magrittr_2.0.4           biovizBase_1.59.0       
 [7] otel_0.2.0               compiler_4.6.0           RSQLite_2.4.6           
[10] GenomicFeatures_1.63.1   png_0.1-8                vctrs_0.7.1             
[13] ProtGenerics_1.43.0      stringr_1.6.0            pkgconfig_2.0.3         
[16] crayon_1.5.3             fastmap_1.2.0            magick_2.9.0            
[19] backports_1.5.0          rmarkdown_2.30           UCSC.utils_1.7.1        
[22] tinytex_0.58             bit_4.6.0                xfun_0.56               
[25] cachem_1.1.0             cigarillo_1.1.0          GenomeInfoDb_1.47.2     
[28] jsonlite_2.0.0           blob_1.3.0               DelayedArray_0.37.0     
[31] BiocParallel_1.45.0      parallel_4.6.0           cluster_2.1.8.2         
[34] R6_2.6.1                 bslib_0.10.0             stringi_1.8.7           
[37] RColorBrewer_1.1-3       bezier_1.1.2             rtracklayer_1.71.3      
[40] rpart_4.1.24             jquerylib_0.1.4          Rcpp_1.1.1              
[43] bookdown_0.46            knitr_1.51               base64enc_0.1-6         
[46] Matrix_1.7-4             nnet_7.3-20              tidyselect_1.2.1        
[49] rstudioapi_0.18.0        dichromat_2.0-0.1        abind_1.4-8             
[52] yaml_2.3.12              codetools_0.2-20         curl_7.0.0              
[55] lattice_0.22-9           tibble_3.3.1             KEGGREST_1.51.1         
[58] S7_0.2.1                 evaluate_1.0.5           foreign_0.8-91          
[61] pillar_1.11.1            BiocManager_1.30.27      checkmate_2.3.4         
[64] RCurl_1.98-1.17          ensembldb_2.35.0         ggplot2_4.0.2           
[67] scales_1.4.0             glue_1.8.0               lazyeval_0.2.2          
[70] Hmisc_5.2-5              tools_4.6.0              BiocIO_1.21.0           
[73] data.table_1.18.2.1      BSgenome_1.79.1          GenomicAlignments_1.47.0
[76] XML_3.99-0.22            grid_4.6.0               colorspace_2.1-2        
[79] AnnotationDbi_1.73.0     htmlTable_2.4.3          restfulr_0.0.16         
[82] Formula_1.2-5            cli_3.6.5                HMM_1.0.2               
[85] S4Arrays_1.11.1          AnnotationFilter_1.35.0  gtable_0.3.6            
[88] sass_0.4.10              digest_0.6.39            SparseArray_1.11.10     
[91] rjson_0.2.23             htmlwidgets_1.6.4        farver_2.1.2            
[94] memoise_2.0.1            htmltools_0.5.9          lifecycle_1.0.5         
[97] httr_1.4.7               bit64_4.6.0-1            bamsignals_1.43.0