In this vignette we cover the generation of a qseaSet using the functions provided in mesa, rather than those in qsea. Please see the qsea tutorial for more detail, but we will cover much of that here.
The first thing to prepare is the sampleTable; a dataframe containing metadata for the samples used in the package. This has three required columns:
sample_name : a name for each sample, which must be unique. This is the primary identifier used to describe each sample, so it is good to be descriptive.group : a column to specify a group for each sample, for use inside certain functions for averaging over samples of the same type. We suggest only using this for replicates of the same (or very similar) samples, and adding extra columns for other comparisons that may be of interest.file_name : A path to a bam file, containing the aligned, deduplicated, reads mapped to the relevant reference genome, from material which has undergone the methylation enrichment/capture process (e.g. MEDIP-seq or MBD-seq). Ideally samtools fixmate has been ran to supply mate information onto the read tags, but filtering for quality is not required.There are also two more optional fields that may be used during generation:
input_file : This is a path to an additional bam file containing aligned, deduplicated reads for material from the same sample, without performing enrichment. This is whole genome sequencing, but may be sequenced at a much lower depth (e.g. ultra-shallow WGS, ~1-2 million fragments is sufficient for calling copy number variation at a window size of 1Mb). If present, this will be used on a sample-by-sample basis to calculate copy number variation, at a chromosome scale, in order to correct for this in the expected read depths. This is most relevant when studying cancer, as particular cancers often show recurrent patterns of copy number changes. Note that below ~0.5 million fragments the calculated copy number patterns will become especially noisy (for human samples).
gender or sex : If one of these columns are included, then any chromosome called X,Y, chrX or chrY will be set with appropriate zygosity for samples with M, m or male in this column. See the qsea code for details; within the National Biomarker Centre we only consider the autosomes.
In addition, any further metadata columns may be included as required, and we recommend that anything you might want to query or compare the data on is included here. This includes experimental conditions (tissue, treatment, tumour status) etc.
One thing to note is that we highly recommend the use of the tool such as NGSCheckMate for verifying that (human) samples come from the same individual or not. This is useful both for ensuring that the enriched and non-enriched samples do indeed come from the same individual, but also to check that paired samples are actually paired (for instance when comparing pre- and post-treatment).
For the example here, we use the example data in the MEDIPSData package. This contains pairs of tumour and normal tissue from the same patients, with MEDIP-seq performed as the enrichment step. These bam files are downsampled to only contain chromosomes 20-22, and are aligned to hg19. Note, we highly recommend using the hg38/GRCh38 reference genome rather than hg19 as used here!
sampleTable <- tribble(~sample_name, ~group, ~tumour, ~patient, ~file_name,
"Normal1", "Normal1", "Normal", "Patient1", system.file("extdata", "NSCLC_MeDIP_1N_fst_chr_20_21_22.bam", package = "MEDIPSData", mustWork = TRUE),
"Normal2", "Normal1", "Normal", "Patient2", system.file("extdata", "NSCLC_MeDIP_2N_fst_chr_20_21_22.bam", package = "MEDIPSData", mustWork = TRUE),
"Tumour1", "Tumour1", "Tumour", "Patient1", system.file("extdata", "NSCLC_MeDIP_1T_fst_chr_20_21_22.bam", package = "MEDIPSData", mustWork = TRUE),
"Tumour2", "Tumour2", "Tumour", "Patient2", system.file("extdata", "NSCLC_MeDIP_2T_fst_chr_20_21_22.bam", package = "MEDIPSData", mustWork = TRUE)
)The qsea tutorial details the steps involved in generating a qseaSet. We provide a function, makeQset, which wraps these up into one function, although you may still use the rest of the mesa functionality with or without using this to create your qseaSet.
Before creating the qseaSet, you need to have:
sampleTable of metadata, including the file paths to the bam files.BSgenome package name, and the package to be installed. You can check what BSgenome packages are available on BioConductor via BSgenome::available.genomes(), or make your own if necessary (vignette("BSgenomeForge", package = "BSgenome")).qseaSet, the genome is split into fixed windows with equal size, and all analysis is performed on these windows as independent units. We generally use 300bp windows, as that is a little bigger than our average DNA fragment sizes, which are in the 150-250bp range. If the DNA is particularly long fragments, a larger window size may make sense. This should also be informed by your depth of sequencing, as a smaller window size will be more prone to noise.For our hg19 example data, we set BSgenome = "BSgenome.Hsapiens.UCSC.hg19", specify the three chromosomes included in the test dataset and give some approximate fragment length and SD values.
makeQset(sampleTable,
BSgenome = "BSgenome.Hsapiens.UCSC.hg19",
chrSelect = paste0("chr",20:22),
windowSize = 300,
fragmentLength = 200,
fragmentSD = 50,
CNVmethod = "none")
#> Detected parallel setup with 2 workers.
#>
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#>
#> strsplit
#> Considering 541532 regions with total size 162459600
#> ==== Creating qsea set ====
#> restricting analysis on chr20, chr21, chr22
#> Dividing provided regions in 300nt windows
#> Scanning up to 2 files in parallel.
#> No methods found in package 'IRanges' for request: 'values' when loading 'MEDIPS'
#> Searching in chr20 ...[ 717722 ] found.
#> Searching in chr21 ...[ 380444 ] found.
#> Searching in chr22 ...[ 578097 ] found.
#> Number of identified CG pattern: 1676263
#> Generated objects
#> Calculated coverage
#> Get genomic positions of "CG" ...
#> searching chr20 for "CG"...
#> found 717722 occurances of CG in chr20
#> estimating expected number of CGs per fragment for windows of chr20...
#> ...done
#> searching chr21 for "CG"...
#> found 380444 occurances of CG in chr21
#> estimating expected number of CGs per fragment for windows of chr21...
#> ...done
#> searching chr22 for "CG"...
#> found 578097 occurances of CG in chr22
#> estimating expected number of CGs per fragment for windows of chr22...
#> ...done
#> selecting windows with low CpG density for background read estimation
#> 4.396% of the windows have enrichment pattern density of at most 0.05 per fragment
#> and are used for background reads estimation
#> qseaSet object generated successfully
#> qseaSet
#> =======================================
#> 4 Samples:
#> Normal1, Normal2, Tumour1, Tumour2
#> 541532 Regions in 3 chromosomes of BSgenome.Hsapiens.UCSC.hg19A range of messages are produced as output during this process. One thing to note is that by default the process is performed without parallelisation, this can be controlled via the setMesaParallel function and will vastly speed up the process when multiple cores are available (but use more RAM).
This function combines several steps:
BSgenome, chrSelect, windowSize and badRegions parameters.fragmentLength and fragmentSD parameters to define the expected contribution of fragments to the likelihood that they overlap the window (see the Calculating CpG Density section below).The chosen chromosomes (via chrSelect) of the BSgenome object will be tiled in windowSize windows. These can be filtered for regions which have poor mappability or are overrepresented in non-enriched sequencing, by providing a GRanges object to the badRegions option. If provided, this will remove any window which overlaps with the GRanges object; ensuring that the windows remain fixed when this filtering is changed. The 2019 ENCODE exclusion regions are provided as an data object called ENCODEbadRegions.
The copy number variation is calculated over a set of larger genomic windows as set by CNVwindowSize; this is typically much larger than the windowSize due to coverage limitations. This will be performed using the same options as the enriched samples. Once the coverage is determined, the R package HMMcopy will be used, with default parameters, to calculate the copy number variation. The non-enriched (also known as “Input”) bam files should be specified by adding a column called input_file to the sampleTable. To use this you need to provide HMMcopy with a pair of data frames containing the GC content and mapability per region you wish to calculate over, as detailed in vignette("HMMcopy", package = "HMMcopy"). There is an exception if you are using the human genome hg38 (or GRCh38), in which case files are provided for window sizes of 50kb, 500kb or 1000kb. These are stored in objects called e.g. gc_hg38_1000kb and map_hg38_1000kb; similar objects need to be generated for other genomes or window sizes using the tools gcCounter or mapCounter from the hmmCopy suite.
Alternatively, copy number variation may be calculated using an external method, for instance using the ichorCNA package. This may then be added to the qseaSet object using the qsea function addCNV.
As a further option, qsea provides a method for determining the copy number from the methylation enriched samples, using only reads which do not contain a CG, this may be selected using CNVmethod = "MeCap". We do note that for our MBD-seq experiments we found there was reasonable overall agreement, but this method did give spurious CNV compared to the non-enriched reads, although that may be due to the selectivity of our MBD-seq enrichment step being relatively high.
No copy number variation will be calculated if CNVmethod = "none".
One essential part of the qseaSet is the CpG_density for each region. This is calculated using the genome sequence, alongside an expected fragment size distribution via the qsea function qsea:::estimatePatternDensity().
This process works by first locating all CG dinucleotides across the genome, before modelling a distribution around each, representing the likelihood for a fragment including that dinucleotide to reach adjacent positions. This distribution is assumed to be normally distributed, with a mean and standard deviation. Thus each CG implies a possibility to impact upon the adjacent base pairs in its surroundings. Finally, these values are averaged across the windows to give a CpG_density column on the regions of the qseaSet. This method is designed to capture the fact that some captured fragments of DNA may be heavily methylated on one end of the fragment, but if the fragment is long it may be counted in an adjacent window.
This calculation therefore requires the fragmentLength and fragmentSD parameters to be specified, to define these normal distributions, and is associated with the whole qseaSet. This is therefore not sample dependent, and so if samples have vastly different fragment distributions then there will be some expected variation not captured in this process. This does have a significant effect on the calculated beta values for each sample, so it is important to set the fragmentLength and fragmentSD values in an appropriate way. We also provide a fragmentType option, which hardcodes these two parameters to values developed on our initial Sheared or cfDNA datasets. When fragmentType = "Sheared" is used this sets fragmentLength = 213, fragmentSD = 60, while fragmentType = "cfDNA" sets fragmentLength = 167, fragmentSD = 38, but we do emphasise that you should check your own data.
Please note that these fragmentLength and fragmentSD parameters are only used for calculating this CG density metric, they are not used to control what reads are included in the coverage calculations.
In any ME-seq experiment, there will be off-target fragments sequenced due to the imperfect nature of the enrichment process. These are used to calculate the offset; a measure of the background level of fragments in windows that do not contain CGs. The maximum CpG_density value for these windows is controlled by maxPatternDensity. If this is set too low, then qsea will give a warning saying that the offset can not be calculated. In particular, this could be an issue if using a targeted enrichment step in your library preparation method.
Finally, the qsea beta value calculation is applied. This requires the CpG_density and the coverage, as well as some assumption on methylated regions or at least the expected number of them. The qsea manuscript details three methods for this assumption, two of which require external data in the form of methylation arrays or bisulphite/enzymatic sequencing. The third method, known as “blind calibration” in the qsea manuscript, is the only one implemented here. This works on a global assumption on the typical methylation levels across the human genome - that the majority of the isolated CGs are methylated while the majority of CpG islands are unmethylated. This blind calibration method therefore assumes that 76% of regions with a CpG_density of 1 are expected to be fully methylated (\(\beta = 1\)), while 25% of regions with a CpG_density of 15 are fully unmethylated (\(\beta = 0\)). A straight line is then fit between these two densities. This method is the only one implemented here, as it does not require additional data.
We have found this blind calibration sufficient for the human genome, including being able to build classifiers using methylation arrays as our data source.
We do note that if you wish to use array data for normalisation, then this Biostars post may be a good source of information, but we have only used the blind calibration.method.
One thing to note is that this function does not set the library_factor for the qseaSet, which suggested to be done in qsea via the Trimmed Means of M-values (TMM) method. The way qsea does this is to choose one sample (by default the first sample), and use this as the reference sample to calculating a scaling factor to adjust the total fragments by when calculating the NRPM (normalised reads per million) values.
TMM normalisation is a method used in RNA-seq to compensate for genes which are excessively expressed in a particular condition. In RNA-seq, the distribution of read counts varies over ~4 orders of magnitude, with some highly expressed genes having tens of thousands of reads while others have tens or hundreds. If in a particular tissue there are a group of genes which are extremely highly expressed that are not expressed in another tissue, then the total pool of available reads will be dominated by those genes in the first tissue, comparatively depressing the other genes. So just using the total size ends up biasing to these larger genes, and making it look like the other genes are lower expressed, ending up with many more differentially expressed genes than might actually be the case.
e.g. if there are 100 reads in a sample, and 3 genes, with values:
then we want to be careful not to call the last two genes as differentially expressed, when they are really just affected by the lack of gene A. One option could be to normalise with the median of the read distribution, although due to many genes being non-expressed this often is implemented as an upper quartile.
TMM normalisation however takes some subset of the “genes” (windows in our case), having trimmed off the very high/very low expressed “genes”, to calculate this scaling factor for total number of fragments.
However, we find that for methylation, after filtering for excluded regions, extremely few windows in any sample have more than 100 reads. So the distribution of read counts is only actually between 0-100ish, making the main necessity for TMM normalisation rather unnecessary here.
There are a couple of reasons why we therefore choose not to apply TMM:
calculateDMRs calculation), there is no effect of the library_factor, as it is normalised out in the beta value fitting process.qsea. This therefore means that the normalisation of all the samples depends on which other samples they were normalised with, meaning that comparing between batches would need renormalising every time. This would also mean that filtering samples pre- or post-normalisation would give different results.However, we note that these considerations might not apply for everyone; the qsea::addLibraryFactors() function may be used to apply this normalisation if required.
#> R version 4.6.0 RC (2026-04-17 r89917)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.80.0
#> [3] rtracklayer_1.72.0 BiocIO_1.22.0
#> [5] Biostrings_2.80.0 XVector_0.52.0
#> [7] GenomicRanges_1.64.0 Seqinfo_1.2.0
#> [9] tibble_3.3.1 org.Mm.eg.db_3.23.0
#> [11] org.Hs.eg.db_3.23.1 AnnotationDbi_1.74.0
#> [13] IRanges_2.46.0 S4Vectors_0.50.0
#> [15] Biobase_2.72.0 BiocGenerics_0.58.0
#> [17] generics_0.1.4 stringr_1.6.0
#> [19] tidyr_1.3.2 ggplot2_4.0.3
#> [21] mesa_0.99.1 dplyr_1.2.1
#> [23] qsea_1.38.0
#>
#> loaded via a namespace (and not attached):
#> [1] filelock_1.0.3
#> [2] bitops_1.0-9
#> [3] ggplotify_0.1.3
#> [4] polyclip_1.10-7
#> [5] preprocessCore_1.74.0
#> [6] janitor_2.2.1
#> [7] enrichit_0.1.4
#> [8] XML_3.99-0.23
#> [9] httr2_1.2.2
#> [10] lifecycle_1.0.5
#> [11] doParallel_1.0.17
#> [12] lattice_0.22-9
#> [13] MASS_7.3-65
#> [14] magrittr_2.0.5
#> [15] limma_3.68.1
#> [16] sass_0.4.10
#> [17] rmarkdown_2.31
#> [18] jquerylib_0.1.4
#> [19] yaml_2.3.12
#> [20] plotrix_3.8-14
#> [21] otel_0.2.0
#> [22] ggtangle_0.1.2
#> [23] DBI_1.3.0
#> [24] RColorBrewer_1.1-3
#> [25] lubridate_1.9.5
#> [26] abind_1.4-8
#> [27] purrr_1.2.2
#> [28] RCurl_1.98-1.18
#> [29] yulab.utils_0.2.4
#> [30] tweenr_2.0.3
#> [31] rappdirs_0.3.4
#> [32] gdtools_0.5.0
#> [33] circlize_0.4.18
#> [34] enrichplot_1.32.0
#> [35] ggrepel_0.9.8
#> [36] tidytree_0.4.7
#> [37] MEDIPS_1.64.0
#> [38] ChIPseeker_1.48.0
#> [39] codetools_0.2-20
#> [40] DelayedArray_0.38.1
#> [41] DNAcopy_1.86.0
#> [42] DOSE_4.6.0
#> [43] ggforce_0.5.0
#> [44] tidyselect_1.2.1
#> [45] shape_1.4.6.1
#> [46] aplot_0.2.9
#> [47] UCSC.utils_1.8.0
#> [48] HMMcopy_1.54.0
#> [49] farver_2.1.2
#> [50] BiocFileCache_3.2.0
#> [51] matrixStats_1.5.0
#> [52] GenomicAlignments_1.48.0
#> [53] jsonlite_2.0.0
#> [54] GetoptLong_1.1.1
#> [55] iterators_1.0.14
#> [56] systemfonts_1.3.2
#> [57] foreach_1.5.2
#> [58] tools_4.6.0
#> [59] ggnewscale_0.5.2
#> [60] progress_1.2.3
#> [61] treeio_1.36.1
#> [62] TxDb.Hsapiens.UCSC.hg19.knownGene_3.22.1
#> [63] Rcpp_1.1.1-1.1
#> [64] glue_1.8.1
#> [65] gridExtra_2.3
#> [66] SparseArray_1.12.2
#> [67] xfun_0.57
#> [68] MatrixGenerics_1.24.0
#> [69] GenomeInfoDb_1.48.0
#> [70] withr_3.0.2
#> [71] fastmap_1.2.0
#> [72] boot_1.3-32
#> [73] caTools_1.18.3
#> [74] digest_0.6.39
#> [75] timechange_0.4.0
#> [76] R6_2.6.1
#> [77] gridGraphics_0.5-1
#> [78] colorspace_2.1-2
#> [79] Cairo_1.7-0
#> [80] GO.db_3.23.1
#> [81] gtools_3.9.5
#> [82] biomaRt_2.68.0
#> [83] dichromat_2.0-0.1
#> [84] RSQLite_2.4.6
#> [85] cigarillo_1.2.0
#> [86] UpSetR_1.4.0
#> [87] utf8_1.2.6
#> [88] fontLiberation_0.1.0
#> [89] data.table_1.18.2.1
#> [90] prettyunits_1.2.0
#> [91] httr_1.4.8
#> [92] htmlwidgets_1.6.4
#> [93] S4Arrays_1.12.0
#> [94] scatterpie_0.2.6
#> [95] pkgconfig_2.0.3
#> [96] gtable_0.3.6
#> [97] MEDIPSData_1.48.0
#> [98] blob_1.3.0
#> [99] ComplexHeatmap_2.28.0
#> [100] S7_0.2.2
#> [101] htmltools_0.5.9
#> [102] fontBitstreamVera_0.1.1
#> [103] clue_0.3-68
#> [104] plyranges_1.32.0
#> [105] scales_1.4.0
#> [106] hues_0.2.0
#> [107] TxDb.Hsapiens.UCSC.hg38.knownGene_3.22.0
#> [108] png_0.1-9
#> [109] snakecase_0.11.1
#> [110] ggfun_0.2.0
#> [111] knitr_1.51
#> [112] reshape2_1.4.5
#> [113] rjson_0.2.23
#> [114] nlme_3.1-169
#> [115] curl_7.1.0
#> [116] cachem_1.1.0
#> [117] zoo_1.8-15
#> [118] GlobalOptions_0.1.4
#> [119] KernSmooth_2.23-26
#> [120] parallel_4.6.0
#> [121] restfulr_0.0.16
#> [122] pillar_1.11.1
#> [123] grid_4.6.0
#> [124] vctrs_0.7.3
#> [125] gplots_3.3.0
#> [126] tidydr_0.0.6
#> [127] dbplyr_2.5.2
#> [128] cluster_2.1.8.2
#> [129] evaluate_1.0.5
#> [130] magick_2.9.1
#> [131] GenomicFeatures_1.64.0
#> [132] cli_3.6.6
#> [133] compiler_4.6.0
#> [134] Rsamtools_2.28.0
#> [135] rlang_1.2.0
#> [136] crayon_1.5.3
#> [137] labeling_0.4.3
#> [138] plyr_1.8.9
#> [139] fs_2.1.0
#> [140] ggiraph_0.9.6
#> [141] stringi_1.8.7
#> [142] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0
#> [143] BiocParallel_1.46.0
#> [144] lazyeval_0.2.3
#> [145] GOSemSim_2.38.0
#> [146] fontquiver_0.2.1
#> [147] Matrix_1.7-5
#> [148] hms_1.1.4
#> [149] patchwork_1.3.2
#> [150] bit64_4.8.0
#> [151] KEGGREST_1.52.0
#> [152] statmod_1.5.1
#> [153] SummarizedExperiment_1.42.0
#> [154] igraph_2.3.0
#> [155] memoise_2.0.1
#> [156] bslib_0.10.0
#> [157] ggtree_4.2.0
#> [158] bit_4.6.0
#> [159] ape_5.8-1