This vignettes shows how to use single, a package to improve consensus calling on nanopore sequencing of gene libraries.
single 1.7.0
SINGLe computes consensus sequence of DNA reads by (noisy) nanopore sequencing. It is focused on long amplicons sequencing, and it aims to the reads of gene libraries, typically used in directed evolution experiments.
SINGLe takes advantage that gene libraries are created from an original wild type or reference sequence, and it characterizes the systematic errors made by nanopore sequencing. Then, uses that information to correct the confidence values (QUAL) assigned to each nucleotide read in the mutants library.
Finally, given that you can identify which variant was read in each case (for example by the use of unique molecular identifiers or DNA barcodes), SINGLe groups them and computes the consensus sequence by weighting the frequencies with the corrected confidence values.
For more details, please refer to our pre-print “Accurate gene consensus at low nanopore coverage” doi: https://doi.org/10.1101/2020.03.25.007146 for more information.
Using bioconductor:
if (!require(“BiocManager”, quietly = TRUE)) install.packages(“BiocManager”)
BiocManager::install(“single”)
To use SINGLe you must have the following data:
Align nanopore reads to the reference sequence and create a sorted bam file.
For the reads of the reference:
minimap2 -ax map-ont --sam-hit-only REF.fasta REF_READS.fastq > REF_READS.sam
samtools view -S -b REF_READS.sam > REF_READS.bam
samtools sort REF_READS.bam -o REF_READS.sorted.bamAnd for the reads of the library:
minimap2 -ax map-ont --sam-hit-only  REF.fasta LIB_READS.fastq >LIB_READS.sam
samtools view -S -b LIB_READS.sam > LIB_READS.bam
samtools sort LIB_READS.bam -o LIB_READS.sorted.bamsSINGLe consists on three steps: train model, evaluate model, compute consensus. As it can be time consuming, I will only analyze a subset of positions
library(single)
pos_start <- 1
pos_end <- 100
refseq_fasta <- system.file("extdata", "ref_seq.fasta", package = "single")
ref_seq <- Biostrings::subseq(Biostrings::readDNAStringSet(refseq_fasta), pos_start,pos_end)First, train the model using nanopore reads of the reference (wild type).
REF_READS <- system.file("extdata", "train_seqs_500.sorted.bam",package = "single")
train <- single_train(bamfile=REF_READS,
                      output="train",
                      refseq_fasta=refseq_fasta,
                      rates.matrix=mutation_rate,
                      mean.n.mutations=5.4,
                      pos_start=pos_start,
                      pos_end=pos_end,
                      verbose=FALSE,
                      save_partial=FALSE,
                      save_final= FALSE)## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not convergeprint(head(train))##   pos strand nucleotide QUAL  p_SINGLe isWT
## 1   1      +          A    1 0.2056718 TRUE
## 2   1      -          A    1 0.2056718 TRUE
## 3   1      +          A    2 0.3690427 TRUE
## 4   1      -          A    2 0.3690427 TRUE
## 5   1      +          A    3 0.4988128 TRUE
## 6   1      -          A    3 0.4988128 TRUESecond, evaluate model: use the fitted model to evaluate the reads of your library, and re-weight the QUAL (quality scores).
LIB_READS <- system.file("extdata","test_sequences.sorted.bam",package ="single")
corrected_reads <- single_evaluate(bamfile = LIB_READS,
                single_fits = train,
                refseq_fasta=refseq_fasta,
                pos_start=pos_start,pos_end=pos_end,
                verbose=FALSE,
                gaps_weights = "minimum",
                save = FALSE,
                save_original_scores = FALSE)
corrected_reads##   A QualityScaledDNAStringSet instance containing:
## 
## DNAStringSet object of length 40:
##      width seq                                              names               
##  [1]   100 ATGCGTCTGCTGCATGAATTTGG...GCATTTGTTGGTTTTGTTCTGA 1944838d-7824-496...
##  [2]   100 ATGCGTCTGCTGCATGAATTTGG...GCATTTGTTGGTTTTGTTCTGA 100a992c-60c8-495...
##  [3]   100 ATGCGTCTGCTGCATGAATTTGG...GCATTTGTTGG-TTTGTTCTGA 82027877-02d8-40f...
##  [4]   100 ATGCGTCTGCTGCATGAATTTGG...GCATTTGTTGGTTTTGTTCTGA e4e1a99b-e8ea-4bc...
##  [5]   100 ATGCGTCTGCTGCCTGAATTTGA...GCATTTGTTGGTTTTGTTCTGA dacd1e41-d654-432...
##  ...   ... ...
## [36]   100 ATGCGTCTGCTGCATGAATTTGG...GCATTTGTTGGTTTTTTTCTGA 2b9cc86d-275f-493...
## [37]   100 ATGCGTCTGCTGCATGAATTTGG...GCATTTGTTGGTTTTCTTCTGA 158ddec8-702c-451...
## [38]   100 ATGCGTCTGCTGCATGAATTTGG...GCATTTGTTGGTTTTCTTCTGA 7f1a9ec1-688a-4ba...
## [39]   100 ATGCGTCTGCTGCATGAATTTGG...GCATTTGTTGGTTTT------- 865c645d-6b61-4ed...
## [40]   100 ATGCGTCTGCTGCATGAATTTGG...GCATTTGTTGGTTTT------- 72627b8a-6b8f-4de...
## 
## PhredQuality object of length 40:
##      width seq                                              names               
##  [1]   100 667;8888:>==9(''''79:<9.....:>AB?>>=::;:7<>>>>?? 1944838d-7824-496...
##  [2]   100 &&'+-23::<::;<<=>?BAB?:...'(6:;=<>>;:==@2:><=9:: 100a992c-60c8-495...
##  [3]   100 455656789=;<<::::;?BB>0...568:=><<<87":;60467210 82027877-02d8-40f...
##  [4]   100 )),0/430/+*&&&&)--4;?;;...55578C=@?>:::;24::-,11 e4e1a99b-e8ea-4bc...
##  [5]   100 63>A?@>@;975+),./00./00...<<;;==<A@=<>??+0764-.. dacd1e41-d654-432...
##  ...   ... ...
## [36]   100 **)*6666>==??>@ID@=;::<...AA;/....981-*)(*2//077 2b9cc86d-275f-493...
## [37]   100 9:?>???A@??@?@DBB>CCA?A...>>>@????C@;=41//1))*?= 158ddec8-702c-451...
## [38]   100 ****:75510,,,,,?<:9:::7...@@?><=<;=;7102++4...71 7f1a9ec1-688a-4ba...
## [39]   100 7666<:982111177ACA@2111...>>066/.'''&'*+******** 865c645d-6b61-4ed...
## [40]   100 --..=:896011188I<867555...?@?;;;;:?=:86766666666 72627b8a-6b8f-4de...Finally, use the reads of the library with the corrected QUAL scores to compute a weighted consensus sequences in subsets of reads. The sets of reads corresponding to each variant are indicated in a table (here BC_TABLE) of two columns: SeqID (name of the read) and BCid (barcode or group identity).
BC_TABLE <- system.file("extdata", "Barcodes_table.txt",package = "single")
consensus <- single_consensus_byBarcode(barcodes_table = BC_TABLE,
                           sequences = corrected_reads,
                           verbose = FALSE)
consensus## DNAStringSet object of length 3:
##     width seq                                               names               
## [1]   100 ATGCGTCTGCTGCATGAATTTGG...TGCATTTGTTGGTTTTGTTCTGA 5
## [2]   100 ATGCGTCTGCTGCATGAATTTGA...TGCATTTGTTGGTTTTGTTCTGA 6
## [3]   100 ATGCGTCTGCTGCATGAATTTGG...TGCATTTGTTGGTTTTCTTCTGA 7Use pileup to create a data.frame with counts by position nucleotide and quality score
counts_pnq <- pileup_by_QUAL(bam_file=REF_READS,
                    pos_start=pos_start,
                    pos_end=pos_end)
head(counts_pnq)##   pos strand nucleotide count QUAL
## 1   1      +          A     1    2
## 2   1      +          A     4    3
## 3   1      +          A     6    4
## 4   1      +          A     6    5
## 5   1      +          A    15    6
## 6   1      +          A    13    7Compute a priori probability of making errors
p_prior_errors <- p_prior_errors(counts_pnq=counts_pnq,
                                  save=FALSE)
p_prior_errors## # A tibble: 570 × 4
##    strand   pos nucleotide p_prior_error
##    <fct>  <dbl> <fct>              <dbl>
##  1 +          1 A                 1     
##  2 -          1 A                 1     
##  3 +          2 T                 1     
##  4 -          2 T                 1     
##  5 +          3 G                 1     
##  6 -          3 G                 1     
##  7 +          4 C                 1     
##  8 -          4 C                 0.972 
##  9 -          4 T                 0.0278
## 10 +          5 A                 0.0235
## # ℹ 560 more rowsCompute a priori probability of having a mutation
p_prior_mutations <- p_prior_mutations(rates.matrix = mutation_rate,
                        mean.n.mut = 5,ref_seq = ref_seq,save = FALSE)
head(p_prior_mutations)##   wt.base nucleotide  p_mutation
## 2       C          A 0.015190037
## 3       G          A 0.027471344
## 4       T          A 0.030703266
## 5       A          C 0.005063346
## 7       G          C 0.004416961
## 8       T          C 0.018852883Fit SINGLe logistic regression using the prior probabilities and the counts
fits <- fit_logregr(counts_pnq = counts_pnq,ref_seq=ref_seq,
                    p_prior_errors = p_prior_errors,
                    p_prior_mutations = p_prior_mutations,
                    save=FALSE)## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not convergehead(fits)## # A tibble: 6 × 5
##   strand   pos nucleotide prior_slope prior_intercept
##   <fct>  <dbl> <fct>            <dbl>           <dbl>
## 1 +          1 C                   NA              NA
## 2 +          1 G                   NA              NA
## 3 +          1 T                   NA              NA
## 4 +          1 -                   NA              NA
## 5 +          2 A                   NA              NA
## 6 +          2 C                   NA              NAUse the fits to obtain the replacement Qscores after SINGLe fit, for all possible QUAL, nucleotide and position values
evaluated_fits <- evaluate_fits(pos_range = c(1,5),q_range = c(0,10),
                                data_fits = fits,ref_seq = ref_seq,
                                save=FALSE,verbose = FALSE)
head(evaluated_fits)##   pos strand nucleotide QUAL  p_SINGLe isWT
## 1   1      +          A    0 0.0000000 TRUE
## 2   1      -          A    0 0.0000000 TRUE
## 3   1      +          A    1 0.2056718 TRUE
## 4   1      -          A    1 0.2056718 TRUE
## 5   1      +          A    2 0.3690427 TRUE
## 6   1      -          A    2 0.3690427 TRUE## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] single_1.7.0     BiocStyle_2.31.0
## 
## loaded via a namespace (and not attached):
##  [1] SummarizedExperiment_1.33.3 xfun_0.43                  
##  [3] bslib_0.7.0                 Biobase_2.63.1             
##  [5] lattice_0.22-6              vctrs_0.6.5                
##  [7] tools_4.4.0                 bitops_1.0-7               
##  [9] generics_0.1.3              stats4_4.4.0               
## [11] parallel_4.4.0              tibble_3.2.1               
## [13] fansi_1.0.6                 pkgconfig_2.0.3            
## [15] Matrix_1.7-0                S4Vectors_0.41.6           
## [17] lifecycle_1.0.4             GenomeInfoDbData_1.2.12    
## [19] compiler_4.4.0              stringr_1.5.1              
## [21] Rsamtools_2.19.4            Biostrings_2.71.5          
## [23] codetools_0.2-20            GenomeInfoDb_1.39.14       
## [25] htmltools_0.5.8.1           sass_0.4.9                 
## [27] yaml_2.3.8                  pillar_1.9.0               
## [29] crayon_1.5.2                jquerylib_0.1.4            
## [31] tidyr_1.3.1                 BiocParallel_1.37.1        
## [33] cachem_1.0.8                DelayedArray_0.29.9        
## [35] abind_1.4-5                 tidyselect_1.2.1           
## [37] digest_0.6.35               stringi_1.8.3              
## [39] dplyr_1.1.4                 reshape2_1.4.4             
## [41] purrr_1.0.2                 bookdown_0.39              
## [43] fastmap_1.1.1               grid_4.4.0                 
## [45] cli_3.6.2                   SparseArray_1.3.5          
## [47] magrittr_2.0.3              S4Arrays_1.3.7             
## [49] utf8_1.2.4                  withr_3.0.0                
## [51] UCSC.utils_0.99.7           rmarkdown_2.26             
## [53] XVector_0.43.1              httr_1.4.7                 
## [55] matrixStats_1.3.0           evaluate_0.23              
## [57] knitr_1.46                  GenomicRanges_1.55.4       
## [59] IRanges_2.37.1              rlang_1.1.3                
## [61] Rcpp_1.0.12                 glue_1.7.0                 
## [63] BiocManager_1.30.22         BiocGenerics_0.49.1        
## [65] jsonlite_1.8.8              R6_2.5.1                   
## [67] plyr_1.8.9                  MatrixGenerics_1.15.1      
## [69] GenomicAlignments_1.39.5    zlibbioc_1.49.3