1 Introduction

The wavFeatExt package implements wavelet-based methodology for feature extraction from copy-number alteration (CNA) profiles using the non-decimated Haar wavelet transform (NHWT). The main goal is to represent genome-wide CNA signals by multiscale wavelet coefficients that are more suitable for prediction than the raw, highly correlated CNA values.

In typical CNA studies, each sample is measured at thousands of genomic windows (or genes), and neighbouring windows are strongly correlated because CNAs occur in contiguous segments. Directly fitting supervised learning models on these correlated predictors can lead to unstable variable selection and suboptimal classification performance. Wavelet-based feature extraction addresses this by:

  • decomposing each CNA profile into detail coefficients (local differences, capturing gains/losses at multiple scales), and

  • scaling coefficients (local averages at multiple scales),

which together provide a compact, multiresolution representation of the genome-wide CNA landscape.

The package provides:

  • simulation of block-correlated CNA data (simulateCNA),

  • segmentation using circular binary segmentation (seg, CBS),

  • wavelet-based feature extraction (wavFeatExt),

  • PCA/ICA feature extraction for comparison (getPca, getIca), and

  • classification wrappers for several machine-learning methods (classifyWavFeatExt, classifyPcaIca).

This vignette gives a short, reproducible example of the wavFeatExt workflow on simulated CNA data, closely mirroring the analyses described in the original paper.

For any query about the package, please contact the maintainer, Maharani A. Ummi ().

1.1 Installation

To install wavFeatExt, use the Bioconductor package manager:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("wavFeatExt")

1.2 Getting Started

After installation, load the package to begin using wavFeatExt

library(wavFeatExt)

1.3 Input data structure

Most functions in wavFeatExt expect CNA data in the form of a numeric matrix

  • rows = samples (patients),

  • columns = genomic locations (windows or genes),

or a list of such matrices when multiple data sets (e.g. repeated simulations) are analysed.

For real data, a typical preprocessing pipeline is:

  1. Align reads and compute depth-of-coverage per genomic window.
  2. Normalise coverage (e.g. GC-content, tumour purity).
  3. Segment each profile using circular binary segmentation (CBS).

In this vignette we will work with simulated, already segmented data from simulateCNA, which mimics these characteristics.

2 Simulation

2.1 Simulating CNA data

The function simulateCNA() generates segmented CNA profiles for two tumour subtypes under a multivariate normal model with a block correlation structure, similar to the simulation study in Ummi et al. (2022).

You can also embed plots, for example:

# One simulated data set with moderate dimension

sim_dat <- simulateCNA(
  n.obs   = 40,
  p       = 64,
  n.sim   = 1,
  n.block = 8,
  verbose = FALSE
)

sim_dat is a list of length n.sim; each element is an n.obs x p matrix of segmented CNA values. By construction, the first half of the rows correspond to subtype 1, and the second half to subtype 2.

We now create a binary response factor:

X <- sim_dat[[1]]
n <- nrow(X)

y <- factor(rep(c("Subtype1", "Subtype2"), each = n / 2))
table(y)
# y
# Subtype1 Subtype2 
#       20       20

2.2 Wavelet-based feature extraction

The core function wavFeatExt() applies the non-decimated Haar wavelet transform across genomic locations for each sample and returns either detail or scaling coefficients at all available scales.

# Extract wavelet detail coefficients (differences)

det_coef <- wavFeatExt(sim_dat, type = "detail")

# Extract wavelet scaling coefficients (averages)

sca_coef <- wavFeatExt(sim_dat, type = "scaling")

length(det_coef)          # number of simulated data sets
# [1] 1
length(det_coef[[1]])     # number of scales for detail coefficients
# [1] 5
dim(det_coef[[1]][[1]])   # samples x windows at the first scale
# [1] 40 64

For each simulated data set:

det_coef[[i]][[m]] is the matrix of detail coefficients at scale m,

sca_coef[[i]][[m]] is the corresponding matrix of scaling coefficients.

Each matrix has the same dimension as the original CNA data (n.obs x p). At fine scales, detail coefficients represent differences between nearby windows; at coarser scales, they summarise changes over longer genomic regions.

2.3 Visualising wavelet coefficients for a single profile

For illustrative purposes, we can inspect the NHWT coefficients of a single CNA profile using the legacy helper nhwt() and plot.nhwt(). This is a convenient way to see how the transform behaves on one sample.


# Take one sample (row) from the simulated data

x1 <- X[1, ]

# Non-decimated Haar transform (detail coefficients)

nh_detail <- nhwt(x1, type = "detail")
nh_scaling <- nhwt(x1, type = "scaling")

# Plot coefficients by scale
plot(x1, type="l", xlab="", ylab="", main="Simulated CNA data")

plot(nh_detail, coef = "detail", type = "by.level", scale = "all")

plot(nh_scaling, coef = "scaling", type = "by.level", scale = "all")

2.4 Classification using wavelet features

After extracting wavelet coefficients, the classifyWavFeatExt() function provides a unified interface to several machine-learning classifiers implemented in glmnet, randomForest, nnet, pls, and class. The function accepts:

  • data – a list of CNA matrices or simulated CNA output

  • y – a factor of class labels

  • det – list of wavelet detail coefficients

  • sca – list of wavelet scaling coefficients

  • method"lasso", "elnet", "RF", "NN", "PLS", or "KNN"

  • k – number of folds for cross-validation

  • ite – number of datasets to evaluate (defaults to all)

Below is an example classification based on wavelet detail coefficients:


# Binary response (for example, first 20 vs last 20 samples)
y <- factor(c(rep("Group1", 20), rep("Group2", 20)))

# Perform classification using Lasso
res_KNN <- classifyWavFeatExt(sim_dat, y, det=det_coef, sca=sca_coef,
                          method = "KNN", k = 5, ite = 2)

The output typically contains:

  • Numeric matrix of cross-validated misclassification errors,

  • Numeric matrix of cross-validated areas under the ROC curve,

  • the classification method used.

Depending on the simulation settings (effect.diff, correlation strength), wavelet features often outperform PCA/ICA because they preserve local, multi-scale genomic changes that drive subtype differences.

3 Comparison with PCA/ICA

To demonstrate this, we extract PCA and ICA features using:

# Obtain PCA/ICA features

pca_feat <- getPca(sim_dat, k = 5)
ica_feat <- getIca(sim_dat, k = 5)

# Classification using PCA and ICA features

res_pcaica <- classifyPcaIca(
  sim_dat,
  y,
  pca_feat,
  ica_feat,
  method = "KNN",
  k = 5,
  ite = 1
)

4 Plotting classification error and AUC

The object returned by classifyWavFeatExt() contains two matrices:

CE – misclassification error for each feature set (detail/scaling scales and segmented data),

AUC – corresponding area under the ROC curve (AUC).

The package provides an S3 method plot.classifyWavFeatExt() to visualise these as boxplots across replications (ite).


## Misclassification error per scale (and segmented baseline)

plot(res_KNN, type = "CE", ylab = "Misclassification error")


## AUC per scale (and segmented baseline)

plot(res_KNN, type = "AUC", ylab = "Area under ROC curve")

Interpretation:

  • Each box corresponds to one feature set: D1, D2, … = detail scales, S1, S2, … = scaling scales, seg = original segmented data.

  • The boxes summarise the distribution of CE/AUC over ite repetitions.

  • The red dashed line is the median performance of the segmented/original data (seg),

  • The blue dotted line marks the best median performance among all feature sets (highest AUC or lowest CE).

If you also use PCA/ICA, you can similarly call:

plot(res_pcaica, type = "CE")   # or type = "AUC"

4.1 Session Information

sessionInfo()
# R version 4.6.0 RC (2026-04-17 r89917)
# Platform: x86_64-pc-linux-gnu
# Running under: Ubuntu 24.04.4 LTS
# 
# Matrix products: default
# BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
# LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
# 
# locale:
#  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#  [3] LC_TIME=en_GB              LC_COLLATE=C              
#  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
# [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
# 
# time zone: America/New_York
# tzcode source: system (glibc)
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] wavFeatExt_0.99.21 BiocStyle_2.39.0  
# 
# loaded via a namespace (and not attached):
#  [1] tidyselect_1.2.1     neuralnet_1.44.2     timeDate_4052.112   
#  [4] dplyr_1.2.1          farver_2.1.2         S7_0.2.2            
#  [7] fastmap_1.2.0        pROC_1.19.0.1        caret_7.0-1         
# [10] digest_0.6.39        rpart_4.1.27         timechange_0.4.0    
# [13] lifecycle_1.0.5      survival_3.8-6       magrittr_2.0.5      
# [16] compiler_4.6.0       rlang_1.2.0          sass_0.4.10         
# [19] tools_4.6.0          yaml_2.3.12          data.table_1.18.2.1 
# [22] knitr_1.51           plyr_1.8.9           RColorBrewer_1.1-3  
# [25] withr_3.0.2          purrr_1.2.2          nnet_7.3-20         
# [28] grid_4.6.0           stats4_4.6.0         e1071_1.7-17        
# [31] future_1.70.0        ggplot2_4.0.3        globals_0.19.1      
# [34] scales_1.4.0         iterators_1.0.14     MASS_7.3-65         
# [37] tinytex_0.59         dichromat_2.0-0.1    cli_3.6.6           
# [40] rmarkdown_2.31       generics_0.1.4       otel_0.2.0          
# [43] ica_1.0-3            future.apply_1.20.2  reshape2_1.4.5      
# [46] DNAcopy_1.85.0       cachem_1.1.0         proxy_0.4-29        
# [49] stringr_1.6.0        splines_4.6.0        parallel_4.6.0      
# [52] BiocManager_1.30.27  matrixStats_1.5.0    vctrs_0.7.3         
# [55] hardhat_1.4.3        glmnet_4.1-10        Matrix_1.7-5        
# [58] jsonlite_2.0.0       bookdown_0.46        listenv_0.10.1      
# [61] magick_2.9.1         wavethresh_4.7.3     foreach_1.5.2       
# [64] gower_1.0.2          jquerylib_0.1.4      pls_2.9-0           
# [67] recipes_1.3.2        glue_1.8.1           parallelly_1.47.0   
# [70] codetools_0.2-20     lubridate_1.9.5      stringi_1.8.7       
# [73] gtable_0.3.6         shape_1.4.6.1        tibble_3.3.1        
# [76] pillar_1.11.1        htmltools_0.5.9      ipred_0.9-15        
# [79] randomForest_4.7-1.2 lava_1.9.0           R6_2.6.1            
# [82] evaluate_1.0.5       lattice_0.22-9       bslib_0.10.0        
# [85] class_7.3-23         Rcpp_1.1.1-1.1       nlme_3.1-169        
# [88] prodlim_2026.03.11   xfun_0.57            ModelMetrics_1.2.2.2
# [91] pkgconfig_2.0.3