The wavFeatExt package implements
wavelet-based methodology for feature extraction from copy-number
alteration (CNA) profiles using the non-decimated Haar wavelet transform
(NHWT). The main goal is to represent genome-wide CNA signals by
multiscale wavelet coefficients that are more suitable for prediction
than the raw, highly correlated CNA values.
In typical CNA studies, each sample is measured at thousands of genomic windows (or genes), and neighbouring windows are strongly correlated because CNAs occur in contiguous segments. Directly fitting supervised learning models on these correlated predictors can lead to unstable variable selection and suboptimal classification performance. Wavelet-based feature extraction addresses this by:
decomposing each CNA profile into detail coefficients (local differences, capturing gains/losses at multiple scales), and
scaling coefficients (local averages at multiple scales),
which together provide a compact, multiresolution representation of the genome-wide CNA landscape.
The package provides:
simulation of block-correlated CNA data
(simulateCNA),
segmentation using circular binary segmentation
(seg,
CBS),
wavelet-based feature extraction
(wavFeatExt),
PCA/ICA feature extraction for comparison
(getPca,
getIca), and
classification wrappers for several machine-learning methods
(classifyWavFeatExt,
classifyPcaIca).
This vignette gives a short, reproducible example of the wavFeatExt workflow on simulated CNA data, closely mirroring the analyses described in the original paper.
For any query about the package, please contact the maintainer, Maharani A. Ummi (maharaniahsani@itb.ac.id).
To install wavFeatExt, use the Bioconductor package
manager:
Most functions in wavFeatExt expect CNA data in the form of a numeric matrix
rows = samples (patients),
columns = genomic locations (windows or genes),
or a list of such matrices when multiple data sets (e.g. repeated simulations) are analysed.
For real data, a typical preprocessing pipeline is:
In this vignette we will work with simulated, already segmented data from simulateCNA, which mimics these characteristics.
The function simulateCNA() generates
segmented CNA profiles for two tumour subtypes under a multivariate
normal model with a block correlation structure, similar to the
simulation study in Ummi et al. (2022).
You can also embed plots, for example:
# One simulated data set with moderate dimension
sim_dat <- simulateCNA(
n.obs = 40,
p = 64,
n.sim = 1,
n.block = 8,
verbose = FALSE
)sim_dat is a list of length
n.sim; each element is an
n.obs x p matrix of segmented CNA values.
By construction, the first half of the rows correspond to subtype 1, and
the second half to subtype 2.
We now create a binary response factor:
The core function wavFeatExt() applies
the non-decimated Haar wavelet transform across genomic locations for
each sample and returns either detail or scaling coefficients at all
available scales.
# Extract wavelet detail coefficients (differences)
det_coef <- wavFeatExt(sim_dat, type = "detail")
# Extract wavelet scaling coefficients (averages)
sca_coef <- wavFeatExt(sim_dat, type = "scaling")
length(det_coef) # number of simulated data sets
# [1] 1
length(det_coef[[1]]) # number of scales for detail coefficients
# [1] 5
dim(det_coef[[1]][[1]]) # samples x windows at the first scale
# [1] 40 64For each simulated data set:
det_coef[[i]][[m]] is the matrix of
detail coefficients at scale m,
sca_coef[[i]][[m]] is the corresponding
matrix of scaling coefficients.
Each matrix has the same dimension as the original CNA data
(n.obs x p). At fine scales, detail
coefficients represent differences between nearby windows; at coarser
scales, they summarise changes over longer genomic regions.
For illustrative purposes, we can inspect the NHWT coefficients of a
single CNA profile using the legacy helper
nhwt() and
plot.nhwt(). This is a convenient way to
see how the transform behaves on one sample.
# Take one sample (row) from the simulated data
x1 <- X[1, ]
# Non-decimated Haar transform (detail coefficients)
nh_detail <- nhwt(x1, type = "detail")
nh_scaling <- nhwt(x1, type = "scaling")
# Plot coefficients by scale
plot(x1, type="l", xlab="", ylab="", main="Simulated CNA data")After extracting wavelet coefficients, the
classifyWavFeatExt() function provides a
unified interface to several machine-learning classifiers implemented in
glmnet,
randomForest,
nnet, pls,
and class. The function accepts:
data – a list of CNA matrices or
simulated CNA output
y – a factor of class
labels
det – list of wavelet detail
coefficients
sca – list of wavelet scaling
coefficients
method –
"lasso",
"elnet",
"RF", "NN",
"PLS", or
"KNN"
k – number of folds for
cross-validation
ite – number of datasets to
evaluate (defaults to all)
Below is an example classification based on wavelet detail coefficients:
# Binary response (for example, first 20 vs last 20 samples)
y <- factor(c(rep("Group1", 20), rep("Group2", 20)))
# Perform classification using Lasso
res_KNN <- classifyWavFeatExt(sim_dat, y, det=det_coef, sca=sca_coef,
method = "KNN", k = 5, ite = 2)The output typically contains:
Numeric matrix of cross-validated misclassification errors,
Numeric matrix of cross-validated areas under the ROC curve,
the classification method used.
Depending on the simulation settings
(effect.diff, correlation strength),
wavelet features often outperform PCA/ICA because they preserve local,
multi-scale genomic changes that drive subtype differences.
To demonstrate this, we extract PCA and ICA features using:
The object returned by
classifyWavFeatExt() contains two
matrices:
CE – misclassification error for each
feature set (detail/scaling scales and segmented data),
AUC – corresponding area under the ROC
curve (AUC).
The package provides an S3 method
plot.classifyWavFeatExt() to visualise
these as boxplots across replications
(ite).
## Misclassification error per scale (and segmented baseline)
plot(res_KNN, type = "CE", ylab = "Misclassification error")
## AUC per scale (and segmented baseline)
plot(res_KNN, type = "AUC", ylab = "Area under ROC curve")Interpretation:
Each box corresponds to one feature set: D1, D2, … = detail scales, S1, S2, … = scaling scales, seg = original segmented data.
The boxes summarise the distribution of CE/AUC over ite repetitions.
The red dashed line is the median performance of the segmented/original data (seg),
The blue dotted line marks the best median performance among all feature sets (highest AUC or lowest CE).
If you also use PCA/ICA, you can similarly call:
sessionInfo()
# R version 4.6.0 (2026-04-24)
# Platform: x86_64-pc-linux-gnu
# Running under: Ubuntu 24.04.4 LTS
#
# Matrix products: default
# BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
# LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#
# locale:
# [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
# [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
# [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
# [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
# [9] LC_ADDRESS=C LC_TELEPHONE=C
# [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#
# time zone: Etc/UTC
# tzcode source: system (glibc)
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] wavFeatExt_0.99.21 BiocStyle_2.39.0
#
# loaded via a namespace (and not attached):
# [1] tidyselect_1.2.1 neuralnet_1.44.2 timeDate_4052.112
# [4] dplyr_1.2.1 farver_2.1.2 S7_0.2.2
# [7] fastmap_1.2.0 pROC_1.19.0.1 caret_7.0-1
# [10] digest_0.6.39 rpart_4.1.27 timechange_0.4.0
# [13] lifecycle_1.0.5 survival_3.8-6 magrittr_2.0.5
# [16] compiler_4.6.0 rlang_1.2.0 sass_0.4.10
# [19] tools_4.6.0 yaml_2.3.12 data.table_1.18.2.1
# [22] knitr_1.51 plyr_1.8.9 RColorBrewer_1.1-3
# [25] withr_3.0.2 purrr_1.2.2 sys_3.4.3
# [28] nnet_7.3-20 grid_4.6.0 stats4_4.6.0
# [31] e1071_1.7-17 future_1.70.0 ggplot2_4.0.3
# [34] globals_0.19.1 scales_1.4.0 iterators_1.0.14
# [37] MASS_7.3-65 cli_3.6.6 rmarkdown_2.31
# [40] generics_0.1.4 ica_1.0-3 future.apply_1.20.2
# [43] reshape2_1.4.5 DNAcopy_1.85.0 cachem_1.1.0
# [46] proxy_0.4-29 stringr_1.6.0 splines_4.6.0
# [49] parallel_4.6.0 BiocManager_1.30.27 matrixStats_1.5.0
# [52] vctrs_0.7.3 hardhat_1.4.3 glmnet_4.1-10
# [55] Matrix_1.7-5 jsonlite_2.0.0 listenv_0.10.1
# [58] maketools_1.3.2 wavethresh_4.7.3 foreach_1.5.2
# [61] gower_1.0.2 jquerylib_0.1.4 pls_2.9-0
# [64] recipes_1.3.2 glue_1.8.1 parallelly_1.47.0
# [67] codetools_0.2-20 lubridate_1.9.5 stringi_1.8.7
# [70] gtable_0.3.6 shape_1.4.6.1 tibble_3.3.1
# [73] pillar_1.11.1 htmltools_0.5.9 ipred_0.9-15
# [76] randomForest_4.7-1.2 lava_1.9.0 R6_2.6.1
# [79] evaluate_1.0.5 lattice_0.22-9 bslib_0.10.0
# [82] class_7.3-23 Rcpp_1.1.1-1 nlme_3.1-169
# [85] prodlim_2026.03.11 xfun_0.57 buildtools_1.0.0
# [88] ModelMetrics_1.2.2.2 pkgconfig_2.0.3