Help for package iSubGen

Title:

Integrative Subtype Generation

Version:

1.0.1

Date:

2021-04-15

Author:

Natalie Fox

Maintainer:

Paul C Boutros <pboutros@mednet.ucla.edu>

Description:

Multi-data type subtyping, which is data type agnostic and accepts missing data. Subtyping is performed using intermediary assessments created with autoencoders and similarity calculations.

Depends:

R (≥ 3.2.3)

Imports:

ConsensusClusterPlus, cluster (≥ 1.14.4), keras, tensorflow, philentropy

Suggests:

knitr, rmarkdown

VignetteBuilder:

knitr

License:

GPL-2

NeedsCompilation:

Packaged:

2021-04-16 16:57:23 UTC; nfox

Repository:

CRAN

Date/Publication:

2021-04-22 14:50:10 UTC

Apply scaling factors

Description

Apply scaling factors prior to autoencoder

Usage

apply.scaling(data.matrices, scaling.factors);

Arguments

data.matrices

list, where each element is a matrix. The list has one matrix for each data type to be scaled

scaling.factors

list with two elements named: \"center\" and \"scale\", and each element is a named numerical vector or a list of named numerical vectors. If scaling.factors$center or scaling.factors$scale are a list then each element needs to correspond to a one of the data matrices. Finally, the named numerical vectors should match the row and rownames from the corresponding data matrix.

Details

The names for the data matrices and the center and scale lists all must match.

Value

A list of matrices of the same format as the data.matrices

Author(s)

Natalie Fox

Examples


# Load molecular profiles for three data types and calculate scaling for each
example.molecular.data.dir <- paste0(path.package('iSubGen'),'/exdata/');
molecular.data <- list();
scaling.factors <- list();
for(i in c('cna','snv','methy')) {
  # Load molecular profiles from example files saved 
  # in the package as <data type>_profiles.txt
  molecular.data[[i]] <- load.molecular.aberration.data(
    paste0(example.molecular.data.dir,i,'_profiles.txt'),
    patients = c(paste0('EP00',1:9), paste0('EP0',10:30))
    );

  scaling.factors[[i]] <- list();

  scaling.factors[[i]]$center <- apply(molecular.data[[i]], 1, mean);
  scaling.factors[[i]]$scale <- apply(molecular.data[[i]], 1, sd);
  }

# Example 1: Transform the molecular profiles by the scaling factors
scaled.molecular.data <- apply.scaling(molecular.data, scaling.factors);

# Example 2: Transform one of the data types based on the scaling factors
scaled.molecular.data2 <- apply.scaling(
  molecular.data[[1]],
  scaling.factors[[1]]
  );

Calculate consensus integrative correlation matrix

Description

Calculate consensus pairwise correlations between patient distances

Usage

calculate.cis.matrix(data.types, data.matrices, dist.metrics,
correlation.method = "spearman", filter.to.common.patients = FALSE, 
patients.to.return = NULL, patients.for.correlations = NULL, 
patient.proportion = 0.8, feature.proportion = 1, num.iterations = 10, 
print.intermediary.similarity.matrices.to.file = TRUE, print.dir = '.',
patient.proportion.seeds = seq(1,num.iterations), 
feature.proportion.seeds = seq(1,num.iterations))

Arguments

data.types

vector of the IDs for the different data types that are the names of the lists for the data.matrices and dist.metrics

data.matrices

list of the matrices with features (rows) by patients (columns)

dist.metrics

list of the distance metrics for comparing patient profiles. ex. euclidean. Options are from philentropy::distance

correlation.method

specifies the type of correlation for similarity comparison. Options are pearson, spearman or kendall.

filter.to.common.patients

logical, where TRUE indicates to filter out patients that don't have all data types

patients.to.return

vector of patients to calculate CIS for. For example, this is the testing cohort patients when calculating CIS for the testing cohort using the training cohort patients. If NULL all patients/columns will be used.

patients.for.correlations

vector of patients to use to calculate the similarities. For example, this would be the training cohort patients when calculating CIS for the testing cohort. If NULL all patients/columns will be used.

patient.proportion

proportion of patients.for.correlations to sample for each iteration (sampled without replacement).

feature.proportion

proportion of the features to sample for each iteration (sampled without replacement).

num.iterations

number of iterations to take the median from

print.intermediary.similarity.matrices.to.file

logical, where TRUE indicates that created intermediary integrative similarity matrix from each iteration should be printed to file

print.dir

directory for where to print the intermediary similarity matrices to file

patient.proportion.seeds

vector of scalars of the length num.iterations specifying the seeds used for random sampling for selecting the patient subsets at each iteration

feature.proportion.seeds

vector of scalars of the length num.iterations specifying the seeds used for random sampling for selecting the feature subsets at each iteration

Value

CIS matrix where rows are patients and columns are pairs of data types

Author(s)

Natalie Fox

Examples


# Load molecular profiles for three data types from example files saved 
# in the package as <data type>_profiles.txt
example.molecular.data.dir <- paste0(path.package('iSubGen'),'/exdata/');
molecular.data <- list();
for(i in c('cna','snv','methy')) {
  molecular.data[[i]] <- load.molecular.aberration.data(
    paste0(example.molecular.data.dir,i,'_profiles.txt'),
    patients = c(paste0('EP00',1:9), paste0('EP0',10:30))
    );
  }

# Example 1: calculate the consensus integrative similarity (CIS) matrix
corr.matrix <- calculate.cis.matrix(
  data.types = names(molecular.data),
  data.matrices = molecular.data,
  dist.metrics = list(
    cna = 'euclidean',
    snv = 'euclidean',
    methy = 'euclidean'
    ),
  print.intermediary.similarity.matrices.to.file = FALSE
  );

# Example 2: calculate the CIS matrix for patients EP001 through EP009 in relation 
# to patients EP010 through EP030 meaning the profile of EP001 is correlated to 
# the profiles of EP010 through EP030 so when assessing new patients, they can be 
# compared to the training profiles
corr.matrix2 <- calculate.cis.matrix(
  data.types = names(molecular.data),
  data.matrices = molecular.data,
  dist.metrics = list(
    cna = 'euclidean',
    snv = 'euclidean',
    methy = 'euclidean'
    ),
  patients.to.return = paste0('EP00',1:9),
  patients.for.correlations = paste0('EP0',10:30),
  print.intermediary.similarity.matrices.to.file = FALSE
  );

# Example 3: Adjusting the proportion of the features that will be used to correlate 
# the patient profiles
corr.matrix3 <- calculate.cis.matrix(
  data.types = names(molecular.data),
  data.matrices = molecular.data,
  dist.metrics = list(
    cna = 'euclidean',
    snv = 'euclidean',
    methy = 'euclidean'
    ),
  patients.to.return = paste0('EP00',1:9),
  patients.for.correlations = paste0('EP0',10:30),
  feature.proportion = 0.6,
  print.intermediary.similarity.matrices.to.file = FALSE
  );

Calculate integrative similarity matrix

Description

Calculate pairwise correlations between patient distances

Usage

calculate.integrative.similarity.matrix(data.types, data.matrices, dist.metrics,
correlation.method = "spearman", filter.to.common.patients = FALSE, 
patients.to.return = NULL, patients.for.correlations = NULL)

Arguments

data.types

vector, where each element is a data type ID matching the names in data.matrices and dist.metrics

data.matrices

list, where each element is a matrix with features as rows and patients as columns

dist.metrics

list, where each element is the distance metric to use for comparing patient profiles. ex. euclidean. Options are from philentropy::distance

correlation.method

specifies the type of correlation. Options are pearson, spearman or kendall.

filter.to.common.patients

logical, where TRUE indicates to filter out patients that don't have all data types

patients.to.return

vector, where each element a patient ID specifying the patients to calculate integrative similarity for. For example, this is the testing cohort patients when calculating integrative similarity for the testing cohort using the training cohort patients. If NULL all patients/columns will be used.

patients.for.correlations

vector, where each element a patient ID specifying the patients to use to calculate the similarities. For example, this would be the training cohort patients when calculating integrative similarity for the testing cohort. If NULL all patients/columns will be used.

Value

matrix where rows are patients and columns are pairs of data types

Author(s)

Natalie Fox

Examples


# Load molecular profiles for three data types from example files saved 
# in the package as <data type>_profiles.txt
example.molecular.data.dir <- paste0(path.package('iSubGen'),'/exdata/');
molecular.data <- list();
for(i in c('cna','snv','methy')) {
  molecular.data[[i]] <- load.molecular.aberration.data(
    paste0(example.molecular.data.dir,i,'_profiles.txt'),
    patients = c(paste0('EP00',1:9), paste0('EP0',10:30))
    );
  }

# Example 1: calculate integrative similarity between pairs of CNA, coding SNVs, methylation data
corr.matrix <- calculate.integrative.similarity.matrix(
  data.types = names(molecular.data),
  data.matrices = molecular.data,
  dist.metrics = list(
    cna = 'euclidean',
    snv = 'euclidean',
    methy = 'euclidean'
    )
  );

# Example 2: calculate the integrative similarity for patients EP001 through EP009 
# in relation to patients EP010 through EP030 meaning the profile of EP001 is 
# correlated to the profiles of EP010 through EP030 so when assessing new patients,
# they can be compared to the training profiles
corr.matrix2 <- calculate.integrative.similarity.matrix(
  data.types = names(molecular.data),
  data.matrices = molecular.data,
  dist.metrics = list(
    cna = 'euclidean',
    snv = 'euclidean',
    methy = 'euclidean'
    ),
  patients.to.return = paste0('EP00',1:9),
  patients.for.correlations = paste0('EP0',10:30)
  );

# Example 3: Calculate integrative similarity between CNA and methylation data
corr.matrix3 <- calculate.integrative.similarity.matrix(
  data.types=names(molecular.data)[c(1,3)],
  data.matrices=molecular.data[c(1,3)],
  dist.metrics=list(
    cna='euclidean',
    snv='euclidean',
    methy='euclidean'
    )[c(1,3)],
  patients.to.return=paste0('EP00',1:9),
  patients.for.correlations=paste0('EP0',10:30)
  );

Calculate scaling factors

Description

Calculate scaling factors

Usage

calculate.scaling(data.matrices);

Arguments

data.matrices

list, where each element is a matrix. The list has one matrix for each data type to be scaled

Details

The names for the data matrices and the center and scale lists all must match.

Value

a list with two elements named: \"center\" and \"scale\", and each of these element is a named numerical vector or a list of named numerical vectors. If scaling.factors$center or scaling.factors$scale are a list then each element will correspond to a one of the data matrices. Finally, the named numerical vectors will match the row and rownames from the data matrices.

Author(s)

Natalie Fox

Examples


# Load molecular profiles for three data types from example files saved 
# in the package as <data type>_profiles.txt
example.molecular.data.dir <- paste0(path.package('iSubGen'),'/exdata/');
molecular.data <- list();
for(i in c('cna','snv','methy')) {
  molecular.data[[i]] <- load.molecular.aberration.data(
    paste0(example.molecular.data.dir,i,'_profiles.txt'),
    patients = c(paste0('EP00',1:9), paste0('EP0',10:30))
    );
  }

# Example 1: Calculate scaling factors for all three data types
scaling.factors <- calculate.scaling(molecular.data);

# Example 2: Calculate scaling factors for only the methylation data
scaling.factors2 <- calculate.scaling(molecular.data[['methy']]);

Clustering to find patient subtypes

Description

A wrapper function for using consensus clustering to subtype patients

Usage

cluster.patients(data.matrix, distance.metric, parent.output.dir,
new.result.dir, subtype.table.file = NULL, max.num.subtypes = 12, 
clustering.reps = 1000, proportion.features = 0.8, proportion.patients = 0.8, 
verbose = FALSE, consensus.cluster.write.table = TRUE);

Arguments

data.matrix

matrix with patients as rows and features as columns

distance.metric

distance metric for comparing patient profiles. ex. euclidean

parent.output.dir

directory where the consensus clustering function will create a directory of results

new.result.dir

directory name for consensus clustering results

subtype.table.file

filename for subtype assignment table for different number of clusters

max.num.subtypes

maximum number of clusters to separate patients into

clustering.reps

number of subsamples for consensus clustering function

proportion.features

proportion of features to sample for each clustering iteration

proportion.patients

proportion of patients to sample for each clustering iteration

verbose

logical, where TRUE indicates to print messages to the screen to indicate progress

consensus.cluster.write.table

logical, where TRUE indicates for the ConsensusClusterPlus function to writeTable

Value

consensus_cluster_result

consensus clustering function return value

subtype_table

the table written to subtype.table.file

Author(s)

Natalie Fox

Examples

## Not run: 

# For this example instead of clustering CIS and IRF matrices,
# create a data matrix to see how the function works without
# running through the whole iSubGen process.
# This example is created with to have 4 distinct clusters
set.seed(5);
ex.matrix <- matrix(
  c(
    sample(c(0,1), 30, replace = TRUE), rep(1,75), rep(0,25),
    sample(c(0,1), 30, replace = TRUE), rep(1,75), rep(0,25),
    sample(c(0,1), 30, replace = TRUE), rep(1,75), rep(0,25),
    sample(c(0,1), 30, replace = TRUE), rep(1,100),
    sample(c(0,1), 30, replace = TRUE), rep(1,100),
    sample(c(0,1), 30, replace = TRUE), rep(1,100),
    sample(c(0,1), 30, replace = TRUE), rep(0,100),
    sample(c(0,1), 30, replace = TRUE), rep(0,100),
    sample(c(0,1), 30, replace = TRUE), rep(0,100),
    sample(c(0,1), 30, replace = TRUE), rep(0,75), rep(1,25),
    sample(c(0,1), 30, replace = TRUE), rep(0,75), rep(1,25),
    sample(c(0,1), 30, replace = TRUE), rep(0,75), rep(1,25)
    ),
  nrow=130);
rownames(ex.matrix) <- paste0('gene',1:130);
colnames(ex.matrix) <- paste0('patient',LETTERS[1:12]); 

# Use Consensus clustering to subtype the patient profiles
subtyping.results <- cluster.patients(
  data.matrix = ex.matrix,
  distance.metric = 'euclidean',
  parent.output.dir = './',
  new.result.dir = 'example_subtyping',
  max.num.subtypes = 6,
  clustering.reps = 50,
  consensus.cluster.write.table = FALSE
  );	

## End(Not run)

Combine iSubGen integrative features

Description

Combine a independent reduced features matrix (ex. from autoencoders) and pairwise integrative similarity matrices into one integrative feature matrix.

Usage

combine.integrative.features(irf.matrix, cis.matrix,
irf.rescale.recenter = NA, cis.rescale.recenter = NA, 
irf.rescale.denominator = NA, cis.rescale.denominator = NA,
irf.weights = rep(1, ncol(irf.matrix)), 
cis.weights = rep(1, ncol(cis.matrix)))

Arguments

irf.matrix

matrix of independent reduced features with patients as rows and features as columns

cis.matrix

matrix of consensus integrative similarity or integrative similarity features with patients as rows and features as columns

irf.rescale.recenter

either NA, "mean", a single number or a vector of numbers of length equal to the number of columns of irf

cis.rescale.recenter

either NA, "mean", a single number or a vector of numbers of length equal to the number of columns of cis

irf.rescale.denominator

either NA, "sd", a single number or a vector of numbers of length equal to the number of columns of irf

cis.rescale.denominator

either NA, "sd", a single number or a vector of numbers of length equal to the number of columns of cis

irf.weights

single number or vector of numbers of length equal to the number of columns of irf

cis.weights

single number or vector of numbers of length equal to the number of columns of cis

Details

The recenter values determine the how column centering is performed. If NA, no recentering is done. If the values equal "mean", then the mean of each column will be used. Otherwise, the numeric values specified will be used. The denominator values determine how column scaling is performed. If NA, no recentering is done. If the denominator values equal "sd", then the standard deviation of each column will be used. Otherwise, the numeric values specified will be used. The values used are returned by the function along with the compressed feature matrix to be recorded for reproducibility purposes.

Value

integrative.feature.matrix

a matrix of compressed features with patients as rows and features as columns

irf.rescale.recenter

a numeric vector with length equal to the number of columns of irf

cis.rescale.recenter

a numeric vector with length equal to the number of columns of cis

irf.rescale.denominator

a numeric vector with length equal to the number of columns of irf

cis.rescale.denominator

a numeric vector with length equal to the number of columns of cis

irf.weights

a numeric vector with length equal to the number of columns of irf

cis.weights

a numeric vector with length equal to the number of columns of cis

Author(s)

Natalie Fox

Examples


# Create matrices for combining
irf.matrix <- matrix(runif(25*4), ncol = 4);
rownames(irf.matrix) <- c(paste0('EP00',1:9), paste0('EP0',10:25));
cis.matrix <- matrix(runif(25*6), ncol=6);
rownames(cis.matrix) <- c(paste0('EP00',1:9), paste0('EP0',10:25));

# Example 1: Join the matrices without any weighting adjustments
isubgen.feature.matrix <- combine.integrative.features(
  irf.matrix,
  cis.matrix
  )$integrative.feature.matrix;

# Example 2: Combine matrices after scaling each column by subtracting the mean
# and dividing by the standard devation of the column
isubgen.feature.matrix.rescaled.result <- combine.integrative.features(
  irf.matrix,
  cis.matrix,
  irf.rescale.recenter = 'mean',
  cis.rescale.recenter = 'mean',
  irf.rescale.denominator = 'sd',
  cis.rescale.denominator = 'sd'
  );
isubgen.feature.matrix.2 <- isubgen.feature.matrix.rescaled.result$integrative.feature.matrix;

# Example 3: Combine matrices 
isubgen.feature.matrix.reweighted.result <- combine.integrative.features(
  irf.matrix,
  cis.matrix,
  irf.weights = 1/4,
  cis.weights = 1/6
  );
isubgen.feature.matrix.3 <- isubgen.feature.matrix.reweighted.result$integrative.feature.matrix;

Create an autoencoder for dimensionality reduction

Description

Create an autoencoder for dimensionality reduction using keras and tensorflow packages

Usage

create.autoencoder(data.type, data.matrix, encoder.layers.node.nums = c(15,2),
autoencoder.activation = 'tanh', optimization.loss.function = 'mean_squared_error', 
model.file.output.dir = '.')

Arguments

data.type

data type ID. The ID will be used for naming the output file

data.matrix

matrix with data features as rows and patients as columns

encoder.layers.node.nums

vector with the number of nodes for each layer when the reducing the feature dimensions within the autoencoder. The autoencoder will be made symmetrically so the number of nodes in each layer will be used in reverse, not repeating the last layer to re encode the features in the autoencoder

autoencoder.activation

activation function to use in the autoencoder

optimization.loss.function

loss function used for optimization while fitting the autoencoder

model.file.output.dir

file location for the autoencoder file

Value

autoencoder

the autoencoder created by the keras package

autoencoder.file

the hdf5 file that the model was saved in and can be loaded from

Author(s)

Natalie Fox

Examples

## Not run: 

example.molecular.data.dir <- paste0(path.package('iSubGen'),'/exdata/');

ae.result <- create.autoencoder(
  data.type = 'cna',
  data.matrix = load.molecular.aberration.data(
    paste0(example.molecular.data.dir,'cna_profiles.txt'),
    patients = c(paste0('EP00',1:9), paste0('EP0',10:30))
    ),
  encoder.layers.node.nums = c(15,5,2)
  );

## End(Not run)

Create matrix of independent reduced features

Description

Create matrix of independent reduced features using autoencoders

Usage

create.autoencoder.irf.matrix(data.types, data.matrices,
autoencoders, filter.to.common.patients = FALSE,
patients.to.return = NULL)

Arguments

data.types

vector, where each element is a data type ID matching the names in data.matrices and dist.metrics

data.matrices

list, where each element is a matrix with features as rows and patients as columns

autoencoders

list, where each element is an autoencoder corresponding to each data type. Can be either an keras autoencoder object or the file where the autoencoder was saved.

filter.to.common.patients

logical, where TRUE indicates to filter out patients that don't have all data types.

patients.to.return

vector of patients to return correlations for. If NULL all patients/columns will be used.

Value

matrix where rows are patients and columns are pairs of data types

Author(s)

Natalie Fox

Examples

## Not run: 

# Load three data types and create an autoencder for each
example.molecular.data.dir <- paste0(path.package('iSubGen'),'/exdata/');
molecular.data <- list();
ae.result <- list();
for(i in c('cna','snv','methy')) {
  molecular.data[[i]] <- load.molecular.aberration.data(
    paste0(example.molecular.data.dir,i,'_profiles.txt'),
    patients = c(paste0('EP00',1:9), paste0('EP0',10:30))
    );
  ae.result[[i]] <- create.autoencoder(
    data.type = i,
    data.matrix = molecular.data[[i]],
    encoder.layers.node.nums = c(10,2)
    )$autoencoder;
  }

# Create a matrix of the bottleneck layers 
irf.matrix <- create.autoencoder.irf.matrix(
  data.types = names(molecular.data),
  data.matrices = molecular.data,
  autoencoders = ae.result
  );

## End(Not run)

Load molecular aberration data

Description

Load the molecular aberration profiles/feature annotation

Usage

load.molecular.aberration.data(file, patients = NULL, annotation.fields = NULL);

Arguments

file

file name of the matrix containing molecular and annotation data. If it does not contain an _absolute_ path, the file name is _relative_ to the current working directory, 'getwd()' as in read.table.

patients

vector of patients IDs. Must match colnames from aberration file

annotation.fields

vector referencing the column names for the feature annotation columns

Details

The annotation.fields argument will look for any colnames which contain the values specified in annotation.fields and then the column will be renamed to the value that matched from annotation.fields.

Value

If the patients argument is specified then the patient molecular aberration profiles are returned. If the annotation.fields argument is specified then the feature annotation is returned. If both are specified then the two matrices are returned in a list. If neither is specified then the entire matrix with the mix of patients and annotation is returned.

Author(s)

Natalie Fox

Examples


example.aberration.data <- paste0(
  path.package('iSubGen'),
  '/exdata/cna_profiles.txt'
  );

# Load the CNA profiles for patients EP001 through EP030
cna.profiles <- load.molecular.aberration.data(
  example.aberration.data,
  patients = c(paste0('EP00',1:9), paste0('EP0',10:30))
  );

# Load feature annotation for the CNA data
cna.annotation <- load.molecular.aberration.data(
  example.aberration.data,
  annotation.fields = c('gene','start','end')
  );

Read scaling factors from file

Description

Read scaling factors from file

Usage

read.scaling.factors(scaling.factor.files.dir,data.types);

Arguments

scaling.factor.files.dir

the directory where the files were saved

data.types

a vector of the data types with saved scaling factors

Details

One scale and one center file is saved per data type

Value

a list with a key \"center\" list and a key \"scale\" list. The center and scale list keys match the data.matrices list keys

Author(s)

Natalie Fox

Examples


# Get the path for the scaling provided in this R package
example.molecular.data.dir <- paste0(path.package('iSubGen'),'/exdata/');

# Example #1: reading scaling factors for a single data type
scaling.factors <- read.scaling.factors(example.molecular.data.dir, 'cna');

# Example #2: reading scaling factors for multiple data types
scaling.factors <- read.scaling.factors(example.molecular.data.dir, c('cna','snv','methy'));

Write scaling factors to file

Description

Write scaling factors to file

Usage

write.scaling.factors(scaling.factors, scaling.factor.files.dir=NULL)

Arguments

scaling.factors

list with the scaling factors created by calculate.scaling

scaling.factor.files.dir

directory to output scaling factor files

Details

Creates two files for each data type key. One file for the recentering values and one file for the rescaling values. Files have the names <data type>_gene_recenter.txt or <data type>_gene_rescale.txt

Value

No return value, called for side effects

Author(s)

Natalie Fox

Examples

## Not run: 

# load the aberration profiles for three data types 
example.molecular.data.dir <- paste0(path.package('iSubGen'),'/exdata/');
molecular.data <- list();
for(i in c('cna','snv','methy')) {
  molecular.data[[i]] <- load.molecular.aberration.data(
    paste0(example.molecular.data.dir,i,'_profiles.txt'),
    patients = c(paste0('EP00',1:9), paste0('EP0',10:30))
    );
  }

# calculate scaling factors for all three data types
scaling.factors <- calculate.scaling(molecular.data);

# save the scaling factors to file
write.scaling.factors(scaling.factors);

## End(Not run)