| Title: | Model Selection Based on Machine Learning (ML) | 
| Version: | 1.0.0.1 | 
| Description: | Model evaluation based on a modified version of the recursive feature elimination algorithm. This package is designed to determine the optimal model(s) by leveraging all available features. | 
| License: | GPL (≥ 3) | 
| URL: | https://github.com/mommy003/MSML | 
| Encoding: | UTF-8 | 
| RoxygenNote: | 7.3.1 | 
| Depends: | R (≥ 2.10) | 
| Imports: | r2redux, R2ROC | 
| LazyData: | true | 
| NeedsCompilation: | no | 
| Packaged: | 2024-03-04 05:04:48 UTC; cvasu | 
| Author: | Hong Lee [aut, cph], Moksedul Momin [aut, cre, cph] | 
| Maintainer: | Moksedul Momin <cvasu.momin@gmail.com> | 
| Repository: | CRAN | 
| Date/Publication: | 2024-03-04 05:20:02 UTC | 
3 sets of covariates for training data set
Description
A dataset containing N sets of covariates (N=3 as an example here) intended for constant use across all model configurations (refer to the 'model_configuration2' function) when using a training dataset. Please note that if constant covariates are not required, this file is unnecessary (refer to the 'model_configuration' function).
Usage
cov_train
Format
A data frame for training dataset:
- V1
- covariate 1 
- V2
- covariate 2 
- V3
- covariate 3 
3 sets of covariates for validation data set
Description
A dataset containing N sets of covariates (N=3 as an example here) intended for constant use across all model configurations (refer to the 'model_configuration2' function) when using a validation dataset. Please note that if constant covariates are not required, this file is unnecessary (refer to the 'model_configuration' function).
Usage
cov_valid
Format
A data frame for validation dataset:
- V1
- covariate 1 
- V2
- covariate 2 
- V3
- covariate 3 
7 sets of PRSs for test dataset and target phenotype
Description
A dataset containing 7 sets of PRSs for test dataset and target phenotype
Usage
data_test
Format
A data frame for test dataset:
- V1
- Feature 1 (or PRS1 constructed using the first subset of SNPs from GWAS summary statistics) 
- V2
- Feature 2 (or PRS2 constructed using the second subset of SNPs from GWAS summary statistics) 
- V3
- Feature 3 (or PRS3 constructed using the third subset of SNPs from GWAS summary statistics) 
- V4
- Feature 4 (or PRS4 constructed using the fourth subset of SNPs from GWAS summary statistics) 
- V5
- Feature 5 (or PRS5 constructed using the fifth subset of SNPs from GWAS summary statistics) 
- V6
- Feature 6 (or PRS6 constructed using the sixth subset of SNPs from GWAS summary statistics) 
- V7
- Feature 7 (or PRS7 constructed using the seventh subset of SNPs from GWAS summary statistics) 
- phenotype
- Phenotypic values 
7 sets of PRSs for training data set and target phenotype
Description
A dataset containing 7 sets of PRSs for training data set and target phenotype
Usage
data_train
Format
A data frame for training dataset:
- V1
- Feature 1 (or PRS1 constructed using the first subset of SNPs from GWAS summary statistics) 
- V2
- Feature 2 (or PRS2 constructed using the second subset of SNPs from GWAS summary statistics) 
- V3
- Feature 3 (or PRS3 constructed using the third subset of SNPs from GWAS summary statistics) 
- V4
- Feature 4 (or PRS4 constructed using the fourth subset of SNPs from GWAS summary statistics) 
- V5
- Feature 5 (or PRS5 constructed using the fifth subset of SNPs from GWAS summary statistics) 
- V6
- Feature 6 (or PRS6 constructed using the sixth subset of SNPs from GWAS summary statistics) 
- V7
- Feature 7 (or PRS7 constructed using the seventh subset of SNPs from GWAS summary statistics) 
- phenotype
- Phenotypic values 
7 sets of PRSs for validation dataset and target phenotype
Description
A dataset containing 7 sets of PRSs for validation dataset and target phenotype
Usage
data_valid
Format
A data frame for validation dataset:
- V1
- Feature 1 (or PRS1 constructed using the first subset of SNPs from GWAS summary statistics) 
- V2
- Feature 2 (or PRS2 constructed using the second subset of SNPs from GWAS summary statistics) 
- V3
- Feature 3 (or PRS3 constructed using the third subset of SNPs from GWAS summary statistics) 
- V4
- Feature 4 (or PRS4 constructed using the fourth subset of SNPs from GWAS summary statistics) 
- V5
- Feature 5 (or PRS5 constructed using the fifth subset of SNPs from GWAS summary statistics) 
- V6
- Feature 6 (or PRS6 constructed using the sixth subset of SNPs from GWAS summary statistics) 
- V7
- Feature 7 (or PRS7 constructed using the seventh subset of SNPs from GWAS summary statistics) 
- phenotype
- Phenotypic values 
model_configuration function
Description
This function generates predicted values for the validation dataset by applying optimal weights to features, which were estimated in the training dataset for each model configuration. The total number of model configurations is determined by summing the combinations for each possible number of features, ranging from 1 to 'n' (C(n, k)), where 'n choose k' (C(n, k)) represents the binomial coefficient. Here, 'n' denotes the total number of features, and 'k' indicates the number of features included in each model. For example, with n=7, the total number of model configurations is 127.
Usage
model_configuration(data_train, data_valid, mv, model = "lm")
Arguments
| data_train | This includes the dataframe of the training dataset in a matrix format | 
| data_valid | This includes the dataframe of the validation dataset in a matrix format | 
| mv | The total number of columns in data_train/data_valid | 
| model | This is the type of model (e.g. lm (default) or glm) | 
Value
This function will generate all possible model outcomes for validation and test dataset
Examples
data_train <- data_train
data_valid  <- data_valid
mv=8
out=model_configuration(data_train,data_valid,mv,model = "lm")
#This process will produce predicted values for the validation datasets,
#corresponding to each model configuration trained on the training dataset.
#The outcome of this function will yield variables named 'predict_validation'
#and 'total_model_configurations.
#To print the outcomes run out$predict_validation and out$total_model_configurations.
#For details (see https://github.com/mommy003/MSML). 
model_configuration2 function
Description
This function is similar to the model_configuration function, with the added capability to maintain constant variables across models during training and prediction (see cov_train and cov_valid in page 2). Additionally, users have the option to select between linear or logistic regression models.
Usage
model_configuration2(
  data_train,
  data_valid,
  mv,
  cov_train,
  cov_valid,
  model = "lm"
)
Arguments
| data_train | This includes the dataframe of the training dataset in a matrix format | 
| data_valid | This includes the dataframe of the validation dataset in a matrix format | 
| mv | The total number of columns in data_train/data_valid | 
| cov_train | This includes dataframe of covariates for training dataset in a matrix format | 
| cov_valid | This includes dataframe of covariates for validation dataset in a matrix format | 
| model | This is the type of model (e.g. lm (default) or glm (logistic regression)) | 
Value
This function will generate all possible model outcomes for validation and test dataset
Examples
data_train <- data_train
data_valid  <- data_valid
mv=8
cov_train <- cov_train
cov_valid  <- cov_valid
out=model_configuration2(data_train,data_valid,mv,cov_train, cov_valid, model = "lm")
#This process will produce predicted values for the validation datasets,
#corresponding to each model configuration trained on the training dataset.
#The outcome of this function will yield variables named 'predict_validation'
#and 'total_model_configurations.
#To print the outcomes run out$predict_validation and out$total_model_configurations.
#For details (see https://github.com/mommy003/MSML). 
#If a user intends to employ logistic regression without constant covariates, 
#we advise preparing a covariate file where all values are set to 1.
model_evaluation function
Description
This function will identify the best model in the validation and test dataset.
Usage
model_evaluation(dat, mv, tn, prev, pthreshold = 0.05, method = "R2ROC")
Arguments
| dat | This is the dataframe for all the combinations of the model in a matrix format | 
| mv | The total number of columns in data_train/data_valid | 
| tn | The total number of best models to be identified | 
| prev | The prevalence of disease in the data | 
| pthreshold | The significance p value threshold when comparing models (default 0.05) | 
| method | The methods to be used to evaluate models (e.g. R2ROC (default) or r2redux) | 
Value
This function will generate all possible model outcomes for validation and test dataset
Examples
dat <- predict_validation
mv=8
tn=15
prev=0.047
out=model_evaluation(dat,mv,tn,prev)
#This process will generate three output files.
#out$out_all, contains AUC, p values for AUC, R2, and p values for R2, 
#respectively for all models.
#out$out_start, contains AUC, p values for AUC, R2, and p values for R2,
#respectively for top tn models.
#out$out_selected, contains AUC, p values for AUC, R2, and p values for R2,
#respectively for best models.  This also includes selected features for models
#For details (see https://github.com/mommy003/MSML).
target phenotype and 127 sets of model configurations based on validation dataset
Description
A dataset containing target phenotype and 127 sets of model configurations based on validation dataset
Usage
predict_validation
Format
A data frame for predicted values for target dataset from model configurations_test:
- V1
- Phenotypic values in target dataset 
- V2
- predicted values for target dataset from model configuration1 
- V3
- predicted values for target dataset from model configuration2 
- V4
- predicted values for target dataset from model configuration3 
- V5
- predicted values for target dataset from model configuration4 
- V6
- predicted values for target dataset from model configuration5 
- V7
- predicted values for target dataset from model configuration6 
- V8
- predicted values for target dataset from model configuration7 
- V9
- predicted values for target dataset from model configuration8 
- V10
- predicted values for target dataset from model configuration9 
- V11
- predicted values for target dataset from model configuration10 
- V12
- predicted values for target dataset from model configuration11 
- V13
- predicted values for target dataset from model configuration12 
- V14
- predicted values for target dataset from model configuration13 
- V15
- predicted values for target dataset from model configuration14 
- V16
- predicted values for target dataset from model configuration15 
- V17
- predicted values for target dataset from model configuration16 
- V18
- predicted values for target dataset from model configuration17 
- V19
- predicted values for target dataset from model configuration18 
- V20
- predicted values for target dataset from model configuration19 
- V21
- predicted values for target dataset from model configuration10 
- V22
- predicted values for target dataset from model configuration21 
- V23
- predicted values for target dataset from model configuration22 
- V24
- predicted values for target dataset from model configuration23 
- V25
- predicted values for target dataset from model configuration24 
- V26
- predicted values for target dataset from model configuration25 
- V27
- predicted values for target dataset from model configuration26 
- V28
- predicted values for target dataset from model configuration27 
- V29
- predicted values for target dataset from model configuration28 
- V30
- predicted values for target dataset from model configuration29 
- V31
- predicted values for target dataset from model configuration30 
- V32
- predicted values for target dataset from model configuration31 
- V33
- predicted values for target dataset from model configuration32 
- V34
- predicted values for target dataset from model configuration33 
- V35
- predicted values for target dataset from model configuration34 
- V36
- predicted values for target dataset from model configuration35 
- V37
- predicted values for target dataset from model configuration36 
- V38
- predicted values for target dataset from model configuration37 
- V39
- predicted values for target dataset from model configuration38 
- V40
- predicted values for target dataset from model configuration39 
- V41
- predicted values for target dataset from model configuration40 
- V42
- predicted values for target dataset from model configuration41 
- V43
- predicted values for target dataset from model configuration42 
- V44
- predicted values for target dataset from model configuration43 
- V45
- predicted values for target dataset from model configuration44 
- V46
- predicted values for target dataset from model configuration45 
- V47
- predicted values for target dataset from model configuration46 
- V48
- predicted values for target dataset from model configuration47 
- V49
- predicted values for target dataset from model configuration48 
- V50
- predicted values for target dataset from model configuration49 
- V51
- predicted values for target dataset from model configuration50 
- V52
- predicted values for target dataset from model configuration51 
- V53
- predicted values for target dataset from model configuration52 
- V54
- predicted values for target dataset from model configuration53 
- V55
- predicted values for target dataset from model configuration54 
- V56
- predicted values for target dataset from model configuration55 
- V57
- predicted values for target dataset from model configuration56 
- V58
- predicted values for target dataset from model configuration57 
- V59
- predicted values for target dataset from model configuration58 
- V60
- predicted values for target dataset from model configuration59 
- V61
- predicted values for target dataset from model configuration60 
- V62
- predicted values for target dataset from model configuration61 
- V63
- predicted values for target dataset from model configuration62 
- V64
- predicted values for target dataset from model configuration63 
- V65
- predicted values for target dataset from model configuration64 
- V66
- predicted values for target dataset from model configuration65 
- V67
- predicted values for target dataset from model configuration66 
- V68
- predicted values for target dataset from model configuration67 
- V69
- predicted values for target dataset from model configuration68 
- V70
- predicted values for target dataset from model configuration69 
- V71
- predicted values for target dataset from model configuration70 
- V72
- predicted values for target dataset from model configuration71 
- V73
- predicted values for target dataset from model configuration72 
- V74
- predicted values for target dataset from model configuration73 
- V75
- predicted values for target dataset from model configuration74 
- V76
- predicted values for target dataset from model configuration75 
- V77
- predicted values for target dataset from model configuration76 
- V78
- predicted values for target dataset from model configuration77 
- V79
- predicted values for target dataset from model configuration78 
- V80
- predicted values for target dataset from model configuration79 
- V81
- predicted values for target dataset from model configuration80 
- V82
- predicted values for target dataset from model configuration81 
- V83
- predicted values for target dataset from model configuration82 
- V84
- predicted values for target dataset from model configuration83 
- V85
- predicted values for target dataset from model configuration84 
- V86
- predicted values for target dataset from model configuration85 
- V87
- predicted values for target dataset from model configuration86 
- V88
- predicted values for target dataset from model configuration87 
- V89
- predicted values for target dataset from model configuration88 
- V90
- predicted values for target dataset from model configuration89 
- V91
- predicted values for target dataset from model configuration90 
- V92
- predicted values for target dataset from model configuration91 
- V93
- predicted values for target dataset from model configuration92 
- V94
- predicted values for target dataset from model configuration93 
- V95
- predicted values for target dataset from model configuration94 
- V96
- predicted values for target dataset from model configuration95 
- V97
- predicted values for target dataset from model configuration96 
- V98
- predicted values for target dataset from model configuration97 
- V99
- predicted values for target dataset from model configuration98 
- V100
- predicted values for target dataset from model configuration99 
- V101
- predicted values for target dataset from model configuration100 
- V102
- predicted values for target dataset from model configuration101 
- V103
- predicted values for target dataset from model configuration102 
- V104
- predicted values for target dataset from model configuration103 
- V105
- predicted values for target dataset from model configuration104 
- V106
- predicted values for target dataset from model configuration105 
- V107
- predicted values for target dataset from model configuration106 
- V108
- predicted values for target dataset from model configuration107 
- V109
- predicted values for target dataset from model configuration108 
- V110
- predicted values for target dataset from model configuration109 
- V111
- predicted values for target dataset from model configuration110 
- V112
- predicted values for target dataset from model configuration111 
- V113
- predicted values for target dataset from model configuration112 
- V114
- predicted values for target dataset from model configuration113 
- V115
- predicted values for target dataset from model configuration114 
- V116
- predicted values for target dataset from model configuration115 
- V117
- predicted values for target dataset from model configuration116 
- V118
- predicted values for target dataset from model configuration117 
- V119
- predicted values for target dataset from model configuration118 
- V120
- predicted values for target dataset from model configuration119 
- V121
- predicted values for target dataset from model configuration120 
- V122
- predicted values for target dataset from model configuration121 
- V123
- predicted values for target dataset from model configuration122 
- V124
- predicted values for target dataset from model configuration123 
- V125
- predicted values for target dataset from model configuration124 
- V126
- predicted values for target dataset from model configuration125 
- V127
- predicted values for target dataset from model configuration126 
- V128
- predicted values for target dataset from model configuration127