holodeckTo simulate co-varying multivariate normal data one can use the
mvrnorm() function from the MASS package.
However, this requires input of a covariance matrix and returns a
matrix. The holodeck package provides functions that are
“tidy” in the sense that they work with dataframes and the pipe operator
(%>%). It also includes some functions that provide an
interface between the Bioconductor package ropls and the
tidyverse.
sim_*()holodeck provides functions to simulate different kinds
of data as columns in a tibble:
sim_cat() for categorical variablessim_covar() for multivariate normal numeric datasim_discr() for multivariate normal data with different
means for levels of some grouping variablesim_missing() for randomly introducing
NAsTo simulate multivariate data you need to start with a dataframe or a
tibble. Once you have a dataframe or tibble, the sim_*()
functions add columns onto it.
df <- tibble(Y = rep(c("a", "b"), each = 5))
df %>% sim_covar(n_vars = 5, var = 1, cov = 0.5)
#> # A tibble: 10 × 6
#>    Y         V1     V2      V3      V4      V5
#>    <chr>  <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
#>  1 a      1.32   0.338 -0.281   0.0491  1.49  
#>  2 a     -0.305  1.18  -0.405   1.25    1.67  
#>  3 a      0.498  0.510  1.06    0.215   0.169 
#>  4 a      0.155 -0.946  0.569  -0.814  -0.808 
#>  5 a      2.16   0.752  0.811   1.48    0.964 
#>  6 b     -0.710 -1.30  -0.705  -0.264  -1.51  
#>  7 b     -0.290 -0.618  0.597   1.11   -1.11  
#>  8 b     -0.276 -0.611 -0.0108 -0.295  -0.226 
#>  9 b     -0.697 -0.360 -1.72    0.0576 -1.24  
#> 10 b     -2.19  -0.926 -0.945  -0.305   0.0470Optionally you can create a tibble with the sim_covar()
or sim_cat() functions by providing them with the
N argument instead of .data.
sim_covar(n_obs = 10, n_vars = 5, var = 1, cov = 0.5)
#> # A tibble: 10 × 5
#>        V1      V2     V3      V4      V5
#>     <dbl>   <dbl>  <dbl>   <dbl>   <dbl>
#>  1 -0.629 -0.801   0.631 -0.0779  0.0657
#>  2  0.310  2.31    1.00  -0.150  -1.18  
#>  3 -0.979 -0.312  -1.85   0.232  -1.68  
#>  4  1.59   0.670   0.123 -0.290  -0.397 
#>  5  1.86   0.888   1.75   0.546   1.15  
#>  6 -0.687 -0.826  -0.884 -1.28   -0.427 
#>  7 -0.191  0.0295 -1.09  -1.43   -1.09  
#>  8  0.620  0.680   1.98   1.53   -0.189 
#>  9 -1.34  -0.477  -1.68  -0.531   0.451 
#> 10 -0.505  0.194   0.235  0.0377  0.742sim_cat() is a rather simple wrapper that just creates a
column of categorical data. Eventually, it will be expanded to allow
creation of crossed and nested factors.
sim_cat(n_obs = 10, n_groups = 2)
#> # A tibble: 10 × 1
#>    group
#>    <chr>
#>  1 a    
#>  2 a    
#>  3 a    
#>  4 a    
#>  5 a    
#>  6 b    
#>  7 b    
#>  8 b    
#>  9 b    
#> 10 bsim_discr() simulates covarying data that differs in
means between levels of some grouping variable.
df %>%
  group_by(Y) %>% 
  sim_discr(n_vars = 5, var = 1, cov = 0.1, group_means = c(1, -1))
#> # A tibble: 10 × 6
#> # Groups:   Y [2]
#>    Y         V1      V2     V3       V4     V5
#>    <chr>  <dbl>   <dbl>  <dbl>    <dbl>  <dbl>
#>  1 a      3.32   2.51    1.04   1.31     1.21 
#>  2 a      2.01   0.0511  2.08   1.44     0.143
#>  3 a      0.607  1.94    1.73   0.182    0.967
#>  4 a      1.35   1.56    1.84   0.119    1.71 
#>  5 a      1.82   0.0868  1.05  -0.146    1.55 
#>  6 b     -0.783 -0.696  -1.33  -1.09    -1.20 
#>  7 b     -3.17  -1.35   -1.99  -1.82    -2.53 
#>  8 b     -1.20  -0.340   0.157  0.00756 -1.00 
#>  9 b     -0.873 -0.626   0.468 -1.69    -2.20 
#> 10 b      0.381 -1.17   -0.823 -0.461   -0.879%>%One advantage of the holodeck package is the ability to
chain functions together to create complex data covariance structures.
You can chain functions together in any order, although the
sim_discr() function requires a grouping variable.
All of the sim_* functions (besides
sim_missing()) take an optional name argument which names
the variables created.
df <-
  sim_covar(n_obs = 20, n_vars = 5, var = 1, cov = 0.1, name = "low") %>% #5 variables with low covariance
  sim_covar(n_vars = 5, var = 1, cov = 0.8, name = "high")            #5 variables with high covarianceNow we could add a categorical variable, and some variables that discriminate between levels of our categorical variable
df1 <-
  df %>% 
  sim_cat(n_groups = 2, name = "factor") %>% 
  group_by(factor) %>% 
  sim_discr(n_vars = 5, var = 1, cov = 0.1, group_means = c(-1, 1), name = "discr") %>% 
  ungroup()Finally, if you want to simulate missing values, you can use
sim_missing() to randomly introduce NAs.
df2 <-
  df1 %>% 
  sim_missing(prop = 0.1)
df2
#> # A tibble: 20 × 16
#>    factor  low_1   low_2    low_3   low_4    low_5  high_1  high_2 high_3
#>    <chr>   <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>   <dbl>  <dbl>
#>  1 a       1.15  -0.179  -0.380   -0.844  -0.0127   1.35    1.34    1.66 
#>  2 a       1.23  -0.0830 -0.747   -0.330  -0.760   -1.76   -1.45   -0.765
#>  3 a      -0.761  0.931   0.0640  -0.334  NA       -0.0859  0.679  NA    
#>  4 a       1.52   1.12    0.0114  -0.948   0.809   NA      -0.0167  0.961
#>  5 a      -1.73  -0.452   0.230   -0.968  NA       -0.847  -1.30   NA    
#>  6 a       0.227 -0.510  -0.159    0.381  -0.160    0.607   0.367  -0.209
#>  7 a       0.308  0.324  -0.469   -0.0127  0.260   -1.01    0.245   0.218
#>  8 a       1.91   0.882   0.525   -0.213  -0.414    0.167  -0.305  -0.701
#>  9 a       1.54  NA       1.66    -0.597  -0.315   -0.732   0.564   0.561
#> 10 a      -1.35  -0.333  -1.78     0.402   0.996   -0.807  NA       0.269
#> 11 b      -0.807  1.07   -1.01    NA       0.519    0.944   1.37   NA    
#> 12 b       0.624  0.443   1.16     0.959  -0.520   -0.359  -0.214   1.06 
#> 13 b      -1.16   0.937  -1.80     0.893   1.25    -0.887  -0.202  -1.05 
#> 14 b       0.644 -0.449   0.518    0.930   0.783   -1.36   -0.314   0.272
#> 15 b      -1.35   0.412  -1.11    -0.225  -0.00819  0.0182  0.372  -0.457
#> 16 b      -2.73   0.648   0.560    1.01   -0.228   NA      -0.245  NA    
#> 17 b      -2.04   0.476  -0.334   -0.495   0.303   -0.888  -1.15   -0.704
#> 18 b       0.318 -0.292   1.36    NA      -0.130   -0.634   0.0614 -0.740
#> 19 b      -1.38   0.207  -0.468   NA      -1.04     0.765   0.495   0.931
#> 20 b       0.584  1.21    0.00916  0.982  -1.10     1.40    0.979   0.934
#> # ℹ 7 more variables: high_4 <dbl>, high_5 <dbl>, discr_1 <dbl>, discr_2 <dbl>,
#> #   discr_3 <dbl>, discr_4 <dbl>, discr_5 <dbl>cov() creates a covariance matrix with variance on the
diagonal. We can visualize it as a heatmap.
Values are higher for the discriminating variables because the
cov and var arguments to
sim_discr() only control the covariance and variance
within groups.
One reason to simulate multivariate data is to test the effects of
different properties of datasets on analysis results. For example,
what’s the effect of missing data on a statistical analysis? The
sim_missing() function replaces a proportion of values with
NA. Let’s see how it affects a PLS-DA analysis.
We can chain several sim_* functions to quickly create a
dataframe.
df2 <- 
  sim_cat(n_obs = 40, n_groups = 3, name = "factor") %>% 
  sim_covar(n_vars = 3, var = 1, cov = 0.0, name = "noise") %>% 
  group_by(factor) %>% 
  sim_discr(n_vars = 5, var = 1, cov = 0, group_means = c(-1, 0, 1), name  = "signal") %>%
  sim_discr(n_vars = 5, var = 1, cov = 0, group_means = c(0, 0.5, 1), name = "signal2") %>% 
  ungroup()
df2
#> # A tibble: 40 × 14
#>    factor noise_1 noise_2  noise_3 signal_1 signal_2 signal_3 signal_4 signal_5
#>    <chr>    <dbl>   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#>  1 a       0.0121  -2.17   0.275    -1.27     -1.76   -0.0391   -2.64    -1.80 
#>  2 a      -0.376    0.895 -0.714    -1.16     -1.44   -1.48     -2.49    -2.44 
#>  3 a       2.01    -0.446 -0.00426  -0.406    -1.43   -1.67     -3.04    -0.804
#>  4 a       0.607   -2.79  -0.388    -1.08     -1.28   -2.34     -0.681    0.565
#>  5 a       0.693    1.42  -0.796    -0.153    -1.21    0.377    -1.59    -0.968
#>  6 a      -1.72    -1.51   0.968    -1.34     -0.510  -1.29      0.347   -1.65 
#>  7 a      -0.430   -1.53   0.940    -1.94     -1.19   -0.589    -3.32    -0.333
#>  8 a      -0.569    0.329 -1.16     -2.06     -1.24    0.0146   -2.18    -0.673
#>  9 a       0.925   -0.507 -0.834     0.0797   -0.132   0.344    -1.56    -1.72 
#> 10 a      -0.399   -0.676 -0.518    -1.09      0.486   0.337    -0.560   -2.02 
#> # ℹ 30 more rows
#> # ℹ 5 more variables: signal2_1 <dbl>, signal2_2 <dbl>, signal2_3 <dbl>,
#> #   signal2_4 <dbl>, signal2_5 <dbl>We can then use map() from the purrr
package to create many randomly generated datasets using the same
specifications, with and without missing values.
set.seed(100)
dfs <-
  map(1:20, 
      ~sim_cat(n_obs = 40, n_groups = 3, name = "factor") %>% 
        sim_covar(n_vars = 3, var = 1, cov = 0.0, name = "noise") %>% 
        group_by(factor) %>% 
        sim_discr(n_vars = 5, var = 1, cov = 0, group_means = c(-1, 0, 1), name  = "signal") %>%
        sim_discr(n_vars = 5, var = 1, cov = 0, group_means = c(0, 0.5, 1), name = "signal2") %>% 
        ungroup())Alternatively, you could generate one large dataframe (many rows) and take subsets. Either way, you know the “true” properties of the data and can compare to the results of the analyses you test.
We can now map the sim_missing() function to randomly
introduce NAs to the datasets.
And finally, deal with those NAs with multiple imputation with the
mice package.
# this might take a few seconds
dfs.imputed <-
  map(dfs.missing, ~mice(., printFlag = FALSE) %>% complete())Here, we can compare an example dataset as original, with NAs, and imputed:
head(dfs[[1]])
#> # A tibble: 6 × 14
#>   factor noise_1  noise_2 noise_3 signal_1 signal_2 signal_3 signal_4 signal_5
#>   <chr>    <dbl>    <dbl>   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#> 1 a       1.07   -0.00916 -0.493    -0.953    -1.29    0.952   -0.651   -1.23 
#> 2 a      -0.0134 -1.85    -0.292    -0.849    -1.87    0.311    0.856   -0.785
#> 3 a      -0.401  -1.28    -1.48     -0.432    -1.66   -0.458    0.873   -0.748
#> 4 a      -1.82    2.01    -1.12      0.298    -1.18   -1.05    -0.331   -0.514
#> 5 a       1.71    0.597   -0.0753   -1.00      1.04   -2.65    -0.442   -0.836
#> 6 a       0.848   1.13     0.0428   -1.60     -1.37   -3.24    -0.888   -1.41 
#> # ℹ 5 more variables: signal2_1 <dbl>, signal2_2 <dbl>, signal2_3 <dbl>,
#> #   signal2_4 <dbl>, signal2_5 <dbl>
head(dfs.missing[[1]])
#> # A tibble: 6 × 14
#>   factor noise_1  noise_2 noise_3 signal_1 signal_2 signal_3 signal_4 signal_5
#>   <chr>    <dbl>    <dbl>   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#> 1 a       1.07   -0.00916 NA        -0.953    -1.29    0.952   -0.651   -1.23 
#> 2 a      -0.0134 -1.85    -0.292    NA        -1.87    0.311    0.856   -0.785
#> 3 a      NA      -1.28    -1.48     -0.432    -1.66   -0.458    0.873   -0.748
#> 4 a      -1.82    2.01    -1.12      0.298    NA      -1.05    -0.331   -0.514
#> 5 a       1.71    0.597   -0.0753   -1.00      1.04   -2.65    -0.442   -0.836
#> 6 a       0.848   1.13     0.0428   -1.60     -1.37   -3.24    -0.888   -1.41 
#> # ℹ 5 more variables: signal2_1 <dbl>, signal2_2 <dbl>, signal2_3 <dbl>,
#> #   signal2_4 <dbl>, signal2_5 <dbl>
head(dfs.imputed[[1]])
#>   factor    noise_1      noise_2     noise_3   signal_1  signal_2   signal_3
#> 1      a  1.0738532 -0.009162203 -0.07529851 -0.9525532 -1.289870  0.9523219
#> 2      a -0.0134100 -1.851241745 -0.29246857  0.4403128 -1.873263  0.3114562
#> 3      a -0.7215366 -1.284047854 -1.48162821 -0.4322049 -1.656319 -0.4583891
#> 4      a -1.8176185  2.014018985 -1.11967562  0.2977722 -1.394962 -1.0478207
#> 5      a  1.7095029  0.596707289 -0.07529851 -0.9996540  1.039059 -2.6513107
#> 6      a  0.8475474  1.130402793  0.04280330 -1.6014609 -1.370313 -3.2403584
#>     signal_4   signal_5  signal2_1  signal2_2  signal2_3   signal2_4  signal2_5
#> 1 -0.6506898 -1.2266449 -0.9398164  0.7444634 -1.4554263 -0.06208422 -1.0852542
#> 2  0.8561365 -0.7850617  0.2655353 -0.1112986  0.9284334  1.70032445 -0.9680104
#> 3  0.8725502 -0.7483738  2.4863183  0.8013671  0.0386806 -1.23665052  0.9490844
#> 4 -0.3310259 -0.5142761 -0.2309898  0.7444634  0.3386661  1.58683283 -2.8872108
#> 5 -0.4418016 -0.8359875  1.0516597 -0.2346468  0.9491426 -0.14501237 -2.8872108
#> 6 -0.8882406 -1.4122971  1.5272074  1.7493532  2.0554579 -0.13655839 -1.8517622