SimMultiCorrData generates continuous (normal or non-normal), binary, ordinal, and count (Poisson or Negative Binomial) variables with a specified correlation matrix. It can also produce a single continuous variable. This package can be used to simulate data sets that mimic real-world situations (i.e. clinical data sets, plasmodes, as in Vaughan et al. (2009)). All variables are generated from standard normal variables with an imposed intermediate correlation matrix. Continuous variables are simulated by specifying mean, variance, skewness, standardized kurtosis, and fifth and sixth standardized cumulants using either Fleishman’s Third-Order (1978) or Headrick’s Fifth-Order (2002) Polynomial Transformation. Binary and ordinal variables are simulated using a modification of GenOrd::ordsample Barbiero and Ferrari (2015a). Count variables are simulated using the inverse cdf method. There are two simulation pathways which differ primarily according to the calculation of the intermediate correlation matrix Sigma. In Correlation Method 1, the intercorrelations involving count variables are determined using a simulation based, logarithmic correlation correction (adapting Yahav and Shmueli (2012)’s method). In Correlation Method 2, the count variables are treated as ordinal (adapting Barbiero and Ferrari (2015b)’s modification of GenOrd). There is an optional error loop that corrects the final correlation matrix to be within a user-specified precision value. The package also includes functions to calculate standardized cumulants for theoretical distributions or from real data sets, check if a target correlation matrix is within the possible correlation bounds (given the distributions of the simulated variables), summarize results (numerically or graphically), verify valid power method pdfs, and calculate lower standardized kurtosis bounds.
The main strengths of SimMultiCorrData are:
The user may generate correlated continuous (normal or non-normal), ordinal (r >= 2 categories), Poisson and/or Negative Binomial variables simultaneously, based on either theoretical distributions or empirical data.
Two distinct methods for generating non-normal continuous variables: Fleishman’s third-order or Headrick’s fifth-order polynomial transformation.
Two distinct methods for generating count variables (see Comparison of Correlation Method 1 and Correlation Method 2 vignette). The user may test each to see which yields greater simulation accuracy.
Calculation of the precise lower kurtosis boundary using the Lagrangean constraint equations, instead of an approximation (see calc_lower_skurt).
Valid power method pdf checks during the calculation of the constants for continuous variables, and optional use of a sixth cumulant correction value to enable the discovery of valid pdf constants.
Computation of feasible correlation bounds based on data simulation method (see valid_corr for correlation method 1 or valid_corr2 for correlation method 2).
Numerous attempts to reproduce the desired correlation matrix, including correcting for non-positive-definite intermediate correlation matrices and an optional final error loop (see Overview of Error Loop vignette). This error loop enables reproduction of many correlation structures that can not be achieved through other methods.
Function arguments (i.e. seed, n, maxit, epsilon) that allow the user to have greater control over simulation accuracy, speed, and reproducibility.
Detailed simulation results, including the simulation time (in minutes) and descriptions of the generated variables and the correlation structure.
Additional functions to supplement the simulation process:
calc_theory) or a vector of data by the method of moments (calc_moments) or based on Fisher’s k-statistics (calc_fisherk). Additional summary functions compute important statistics for the generated continuous variables.ggplot2 objects so that the user may save them or further adapt the graphs as necessary.There are several other simulation packages. For example, Barbiero & Ferrari’s (2015a) GenOrd, Amatya & Demirtas’ (2016a) MultiOrd, Leisch, Kaiser, & Hornik’s (2010) orddata, and Demirtas, Nordgren, & Allozi’s (2017) PoisBinOrdNonNor. The first three generate only binary and ordinal data, while the last generates Poisson, binary, ordinal, and non-normal variables.
GenOrdGenOrd generates discrete random variables (i.e. binary or ordinal) with given correlation matrix and marginal distributions. The method used to determine the intermediate MVN correlation matrix in GenOrd::ordcont has been modified in SimMultiCorrData’s ordnorm function. It works by setting the intermediate correlation equal to the target correlation of the discrete variables. Each intermediate pairwise correlation is updated until the final pairwise correlation is within a user-specified precision value (epsilon) of the target correlation or the maximum number of iterations (maxit) has been reached. GenOrd::ordcont has been modified in the following ways:
SimMultiCorrData::valid_corr or valid_corr2.Sigma for all variable types, and if necessary, Sigma is converted to the nearest positive-definite matrix using Higham’s (2002) algorithm in Matrix::nearPD.SimMultiCorrData::ordnorm uses GenOrd::contord to calculate the ordinal correlation obtained from discretizing the normal variables generated from the intermediate correlation matrix Sigma. The reason is because the function does not require random generation of the normal variables, which ensures greater reproducibility.
SimMultiCorrData also improves the way ordinal variables are generated, as compared to GenOrd::ordsample:
SimMultiCorrData::rcorrvar and rcorrvar2 allow a user-specified seed, maximum number of iterations, and epsilon value.GenOrd::ordsample stops if the intermediate correlation matrix Sigma is not positive-definite. As described above, SimMultiCorrData attempts to correct the problem and a warning is given that it may not be possible to produce the desired correlation matrix.MultiOrdMultiOrd generates multivariate ordinal data with given correlation matrix and marginal distributions via the binary conversion method of Demirtas (2006). This method computes the binary marginals by collapsing the marginal distributions of the ordinal variables. The intermediate correlation matrix is also computed through an iterative process based on matching the target matrix. Binary data are then converted to ordinal data through a randomization step. This procedure requires the simulation of large samples of binary data in order to maximize accuracy, which requires greater computational time and resources than the methods used in SimMultiCorrData.
orddataorddata generates binary and ordinal data through 4 available methods:
PoisBinOrdNonNorPoisBinOrdNonNor is one in an extensive series of simulation packages created by Demirtas with additional authors. Other packages include OrdNor (Amatya and Demirtas 2015), BinNonNor (Inan and Demirtas 2016a), BinOrdNonNor (Demirtas, Wang, and Allozi 2017), PoisBinOrd (Inan and Demirtas 2016b), PoisNor (Amatya and Demirtas 2016b), and PoisBinOrdNor (Demirtas, Hu, and Allozi 2017). PoisBinOrdNonNor generates Poisson, binary, ordinal, and non-normal variables. It differs from SimMultiCorrData in the following ways:
SimMultiCorrData’s simulation functions rcorrvar and rcorrvar2 allow the user to either provide an intermediate matrix or the matrix is calculated during the simulation.SimMultiCorrData. However, PoisBinOrdNonNor does not produce Negative Binomial variables.SimMultiCorrData. However, those for ordinal variables are found using ordcont, which, as previously mentioned, will stop if the intermediate matrix is not positive-definite.SimMultiCorrData contains the functions power_norm_corr and pdf_check. The function that solves for the constants (SimMultiCorrData::find_constants) executes these checks when finding the constants and attempts to produce valid pdf constants. In the case of Headrick’s fifth-order method, the user may specify a sixth cumulant correction value to help find these constants.PoisBinOrdNonNor is a simple approximation: \(\Large standardized\ kurtosis \ge skew^2 - 2\). SimMultiCorrData::calc_lower_skurt solves the Lagrangean expressions (as described in Headrick (2002) and Headrick and Sawilowsky (2002)) that determine the precise lower kurtosis boundary. Examination of the boundaries computed in PoisBinOrdNonNor demonstrates that the approximate boundaries are much lower than the actual Fleishman boundaries, indicating that the guideline is not accurate (see calc_lower_skurt for examples).PoisBinOrdNonNor does not allow the user to specify a seed for random number generation, or an epsilon value and maximum number of iterations to use when determining the intermediate ordinal correlations. These specifications, as found in SimMultiCorr’s simulation functions rcorrvar and rcorrvar2, are essential for reproducibility and controlling accuracy.SimMultiCorr’s simulation functions produce detailed summaries of the variables, the final correlation matrix, the maximum error between the final and target correlation matrices, and the simulation time.Amatya, A, and H Demirtas. 2015. OrdNor: An R Package for Concurrent Generation of Correlated Ordinal and Normal Data. Journal of Statistical Software Code Snippets. Vol. 68. doi:10.18637/jss.v068.c02.
———. 2016a. MultiOrd: Generation of Multivariate Ordinal Variates. https://CRAN.R-project.org/package=MultiOrd.
———. 2016b. PoisNor: Simultaneous Generation of Multivariate Data with Poisson and Normal Marginals. https://CRAN.R-project.org/package=PoisNor.
Barbiero, A, and P A Ferrari. 2015a. GenOrd: Simulation of Discrete Random Variables with Given Correlation Matrix and Marginal Distributions. https://CRAN.R-project.org/package=GenOrd.
———. 2015b. “Simulation of Correlated Poisson Variables.” Applied Stochastic Models in Business and Industry 31: 669–80. doi:10.1002/asmb.2072.
Demirtas, H. 2006. “A Method for Multivariate Ordinal Data Generation Given Marginal Distributions and Correlations.” Journal of Statistical Computation and Simulation 76 (11): 1017–25. doi:10.1080/10629360600569246.
Demirtas, H, D Hedeker, and R J Mermelstein. 2012. “Simulation of Massive Public Health Data by Power Polynomials.” Statistics in Medicine 31 (27): 3337–46. doi:10.1002/sim.5362.
Demirtas, H, Y Hu, and R Allozi. 2017. PoisBinOrdNor: Data Generation with Poisson, Binary, Ordinal and Normal Components. https://CRAN.R-project.org/package=PoisBinOrdNor.
Demirtas, H, R Nordgren, and R Allozi. 2017. PoisBinOrdNonNor: Generation of up to Four Different Types of Variables. https://CRAN.R-project.org/package=PoisBinOrdNonNor.
Demirtas, H, Y Wang, and R Allozi. 2017. BinOrdNonNor: Concurrent Generation of Binary, Ordinal and Continuous Data. https://CRAN.R-project.org/package=BinOrdNonNor.
Fleishman, A I. 1978. “A Method for Simulating Non-Normal Distributions.” Psychometrika 43: 521–32. doi:10.1007/BF02293811.
Headrick, T C. 2002. “Fast Fifth-Order Polynomial Transforms for Generating Univariate and Multivariate Non-Normal Distributions.” Computational Statistics and Data Analysis 40 (4): 685–711. doi:10.1016/S0167-9473(02)00072-5.
Headrick, T C, and S S Sawilowsky. 2002. “Weighted Simplex Procedures for Determining Boundary Points and Constants for the Univariate and Multivariate Power Methods.” Journal of Educational and Behavioral Statistics 25: 417–36. doi:10.3102/10769986025004417.
Inan, G, and H Demirtas. 2016a. BinNonNor: Data Generation with Binary and Continuous Non-Normal Components. https://CRAN.R-project.org/package=BinNonNor.
———. 2016b. PoisBinOrd: Data Generation with Poisson, Binary and Ordinal Components. https://CRAN.R-project.org/package=PoisBinOrd.
Kaiser, S, D Traeger, and F Leisch. 2011. “Generating Correlated Ordinal Random Values.” Technical Report Number 94. Department of Statistics at University of Munich. https://epub.ub.uni-muenchen.de/12157/1/kaiser-tr-94-ordinal.pdf.
Leisch, F, A W S Kaiser, and K Hornik. 2010. Orddata: Generation of Artificial Ordinal and Binary Data.
Vaughan, L K, J Divers, M Padilla, D T Redden, H K Tiwari, D Pomp, and D B Allison. 2009. “The Use of Plasmodes as a Supplement to Simulations: A Simple Example Evaluating Individual Admixture Estimation Methodologies.” Computational Statistics and Data Analysis 53 (5): 1755–66. doi:10.1016/j.csda.2008.02.032.
Yahav, I, and G Shmueli. 2012. “On Generating Multivariate Poisson Data in Management Science Applications.” Applied Stochastic Models in Business and Industry 28 (1): 91–102. doi:10.1002/asmb.901.