Note that all examples are run without parallel processing and a small number of simulation runs B to satisfy CRAN submission rules.
This package brings together a number of routines for the two sample problem for multivariate data. There are two data sets x and y and we want to test whether they were generated by the same probability distribution.
The highlights of this package are:
We generate two-dimensional data sets of size 100 and 120 respectively from multivariate normal distributions and run the test:
x1 = mvtnorm::rmvnorm(100, c(0,0))
y1 = mvtnorm::rmvnorm(120, c(0,0))
twosample_test(x1, y1, B=B, maxProcessor = 1)
#> Data is assumed to be continuous
#> $statistics
#>        KS         K       CvM        AD       NN1       NN5        AZ        BF 
#>  0.108300  0.176600  0.075930  0.562000  1.108300  2.036700  0.006952  0.250400 
#>        BG        FR       NN0       CF1       CF2       CF3       CF4      Ball 
#>  0.763600 -1.102000  0.227200 -1.102000  1.441500  1.127300  1.285200  0.006210 
#>        ES        EP 
#> 15.452000 32.531000 
#> 
#> $p.values
#>     KS      K    CvM     AD    NN1    NN5     AZ     BF     BG     FR    NN0 
#> 0.7650 0.7600 0.7500 0.8700 0.1200 0.3450 0.4500 0.5500 0.3450 0.1353 0.0000 
#>    CF1    CF2    CF3    CF4   Ball     ES     EP 
#> 0.1353 0.4864 0.1298 0.3027 0.3761 0.3479 0.1144Arguments of twosample_test
The routine chiTS.cont (included in the package) finds either the test statistic or the p value of a chi square test in two dimensions. First we need to set
this list is passed to chiTS.cont and tells the routine to
twosample_test(x1, y1, TS=chiTS.cont, TSextra=TSextra, 
               B=B, maxProcessor = 1)
#> Data is assumed to be continuous
#> $statistics
#> Chisq Stat 3,3 Chisq Stat 3,4 
#>         4.2376        16.3280 
#> 
#> $p.values
#> Chisq Stat 3,3 Chisq Stat 3,4 
#>          0.840          0.145Arguments and output of new test routine for continuous data
The arguments have to be x and y for the two data sets and (optionally) a list called TSextra for any additional input needed to find test statistic.
Note that the output vector of the routine has to be a named vector. If the routine is written in Rcpp parallel programming is not available.
If several tests are run one can use the routine twosample_test_adjusted_pvalue to find a p value that is adjusted for simultaneous testing:
The routine twosample_power allows us to estimate the power of the tests.
Let’s say we want to estimate the power when one data set comes from a multivariate standard normal distribution and the other from a normal distribution with a different covariance. We first need a function that generates these data sets:
f=function(a=0) {
   S=diag(2) 
   x=mvtnorm::rmvnorm(100, sigma = S)
   S[1,2]=a
   S[2,1]=a
   y=mvtnorm::rmvnorm(120, sigma = S)
   list(x=x, y=y)
}   Now we can run
twosample_power(f, c(0,0.5), B=B, maxProcessor=1)
#>        KS    K   CvM    AD   NN1  NN5    AZ    BF   BG   FR NN0  CF1  CF2  CF3
#> 0   0.080 0.11 0.050 0.035 0.055 0.04 0.045 0.055 0.04 0.05   1 0.05 0.05 0.04
#> 0.5 0.135 0.21 0.255 0.280 0.255 0.29 0.240 0.245 0.04 0.26   1 0.26 0.11 0.24
#>       CF4  Ball    ES    EP
#> 0   0.055 0.055 0.045 0.040
#> 0.5 0.185 0.115 0.435 0.455Arguments of twosample_power
Again the user can provide their own routine:
twosample_power(f, c(0,0.5), TS=chiTS.cont, 
                TSextra=TSextra, B=B, maxProcessor=1)
#>     Chisq Stat 3,3 Chisq Stat 3,4
#> 0            0.045           0.07
#> 0.5          0.645           0.58The routine that generates data can also have two arguments:
f1=function(a=0, n=100) {
   S=diag(2) 
   x=mvtnorm::rmvnorm(100, sigma = S)
   S[1,2]=a
   S[2,1]=a
   y=mvtnorm::rmvnorm(n, sigma = S)
   list(x=x, y=y)
}   If the user routine returns p values run
First note that tests for discrete data are implemented only in two dimensions.
We consider the case of data from binomial distributions:
vals_x = 0:5 #possible values of first random variable
vals_y = 0:6 #possible values of second random variable
a1x = rbinom(1000, 5, 0.5) 
a2x = rbinom(1000, 6, 0.5)
a1y = rbinom(1200, 5, 0.5) 
a2y = rbinom(1200, 6, 0.5)
x2=matrix(0, 6*7, 4)
colnames(x2)=c("vals_x", "vals_y", "x", "y")
x2[, 1]=rep(vals_x, length(vals_y))
x2[, 2]=rep(vals_y, each=length(vals_x))
for(i in 0:5) {
  for(j in 0:6) {
    x2[x2[,1]==i&x2[,2]==j, 3]=sum(a1x==i&a2x==j)
    x2[x2[,1]==i&x2[,2]==j, 4]=sum(a1y==i&a2y==j)
  }
}
twosample_test(x2, B=B, maxProcessor = 1)
#> Data is assumed to be discrete
#> $statistics
#>        KS         K       CvM        AD        NN        AZ        BF ChiSquare 
#>   0.03333   0.05716   0.17180   1.11780   3.53210   0.04232   0.06514  28.20800 
#> 
#> $p.values
#>        KS         K       CvM        AD        NN        AZ        BF ChiSquare 
#>    0.4700    0.3400    0.3650    0.3350    0.5650    0.3350    0.3050    0.7469Arguments of twosample_test
As in the case of continuous data the arguments include TS, TSextra, minexpcount, rnull, maxProcessor and doMethods. In addition we now have
Again one can find a p value adjusted for simultaneous testing:
The routine chiTS.disc (included in package) does a chi-square test for discrete data:
Again we need a routine that generates data sets. In the discrete case this has to be a routine whose output is a matrix with columns named vals_x, vals_y, x and y
g=function(a=0) {
   x=cbind(rbinom(1000, 5, 0.5), rpois(1000, 1))
   x[,2][x[,2]>5]=5
   lambda=1+a*x[,1]/5
   y=cbind(rbinom(1200, 5, 0.5), rpois(1200, lambda))
   y[,2][y[,2]>5]=5
   vx=0:5
   vy=0:5
   A=matrix(0,length(vx)*length(vy),4)                  
   k=0
   for(i in seq_along(vx))
     for(j in seq_along(vy)) {
       k=k+1
       A[k,1]=vx[i]
       A[k,2]=vy[j]
       A[k,3]=sum(x[,1]==vx[i] & x[,2]==vy[j])
       A[k,4]=sum(y[,1]==vx[i] & y[,2]==vy[j])
     }
   colnames(A)=c("vals_x", "vals_y", "x", "y")
   A
}twosample_power(g, c(0, 0.25, 0.5), B=200, maxProcessor=1)
#>         KS     K   CvM    AD    NN    AZ    BF Chisquare
#> 0    0.085 0.040 0.075 0.080 0.035 0.045 0.045     0.035
#> 0.25 0.415 0.315 0.495 0.495 0.090 0.115 0.115     0.240
#> 0.5  0.990 0.965 0.990 0.990 0.670 0.140 0.200     0.910or using a user supplied test:
TSextra=list(which="statistic")
twosample_power(g, c(0, 0.25, 0.5), B=200, 
                TS=chiTS.disc, TSextra=TSextra, maxProcessor=1)
#>      Chisq Stat
#> 0         0.015
#> 0.25      0.100
#> 0.5       0.845
TSextra=list(which="pvalue")
twosample_power(g, c(0, 0.25, 0.5), B=200, 
    TS=chiTS.disc, TSextra=TSextra, With.p.value = TRUE,
    maxProcessor=1)
#>      Chisq P
#> 0      0.030
#> 0.25   0.195
#> 0.5    0.830We have a data set x and we want to test whether it comes from a bivariate normal distribution. Instead of a goodness-of-fit test, however, we generate a Monte Carlo data set y from a bivariate normal rv with mean and covariance estimated from the real data set x, and then run a two-sample test.
In this scenario the two data sets are not independent, and the permutation approach to finding p values is extremely conservative, that is, the true type I error probability is much smaller than the nominal one. This in turn makes the power of the test much lower as well. Instead one can supply a routine that generates new data, just as one would in a goodness-of-fit test. Moreover, all the methods who find their own p values will now fail completely and so sshould not be run.
# generate real and MC data sets:
f=function(mu) {
    x=mvtnorm::rmvnorm(100, c(mu, mu))
    y=mvtnorm::rmvnorm(100, apply(x, 2, mean), cor(x))
    list(x=x, y=y)
}
#True data is a mixture of normal and uniform
g=function(alpha=0) {
    x=rbind(mvtnorm::rmvnorm((1-alpha)*100, c(0, 0)),
            matrix(runif(200*alpha),ncol=2))
    y=mvtnorm::rmvnorm(100, apply(x, 2, mean), cor(x))
    list(x=x, y=y)
}
# generate two-sample data set
rnull=function(dta) {
   x=mvtnorm::rmvnorm(nrow(dta$x), apply(dta$x, 2, mean), cor(dta$x))
   y=mvtnorm::rmvnorm(nrow(x), apply(x, 2, mean), cor(x))
   list(x=x, y=y)
}# Only run these methods for hypbrid problem
mt=c("KS", "K", "CvM", "AD", "NN1", "NN5", "AZ", "BF", "BG")
# Null hypothesis is true:
twosample_power(f, c(0, 1), doMethods = mt, B=200, maxProcessor = 1)
#>      KS     K  CvM    AD   NN1  NN5    AZ   BF   BG
#> 0 0.025 0.010 0.02 0.010 0.015 0.04 0.015 0.00 0.05
#> 1 0.025 0.045 0.01 0.015 0.045 0.02 0.010 0.01 0.03
twosample_power(f, c(0, 1), rnull=rnull, B=200, maxProcessor = 1)
#>      KS     K  CvM   AD   NN1  NN5    AZ    BF    BG
#> 0 0.025 0.020 0.08 0.07 0.050 0.06 0.050 0.050 0.050
#> 1 0.070 0.075 0.05 0.04 0.035 0.06 0.035 0.055 0.035
# Null hypothesis is false:
twosample_power(g, c(0, 0.5), doMethods = mt, B=200, maxProcessor = 1)
#>       KS     K   CvM    AD  NN1  NN5    AZ    BF    BG
#> 0   0.00 0.035 0.000 0.005 0.02 0.02 0.005 0.000 0.060
#> 0.5 0.87 1.000 0.505 0.500 0.72 0.87 1.000 0.995 0.995
twosample_power(g, c(0, 0.5), rnull=rnull, B=200, maxProcessor = 1)
#>        KS     K   CvM   AD  NN1  NN5    AZ   BF    BG
#> 0   0.045 0.035 0.045 0.05 0.05 0.08 0.045 0.06 0.035
#> 0.5 0.940 1.000 0.765 0.79 0.81 0.93 0.995 1.00 1.000As we can see, the true type I error using the permutation method is much smaller than the nominal \(\alpha=0.05\), and so the power is lower as well.
The routine run.studies allows the user to quickly compare the power of a new test with the power of the included tests for 25 different combinations of null hypothesis vs alternative for continuous data and 20 for discrete data. It also allows the user to find the power for those case studies for different sample size and type I error probabilities.
As an example, let’s compare the power of the test based on differences in means to the included ones for two of the studies.
Note that these examples are not run so the package can be submitted to CRAN:
Arguments of run.studies
The data set power_studies_results included in MD2sample has the results for the included tests using a sample size of 200, a true type I error probability of 0.05 and two values of the parameter under the alternative, on which makes the null hypothesis true and one which makes it false. If the user wants different numbers he can run:
run.studies(Continuous=TRUE, 
            study=c("NormalD2", "tD2"), 
            param_alt=cbind(c(0.4, 0.4), c(0.7, 0.7)), 
            alpha=0.1, B=100)Say the new method can find p values without simulation:
Note that the routine should return a named vector with the p values.
TSextra=list(which="pvalue", nbins=cbind(c(3,3), c(4,4)))
run.studies(Continuous=TRUE, 
                study=c("NormalD2", "tD2"),
                TS=chiTS.cont, 
                TSextra=TSextra,
                With.p.value = TRUE, 
                B=500,
                SuppressMessages = TRUE,
                maxProcessor=1)
#> Average number of times a test is close to best:
#>             BG           Ball            CF2            NN0             KS 
#>            1.0            2.0            3.0            4.0            6.0 
#>              K            NN1            CF4             FR            CF1 
#>            6.0            6.0            8.0           10.5           10.5 
#>            CF3            CvM            NN5             AD             BF 
#>           10.5           11.0           13.0           13.5           15.0 
#>             AZ             ES             EP Chisq Pval 3,4 Chisq Pval 3,4 
#>           16.0           17.0           18.0           19.5           19.5
#>          Chisq Pval 3,4 Chisq Pval 3,4    KS     K   CvM    AD   NN1   NN5
#> NormalD2          0.986          0.986 0.418 0.430 0.552 0.642 0.484 0.664
#> tD2               0.960          0.960 0.396 0.393 0.539 0.635 0.357 0.502
#>             AZ    BF    BG    FR   NN0   CF1   CF2   CF3   CF4  Ball    ES
#> NormalD2 0.884 0.831 0.050 0.601 0.395 0.601 0.377 0.601 0.502 0.302 0.973
#> tD2      0.817 0.781 0.067 0.491 0.312 0.491 0.307 0.491 0.401 0.200 0.863
#>             EP
#> NormalD2 0.978
#> tD2      0.933Consider the following situation. We have a data set \(x\) and want to test whether it comes from some probability distribution F, that is, we have a goodness-of-fit problem. However, it is not possible to calculate probabilities from \(F\), probably because this would require integration in high dimensions. We are, though, able to generate a second data set \(y\) under \(F\), and so can run a twosample test.
It can be shown that if the model \(F\) is not fully specified but includes parameters that have to be estimated from \(x\), finding p values using the permutation method leads to an extremely conservative tests, that is, the true type I error probability is much smaller than the desired one. Instead one can use a parametric bootstrap approach, that is a new data set \(y\) can be generated in each simulation run.
Let’s say we wish to test whether the data set \(x\) comes from a multivariate normal distribution with unknown mean and covariance:
f=function(a) {
    x=mvtnorm::rmvnorm(500, c(0.5, 0.5))
    y=rbind(matrix(runif(a*1000), ncol=2),
            mvtnorm::rmvnorm((1-a)*500, c(0.5,0.5)))
    list(x=x, y=y)
}    
rnull=function(dta) {
  muhat=apply(dta$x, 2, mean)
  sigmahat=cor(dta$x)
  list(x=mvtnorm::rmvnorm(nrow(dta$x), muhat, sigmahat),  
       y=mvtnorm::rmvnorm(nrow(dta$y), muhat, sigmahat))
}dta=f(0) # Null hypothesis is true
twosample_test(dta, rnull=rnull, B=B, maxProcessor = 1)
#> Data is assumed to be continuous
#> $statistics
#>       KS        K      CvM       AD      NN1      NN5       AZ       BF 
#> 0.076000 0.116000 0.176700 1.321800 1.064000 2.038000 0.001345 0.293900 
#>       BG 
#> 0.888500 
#> 
#> $p.values
#>    KS     K   CvM    AD   NN1   NN5    AZ    BF    BG 
#> 0.245 0.255 0.205 0.170 0.055 0.250 0.170 0.200 0.035dta=f(0.2) # Null hypothesis is false
twosample_test(dta, rnull=rnull, B=B, maxProcessor = 1)
#> Data is assumed to be continuous
#> $statistics
#>      KS       K     CvM      AD     NN1     NN5      AZ      BF      BG 
#> 0.12400 0.20000 0.49640 3.25860 1.10000 2.22200 0.01947 0.97180 4.03100 
#> 
#> $p.values
#>    KS     K   CvM    AD   NN1   NN5    AZ    BF    BG 
#> 0.000 0.000 0.000 0.000 0.005 0.000 0.020 0.000 0.000Again we can find adjusted p values:
and find the power of the tests:
Denote by \(\mathbf{z}\) the combined data set \(x_1,..,x_n,y_1,..y_m\). Let \(\hat{F}\) and \(\hat{G}\) be the empirical distribution functions of the two data sets, and let \(\hat{H}\) be the empirical distribution function of \(\mathbf{z}\)
Kolmogorov-Smirnov test
This classic test uses the test statistic
\[\max\{\vert \hat{F}(z_i)-\hat{G}(z_i)\vert;z_1,..,z_{n+m}\}\]
It was first proposed in (Kolmogorov 1933) and (Smirnov 1939).
Kiuper’s test
A variant of Kolmorogov-Smirnov:
\[\max\{\hat{F}(z_i)-\hat{G}(z_i);z_1,..,z_{n+m}\}-\min\{ \hat{F}(z_i)-\hat{G}(z_i);z_1,..,z_{n+m}\}\] This test was first discussed in (Kuiper 1960).
Cramer-vonMises test
\[\sum_{i=1}^{n+m} \left(\hat{F}(z_i)-\hat{G}(z_i)\right)^2\] the extension to the two-sample problem of the Cramer-vonMises criterion is discussed in (T. W. Anderson 1962).
Anderson-Darling test
\[\sum_{i=1}^{n+m} \frac{\left(\hat{F}(z_i)-\hat{G}(z_i)\right)^2}{\hat{H}(z_i)(1-\hat{H}(z_i))}\] It was first proposed in (Theodore W. Anderson, Darling, et al. 1952).
The test statistics are the average number of nearest neighbors of the \(\mathbf{x}\) data set that are also from \(\mathbf{x}\) plus the average number of nearest neighbors of the \(\mathbf{y}\) data set that are also from \(\mathbf{y}\). NN1 uses one nearest neighbor and \(NN5\) uses 5.
We denote by \(||.||\) Euclidean distance
Aslan-Zech test
This test discussed in (Aslan and and 2005) uses the test statistic
\[ \begin{aligned} &\frac{1}{nm}\sum_{i=1}^n \sum_{j=1}^m \log(||x_i-y_j||) -\\ &\frac{1}{n^2}\sum_{i=1}^n \sum_{i<j} \log(||x_i-x_j||) - \\ &\frac{1}{m^2}\sum_{i=1}^m \sum_{i<j} \log(||y_i-y_j||) \end{aligned} \] Baringhaus-Franz test
Similar to the Aslan-Zech test, it uses the test statistic
\[ \begin{aligned} &\frac{nm}{n+m}\left[\frac{1}{nm}\sum_{i=1}^n \sum_{j=1}^m \sqrt{||x_i-y_j||} + \right.\\ &\frac{1}{n^2}\sum_{i=1}^n \sum_{i<j} \sqrt{||x_i-x_j||} -\\ &\left. \frac{1}{m^2}\sum_{i=1}^m \sum_{i<j} \sqrt{||y_i-y_j||} \right]\\ \end{aligned} \] and was first proposed in (Baringhaus and Franz 2004).
Biswas-Ghosh test
Another variation of test based on Euclidean distance was discussed in (Biswas and Ghosh 2014).
\[ \begin{aligned} &B_{xy} = \frac{1}{nm}\sum_{i=1}^n \sum_{j=1}^m \sqrt{||x_i-y_j||} \\ &B_{xx}= \frac{2}{n(n-1)}\sum_{i=1}^n \sum_{i<j} \sqrt{||x_i-x_j||} \\ &B_{yy}=\frac{2}{m(m-1)}\sum_{i=1}^m \sum_{i<j} \sqrt{||y_i-y_j||}\\ &\left(B_{xx}-B_{xy}\right)^2+\left(B_{yy}-B_{xy}\right)^2 \end{aligned} \]
Friedman-Rafski test
This test is a multi-dimensional extension of the classic Wald-Wolfowitz test bases on minimum spanning trees. It was discussed in (Friedman and Rafsky 1979).
Simple Nearest Neighboor test
Similar to the nearest neigboor tests described earlier, it uses only the number of nearest neighbors of the first data set that are also from the first data set. This number has a binomial distribution, and this can be used to find p values.
Chen-Friedman tests
These tests, discussed in (Chen and Friedman, n.d.), are implemented in the gTests (Chen and Zhang 2017) package.
Ball Divergence test
A test described in (Pan et al. 2018) and implemented in the R package (Zhu et al. 2021).
Implemented for discrete data are versions of the Kolmogorov-Smirnov, Kuiper, Cramer-vonMises, Anderson-Darling, nearest neighboor, Aslan-Zech, Baringhaus-Franz tests as well as a chisquare test.