Title: | Robust Graph-Based Two-Sample Test |
Version: | 0.1 |
Description: | Useful tools for determining whether two samples are from the same distribution. Utilizes a robust method to address the problematic structure of the similarity graph constructed from high-dimensional data. The method is provided in Yichuan Bai and Lynna Chu (2023) <doi:10.48550/arXiv.2307.12325>. |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
RoxygenNote: | 7.2.3 |
Imports: | ade4, stats |
Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
VignetteBuilder: | knitr |
Depends: | R (≥ 3.0.1) |
LazyData: | true |
NeedsCompilation: | no |
Packaged: | 2023-08-11 21:07:22 UTC; tutu |
Author: | Yichuan Bai [aut, cre], Lynna Chu [aut] |
Maintainer: | Yichuan Bai <ycbai@iastate.edu> |
Repository: | CRAN |
Date/Publication: | 2023-08-14 11:40:02 UTC |
get the approximate test statistic and p-value based on asymptotic theory using robust generalized edge-count test
Description
get the approximate test statistic and p-value based on asymptotic theory using robust generalized edge-count test
Usage
asy_gen(asy_res, R1_test, R2_test)
Arguments
asy_res |
analytic expressions of expectations, variances and covariances |
R1_test |
weighted within-sample edge-counts of sample 1 |
R2_test |
weighted within-sample edge-counts of sample 2 |
Value
A list containing the following components:
test_statistic |
the asymptotic test statistic using robust generalized graph-based test. |
p_value |
the asymptotic p-value using robust generalized graph-based test. |
get the approximate test statistic and p-value based on asymptotic theory using robust max-type edge-count test
Description
get the approximate test statistic and p-value based on asymptotic theory using robust max-type edge-count test
Usage
asy_max(asy_res, R1_test, R2_test, n1, n2)
Arguments
asy_res |
analytic expressions of expectations, variances and covariances |
R1_test |
weighted within-sample edge-counts of sample 1 |
R2_test |
weighted within-sample edge-counts of sample 2 |
n1 |
number of observations in sample 1 |
n2 |
number of observations in sample 2 |
Value
A list containing the following components:
test_statistic |
the asymptotic test statistic using robust max-type graph-based test. |
p_value |
the asymptotic p-value using robust max-type graph-based test. |
get the approximate test statistic and p-value based on asymptotic theory using robust weighted edge-count test
Description
get the approximate test statistic and p-value based on asymptotic theory using robust weighted edge-count test
Usage
asy_wei(asy_res, R1_test, R2_test, n1, n2)
Arguments
asy_res |
analytic expressions of expectations, variances and covariances |
R1_test |
weighted within-sample edge-counts of sample 1 |
R2_test |
weighted within-sample edge-counts of sample 2 |
n1 |
number of observations in sample 1 |
n2 |
number of observations in sample 2 |
Value
A list containing the following components:
test_statistic |
the asymptotic test statistic using robust weighted graph-based test. |
p_value |
the asymptotic p-value using robust weighted graph-based test. |
Example
Description
These example contains a dataset, the label of the observations in the dataset, the distance matrix of the dataset using L2 distance, and the edge matrix generated by 5-MST.
Usage
example0
Format
An object of class list
of length 4.
Details
data
pooled dataset of two samples sampling from two different t-distributions.
label
label of the observations. 'sample 1' denotes the observations in sample 1. 'sample 2' denotes the observations in sample 2.
distance
the distance matrix of the pooled dataset using L2 distance.
edge
edge matrix generated by 5-MST.
Get distance matrix
Description
This function returns the distance matrix using L2 distance.
Usage
getdis(y)
Arguments
y |
dataset of the pooled data |
Value
A distance matrix based on the L2 distance.
Examples
data(example0)
data = as.matrix(example0$data) # pooled dataset
getdis(data)
construct k-mst
Description
construct k-mst
Usage
kmst(y = NULL, dis = NULL, k = 1)
Arguments
y |
data |
dis |
distance matrix |
k |
parameter in K-MST, with default 1 |
Value
An edge matrix representing a similarity graph. Each row represents an edge and records the indices of two ends of an edge in two columns.
get lists of permuted weighted within-sample edge-counts and between-sample edge-counts
Description
get lists of permuted weighted within-sample edge-counts and between-sample edge-counts
Usage
permu_edge(n_per, E, n1, n2, wei, progress_bar = FALSE)
Arguments
n_per |
number of permutations. |
E |
an edge matrix representing a similarity graph. Each row represents an edge and records the indices of two ends of an edge in two columns. The indices of observations in sample 1 are from 1 to n1 and indices of observations in sample 2 are from 1+n1 to n1+n2. |
n1 |
number of observations in sample 1. |
n2 |
number of observations in sample 2. |
wei |
a vector of weights of each edge. |
progress_bar |
a logical evaluating to TRUE or FALSE indicating whether a progress bar of the permutation should be printed. |
Value
R1 |
the permuted weighted within-sample edge-counts for sample 1. |
R2 |
the permuted weighted within-sample edge-counts for sample 2. |
R |
the permuted weighted between-sample edge-counts. |
get the test statistic and p-value based on permutation using robust generalized edge-count test
Description
get the test statistic and p-value based on permutation using robust generalized edge-count test
Usage
permu_gen(R1_list, R2_list, R1_test, R2_test, n_per)
Arguments
R1_list |
list of permuted weighted within-sample edge-counts of sample 1 |
R2_list |
list of permuted weighted within-sample edge-counts of sample 2 |
R1_test |
weighted within-sample edge-counts of sample 1 |
R2_test |
weighted within-sample edge-counts of sample 2 |
n_per |
number of permutations |
Value
The p-value based on permutation distribution using robust generalized graph-based test.
get the test statistic and p-value based on permutation using robust max-type edge-count test
Description
get the test statistic and p-value based on permutation using robust max-type edge-count test
Usage
permu_max(R1_list, R2_list, R1_test, R2_test, n1, n2, n_per)
Arguments
R1_list |
list of permuted weighted within-sample edge-counts of sample 1 |
R2_list |
list of permuted weighted within-sample edge-counts of sample 2 |
R1_test |
weighted within-sample edge-counts of sample 1 |
R2_test |
weighted within-sample edge-counts of sample 2 |
n1 |
number of observations in sample 1 |
n2 |
number of observations in sample 2 |
n_per |
number of permutations |
Value
The p-value based on permutation distribution using robust max-type graph-based test.
get the test statistic and p-value based on permutation using robust weighted edge-count test
Description
get the test statistic and p-value based on permutation using robust weighted edge-count test
Usage
permu_wei(R1_list, R2_list, R1_test, R2_test, n1, n2, n_per)
Arguments
R1_list |
list of permuted weighted within-sample edge-counts of sample 1 |
R2_list |
list of permuted weighted within-sample edge-counts of sample 2 |
R1_test |
weighted within-sample edge-counts of sample 1 |
R2_test |
weighted within-sample edge-counts of sample 2 |
n1 |
number of observations in sample 1 |
n2 |
number of observations in sample 2 |
n_per |
number of permutations |
Value
The p-value based on permutation distribution using robust weighted graph-based test.
Robust graph-based two sample test
Description
Performs robust graph-based two sample test.
Usage
rg.test(data.X, data.Y, dis = NULL, E = NULL, n1, n2, k = 5, weigh.fun, perm.num = 0,
test.type = list("ori", "gen", "wei", "max"), progress_bar = FALSE)
Arguments
data.X |
a numeric matrix for observations in sample 1. |
data.Y |
a numeric matrix for observations in sample 2. |
dis |
a distance matrix of the pooled dataset of sample 1 and sample 2. The indices of observations in sample 1 are from 1 to n1 and indices of observations in sample 2 are from 1+n1 to n1+n2 in the pooled dataset. |
E |
an edge matrix representing a similarity graph. Each row represents an edge and records the indices of two ends of an edge in two columns. The indices of observations in sample 1 are from 1 to n1 and indices of observations in sample 2 are from 1+n1 to n1+n2. |
n1 |
number of observations in sample 1. |
n2 |
number of observations in sample 2. |
k |
parameter in K-MST, with default 5. |
weigh.fun |
weighted function which returns weights of each edge and is a function of node degrees. |
perm.num |
number of permutations used to calculate the p-value (default=1000). Use 0 for getting only the approximate p-value based on asymptotic theory. |
test.type |
type of graph-based test. This must be a list containing elements chosen from "ori", "gen", "wei", and "max", with default 'list("ori", "gen", "wei", "max")'. "ori" refers to robust orignial edge-count test, "gen" refers to robust generalized edge-count test, "wei" refers to robust weighted edge-count test and "max" refers to robust max-type edge-count tests. |
progress_bar |
a logical evaluating to TRUE or FALSE indicating whether a progress bar of the permutation should be printed. |
Details
The input should be one of the following:
datasets of the two samples;
the distance matrix of the pooled dataset;
the edge matrix generated from a similarity graph.
Typical usages are:
rg.test(data.X, data.Y, n1, n2, weigh.fun, ...)
rg.test(dis, n1, n2, weigh.fun, ...)
rg.test(E, n1, n2, weigh.fun, ...)
If the data matrices or the distance matrix are used, the similarity graph is generated using K-MST.
Value
A list containing the following components:
asy.ori.statistic |
the asymptotic test statistic using robust original graph-based test. |
asy.ori.pval |
the asymptotic p-value using robust original graph-based test. |
asy.gen.statistic |
the asymptotic test statistic using robust generalized graph-based test. |
asy.gen.pval |
the asymptotic p-value using robust generalized graph-based test. |
asy.wei.statistic |
the asymptotic test statistic using robust weighted graph-based test. |
asy.wei.pval |
the asymptotic p-value using robust weighted graph-based test. |
asy.max.statistic |
the asymptotic test statistic using robust max-type graph-based test. |
asy.max.pval |
the asymptotic p-value using robust max-type graph-based test. |
perm.ori.pval |
the p-value based on permutation using robust original graph-based test. |
perm.gen.pval |
the p-value based on permutation using robust generalized graph-based test. |
perm.wei.pval |
the p-value based on permutation using robust weighted graph-based test. |
perm.max.pval |
the p-value based on permutation using robust max-type graph-based test. |
Examples
## Simulated from Student's t-distribution.
## Observations for the two samples are from different distributions.
data(example0)
data = as.matrix(example0$data) # pooled dataset
label = example0$label # label of observations
s1 = data[label == 'sample 1', ] # sample 1
s2 = data[label == 'sample 2', ] # sample 2
num1 = nrow(s1) # number of observations in sample 1
num2 = nrow(s2) # number of observations in sample 2
## Graph-based two sample test using data as input
rg.test(data.X = s1, data.Y = s2, n1 = num1, n2 = num2, k = 5, weigh.fun = weiMax, perm.num = 0)
## Graph-based two sample test using distance matrix as input
dist = example0$distance
rg.test(dis = dist, n1 = num1, n2 = num2, k = 5, weigh.fun = weiMax, perm.num = 0)
## Graph-based two sample test using edge matrix of the similarity graph as input
E = example0$edge
rg.test(E = E, n1 = num1, n2 = num2, weigh.fun = weiMax, perm.num = 0)
get analytic expressions of expectations, variances and covariances
Description
get analytic expressions of expectations, variances and covariances
Usage
theo_mu_sig(E, n1, n2, weights)
Arguments
E |
an edge matrix representing a similarity graph. Each row represents an edge and records the indices of two ends of an edge in two columns. The indices of observations in sample 1 are from 1 to n1 and indices of observations in sample 2 are from 1+n1 to n1+n2. |
n1 |
number of observations in sample 1 |
n2 |
number of observations in sample 2 |
weights |
weights assigned to each edges |
Value
mu |
the expectation of the between-sample edge-count. |
mu1 |
the expectation of the within-sample edge-count for sample 1. |
mu2 |
the expectation of the within-sample edge-count for sample 2. |
sig |
the variance of the between-sample edge-count. |
sig11 |
the variance of the within-sample edge-count for sample 1. |
sig22 |
the variance of the within-sample edge-count for sample 2. |
sig12 |
the covariance of the within-sample edge-counts. |
Weighted function
Description
This weight function returns the inverse of the arithmetic average of the node degrees of an edge.
Usage
weiArith(a, b)
Arguments
a |
node degree of one end of an edge |
b |
node degree of another end of an edge |
Value
The weight uses the arithmetic average of the node degrees of an edge.
Examples
# For an edge where one end has a node degree of 5
# another end has a node degree of 6
weiArith(6, 5)
Weighted function
Description
This weight function returns the inverse of the geometric average of the node degrees of an edge.
Usage
weiGeo(a, b)
Arguments
a |
node degree of one end of an edge |
b |
node degree of another end of an edge |
Value
The weight uses the geometric average of the node degrees of an edge.
Examples
# For an edge where one end has a node degree of 5
# another end has a node degree of 6
weiGeo(6, 5)
Weighted function
Description
This weight function returns the inverse of the max node degree of an edge.
Usage
weiMax(a, b)
Arguments
a |
node degree of one end of an edge |
b |
node degree of another end of an edge |
Value
The weight uses the max node degrees of an edge.
Examples
# For an edge where one end has a node degree of 5
# another end has a node degree of 6
weiMax(6, 5)
get weighted within-sample edge-counts and between-sample edge-counts
Description
get weighted within-sample edge-counts and between-sample edge-counts
Usage
weighted_R1R2(E, n1, wei)
Arguments
E |
an edge matrix representing a similarity graph. Each row represents an edge and records the indices of two ends of an edge in two columns. The indices of observations in sample 1 are from 1 to n1 and indices of observations in sample 2 are from 1+n1 to n1+n2. |
n1 |
number of observations in sample 1. |
wei |
a vector of weights of each edge. |
Value
R1 |
the weighted within-sample edge-count for sample 1. |
R2 |
the weighted within-sample edge-count for sample 2. |
R |
the weighted between-sample edge-count. |