Help for package rgTest

Title:

Robust Graph-Based Two-Sample Test

Version:

0.1

Description:

Useful tools for determining whether two samples are from the same distribution. Utilizes a robust method to address the problematic structure of the similarity graph constructed from high-dimensional data. The method is provided in Yichuan Bai and Lynna Chu (2023) <doi:10.48550/arXiv.2307.12325>.

License:

MIT + file LICENSE

Encoding:

UTF-8

RoxygenNote:

7.2.3

Imports:

ade4, stats

Suggests:

knitr, rmarkdown, testthat (≥ 3.0.0)

Config/testthat/edition:

VignetteBuilder:

knitr

Depends:

R (≥ 3.0.1)

LazyData:

true

NeedsCompilation:

Packaged:

2023-08-11 21:07:22 UTC; tutu

Author:

Yichuan Bai [aut, cre], Lynna Chu [aut]

Maintainer:

Yichuan Bai <ycbai@iastate.edu>

Repository:

CRAN

Date/Publication:

2023-08-14 11:40:02 UTC

get the approximate test statistic and p-value based on asymptotic theory using robust generalized edge-count test

Description

get the approximate test statistic and p-value based on asymptotic theory using robust generalized edge-count test

Usage

asy_gen(asy_res, R1_test, R2_test)

Arguments

asy_res

analytic expressions of expectations, variances and covariances

R1_test

weighted within-sample edge-counts of sample 1

R2_test

weighted within-sample edge-counts of sample 2

Value

A list containing the following components:

test_statistic

the asymptotic test statistic using robust generalized graph-based test.

p_value

the asymptotic p-value using robust generalized graph-based test.

get the approximate test statistic and p-value based on asymptotic theory using robust max-type edge-count test

Description

get the approximate test statistic and p-value based on asymptotic theory using robust max-type edge-count test

Usage

asy_max(asy_res, R1_test, R2_test, n1, n2)

Arguments

asy_res

analytic expressions of expectations, variances and covariances

R1_test

weighted within-sample edge-counts of sample 1

R2_test

weighted within-sample edge-counts of sample 2

n1

number of observations in sample 1

n2

number of observations in sample 2

Value

A list containing the following components:

test_statistic

the asymptotic test statistic using robust max-type graph-based test.

p_value

the asymptotic p-value using robust max-type graph-based test.

get the approximate test statistic and p-value based on asymptotic theory using robust weighted edge-count test

Description

get the approximate test statistic and p-value based on asymptotic theory using robust weighted edge-count test

Usage

asy_wei(asy_res, R1_test, R2_test, n1, n2)

Arguments

asy_res

analytic expressions of expectations, variances and covariances

R1_test

weighted within-sample edge-counts of sample 1

R2_test

weighted within-sample edge-counts of sample 2

n1

number of observations in sample 1

n2

number of observations in sample 2

Value

A list containing the following components:

test_statistic

the asymptotic test statistic using robust weighted graph-based test.

p_value

the asymptotic p-value using robust weighted graph-based test.

Example

Description

These example contains a dataset, the label of the observations in the dataset, the distance matrix of the dataset using L2 distance, and the edge matrix generated by 5-MST.

Usage

example0

Format

An object of class list of length 4.

Details

data: pooled dataset of two samples sampling from two different t-distributions.
label: label of the observations. 'sample 1' denotes the observations in sample 1. 'sample 2' denotes the observations in sample 2.
distance: the distance matrix of the pooled dataset using L2 distance.
edge: edge matrix generated by 5-MST.

Get distance matrix

Description

This function returns the distance matrix using L2 distance.

Usage

getdis(y)

Arguments

y

dataset of the pooled data

Value

A distance matrix based on the L2 distance.

Examples

data(example0)
data = as.matrix(example0$data)     # pooled dataset
getdis(data)

construct k-mst

Description

construct k-mst

Usage

kmst(y = NULL, dis = NULL, k = 1)

Arguments

y

data

dis

distance matrix

k

parameter in K-MST, with default 1

Value

An edge matrix representing a similarity graph. Each row represents an edge and records the indices of two ends of an edge in two columns.

get lists of permuted weighted within-sample edge-counts and between-sample edge-counts

Description

get lists of permuted weighted within-sample edge-counts and between-sample edge-counts

Usage

permu_edge(n_per, E, n1, n2, wei, progress_bar = FALSE)

Arguments

n_per

number of permutations.

E

an edge matrix representing a similarity graph. Each row represents an edge and records the indices of two ends of an edge in two columns. The indices of observations in sample 1 are from 1 to n1 and indices of observations in sample 2 are from 1+n1 to n1+n2.

n1

number of observations in sample 1.

n2

number of observations in sample 2.

wei

a vector of weights of each edge.

progress_bar

a logical evaluating to TRUE or FALSE indicating whether a progress bar of the permutation should be printed.

Value

R1

the permuted weighted within-sample edge-counts for sample 1.

R2

the permuted weighted within-sample edge-counts for sample 2.

R

the permuted weighted between-sample edge-counts.

get the test statistic and p-value based on permutation using robust generalized edge-count test

Description

get the test statistic and p-value based on permutation using robust generalized edge-count test

Usage

permu_gen(R1_list, R2_list, R1_test, R2_test, n_per)

Arguments

R1_list

list of permuted weighted within-sample edge-counts of sample 1

R2_list

list of permuted weighted within-sample edge-counts of sample 2

R1_test

weighted within-sample edge-counts of sample 1

R2_test

weighted within-sample edge-counts of sample 2

n_per

number of permutations

Value

The p-value based on permutation distribution using robust generalized graph-based test.

get the test statistic and p-value based on permutation using robust max-type edge-count test

Description

get the test statistic and p-value based on permutation using robust max-type edge-count test

Usage

permu_max(R1_list, R2_list, R1_test, R2_test, n1, n2, n_per)

Arguments

R1_list

list of permuted weighted within-sample edge-counts of sample 1

R2_list

list of permuted weighted within-sample edge-counts of sample 2

R1_test

weighted within-sample edge-counts of sample 1

R2_test

weighted within-sample edge-counts of sample 2

n1

number of observations in sample 1

n2

number of observations in sample 2

n_per

number of permutations

Value

The p-value based on permutation distribution using robust max-type graph-based test.

get the test statistic and p-value based on permutation using robust weighted edge-count test

Description

get the test statistic and p-value based on permutation using robust weighted edge-count test

Usage

permu_wei(R1_list, R2_list, R1_test, R2_test, n1, n2, n_per)

Arguments

R1_list

list of permuted weighted within-sample edge-counts of sample 1

R2_list

list of permuted weighted within-sample edge-counts of sample 2

R1_test

weighted within-sample edge-counts of sample 1

R2_test

weighted within-sample edge-counts of sample 2

n1

number of observations in sample 1

n2

number of observations in sample 2

n_per

number of permutations

Value

The p-value based on permutation distribution using robust weighted graph-based test.

Robust graph-based two sample test

Description

Performs robust graph-based two sample test.

Usage

rg.test(data.X, data.Y, dis = NULL, E = NULL, n1, n2, k = 5, weigh.fun, perm.num = 0, 
test.type = list("ori", "gen", "wei", "max"), progress_bar = FALSE)

Arguments

data.X

a numeric matrix for observations in sample 1.

data.Y

a numeric matrix for observations in sample 2.

dis

a distance matrix of the pooled dataset of sample 1 and sample 2. The indices of observations in sample 1 are from 1 to n1 and indices of observations in sample 2 are from 1+n1 to n1+n2 in the pooled dataset.

E

n1

number of observations in sample 1.

n2

number of observations in sample 2.

k

parameter in K-MST, with default 5.

weigh.fun

weighted function which returns weights of each edge and is a function of node degrees.

perm.num

number of permutations used to calculate the p-value (default=1000). Use 0 for getting only the approximate p-value based on asymptotic theory.

test.type

type of graph-based test. This must be a list containing elements chosen from "ori", "gen", "wei", and "max", with default 'list("ori", "gen", "wei", "max")'. "ori" refers to robust orignial edge-count test, "gen" refers to robust generalized edge-count test, "wei" refers to robust weighted edge-count test and "max" refers to robust max-type edge-count tests.

progress_bar

a logical evaluating to TRUE or FALSE indicating whether a progress bar of the permutation should be printed.

Details

The input should be one of the following:

datasets of the two samples;
the distance matrix of the pooled dataset;
the edge matrix generated from a similarity graph.

Typical usages are:

rg.test(data.X, data.Y, n1, n2, weigh.fun, ...)

rg.test(dis, n1, n2, weigh.fun, ...)

rg.test(E, n1, n2, weigh.fun, ...)

If the data matrices or the distance matrix are used, the similarity graph is generated using K-MST.

Value

A list containing the following components:

asy.ori.statistic

the asymptotic test statistic using robust original graph-based test.

asy.ori.pval

the asymptotic p-value using robust original graph-based test.

asy.gen.statistic

the asymptotic test statistic using robust generalized graph-based test.

asy.gen.pval

the asymptotic p-value using robust generalized graph-based test.

asy.wei.statistic

the asymptotic test statistic using robust weighted graph-based test.

asy.wei.pval

the asymptotic p-value using robust weighted graph-based test.

asy.max.statistic

the asymptotic test statistic using robust max-type graph-based test.

asy.max.pval

the asymptotic p-value using robust max-type graph-based test.

perm.ori.pval

the p-value based on permutation using robust original graph-based test.

perm.gen.pval

the p-value based on permutation using robust generalized graph-based test.

perm.wei.pval

the p-value based on permutation using robust weighted graph-based test.

perm.max.pval

the p-value based on permutation using robust max-type graph-based test.

Examples

## Simulated from Student's t-distribution. 
## Observations for the two samples are from different distributions.
data(example0)
data = as.matrix(example0$data)     # pooled dataset
label = example0$label              # label of observations
s1 = data[label == 'sample 1', ]    # sample 1
s2 = data[label == 'sample 2', ]    # sample 2
num1 = nrow(s1)                     # number of observations in sample 1
num2 = nrow(s2)                     # number of observations in sample 2

## Graph-based two sample test using data as input
rg.test(data.X = s1, data.Y = s2, n1 = num1, n2 = num2, k = 5, weigh.fun = weiMax, perm.num = 0)

## Graph-based two sample test using distance matrix as input
dist = example0$distance
rg.test(dis = dist, n1 = num1, n2 = num2, k = 5, weigh.fun = weiMax, perm.num = 0)

## Graph-based two sample test using edge matrix of the similarity graph as input
E = example0$edge
rg.test(E = E, n1 = num1, n2 = num2, weigh.fun = weiMax, perm.num = 0)

get analytic expressions of expectations, variances and covariances

Description

get analytic expressions of expectations, variances and covariances

Usage

theo_mu_sig(E, n1, n2, weights)

Arguments

E

n1

number of observations in sample 1

n2

number of observations in sample 2

weights

weights assigned to each edges

Value

mu

the expectation of the between-sample edge-count.

mu1

the expectation of the within-sample edge-count for sample 1.

mu2

the expectation of the within-sample edge-count for sample 2.

sig

the variance of the between-sample edge-count.

sig11

the variance of the within-sample edge-count for sample 1.

sig22

the variance of the within-sample edge-count for sample 2.

sig12

the covariance of the within-sample edge-counts.

Weighted function

Description

This weight function returns the inverse of the arithmetic average of the node degrees of an edge.

Usage

weiArith(a, b)

Arguments

a

node degree of one end of an edge

b

node degree of another end of an edge

Value

The weight uses the arithmetic average of the node degrees of an edge.

Examples

# For an edge where one end has a node degree of 5
# another end has a node degree of 6
 weiArith(6, 5)

Weighted function

Description

This weight function returns the inverse of the geometric average of the node degrees of an edge.

Usage

weiGeo(a, b)

Arguments

a

node degree of one end of an edge

b

node degree of another end of an edge

Value

The weight uses the geometric average of the node degrees of an edge.

Examples

# For an edge where one end has a node degree of 5
# another end has a node degree of 6
weiGeo(6, 5)

Weighted function

Description

This weight function returns the inverse of the max node degree of an edge.

Usage

weiMax(a, b)

Arguments

a

node degree of one end of an edge

b

node degree of another end of an edge

Value

The weight uses the max node degrees of an edge.

Examples

# For an edge where one end has a node degree of 5
# another end has a node degree of 6
weiMax(6, 5)

get weighted within-sample edge-counts and between-sample edge-counts

Description

get weighted within-sample edge-counts and between-sample edge-counts

Usage

weighted_R1R2(E, n1, wei)

Arguments

E

n1

number of observations in sample 1.

wei

a vector of weights of each edge.

Value

R1

the weighted within-sample edge-count for sample 1.

R2

the weighted within-sample edge-count for sample 2.

R

the weighted between-sample edge-count.