Title: | Selection of Statistically Similar Research Groups |
Version: | 1.0.3 |
Description: | Select statistically similar research groups by backward selection using various robust algorithms, including a heuristic based on linear discriminant analysis, multiple heuristics based on the test statistic, and parallelized exhaustive search. |
Depends: | R (≥ 3.0.0) |
License: | MIT + file LICENSE |
VignetteBuilder: | knitr |
Suggests: | knitr, markdown, rmarkdown, testthat, roxygen2, doParallel |
Imports: | RUnit, data.table, entropy, foreach, iterators, iterpc, kSamples, stats, car, gmp, utils, methods |
RoxygenNote: | 7.3.1 |
Encoding: | UTF-8 |
NeedsCompilation: | no |
Packaged: | 2024-04-14 17:12:48 UTC; Kiss család |
Author: | Kyle Gorman [aut, cre], Géza Kiss [aut] |
Maintainer: | Kyle Gorman <kylebgorman@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2024-04-14 17:50:05 UTC |
ldamatch: Selection of Statistically Similar Research Groups.
Description
Select statistically similar research groups by backward selection
using various robust algorithms,
including a heuristic based on linear discriminant analysis,
multiple heuristics based on the test statistic,
and parallelized exhaustive search.
See the help for function match_groups
.
Author(s)
Maintainer: Kyle Gorman kylebgorman@gmail.com
Authors:
Géza Kiss kiss2017@alumni.ohsu.edu
Criterion function for U_halt.
Description
Criterion function for U_halt.
Usage
.U_crit(covariate, condition)
Arguments
covariate |
A vector containing a covariate to match the conditions on. |
condition |
A factor vector containing condition labels. |
Value
The p-value.
Criterion function for ad_halt.
Description
Criterion function for ad_halt.
Usage
.ad_crit(covariate, condition)
Arguments
covariate |
A vector containing a covariate to match the conditions on. |
condition |
A factor vector containing condition labels. |
Value
The p-value.
Returns smallest halting_test-threshold ratio, or 0 if less than 1.
Description
Returns smallest halting_test-threshold ratio, or 0 if less than 1.
Usage
.apply_crit(covariates, crit, condition, thresh)
Arguments
covariates |
A columnwise matrix containing covariates to match the conditions on. |
crit |
The criterion function to use, such as |
condition |
A factor vector containing condition labels. |
thresh |
The return value of halting_test has to be greater than or equal to thresh for the matched groups. |
Value
The ratio of the p-value and the threshold, or 0 if the p-value is less than the threshold.
Returns smallest value from .apply_crit for all condition pairs.
Description
Returns smallest value from .apply_crit for all condition pairs.
Usage
.apply_crit_to_condition_pairs(covariates, crit, condition, thresh)
Arguments
covariates |
A columnwise matrix containing covariates to match the conditions on. |
crit |
The criterion function to use, such as |
condition |
A factor vector containing condition labels. |
thresh |
The return value of halting_test has to be greater than or equal to thresh for the matched groups. |
Value
The ratio of the p-value and the threshold, or 0 if the p-value is less than the threshold.
Calculates multipliers used in search_random.
Description
Derives multiplier for rcounts (the number of subjects that can be removed) such that the proportion of the expected sizes of groups will be props. The returned multipliers will be in the range of 0 to 1.
Usage
.calc_multipliers(counts, rcounts, props)
Arguments
counts |
The number of subjects for each group. |
rcounts |
The number of subjects that can be removed for each group. |
props |
The expected proportion of subjects for each group. |
Calculates p-value-threshold ratio.
Description
Calculates p-value-threshold ratio.
Usage
.calc_p_thresh_ratio(
condition,
covariates,
halting_test,
thresh,
silent = TRUE
)
Arguments
condition |
A factor vector containing condition labels. |
covariates |
A columnwise matrix containing covariates to match the conditions on. |
halting_test |
A function to apply to 'covariates' (in matrix form)
which is TRUE iff the conditions are matched.
Signature: halting_test(condition, covariates, thresh).
The following halting tests are part of this package:
|
thresh |
The return value of halting_test has to be greater than or equal to thresh for the matched groups. |
silent |
If FALSE, prints warning when the test statistic cannot be calculated; if TRUE (the default) they are not printed. |
Value
The p-value-threshold ratio, or NA if the p-value could not be calculated.
The p-value / thresh ratio.
Characterizes closeness of actual group sizes to what is expected.
Description
Characterizes closeness of actual group sizes to what is expected.
Usage
.calc_subject_balance_divergence(table_condition, props)
Arguments
table_condition |
The number of different condition values, usually created by calling table(condition). |
props |
Either the desired proportions (percentage) of the sample for each condition as a named vector, or the names of the conditions for which we prefer to preserve the subjects, in decreasing order of preference. If not specified, the (full) sample proportions are used. This is preferred among configurations with the same taken into account by the other methods to some extent. For example, c(A = 0.4, B = 0.4, C = 0.2) means that we would like the number of subjects in groups A, B, and C to be around 40%, 40%, and 20% of the total number of subjects, respectively. Whereas c("A", "B", "C") means that if possible, we would like to keep all subjects in group A, and prefer keeping subjects in B, even if it results in losing more subjects from C. |
Value
KL divergence of the actual group size proportions from the expected ones.
See Also
match_groups
for meaning of condition parameter.
Searches over all possible subspaces for specified group size setup.
Description
Results are optimized for the following, in decreasing order of preference: number of subjects; proportion of group sizes close to props; p-value as large as possible.
Usage
.check_subspaces_for_group_size_setup(
best,
grpsize_setup,
sspace,
condition,
covariates,
halting_test,
thresh,
print_info
)
Arguments
best |
The best matched groups so far together with its p-value / thresh ratio; a list containing ratio and sets (a list of subject index vectors). |
grpsize_setup |
A set of group sizes as a data.table row (also a list). |
sspace |
An ordered subject subspace: a list of vectors, with one vector per group containing the corresponding subject indices. |
condition |
A factor vector containing condition labels. |
covariates |
A columnwise matrix containing covariates to match the conditions on. |
halting_test |
A function to apply to 'covariates' (in matrix form)
which is TRUE iff the conditions are matched.
Signature: halting_test(condition, covariates, thresh).
The following halting tests are part of this package:
|
thresh |
The return value of halting_test has to be greater than or equal to thresh for the matched groups. |
print_info |
If TRUE, prints summary information on the input and the
results, as well as progress information for the
exhaustive search and random algorithms. Default: TRUE;
can be changed using
|
Value
A list of logical vectors for the best matched groups.
Chooses best set of subjects in a set.
Description
Chooses best set of subjects in a set.
Usage
.choose_best_subjects(
candidates,
is.in,
condition,
covariates,
halting_test,
thresh,
tiebreaker,
props,
prefer_test,
max_removed_per_cond,
max_removed_in_next_step,
ratio_for_slowdown,
remove_best_only
)
Arguments
candidates |
An iterator returning (or a list containing) indices for the is.in logical vector whose in / out status is to be changed. |
is.in |
A logical vector showing which items are preserved currently; versions resulting by changing indices for each candidate are then compared. |
condition |
A factor vector containing condition labels. |
covariates |
A columnwise matrix containing covariates to match the conditions on. |
halting_test |
A function to apply to 'covariates' (in matrix form)
which is TRUE iff the conditions are matched.
Signature: halting_test(condition, covariates, thresh).
The following halting tests are part of this package:
|
thresh |
The return value of halting_test has to be greater than or equal to thresh for the matched groups. |
tiebreaker |
NULL, or a function similar to halting_test, used to decide between cases for which halting_test yields equal values. |
props |
Either the desired proportions (percentage) of the sample for each condition as a named vector, or the names of the conditions for which we prefer to preserve the subjects, in decreasing order of preference. If not specified, the (full) sample proportions are used. This is preferred among configurations with the same taken into account by the other methods to some extent. For example, c(A = 0.4, B = 0.4, C = 0.2) means that we would like the number of subjects in groups A, B, and C to be around 40%, 40%, and 20% of the total number of subjects, respectively. Whereas c("A", "B", "C") means that if possible, we would like to keep all subjects in group A, and prefer keeping subjects in B, even if it results in losing more subjects from C. |
prefer_test |
If TRUE, prefers higher test statistic more than the expected group size proportion; default is TRUE. Used by all algorithms except exhaustive, which always |
max_removed_per_cond |
A named integer vector, containing the maximum number of subjects that can be removed from each group. Specify 0 for groups if you want to preserve all of their subjects. If you do not specify a value for a group, it defaults to 2 less than the group size. Values outside the valid range of 0..(N-1) (where N is the number of subjects in the group) are corrected without a warning. |
ratio_for_slowdown |
The p-value / threshold ratio at which it starts removing subjects one by one. Used when max_removed_per_step > 1, with a default value of 0.5. |
Value
list(inds): A list containing the best index vectors indicating the positions to flip in is.in.
Chooses rows with best test statistic.
Description
Chooses rows with best test statistic.
Usage
.choose_best_test_statistic(
dat,
condition,
covariates,
halting_test,
thresh,
tiebreaker
)
Arguments
dat |
A data.table with an ind column with indices for items to consider dropping. |
condition |
A factor vector containing condition labels. |
covariates |
A columnwise matrix containing covariates to match the conditions on. |
halting_test |
A function to apply to 'covariates' (in matrix form)
which is TRUE iff the conditions are matched.
Signature: halting_test(condition, covariates, thresh).
The following halting tests are part of this package:
|
thresh |
The return value of halting_test has to be greater than or equal to thresh for the matched groups. |
tiebreaker |
NULL, or a function similar to halting_test, used to decide between cases for which halting_test yields equal values. |
Value
A data.table containing only the rows of dat with for best test statistic values (decided primarily by halting_test, then by tiebreaker).
Chooses best one(s) of a set of subjects having the best p-value(s).
Description
Used as first parameter of .search_heuristic_with_lookahead. It chooses subject(s) for removal: the most frequently removed one(s); among those, it prefers the one with the lowest p-value, then chooses randomly among the remaining ones.
Usage
.choose_most_frequently_chosen_subject_from_subject_tuples(
is.in,
best_sets,
look,
condition,
covariates,
halting_test,
thresh,
tiebreaker,
props,
prefer_test,
max_removed_per_cond,
max_removed_in_next_step,
ratio_for_slowdown,
remove_best_only,
print_info
)
Value
The table of counts for the chosen indices within is.in.
Chooses best one(s) of a set of subjects having the best p-value(s).
Description
Used as first parameter of .search_heuristic_with_lookahead. It traces the best parents for a set of subjects to be removed back to the first level. Note that the subject indices may not be unique: ones that occur in more configurations may be listed multiple times.
Usage
.choose_subject_with_best_p_value_from_subject_tuples(
is.in,
best_sets,
look,
condition,
covariates,
halting_test,
thresh,
tiebreaker,
props,
prefer_test,
max_removed_per_cond,
max_removed_in_next_step,
ratio_for_slowdown,
remove_best_only,
print_info
)
Value
The table of counts for the chosen indices within is.in.
Combines current best and candidate sets, keeping the highest metric value.
Description
Combines current best and candidate sets, keeping the highest metric value.
Usage
.combine_sets(best, candidate)
Arguments
best |
A list(metric, sets); metric is a number, set is a list of vectors. |
candidate |
A list(metric, set); metric is a number, set is a vector. Candidate is only considered if metric is not zero. |
Value
A list containing the highest metric and a list of set values (sets).
Creates Cartesian product of iterators.
Description
Creates Cartesian product of iterators.
Usage
.create_Cartesian_iterable(initializers, get_next, sspace)
Arguments
initializers |
A list of initializer functions (with no arguments) for iterators. |
get_next |
A function for retrieving next item for an iterator argument; it assumes that the iterator returns NULL when finished. |
sspace |
elements to be used (a list of vectors) |
Value
A function that returns list of values, and stops with "StopIteration" message when finished, so that it can be used with the iterators::iter() function to create an iterator that works with foreach.
Creates all group sizes by reducing one group in all rows of grpsizes.
Description
Used for generating all group size combinations for one specific total size iteratively, starting from grpsizes with one row containing original group sizes.
Usage
.decrease_group_sizes(grpsizes, grpnames, minpergrp)
Arguments
grpsizes |
A data.table with the columns containing the group names, and the rows containing a particular setup of group sizes. All rows are expected to have the same sum (not checked). |
grpnames |
The group names (specified because the table can have other columns as well). |
minpergrp |
The minimum number of subjects to be preserved per group. |
Value
A data.table with the same format as grpsizes, containing all possible group setups totaling to one less than the total in grpsizes.
Criterion function for f_halt.
Description
Criterion function for f_halt.
Usage
.f_crit(covariate, condition)
Arguments
covariate |
A vector containing a covariate to match the conditions on. |
condition |
A factor vector containing condition labels. |
Value
The p-value.
Flips logical vector at specified indices
Description
Flips logical vector at specified indices
Usage
.flip_ind(ind, is.in)
Arguments
ind |
Integer indices for the is.in logical vector. |
is.in |
A logical vector showing which items are preserved. |
Value
A logical vector identical to is.in except for indices in ind where it is is.in negated.
Wrapper to foreach::foreach called from .choose_best_subjects.
Description
Wrapper to foreach::foreach called from .choose_best_subjects.
Usage
.foreach(
input,
operation,
preprocess_input,
.init,
.combine,
max_chunk_size = get("PROCESSED_CHUNK_SIZE", .ldamatch_globals),
print_progress = get("PRINT_PROGRESS", .ldamatch_globals)
)
Arguments
input |
An iterator created using either the iterpc or the iterators package, or anything else foreach::foreach can interpret (esp. a list). |
operation |
The operation to be performed for each item in input (possibly after preprocessing it; see preprocess_input). |
preprocess_input |
Processes each value retrieved from the input iterator. |
.init , .combine |
The same as the parameters of foreach::foreach with identical names. |
max_chunk_size |
The maximum number of items to be retrieved from input if it is an iterator. |
print_progress |
If TRUE, prints messages about the progress. Used to use iterpc::iter_wrapper() on iterpc iterators, but realized that foreach doesn't handle iterators in a nice way (converts it to a list, which may be huge, instead of gradually retrieving the contensts), so feeding segments of the iterators to foreach instead. |
Returns halting tests for names, or checks if pass functions are suitable.
Description
Returns halting tests for names, or checks if pass functions are suitable.
Usage
.get_halting_test(halting_test)
Arguments
halting_test |
The name of one halting test, or a halting test function. |
Value
A vector of halting test functions.
Returns human readable format for number of seconds.
Description
Returns human readable format for number of seconds.
Usage
.get_human_readable(seconds, num_decimals = 1)
Arguments
seconds |
The number of seconds to convert to human-readable form. |
num_decimals |
The number of decimals to print in the output. |
Value
A string containing "<number> seconds/minutes/hours/days/years".
Determines which arguments for a function, which is its caller by default.
Description
Determines which arguments for a function, which is its caller by default.
Usage
.get_if_args_are_missing(fun = sys.function(-1), ncall = 3)
Arguments
fun |
A function; default: the caller. |
ncall |
The parent frame index; default: 3 (the great-grandparent). |
Value
A named boolean vector that contains whether each argument is missing.
Compares outputs of ldamatch runs using internally normalized parameters.
Description
Compares outputs of ldamatch runs using internally normalized parameters.
Usage
.internally_compare_ldamatch_outputs(
is.in1,
is.in2,
condition,
covariates,
halting_test,
props,
prefer_test,
tiebreaker
)
Arguments
is.in1 |
A logical vector for output 1, TRUE iff row is in the match. |
is.in2 |
A logical vector for output 2, TRUE iff row is in the match. |
condition |
A factor vector containing condition labels. |
covariates |
A columnwise matrix containing covariates to match the conditions on. |
halting_test |
A function to apply to 'covariates' (in matrix form)
which is TRUE iff the conditions are matched.
Signature: halting_test(condition, covariates, thresh).
The following halting tests are part of this package:
|
props |
Either the desired proportions (percentage) of the sample for each condition as a named vector, or the names of the conditions for which we prefer to preserve the subjects, in decreasing order of preference. If not specified, the (full) sample proportions are used. This is preferred among configurations with the same taken into account by the other methods to some extent. For example, c(A = 0.4, B = 0.4, C = 0.2) means that we would like the number of subjects in groups A, B, and C to be around 40%, 40%, and 20% of the total number of subjects, respectively. Whereas c("A", "B", "C") means that if possible, we would like to keep all subjects in group A, and prefer keeping subjects in B, even if it results in losing more subjects from C. |
prefer_test |
If TRUE, it prioritizes the test statistic more than the group size proportion. |
tiebreaker |
NULL, or a function similar to halting_test, used to decide between cases for which halting_test yields equal values. |
Value
A number that is > 0 if is.in1 is a better solution than is.in2, < 0 if is.in1 is a worse solution than is.in2, or 0 if the two solutions are equivalent (not necessarily identical).
See Also
compare_ldamatch_outputs
for operation and meaning of
parameters.
Criterion function for ks_halt.
Description
Warnings such as "cannot compute exact p-value with ties" are suppressed.
Usage
.ks_crit(covariate, condition)
Arguments
covariate |
A vector containing a covariate to match the conditions on. |
condition |
A factor vector containing condition labels. |
Value
The p-value.
Criterion function for l_halt.
Description
Warnings such as "ANOVA F-tests on an essentially perfect fit are unreliable" are suppressed.
Usage
.l_crit(covariate, condition)
Arguments
covariate |
A vector containing a covariate to match the conditions on. |
condition |
A factor vector containing condition labels. |
Value
The p-value.
Normalizes max_removed_per_cond parameter for match_groups() and estimate_exhaustive().
Description
Normalizes max_removed_per_cond parameter for match_groups() and estimate_exhaustive().
Usage
.normalize_max_removed_per_cond(max_removed_per_cond, condition)
Arguments
max_removed_per_cond |
A named integer vector, containing the maximum number of subjects that can be removed from each group. Specify 0 for groups if you want to preserve all of their subjects. If you do not specify a value for a group, it defaults to 2 less than the group size. Values outside the valid range of 0..(N-1) (where N is the number of subjects in the group) are corrected without a warning. |
condition |
A factor vector containing condition labels. |
Normalizes the props parameter for match_groups().
Description
Normalizes the props parameter for match_groups().
Usage
.normalize_props(props, condition, keep_last_item = FALSE)
Arguments
props |
Either the desired proportions (percentage) of the sample for each condition as a named vector, or the names of the conditions for which we prefer to preserve the subjects, in decreasing order of preference. If not specified, the (full) sample proportions are used. This is preferred among configurations with the same taken into account by the other methods to some extent. For example, c(A = 0.4, B = 0.4, C = 0.2) means that we would like the number of subjects in groups A, B, and C to be around 40%, 40%, and 20% of the total number of subjects, respectively. Whereas c("A", "B", "C") means that if possible, we would like to keep all subjects in group A, and prefer keeping subjects in B, even if it results in losing more subjects from C. |
condition |
A factor vector containing condition labels. |
keep_last_item |
If TRUE and props is a character vector, last item is not dropped. |
Value
A named vector: if props contains proportions, it is the same, but ordered to follow the levels of condition; if props contains names of conditions, the total number of subjects for the condition names in props.
Recycles threshold values for halting tests.
Description
Recycles threshold values for halting tests.
Usage
.recycle(ts, hs)
Arguments
ts |
Threshold value(s). |
hs |
Halting tests. |
Value
A vector with one threshold value per halting test.
Finds matching using depth-first search, looking ahead n steps.
Description
In each step, it removes one subject from the set of subjects with the smallest associated p-value after "lookahead" steps.
Usage
.search_heuristic_with_lookahead(
choose_from_subject_tuples,
condition,
covariates,
halting_test,
thresh,
props,
max_removed_per_cond,
tiebreaker = NULL,
min_preserved = length(levels(condition)),
lookahead = 2,
prefer_test = TRUE,
print_info = TRUE,
max_removed_per_step = 1,
max_removed_percent_per_step = 0.5,
ratio_for_slowdown = 0.5,
given_args = NULL,
...
)
Arguments
condition |
A factor vector containing condition labels. |
covariates |
A columnwise matrix containing covariates to match the conditions on. |
halting_test |
A function to apply to 'covariates' (in matrix form)
which is TRUE iff the conditions are matched.
Signature: halting_test(condition, covariates, thresh).
The following halting tests are part of this package:
|
thresh |
The return value of halting_test has to be greater than or equal to thresh for the matched groups. |
props |
Either the desired proportions (percentage) of the sample for each condition as a named vector, or the names of the conditions for which we prefer to preserve the subjects, in decreasing order of preference. If not specified, the (full) sample proportions are used. This is preferred among configurations with the same taken into account by the other methods to some extent. For example, c(A = 0.4, B = 0.4, C = 0.2) means that we would like the number of subjects in groups A, B, and C to be around 40%, 40%, and 20% of the total number of subjects, respectively. Whereas c("A", "B", "C") means that if possible, we would like to keep all subjects in group A, and prefer keeping subjects in B, even if it results in losing more subjects from C. |
max_removed_per_cond |
The maximum number of subjects that can be removed from each group. It must have a valid number for each group. |
tiebreaker |
NULL, or a function similar to halting_test, used to decide between cases for which halting_test yields equal values. |
min_preserved |
The minimum number of preserved subjects. It can be used to ensure that the search will not take forever to run, but instead fail when a solution is not found when preserving this number of subjects. |
lookahead |
The lookahead to use: a positive integer. It is used by the heuristic3 and heuristic4 algorithms, with a default of 2. The running time is O(N ^ lookahead), wheren N is the number of subjects. |
prefer_test |
If TRUE, prefers higher test statistic more than the expected group size proportion; default is TRUE. Used by all algorithms except exhaustive, which always |
print_info |
If TRUE, prints summary information on the input and the
results, as well as progress information for the
exhaustive search and random algorithms. Default: TRUE;
can be changed using
|
max_removed_per_step |
The number of equivalent subjects that can be removed in each step. (The actual allowed number may be less depending on the p-value / theshold ratio.) This parameters is used by the heuristic3 and heuristic4 algorithms, with a default value of 1. |
max_removed_percent_per_step |
The percentage of remaining subjects that can be removed in each step. Used when max_removed_per_step > 1, with a default value of 0.5. |
ratio_for_slowdown |
The p-value / threshold ratio at which it starts removing subjects one by one. Used when max_removed_per_step > 1, with a default value of 0.5. |
given_args |
The names of arguments given to the search function. |
... |
Consumes extra parameters that are not used by the search algorithm at hand; this function gives a warning about the ones whose value is not NULL that their value is not used. |
Value
All results found by search method in a list. It raises a "Convergence failure" error if it cannot find a matched set.
Orders rows by similarity to expected group size proportions.
Description
Orders rows by similarity to expected group size proportions.
Usage
.sort_group_sizes(grpsizes, grpnames, props)
Arguments
grpsizes |
A data.table with the columns containing the group names, and the rows containing a particular setup of group sizes. All rows are expected to have the same sum (not checked). |
grpnames |
The group names (specified because the table can have other columns as well). |
props |
Either the desired proportions (percentage) of the sample for each condition as a named vector, or the names of the conditions for which we prefer to preserve the subjects, in decreasing order of preference. If not specified, the (full) sample proportions are used. This is preferred among configurations with the same taken into account by the other methods to some extent. For example, c(A = 0.4, B = 0.4, C = 0.2) means that we would like the number of subjects in groups A, B, and C to be around 40%, 40%, and 20% of the total number of subjects, respectively. Whereas c("A", "B", "C") means that if possible, we would like to keep all subjects in group A, and prefer keeping subjects in B, even if it results in losing more subjects from C. |
Criterion function for t_halt.
Description
Criterion function for t_halt.
Usage
.t_crit(covariate, condition)
Arguments
covariate |
A vector containing a covariate to match the conditions on. |
condition |
A factor vector containing condition labels. |
Value
The p-value.
An infinitesimally small amount, used to check if values are approximately the same.
Description
An infinitesimally small amount, used to check if values are approximately the same.
Usage
.tolerance
Format
An object of class numeric
of length 1.
Uniquifies a list.
Description
Uniquifies a list.
Usage
.unique_list(l)
Arguments
l |
A list. |
Value
The unique list items.
Creates string from list of vectors.
Description
Creates string from list of vectors.
Usage
.vector_list_to_string(lv, sep = "")
Arguments
lv |
A list of vectors. |
sep |
A string to be inserted between the name of a vector item and its value. |
Value
A character string.
Warns about extra (i.e. unused) parameters.
Description
Warns about extra (i.e. unused) parameters.
Usage
.warn_about_extra_params(given_args = NULL, ...)
Arguments
given_args |
The names of arguments given to the search function. |
... |
Consumes extra parameters that are not used by the search algorithm at hand; this function gives a warning about the ones whose value is not NULL that their value is not used. |
A univariate halting test using the Wilcoxon test, which must be satisfied for all condition pairs.
Description
A univariate halting test using the Wilcoxon test, which must be satisfied for all condition pairs.
Usage
U_halt(condition, covariates, thresh)
Arguments
condition |
A factor vector containing condition labels. |
covariates |
A columnwise matrix containing covariates to match the conditions on. |
thresh |
The return value of halting_test has to be greater than or equal to thresh for the matched groups. |
Value
The ratio of the p-value and the threshold, or 0 if the p-value is less than the threshold. If there are more than two conditions, it returns the smallest value found for any condition pair.
A univariate halting test using the Anderson-Darling test.
Description
A univariate halting test using the Anderson-Darling test.
Usage
ad_halt(condition, covariates, thresh)
Arguments
condition |
A factor vector containing condition labels. |
covariates |
A columnwise matrix containing covariates to match the conditions on. |
thresh |
The return value of halting_test has to be greater than or equal to thresh for the matched groups. |
Value
The ratio of the p-value and the threshold, or 0 if the p-value is less than the threshold.
Calculates basic metrics about ldamatch search result.
Description
Calculates basic metrics about ldamatch search result.
Usage
calc_metrics(
is.in,
condition,
covariates,
halting_test,
props = prop.table(table(condition)),
tiebreaker = NULL
)
Arguments
is.in |
The output of |
condition |
A factor vector containing condition labels. |
covariates |
A columnwise matrix containing covariates to match the conditions on. |
halting_test |
A function to apply to 'covariates' (in matrix form)
which is TRUE iff the conditions are matched.
Signature: halting_test(condition, covariates, thresh).
The following halting tests are part of this package:
|
props |
Either the desired proportions (percentage) of the sample for each condition as a named vector, or the names of the conditions for which we prefer to preserve the subjects, in decreasing order of preference. If not specified, the (full) sample proportions are used. This is preferred among configurations with the same taken into account by the other methods to some extent. For example, c(A = 0.4, B = 0.4, C = 0.2) means that we would like the number of subjects in groups A, B, and C to be around 40%, 40%, and 20% of the total number of subjects, respectively. Whereas c("A", "B", "C") means that if possible, we would like to keep all subjects in group A, and prefer keeping subjects in B, even if it results in losing more subjects from C. |
tiebreaker |
NULL, or a function similar to halting_test, used to decide between cases for which halting_test yields equal values. |
Value
A list containing:
- all.is.in
all results as a list;
- is.in
simply the first item in all.is.in or the error contained in is.in if there was an error running
match_groups
;- num_excluded
the number of excluded subjects;
- p_matched
the test statistic from halting_test for the matched groups;
- p_tiebreaker
the test statistic from tiebreaker for the matched groups; and
- balance_divergence
a value characterizing the deviation from the expected group size proportions specified in props.
If the value for a field cannot be calculated, it will still be present with a value of NA.
Calculates p-value using specified halting test.
Description
Calculates p-value using specified halting test.
Usage
calc_p_value(condition, covariates, halting_test)
Arguments
condition |
A factor vector containing condition labels. |
covariates |
A columnwise matrix containing covariates to match the conditions on. |
halting_test |
A function to apply to 'covariates' (in matrix form)
which is TRUE iff the conditions are matched.
Signature: halting_test(condition, covariates, thresh).
The following halting tests are part of this package:
|
Value
The p-value.
Compares outputs of ldamatch runs.
Description
It favors, in decreasing order of priority, fewer excluded subjects, better balance (i.e. subsamples that diverge less from the expected proportions, which are by default the proportions of the input groups), and better (i.e. larger) test statistic for the matched groups. The preference order for the last two items can be reversed by specifying prefer_test = TRUE.
Usage
compare_ldamatch_outputs(
is.in1,
is.in2,
condition,
covariates = matrix(),
halting_test = NA,
props = prop.table(table(condition)),
prefer_test = is.null(props),
tiebreaker = NULL
)
Arguments
is.in1 |
A logical vector for output 1, TRUE iff row is in the match. |
is.in2 |
A logical vector for output 2, TRUE iff row is in the match. |
condition |
A factor vector containing condition labels. |
covariates |
A columnwise matrix containing covariates to match the conditions on. |
halting_test |
A function to apply to 'covariates' (in matrix form)
which is TRUE iff the conditions are matched.
Signature: halting_test(condition, covariates, thresh).
The following halting tests are part of this package:
|
props |
Either the desired proportions (percentage) of the sample for each condition as a named vector, or the names of the conditions for which we prefer to preserve the subjects, in decreasing order of preference. If not specified, the (full) sample proportions are used. This is preferred among configurations with the same taken into account by the other methods to some extent. For example, c(A = 0.4, B = 0.4, C = 0.2) means that we would like the number of subjects in groups A, B, and C to be around 40%, 40%, and 20% of the total number of subjects, respectively. Whereas c("A", "B", "C") means that if possible, we would like to keep all subjects in group A, and prefer keeping subjects in B, even if it results in losing more subjects from C. |
prefer_test |
If TRUE, it prioritizes the test statistic more than the group size proportion. |
tiebreaker |
NULL, or a function similar to halting_test, used to decide between cases for which halting_test yields equal values. |
Value
A number that is > 0 if is.in1 is a better solution than is.in2, < 0 if is.in1 is a worse solution than is.in2, or 0 if the two solutions are equivalent (not necessarily identical).
Creates halting test from multiple tests.
Description
The created halting test function returns the smallest p-value-to-threshold ratio of the values produced by the supplied tests, or zero if any of the p-values does not exceed the threshold. The resulting function expects one threshold per halting test in a vector or it recycles the given value(s) to get a threshold for each one.
Usage
create_halting_test(halting_tests)
Arguments
halting_tests |
Either a vector of halting test functions
(or function names) with the signature
halting_test(condition, covariates, thresh)
(for the meaning of the parameters see
|
Value
A function that returns the minimum of all halting test values; the threshold value supplied to it is recycled for the individual functions.
Estimates the maximum number of cases to be checked during exhaustive search.
Description
Estimates the maximum number of cases to be checked during exhaustive search.
Usage
estimate_exhaustive(
min_preserved = sum(group_sizes),
condition,
cases_per_second = 100,
print_info = TRUE,
max_removed_per_cond = NULL,
group_sizes = NULL,
props = prop.table(table(condition)),
max_cases = Inf
)
Arguments
min_preserved |
Assumes that at least a total of this many subjects will be preserved. |
condition |
A factor vector containing condition labels. |
cases_per_second |
Assumes that this number of cases are checked out per second, for estimating the time it takes to run the exhaustive search; default: 100. |
print_info |
If TRUE, prints partial calculations as well for the number of cases and estimated time when removing 1, 2, ... subjects. |
max_removed_per_cond |
A named integer vector, containing the maximum number of subjects that can be removed from each group. Specify 0 for groups if you want to preserve all of their subjects. If you do not specify a value for a group, it defaults to 2 less than the group size. Values outside the valid range of 0..(N-1) (where N is the number of subjects in the group) are corrected without a warning. |
group_sizes |
A particular set of group sizes that we know a matched solution for; min_preserved need not be specified if this one is. |
props |
Either the desired proportions (percentage) of the sample for each condition as a named vector, or the names of the conditions for which we prefer to preserve the subjects, in decreasing order of preference. If not specified, the (full) sample proportions are used. |
max_cases |
Once it is certain that the number of cases is definitely above this number, calculation stops. In this case, the returned number is guaranteed to be larger than max_cases, but it is not the exact number of exhaustive cases. Default is infimum, i.e. the exact number of cases is calculated. |
Value
The maximum number of cases: an integer if not greater than the maximum integer size (.Machine$integer.max), otherwise a Big Integer (see the gmp package).
Examples
estimate_exhaustive(58, as.factor(c(rep("ALN", 25), rep("TD", 44))))
estimate_exhaustive(84, as.factor(c(rep("ASD", 51), rep("TD", 44))))
A univariate halting test using Fisher's exact test.
Description
A univariate halting test using Fisher's exact test.
Usage
f_halt(condition, covariates, thresh)
Arguments
condition |
A factor vector containing condition labels. |
covariates |
A columnwise matrix containing covariates to match the conditions on. |
thresh |
The return value of halting_test has to be greater than or equal to thresh for the matched groups. |
Value
The ratio of the p-value and the threshold, or 0 if the p-value is less than the threshold.
Gets value for ldamatch global parameter.
Description
Gets value for ldamatch global parameter.
Usage
get_param(name)
Arguments
name |
The name of the global parameter. |
Value
The value of the global parameter.
See Also
set_param
for parameter names.
A univariate halting test using the Kolmogorov-Smirnov Test, which must be satisfied for all condition pairs.
Description
The condition must have two levels.
Usage
ks_halt(condition, covariates, thresh)
Arguments
condition |
A factor vector containing condition labels. |
covariates |
A columnwise matrix containing covariates to match the conditions on. |
thresh |
The return value of halting_test has to be greater than or equal to thresh for the matched groups. |
Details
Note that unlike many tests, the null hypothesis is that the two samples are are drawn from the same distribution.
Warnings such as "cannot compute exact p-value with ties" are suppressed.
Value
The ratio of the p-value and the threshold, or 0 if the p-value is less than the threshold. If there are more than two conditions, it returns the smallest value found for any condition pair.
A univariate halting test using Levene's test.
Description
Warnings such as "ANOVA F-tests on an essentially perfect fit are unreliable" are suppressed.
Usage
l_halt(condition, covariates, thresh)
Arguments
condition |
A factor vector containing condition labels. |
covariates |
A columnwise matrix containing covariates to match the conditions on. |
thresh |
The return value of halting_test has to be greater than or equal to thresh for the matched groups. |
Value
The ratio of the p-value and the threshold, or 0 if the p-value is less than the threshold.
Creates a matched group via backward selection.
Description
Creates a matched group via backward selection.
Usage
match_groups(
condition,
covariates,
halting_test,
thresh = 0.2,
method = ldamatch::matching_methods,
props = prop.table(table(condition)),
replicates = get("RND_DEFAULT_REPLICATES", .ldamatch_globals),
min_preserved = length(levels(condition)),
print_info = get("PRINT_INFO", .ldamatch_globals),
max_removed_per_cond = NULL,
tiebreaker = NULL,
lookahead = 2,
all_results = FALSE,
prefer_test = TRUE,
max_removed_per_step = 1,
max_removed_percent_per_step = 0.5,
ratio_for_slowdown = 0.5
)
Arguments
condition |
A factor vector containing condition labels. |
covariates |
A columnwise matrix containing covariates to match the conditions on. |
halting_test |
A function to apply to 'covariates' (in matrix form)
which is TRUE iff the conditions are matched.
Signature: halting_test(condition, covariates, thresh).
The following halting tests are part of this package:
|
thresh |
The return value of halting_test has to be greater than or equal to thresh for the matched groups. |
method |
The choice of search method, one of "random",
You can get more information about each method on the
help page for "search_<method_name>"
(e.g. " |
props |
Either the desired proportions (percentage) of the sample for each condition as a named vector, or the names of the conditions for which we prefer to preserve the subjects, in decreasing order of preference. If not specified, the (full) sample proportions are used. This is preferred among configurations with the same taken into account by the other methods to some extent. For example, c(A = 0.4, B = 0.4, C = 0.2) means that we would like the number of subjects in groups A, B, and C to be around 40%, 40%, and 20% of the total number of subjects, respectively. Whereas c("A", "B", "C") means that if possible, we would like to keep all subjects in group A, and prefer keeping subjects in B, even if it results in losing more subjects from C. |
replicates |
The maximum number of random replications to be performed. This is only used for the "random" method. |
min_preserved |
The minimum number of preserved subjects. It can be used to ensure that the search will not take forever to run, but instead fail when a solution is not found when preserving this number of subjects. |
print_info |
If TRUE, prints summary information on the input and the
results, as well as progress information for the
exhaustive search and random algorithms. Default: TRUE;
can be changed using
|
max_removed_per_cond |
A named integer vector, containing the maximum number of subjects that can be removed from each group. Specify 0 for groups if you want to preserve all of their subjects. If you do not specify a value for a group, it defaults to 2 less than the group size. Values outside the valid range of 0..(N-1) (where N is the number of subjects in the group) are corrected without a warning. |
tiebreaker |
NULL, or a function similar to halting_test, used to decide between cases for which halting_test yields equal values. |
lookahead |
The lookahead to use: a positive integer. It is used by the heuristic3 and heuristic4 algorithms, with a default of 2. The running time is O(N ^ lookahead), wheren N is the number of subjects. |
all_results |
If TRUE, returns all results found by method in a list. (A list is returned even if there is only one result.) If FALSE (the default), it returns the first result (a logical vector). |
prefer_test |
If TRUE, prefers higher test statistic more than the expected group size proportion; default is TRUE. Used by all algorithms except exhaustive, which always |
max_removed_per_step |
The number of equivalent subjects that can be removed in each step. (The actual allowed number may be less depending on the p-value / theshold ratio.) This parameters is used by the heuristic3 and heuristic4 algorithms, with a default value of 1. |
max_removed_percent_per_step |
The percentage of remaining subjects that can be removed in each step. Used when max_removed_per_step > 1, with a default value of 0.5. |
ratio_for_slowdown |
The p-value / threshold ratio at which it starts removing subjects one by one. Used when max_removed_per_step > 1, with a default value of 0.5. |
Details
The exhaustive, heuristic3, and heuristic4 search methods use the foreach
package to parallelize computation.
To take advantage of this, you must register a cluster.
For example, to use all but one of the CPU cores, run:
doParallel::registerDoParallel(cores = max(1, parallel::detectCores() - 1))
To use sequential processing without getting a warning, run:
foreach::registerDoSEQ()
Value
A logical vector that contains TRUE for the conditions that are in the matched groups; or if all_results = TRUE, a list of such vectors.
See Also
calc_p_value
for calculating the test statistic for
a group setup.
calc_metrics
for calculating multiple metrics about
the goodness of the result.
compare_ldamatch_outputs
for comparing multiple
different results from this function.
search_heuristic2, search_heuristic3, search_heuristic4, search_random, search_exhaustive
for
The available methods for matching.
Description
The available methods for matching.
Usage
matching_methods
Format
An object of class character
of length 5.
The available nondeterministic methods for matching.
Description
The available nondeterministic methods for matching.
Usage
nondeterministic_matching_methods
Format
An object of class character
of length 3.
The available parallelized methods for matching.
Description
The available parallelized methods for matching.
Usage
parallelized_matching_methods
Format
An object of class character
of length 3.
Searches the space backwards, prefering more subjects and certain group size proportions.
Description
Searches the space backwards, prefering more subjects and certain group size proportions.
Usage
search_exhaustive(
condition,
covariates,
halting_test,
thresh,
props,
max_removed_per_cond,
tiebreaker = NULL,
min_preserved = length(levels(condition)),
print_info = TRUE,
given_args = NULL,
...
)
Arguments
condition |
A factor vector containing condition labels. |
covariates |
A columnwise matrix containing covariates to match the conditions on. |
halting_test |
A function to apply to 'covariates' (in matrix form)
which is TRUE iff the conditions are matched.
Signature: halting_test(condition, covariates, thresh).
The following halting tests are part of this package:
|
thresh |
The return value of halting_test has to be greater than or equal to thresh for the matched groups. |
props |
Either the desired proportions (percentage) of the sample for each condition as a named vector, or the names of the conditions for which we prefer to preserve the subjects, in decreasing order of preference. If not specified, the (full) sample proportions are used. This is preferred among configurations with the same taken into account by the other methods to some extent. For example, c(A = 0.4, B = 0.4, C = 0.2) means that we would like the number of subjects in groups A, B, and C to be around 40%, 40%, and 20% of the total number of subjects, respectively. Whereas c("A", "B", "C") means that if possible, we would like to keep all subjects in group A, and prefer keeping subjects in B, even if it results in losing more subjects from C. |
max_removed_per_cond |
The maximum number of subjects that can be removed from each group. It must have a valid number for each group. |
tiebreaker |
NULL, or a function similar to halting_test, used to decide between cases for which halting_test yields equal values. |
min_preserved |
The minimum number of preserved subjects. It can be used to ensure that the search will not take forever to run, but instead fail when a solution is not found when preserving this number of subjects. |
print_info |
If TRUE, prints summary information on the input and the
results, as well as progress information for the
exhaustive search and random algorithms. Default: TRUE;
can be changed using
|
given_args |
The names of arguments given to the search function. |
... |
Consumes extra parameters that are not used by the search algorithm at hand; this function gives a warning about the ones whose value is not NULL that their value is not used. |
Details
While the search is done in parallel, the search space is enormous and so it can be very slow in the worst case. It is perhaps most useful as a tool to study other matching procedures.
You can calculate the maximum possible number of cases to evaluate by calling estimate_exhaustive().
Value
All results found by search method in a list. It raises a "Convergence failure" error if it cannot find a matched set.
OBSOLETE: Finds matching using depth-first search recursively.
Description
Please use the heuristic3 search algorithm with lookahead=1 instead for nearly equivalent results. Note that heuristic3 is parallelized, more memory efficient, and chooses subject to remove randomly from among equivalent choices instead of choosing the first one deterministically. This function is implemented recursively, so may run out of memory when applied to many subjects.
Usage
search_heuristic2(
condition,
covariates,
halting_test,
thresh,
props,
max_removed_per_cond,
tiebreaker = NULL,
prefer_test = TRUE,
print_info = TRUE,
given_args = NULL,
...
)
Arguments
condition |
A factor vector containing condition labels. |
covariates |
A columnwise matrix containing covariates to match the conditions on. |
halting_test |
A function to apply to 'covariates' (in matrix form)
which is TRUE iff the conditions are matched.
Signature: halting_test(condition, covariates, thresh).
The following halting tests are part of this package:
|
thresh |
The return value of halting_test has to be greater than or equal to thresh for the matched groups. |
props |
Either the desired proportions (percentage) of the sample for each condition as a named vector, or the names of the conditions for which we prefer to preserve the subjects, in decreasing order of preference. If not specified, the (full) sample proportions are used. This is preferred among configurations with the same taken into account by the other methods to some extent. For example, c(A = 0.4, B = 0.4, C = 0.2) means that we would like the number of subjects in groups A, B, and C to be around 40%, 40%, and 20% of the total number of subjects, respectively. Whereas c("A", "B", "C") means that if possible, we would like to keep all subjects in group A, and prefer keeping subjects in B, even if it results in losing more subjects from C. |
max_removed_per_cond |
The maximum number of subjects that can be removed from each group. It must have a valid number for each group. |
tiebreaker |
NULL, or a function similar to halting_test, used to decide between cases for which halting_test yields equal values. |
prefer_test |
If TRUE, prefers higher test statistic more than the expected group size proportion; default is TRUE. Used by all algorithms except exhaustive, which always |
print_info |
If TRUE, prints summary information on the input and the
results, as well as progress information for the
exhaustive search and random algorithms. Default: TRUE;
can be changed using
|
given_args |
The names of arguments given to the search function. |
... |
Consumes extra parameters that are not used by the search algorithm at hand; this function gives a warning about the ones whose value is not NULL that their value is not used. |
Details
In each step, it removes one subject from the set of subjects with the smallest p-value recursively.
Value
All results found by search method in a list. It raises a "Convergence failure" error if it cannot find a matched set.
Finds matching using depth-first search, looking ahead n steps.
Description
In each step, it removes one subject from the set of subjects with the smallest associated p-value after "lookahead" steps.
Usage
search_heuristic3(
condition,
covariates,
halting_test,
thresh,
props,
max_removed_per_cond,
tiebreaker = NULL,
min_preserved = length(levels(condition)),
lookahead = 2,
prefer_test = TRUE,
print_info = TRUE,
max_removed_per_step = 1,
max_removed_percent_per_step = 0.5,
ratio_for_slowdown = 0.5,
given_args = NULL,
...
)
Arguments
condition |
A factor vector containing condition labels. |
covariates |
A columnwise matrix containing covariates to match the conditions on. |
halting_test |
A function to apply to 'covariates' (in matrix form)
which is TRUE iff the conditions are matched.
Signature: halting_test(condition, covariates, thresh).
The following halting tests are part of this package:
|
thresh |
The return value of halting_test has to be greater than or equal to thresh for the matched groups. |
props |
Either the desired proportions (percentage) of the sample for each condition as a named vector, or the names of the conditions for which we prefer to preserve the subjects, in decreasing order of preference. If not specified, the (full) sample proportions are used. This is preferred among configurations with the same taken into account by the other methods to some extent. For example, c(A = 0.4, B = 0.4, C = 0.2) means that we would like the number of subjects in groups A, B, and C to be around 40%, 40%, and 20% of the total number of subjects, respectively. Whereas c("A", "B", "C") means that if possible, we would like to keep all subjects in group A, and prefer keeping subjects in B, even if it results in losing more subjects from C. |
max_removed_per_cond |
The maximum number of subjects that can be removed from each group. It must have a valid number for each group. |
tiebreaker |
NULL, or a function similar to halting_test, used to decide between cases for which halting_test yields equal values. |
min_preserved |
The minimum number of preserved subjects. It can be used to ensure that the search will not take forever to run, but instead fail when a solution is not found when preserving this number of subjects. |
lookahead |
The lookahead to use: a positive integer. It is used by the heuristic3 and heuristic4 algorithms, with a default of 2. The running time is O(N ^ lookahead), wheren N is the number of subjects. |
prefer_test |
If TRUE, prefers higher test statistic more than the expected group size proportion; default is TRUE. Used by all algorithms except exhaustive, which always |
print_info |
If TRUE, prints summary information on the input and the
results, as well as progress information for the
exhaustive search and random algorithms. Default: TRUE;
can be changed using
|
max_removed_per_step |
The number of equivalent subjects that can be removed in each step. (The actual allowed number may be less depending on the p-value / theshold ratio.) This parameters is used by the heuristic3 and heuristic4 algorithms, with a default value of 1. |
max_removed_percent_per_step |
The percentage of remaining subjects that can be removed in each step. Used when max_removed_per_step > 1, with a default value of 0.5. |
ratio_for_slowdown |
The p-value / threshold ratio at which it starts removing subjects one by one. Used when max_removed_per_step > 1, with a default value of 0.5. |
given_args |
The names of arguments given to the search function. |
... |
Consumes extra parameters that are not used by the search algorithm at hand; this function gives a warning about the ones whose value is not NULL that their value is not used. |
Details
Note that this algorithm is not deterministic, as it chooses one possible path randomly when there are multiple apparently equivalent ones. In practice this means that it may return different results on different runs (including the case that it fails to converge to a solution in one run, but converges in another run). If print_info = TRUE (the default), you will see a message about "Random choices" if the algorithm needed to make random path choices.
Value
All results found by search method in a list. It raises a "Convergence failure" error if it cannot find a matched set.
Finds matching using depth-first search, looking ahead n steps.
Description
In each step, it removes one subject from the set of subjects that were removed on most paths after "lookahead" steps, preferring one with the smallest associate p-value.
Usage
search_heuristic4(
condition,
covariates,
halting_test,
thresh,
props,
max_removed_per_cond,
tiebreaker = NULL,
min_preserved = length(levels(condition)),
lookahead = 2,
prefer_test = TRUE,
print_info = TRUE,
max_removed_per_step = 1,
max_removed_percent_per_step = 0.5,
ratio_for_slowdown = 0.5,
given_args = NULL,
...
)
Arguments
condition |
A factor vector containing condition labels. |
covariates |
A columnwise matrix containing covariates to match the conditions on. |
halting_test |
A function to apply to 'covariates' (in matrix form)
which is TRUE iff the conditions are matched.
Signature: halting_test(condition, covariates, thresh).
The following halting tests are part of this package:
|
thresh |
The return value of halting_test has to be greater than or equal to thresh for the matched groups. |
props |
Either the desired proportions (percentage) of the sample for each condition as a named vector, or the names of the conditions for which we prefer to preserve the subjects, in decreasing order of preference. If not specified, the (full) sample proportions are used. This is preferred among configurations with the same taken into account by the other methods to some extent. For example, c(A = 0.4, B = 0.4, C = 0.2) means that we would like the number of subjects in groups A, B, and C to be around 40%, 40%, and 20% of the total number of subjects, respectively. Whereas c("A", "B", "C") means that if possible, we would like to keep all subjects in group A, and prefer keeping subjects in B, even if it results in losing more subjects from C. |
max_removed_per_cond |
The maximum number of subjects that can be removed from each group. It must have a valid number for each group. |
tiebreaker |
NULL, or a function similar to halting_test, used to decide between cases for which halting_test yields equal values. |
min_preserved |
The minimum number of preserved subjects. It can be used to ensure that the search will not take forever to run, but instead fail when a solution is not found when preserving this number of subjects. |
lookahead |
The lookahead to use: a positive integer. It is used by the heuristic3 and heuristic4 algorithms, with a default of 2. The running time is O(N ^ lookahead), wheren N is the number of subjects. |
prefer_test |
If TRUE, prefers higher test statistic more than the expected group size proportion; default is TRUE. Used by all algorithms except exhaustive, which always |
print_info |
If TRUE, prints summary information on the input and the
results, as well as progress information for the
exhaustive search and random algorithms. Default: TRUE;
can be changed using
|
max_removed_per_step |
The number of equivalent subjects that can be removed in each step. (The actual allowed number may be less depending on the p-value / theshold ratio.) This parameters is used by the heuristic3 and heuristic4 algorithms, with a default value of 1. |
max_removed_percent_per_step |
The percentage of remaining subjects that can be removed in each step. Used when max_removed_per_step > 1, with a default value of 0.5. |
ratio_for_slowdown |
The p-value / threshold ratio at which it starts removing subjects one by one. Used when max_removed_per_step > 1, with a default value of 0.5. |
given_args |
The names of arguments given to the search function. |
... |
Consumes extra parameters that are not used by the search algorithm at hand; this function gives a warning about the ones whose value is not NULL that their value is not used. |
Details
Note that this algorithm is not deterministic, as it chooses one possible subject for removal randomly when there are multiple apparently equivalent ones. In practice it means that it may return different results on different runs (including the case that it fails to converge to a solution in one run, but converges in another run). If print_info = TRUE (the default), you will see a message about "Random choices" if the algorithm needed to make such random decisions.
Value
All results found by search method in a list. It raises a "Convergence failure" error if it cannot find a matched set.
Searches by randomly selecting subspaces with decreasing expected size.
Description
Searches by randomly selecting subspaces with decreasing expected size.
Usage
search_random(
condition,
covariates,
halting_test,
thresh,
props,
max_removed_per_cond,
tiebreaker = NULL,
replicates,
prefer_test = TRUE,
print_info = TRUE,
given_args = NULL,
...
)
Arguments
condition |
A factor vector containing condition labels. |
covariates |
A columnwise matrix containing covariates to match the conditions on. |
halting_test |
A function to apply to 'covariates' (in matrix form)
which is TRUE iff the conditions are matched.
Signature: halting_test(condition, covariates, thresh).
The following halting tests are part of this package:
|
thresh |
The return value of halting_test has to be greater than or equal to thresh for the matched groups. |
props |
Either the desired proportions (percentage) of the sample for each condition as a named vector, or the names of the conditions for which we prefer to preserve the subjects, in decreasing order of preference. If not specified, the (full) sample proportions are used. This is preferred among configurations with the same taken into account by the other methods to some extent. For example, c(A = 0.4, B = 0.4, C = 0.2) means that we would like the number of subjects in groups A, B, and C to be around 40%, 40%, and 20% of the total number of subjects, respectively. Whereas c("A", "B", "C") means that if possible, we would like to keep all subjects in group A, and prefer keeping subjects in B, even if it results in losing more subjects from C. |
max_removed_per_cond |
The maximum number of subjects that can be removed from each group. It must have a valid number for each group. |
tiebreaker |
NULL, or a function similar to halting_test, used to decide between cases for which halting_test yields equal values. |
replicates |
The maximum number of random replications to be performed. This is only used for the "random" method. |
prefer_test |
If TRUE, prefers higher test statistic more than the expected group size proportion; default is TRUE. Used by all algorithms except exhaustive, which always |
print_info |
If TRUE, prints summary information on the input and the
results, as well as progress information for the
exhaustive search and random algorithms. Default: TRUE;
can be changed using
|
given_args |
The names of arguments given to the search function. |
... |
Consumes extra parameters that are not used by the search algorithm at hand; this function gives a warning about the ones whose value is not NULL that their value is not used. |
Value
All results found by search method in a list. It raises a
Sets value for ldamatch global parameter.
Description
Sets value for ldamatch global parameter.
Usage
set_param(name, value)
Arguments
name |
The name of the global parameter. |
value |
The new value of the global parameter. |
Details
The names of the available parameters:
- RND_DEFAULT_REPLICATES
random search: default number of replicates
- Anderson-Darling test parameters; see kSamples::ad.test for explanation
-
- AD_METHOD
the method parameter for ad.test; default: asymptotic
- AD_NSIM
the Nsim parameter for ad.test, used when AD_METHOD is 'simulated'; default: 10000
- AD_VERSION
1 or 2 for the two versions of the test statistic; default: 1
- PRINT_INFO
print summary information, and progress information for the exhaustive search algorithm
- PRINT_PROGRESS
whether to print progress information about parallel processing of cases
- PROCESSED_CHUNK_SIZE
the number of cases to be retrieved at a time from iterators for parallel processing
Value
The previous value of the global parameter.
See Also
get_param
for retrieving the current value of a
parameter.
A univariate halting test using the t-test, which must be satisfied for all condition pairs.
Description
A univariate halting test using the t-test, which must be satisfied for all condition pairs.
Usage
t_halt(condition, covariates, thresh)
Arguments
condition |
A factor vector containing condition labels. |
covariates |
A columnwise matrix containing covariates to match the conditions on. |
thresh |
The return value of halting_test has to be greater than or equal to thresh for the matched groups. |
Value
The ratio of the p-value and the threshold, or 0 if the p-value is less than the threshold. If there are more than two conditions, it returns the smallest value found for any condition pair.
A multivariate halting test appropriate for more than two condition levels.
Description
A multivariate halting test appropriate for more than two condition levels.
Usage
wilks_halt(condition, covariates, thresh)
Arguments
condition |
A factor vector containing condition labels. |
covariates |
A columnwise matrix containing covariates to match the conditions on. |
thresh |
The return value of halting_test has to be greater than or equal to thresh for the matched groups. |
Value
The ratio of the p-value and the threshold, or 0 if the p-value is less than the threshold.