Measures of Homogeneity

Count Data

For count data, \(\chi^2\) can be estimated using the Cressie-Read power-divergence statistic for each surface-subsurface pair:

CR(x[,1:2])
#> [1] 4.165364
CR(x[,3:4])
#> [1] 111.5364

Then, to obtain a measure of effect size, the bias-corrected estimation of Cramér’s \(V\), denoted \(V_B\), allows for cross-comparability of the degree of homogeneity:

VB(x[,1:2])
#> [1] 0.03878439
VB(x[,3:4])
#> [1] 0.8551924

The VB() function by default uses CR() with a parameter of lambda = 2/3 to estimate \(\chi^2\); if one wants to use Pearson’s method, one sets lambda = 1 in the input:

# Pearson's chi-squared
VB(x[,1:2], lambda = 1)
#> [1] 0.03588807
VB(x[,3:4], lambda = 1)
#> [1] 0.8394991

It is tedious to re-run effect sizes between selected columns, and so the VB_pair() function returns estimates of \(V_B\) pairwise between all columns:

x_effect_sizes <- VB_pair(x)
x_effect_sizes
#>               surface1 subsurface1  surface2 subsurface2
#> surface1    0.00000000  0.03878439 0.5152189   0.6407815
#> subsurface1 0.03878439  0.00000000 0.4872673   0.7486347
#> surface2    0.51521892  0.48726730 0.0000000   0.8551924
#> subsurface2 0.64078148  0.74863466 0.8551924   0.0000000
#> attr(,"class")
#> [1] "effect_sizes" "matrix"

To assess whether the effect sizes between pairs of related contexts are lower (i.e., more homogenous) than those between pairs of unrelated contexts (e.g, between surface1 and surface2, or subsurface1 and surface2), the function homogeneity() is used. The related pairs must be specified, in either a matrix or data frame. The unrelated pairs may be similarly specified, or, not entered into the function, are automatically determined as all pairs which are not contained in the related dataset.

Note that the homogeneity() function takes the results of the pairwise function VB_pair(), not the original matrix x (effect sizes can be estimated via other means, as will be shown below for presence/absence data):

# related pairs
W_contexts <- matrix(c("surface1", "surface2", "subsurface1", "subsurface2"), ncol = 2)
W_contexts
#>      [,1]       [,2]         
#> [1,] "surface1" "subsurface1"
#> [2,] "surface2" "subsurface2"

# homogeneity
x_homogeneity <- homogeneity(x_effect_sizes, related = W_contexts)

x_homogeneity
#> 
#>  Homogeneity between Related Contexts:
#>      Number of related pairs: 4 
#>      Number of unrelated pairs: 8 
#>      Mean quantile: 0.5

The results of homogeneity() comprise a list object containing:

n: a vector containing the number of related and unrelated pairs of contexts
U: the effect sizes between unrelated pairs of contexts
W: the effect sizes between related pairs of contexts
Q: for each related pair, the quantile indicating the proportion of unrelated pairs which are less homogenous. Higher values of Q indicate that the pair is more homogenous in comparison.
D: the distribution of differences among all U and W. The proportion of D greater than zero is equivalent to the mean of all quantile estimates Q.

The results of this example show that the surface1/subsurface1 pairing has the most homogeneous distribution compared to unrelated pairs, whereas the surface2/subsurface2 pairing is less homogeneous than even unrelated pairs of contexts. By taking the mean quantile (or proportion of differences \(D\) greater than zero), one can see that the matter of representativeness of surface/subsurface pairings is evenly split, with the related surface/subsurface pairs in their totality being no more or less homogenous than those of unrelated pairs of contexts.

Presence/Absence Data

In many situations one may not have counts, but only information on the presence/absence of an artifact type. In such a case, one can draft a \(2 \times 2\) contingency table for two contexts, giving the count of how many types are present in both, missing in one or the other, or missing in both:

	Present	Absent
Present	(present in both)	(present in 1 but not 2)
Absent	(present in 2 but not 1)	(absent from both)

Homogeneity in such a situation can be evaluated using the conventional log odds ratio, or also by giving the trace of the \(2 \times 2\) presence/absence contingency table. The log odds ratio will be affected by the degree to which types are both present or both absent (i.e., comparing contexts with themselves will yield different log odds ratios, if they happen to be missing more or less types – see the diagonal of the pairwise log odds ratio matrix below). The trace coefficient will not abe affected by this tendency. arkhaia has functionality for evaluating both log odds ratios and the trace, pairwise, for a contingency table:

x_logOR <- log_OR_pair(x)
x_logOR
#>               surface1 subsurface1   surface2 subsurface2
#> surface1     3.2958369   1.0986123  1.9459101  -0.2513144
#> subsurface1  1.0986123   2.3978953  0.3364722   1.0986123
#> surface2     1.9459101   0.3364722  3.5553481  -1.0986123
#> subsurface2 -0.2513144   1.0986123 -1.0986123   3.2958369
#> attr(,"class")
#> [1] "effect_sizes" "matrix"

x_trace <- trace_pair(x)
x_trace
#>             surface1 subsurface1 surface2 subsurface2
#> surface1           5           4        4           3
#> subsurface1        4           5        3           4
#> surface2           4           3        5           2
#> subsurface2        3           4        2           5
#> attr(,"class")
#> [1] "effect_sizes" "matrix"

The log_OR_pair() function adds 0.5 to all cells to avoid zero entries (see documentation).

For presence/absence data, it can be emphasized that a higher effect size indicates greater homogeneity (the opposite of Cramér’s \(V\) for count data). Hence, the homogeneity() has a direction argument, such that the quantiles and differences are reversed:

# homogeneity
x_homogeneity_logOR <- homogeneity(x_logOR, related = W_contexts, direction = "WU")
x_homogeneity_logOR
#> 
#>  Homogeneity between Related Contexts:
#>      Number of related pairs: 4 
#>      Number of unrelated pairs: 8 
#>      Mean quantile: 0.25

x_homogeneity_trace <- homogeneity(x_trace, related = W_contexts, direction = "WU")
x_homogeneity_trace
#> 
#>  Homogeneity between Related Contexts:
#>      Number of related pairs: 4 
#>      Number of unrelated pairs: 8 
#>      Mean quantile: 0.25

The results of measuring homogeneity on the basis of presence/absence data can be at odds with those based in count data: consider deposits which may have greater diversity of types and more sporadic finds, but in exceedingly low counts, such as with residual material; by treating the presence of a type as a binary variate, there is no room for quantifying the magnitude of the presence of that type. Evaluating both count and presence/absence statistics will bear out such a distinction, which may not be immediately clear in the case of larger datasets.

Measures of Homogeneity for Depositional Contexts

Introduction

Measures of Homogeneity

Count Data

Presence/Absence Data