This vignette illustrates functionality for measuring homogeneity between depositional contexts on the basis of artifact assemblages, either as counts of types or the presence absence of types. For more details, see the paper “Evaluating the Relationship between Surface, Subsurface, and Stratigraphic Assemblages” (under review).
Say one has two pairs of assemblages, each comprising a surface and subsurface context, of counts of artifacts of type A, B, C, D, and E:
x1 <- c(2, 0, 10, 11, 5)
x2 <- c(1, 1, 17, 23, 3)
x3 <- c(0, 0, 2, 81, 11)
x4 <- c(3, 18, 9, 0, 23)
x <- matrix(c(x1, x2, x3, x4), ncol = 4)
colnames(x) <- c("surface1", "subsurface1", "surface2", "subsurface2")
rownames(x) <- LETTERS[1:5]
x
#> surface1 subsurface1 surface2 subsurface2
#> A 2 1 0 3
#> B 0 1 0 18
#> C 10 17 2 9
#> D 11 23 81 0
#> E 5 3 11 23One would like to know if the surface assemblages reflect their
respective subsurface assemblages. This problem can be approached by
asking if the pairs of assemblages have a homogenous distribution, which
is typically approaches using chi-squared testing. arkaia
goes beyond null-hypothesis statistical test by grounding inference in
practical, archaeological terms, to assess whether such homogeneity is
greater for the pairs of related contexts (the surface-subsurface pair)
than any unrelated pair. Such determination can be made using either
counts of types or the presence/absence of types (as count data may not
be available).
For this example, it is already apparent that the distribution of counts for surface1/subsurface1 are a lot more similar than surface2/subsurface2; this will be reflected in their measures of homogeneity.
For count data, \(\chi^2\) can be estimated using the Cressie-Read power-divergence statistic for each surface-subsurface pair:
Then, to obtain a measure of effect size, the bias-corrected estimation of Cramér’s \(V\), denoted \(V_B\), allows for cross-comparability of the degree of homogeneity:
The VB() function by default uses CR() with
a parameter of lambda = 2/3 to estimate \(\chi^2\); if one wants to use Pearson’s
method, one sets lambda = 1 in the input:
# Pearson's chi-squared
VB(x[,1:2], lambda = 1)
#> [1] 0.03588807
VB(x[,3:4], lambda = 1)
#> [1] 0.8394991It is tedious to re-run effect sizes between selected columns, and so
the VB_pair() function returns estimates of \(V_B\) pairwise between all columns:
x_effect_sizes <- VB_pair(x)
x_effect_sizes
#> surface1 subsurface1 surface2 subsurface2
#> surface1 0.00000000 0.03878439 0.5152189 0.6407815
#> subsurface1 0.03878439 0.00000000 0.4872673 0.7486347
#> surface2 0.51521892 0.48726730 0.0000000 0.8551924
#> subsurface2 0.64078148 0.74863466 0.8551924 0.0000000
#> attr(,"class")
#> [1] "effect_sizes" "matrix"To assess whether the effect sizes between pairs of related contexts
are lower (i.e., more homogenous) than those between pairs of unrelated
contexts (e.g, between surface1 and surface2, or subsurface1 and
surface2), the function homogeneity() is used. The
related pairs must be specified, in either a matrix or data
frame. The unrelated pairs may be similarly specified, or,
not entered into the function, are automatically determined as all pairs
which are not contained in the related
dataset.
Note that the homogeneity() function takes the results
of the pairwise function VB_pair(), not the original matrix
x (effect sizes can be estimated via other means, as will
be shown below for presence/absence data):
# related pairs
W_contexts <- matrix(c("surface1", "surface2", "subsurface1", "subsurface2"), ncol = 2)
W_contexts
#> [,1] [,2]
#> [1,] "surface1" "subsurface1"
#> [2,] "surface2" "subsurface2"
# homogeneity
x_homogeneity <- homogeneity(x_effect_sizes, related = W_contexts)
x_homogeneity
#>
#> Homogeneity between Related Contexts:
#> Number of related pairs: 4
#> Number of unrelated pairs: 8
#> Mean quantile: 0.5The results of homogeneity() comprise a
list object containing:
n: a vector containing the number of related and
unrelated pairs of contextsU: the effect sizes between unrelated pairs of
contextsW: the effect sizes between related pairs of
contextsQ: for each related pair, the quantile indicating the
proportion of unrelated pairs which are less homogenous. Higher values
of Q indicate that the pair is more homogenous in
comparison.D: the distribution of differences among all
U and W. The proportion of D
greater than zero is equivalent to the mean of all quantile estimates
Q.The results of this example show that the surface1/subsurface1 pairing has the most homogeneous distribution compared to unrelated pairs, whereas the surface2/subsurface2 pairing is less homogeneous than even unrelated pairs of contexts. By taking the mean quantile (or proportion of differences \(D\) greater than zero), one can see that the matter of representativeness of surface/subsurface pairings is evenly split, with the related surface/subsurface pairs in their totality being no more or less homogenous than those of unrelated pairs of contexts.
In many situations one may not have counts, but only information on the presence/absence of an artifact type. In such a case, one can draft a \(2 \times 2\) contingency table for two contexts, giving the count of how many types are present in both, missing in one or the other, or missing in both:
| Present | Absent | |
|---|---|---|
| Present | (present in both) | (present in 1 but not 2) |
| Absent | (present in 2 but not 1) | (absent from both) |
Homogeneity in such a situation can be evaluated using the
conventional log odds ratio, or also by giving the trace of the \(2 \times 2\) presence/absence contingency
table. The log odds ratio will be affected by the degree to which types
are both present or both absent (i.e., comparing contexts with
themselves will yield different log odds ratios, if they happen to be
missing more or less types – see the diagonal of the pairwise log odds
ratio matrix below). The trace coefficient will not abe affected by this
tendency. arkhaia has functionality for evaluating both log
odds ratios and the trace, pairwise, for a contingency table:
x_logOR <- log_OR_pair(x)
x_logOR
#> surface1 subsurface1 surface2 subsurface2
#> surface1 3.2958369 1.0986123 1.9459101 -0.2513144
#> subsurface1 1.0986123 2.3978953 0.3364722 1.0986123
#> surface2 1.9459101 0.3364722 3.5553481 -1.0986123
#> subsurface2 -0.2513144 1.0986123 -1.0986123 3.2958369
#> attr(,"class")
#> [1] "effect_sizes" "matrix"
x_trace <- trace_pair(x)
x_trace
#> surface1 subsurface1 surface2 subsurface2
#> surface1 5 4 4 3
#> subsurface1 4 5 3 4
#> surface2 4 3 5 2
#> subsurface2 3 4 2 5
#> attr(,"class")
#> [1] "effect_sizes" "matrix"The log_OR_pair() function adds 0.5 to all cells to
avoid zero entries (see documentation).
For presence/absence data, it can be emphasized that a higher effect
size indicates greater homogeneity (the opposite of Cramér’s \(V\) for count data). Hence, the
homogeneity() has a direction argument, such
that the quantiles and differences are reversed:
# homogeneity
x_homogeneity_logOR <- homogeneity(x_logOR, related = W_contexts, direction = "WU")
x_homogeneity_logOR
#>
#> Homogeneity between Related Contexts:
#> Number of related pairs: 4
#> Number of unrelated pairs: 8
#> Mean quantile: 0.25
x_homogeneity_trace <- homogeneity(x_trace, related = W_contexts, direction = "WU")
x_homogeneity_trace
#>
#> Homogeneity between Related Contexts:
#> Number of related pairs: 4
#> Number of unrelated pairs: 8
#> Mean quantile: 0.25The results of measuring homogeneity on the basis of presence/absence data can be at odds with those based in count data: consider deposits which may have greater diversity of types and more sporadic finds, but in exceedingly low counts, such as with residual material; by treating the presence of a type as a binary variate, there is no room for quantifying the magnitude of the presence of that type. Evaluating both count and presence/absence statistics will bear out such a distinction, which may not be immediately clear in the case of larger datasets.