Documentation for the Dyad Ratios Algorithm

Usage

The routine is called from R by the command:

output<- extract(varname, date, index, ncases=NULL, 
  unit="A", mult=1, begindt=NA, enddt=NA, npass=1,
  smoothing=TRUE, endmonth=12)

Arguments

The first three arguments must be specified by the user, others are optional:

varname is an \(n\)-length vector containing names given to input series. The routine assumes that any two observations with the same name are comparable and that any change in name signals noncomparability. Ratios are computed only between comparable observations.
date is an \(n\)-length vector of dates, typically one of the dates of survey field work (e.g., first, last, or median day). This should be recorded in a date format. For example, if you had the string “2025-01-01”, you could use lubridate::ymd("2025-01-01") to convert it to a date format.
index is an \(n\)-length vector of the numeric summary value of the result. It might be a percent or proportion responding in a single category, (e.g., the “approve” response in presidential approval) or some multi-response summary, for example,

\[\text{Index} = \frac{\text{Percent Agree}}{\text{Percent Agree} + \text{Percent Disagree}}\]

Interpretation of the derived latent dimension is eased by having the index coded such that polarity is the same across items—for example, if the concept being measured is liberalism, then high values always stand for the more liberal response - but as in principal component analysis, the routine deals appropriately with polarity switches.

ncases is an \(n\)-length vector giving the number of cases for each value in index, typically sample size, is used during aggregation to produce a weighted average when multiple readings fall in one aggregation period. If this issue doesn’t occur or if the user prefers an unweighted average, then ncases=NULL—or omitting the ncases variable—will ignore case weighting. In the case of a mix of available and missing ncases indicators, 0 or NA values are reset to 1000.
unit is the aggregation period, “D” (daily), “M” (monthly), “Q” (quarterly), “A” (annual, default), “O” (other, for multi-year aggregation). mult is the number of years, used only for unit option “O”.
begindt is the beginning date for analysis. Default (NA) determines beginning date from the earliest date in the data. Entries for beginning and ending date use the ISOdate function. For example, to start an analysis in January, 1993, enter ISOdate(1993,1,1). (As always, R is case sensitive. So “ISO” must be caps and “date” lower case.)
enddt is the ending date for analysis. Default (NA) determines ending date from the latest date in the data.

Warning: The routine cannot determine the earliest or latest dates of items which not actually are used in the analysis. The criterion for usage is that items must appear in more than one period after aggregation. So if the beginning or ending dates are determined by an item which is discarded because it does not meet this criterion, the routine will fail.

smoothing specifies whether or not exponential smoothing is applied to intermediate estimates during the iterative solution process. Default is TRUE.
npass number of dimensions to be extracted can be 1 or 2, defaults to 1.

Dating Considerations

The routine expects a date variable of R class date. Generally, import packages like foreign, readr, haven, rio will try to turn date-looking variables into date objects. The str() function will identify if your variable is a date or character. We can see from the output below that the date variable in the jennings dataset is indeed in a date format.

library(DyadRatios)
data(jennings)
str(jennings$date)
#>  Date[1:295], format: "1973-12-01" "1986-05-24" "1987-05-05" "1991-06-08" "1994-07-26" ...

If your variable is not already in a date format, you can transform it as such with the functions from the lubridate package. For example, if your date variable is in the format “2025-01-01”, you can use the ymd() function to convert it to a date format as follows:

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
temp <- data.frame(survey_date = c("2025-01-01", "2024-10-13", "2020-05-13"))
str(temp)
#> 'data.frame':    3 obs. of  1 variable:
#>  $ survey_date: chr  "2025-01-01" "2024-10-13" "2020-05-13"
temp$survey_date <- ymd(temp$survey_date)
str(temp)
#> 'data.frame':    3 obs. of  1 variable:
#>  $ survey_date: Date, format: "2025-01-01" "2024-10-13" ...

See how the class changed from chr to Date.

If you happened to have three variables for month, day and year, you could paste them together and use the lubridate functions:

temp <- data.frame(year= c(2025, 2024, 2020), 
                   month = c(1, 10, 5), 
                   day = c(1, 13, 13))
temp$date <- lubridate::ymd(paste(temp$year, temp$month, temp$day, sep="-"))
str(temp)
#> 'data.frame':    3 obs. of  4 variables:
#>  $ year : num  2025 2024 2020
#>  $ month: num  1 10 5
#>  $ day  : num  1 13 13
#>  $ date : Date, format: "2025-01-01" "2024-10-13" ...

Warning: The lubridate functions will not handle fake dates (for example, 1-32-05). It decodes dates that actually existed on past calendars or will exist on future ones (e.g., no Feb 29 unless year is actually a leap year.)

Ideally, the dates will be in a structured format. The lubridate functions will even parse string dates with words, e.g., “January 1, 1993” or “Aug 2, 2020” so long as the strings have month, day and year in the same position in the date.

temp <- data.frame(date = c("January 1, 1993", "Jan 1, 1993", "Aug 2, 2020"))
temp$date <- mdy(temp$date)

The formats produced by importing excel or csv documents will not be identical to those produced by lubridate functions, but they will still work with the extract() function as their components can still be extracted by the lubridate() functions month(), day() and year().

temp <- data.frame(date = ymd("2025-01-01", "1990-01-01", "1970-01-01", "1950-01-01"))
xl_temp <- tempfile(fileext = ".xlsx")
csv_temp <- tempfile(fileext = ".csv")

rio::export(temp, xl_temp)
rio::export(temp, csv_temp)


temp_xl <- rio::import(xl_temp)
temp_csv <- rio::import(csv_temp)

str(temp_xl)
#> 'data.frame':    4 obs. of  1 variable:
#>  $ date: POSIXct, format: "2025-01-01" "1990-01-01" ...
str(temp_csv)
#> 'data.frame':    4 obs. of  1 variable:
#>  $ date: IDate, format: "2025-01-01" "1990-01-01" ...

Notice that from excel, you get a POSIXct format variable and from csv you get an IDate object. They still are equal in integer form to the lubridate way. All values pass the equivalence test.

all(temp$date == temp_xl$date)
#> [1] TRUE
all(temp$date == temp_csv$date)
#> [1] TRUE

Output

The extract function produces as output 8 categories of information:

formula reproduces the user call
setup supplies basic information about options and the iterative solution
period is a list of the aggregation periods, for example, 2005.02 for February, 2005
varname is a list in order of the variables actually used in the analysis, a subset of all those in the data.
loadings are the item-scale correlations from the final solution. Their square is the validity estimate used in weighting.
means and std.deviations document the item descriptive information, and
latent is the estimated time series, the purpose of everything. The variable latent1 is for the first dimension and latent2 for the second, if applicable.
Diagnostic Information: there are several additional objects returned in the output, including hist - the iteration history, totalvar - the total variance to be explained, var_exp - the variance explained by each dimension, dropped - any series dropped from the analysis, and smoothing - whether smoothing was applied during the iterative solution process.

Output Functions

The raw output object created at run time contains everything of interest, but in an inconvenient format. There are several functions that can be used to display results in different ways.

plot displays a time series ggplot of the number of dimensions estimated on y axes against time units on the x axis and can be called with plot(object)
print displays the output of the iterative solution process. This is what previously was printed with verbose=TRUE which has been removed in favor of a print method for the function output.
summary displays information about the raw series of the analysis. Under the heading: “Variable Name Cases Loading Mean Std Dev” it lists as many series as are used in the solution, giving variable name, number of cases (after aggregation), dimension loadings, and means and standard deviations of the raw series.
get_mood retrieves the period and latent dimension estimate(s) from the model object and returns them in a data frame.

Negative Correlations?

Correlations, in the case of time series, measure whether two series vary in or out of phase. Thus the cross-sectional interpretation of the negative correlation—that two items are inversely related - does not hold. It is not unusual to observe negative “loadings” in extract analyses. They mean only that items move out of phase, not that they are opposites.

Model

Assume \(N\) survey results, each coded into a meaningful single indicator. These results consist of \(n\) subsets of comparable items measured at different times, \(t={1,\ldots, T}\). Designate each result \(x_{it}\), where the \(i\) subscript indicates item and \(t\) indicates aggregation period, \(1,\ldots, T\).

The starting assumption is that ratios, \(r_{it+k} = \frac{x_it+k}{x_{it}}\) of the comparable item \(i\) at different times will be meaningful indicators of the latent concept to be extracted. Metric information is thus lost, which is desirable because absent a science of question wording, we have no ability to compare different items. If there were no missing observations, then for each item \(i\), we could let \(r_{i1}=1.0\) and observe the complete set of ratios, \(r_{i2}, r_{i3}, \ldots, r_{iT}\). Then an average across the n items forms an excellent estimate of the latent concept \(\theta_{t}\):

\[\theta_{t} = \frac{\sum_{i=1}^{n}r_{it}}{n}\]

But we do have missing x’s - and in the typical case it is most of them. We would still be in good shape if we had a common period, say period 1, which was available for all series. We could then form the sum from the \(k\) available items, \(k\leq n\), and divide by \(k\). But we also lack such a common period. That motivates a recursive approach.

Forward Recursion: Begin by selecting that subset of items which are available at time 1. For them we can form \(\hat{\theta}\) for \(t=1,\ldots, T\) setting \(\hat{\theta}_{1} = 1.0\) and calculating \(\hat{\theta}_{2}, \ldots, \hat{\theta}_{T}\) from whatever subsets of items are available. Now proceed to period 2, setting \(\hat{\theta}_{2}\) to that value estimated from period 1 and now, using the subset of items which include period 2, estimating \(\hat{\theta}_{3}, \ldots, \hat{\theta}_{T}\) from the assumption that \(\theta_{2} = \hat{\theta}_{2}\). By projecting \(\hat{\theta}_{2}\) forward in this manner, the estimates for periods 3 through \(T\) become comparable to what they would have been had period 1 information been available. This procedure is repeated one period at a time through \(T-1\), giving from 1 to \(T-1\) different estimates of each of the \(\theta_{t}\). An average of all of them becomes \(\hat{\theta}_{t}\).

Backward Recursion: It will be seen that forward recursion very heavily weights early information relative to later information. Period 1 contributes to all subsequent estimates,whereas period \(T-1\) contributes only to \(T\), and period \(T\) only to itself. Thus the direction of recursion matters. Employing the same procedure backward puts a different weight on the items and gives a comparable, but not identical, set of estimates. Thus a more efficient set of estimates, one weighting all information equally, can be gained by averaging the two recursive estimates. (And the correlation between forward and backward series becomes a reliability estimate.)

Iterative Validity Estimation

As in iterated principal components we both make assumptions about item validity and then, post hoc, have the ability to observe empirical estimates of validities (the square of item/scale correlations). At the beginning validities are assumed to be 1.0 for all items. Then the empirically estimated validities become assumed validities for the next iteration. This procedure is repeated until the difference between assumed and estimated validities is effectively zero for all items, the maximum item discrepancy less than .001.

Example

We illustrate the use of extract with the data from Jennings et. al. (2017) who graciously shared their data through dataverse. The first few observations show the question names, dates, percentage distrusting in government, and sample size.

library(DyadRatios)
data(jennings)
head(jennings)
#>       variable       date value    n
#> 1 govtrust_bsa 1973-12-01  15.2 1857
#> 2 govtrust_bsa 1986-05-24  11.8 1548
#> 3 govtrust_bsa 1987-05-05  11.2 1410
#> 4 govtrust_bsa 1991-06-08  14.1 1445
#> 5 govtrust_bsa 1994-07-26  21.2 1137
#> 6 govtrust_bsa 1996-06-01  23.5 1180

We could see how many questions there are in the data.

library(dplyr) 
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
jennings %>% 
  group_by(variable) %>% 
  tally() %>% 
  arrange(desc(n)) %>% 
  print(n=40)
#> # A tibble: 37 × 2
#>    variable          n
#>    <chr>         <int>
#>  1 eb_trustprl      30
#>  2 eb_trustgov      29
#>  3 govtrust_bsa     20
#>  4 h_govsys         20
#>  5 trust_mori2      18
#>  6 trust_mori1      17
#>  7 bsa_poltrust     15
#>  8 bsa_MPs          14
#>  9 bsa_votes        14
#> 10 bes_mpstrust      8
#> 11 bsa_parties       7
#> 12 ess_trustparl     7
#> 13 ess_trustpol      7
#> 14 cspl_pubstd       6
#> 15 h_parlsat         6
#> 16 pols_mori         6
#> 17 bes_parties       5
#> 18 bes_polmoney      5
#> 19 h_mpssat          5
#> 20 improp_g          5
#> 21 trustmps_m        5
#> 22 bes_nosay         4
#> 23 mpgain_m          4
#> 24 pollies_g         4
#> 25 polmor_g          4
#> 26 trustown_m        4
#> 27 bes_polstrust     3
#> 28 bsa_govtrust2     3
#> 29 govtrust_m        3
#> 30 pols_g            3
#> 31 bes_govtrust      2
#> 32 bes_wmtrust       2
#> 33 bsa_pols          2
#> 34 eb_polcor         2
#> 35 efficacy_g        2
#> 36 govtrust2_m       2
#> 37 spint_g           2

We can now estimate the dyad ratios algorithm:

jennings_out <- DyadRatios::extract(
  varname = jennings$variable, 
  date = jennings$date, 
  index = jennings$value, 
  ncases = jennings$n, 
  begindt = lubridate::ymd("1985-01-01", tz="GMT"), 
  enddt = max(jennings$date), 
  npass=1
)

We can print the information about the estimates with:

print(jennings_out)
#> 
#> Estimation report:
#> Period: 1985  to 2016 ,  32  time points
#> Number of series:  36 
#> Number of usable series:  35 
#> Exponential smoothing:  Yes 
#> 
#> Iteration history: 
#>  
#>    Dimension Iter Convergence Criterion Reliability Alphaf Alphab
#>  Dimension 1    1      0.3376     0.001       0.854    0.5    0.5
#>  Dimension 1    2      0.0365     0.001       0.896    0.5    0.5
#>  Dimension 1    3      0.0077     0.001       0.903    0.5    0.5
#>  Dimension 1    4      0.0017     0.001       0.904    0.5    0.5
#>  Dimension 1    5       4e-04     0.001       0.904    0.5    0.5
#>  
#> Total Variance to be Explained =  7.69 
#> 
#> Percent Variance Explained: 
#>    Dimension Variance Proportion
#>  Dimension 1     3.59      0.467
#>  
#> Final Weighted Average Metric:  Mean:  51.11  St. Dev:  4.71

Our latent dimension explains about \(50\%\) of the variance in the raw series. We could also look at the loadings:

summary(jennings_out)
#> Variable Loadings and Descriptive Information: Dimension 1
#> Variable Name Cases Loading    Mean  Std Dev 
#>  bes_govtrust     2    1.000 10.9000  1.700000 
#>  bes_mpstrust     3    0.989 52.7362  3.119561 
#>     bes_nosay     4    0.991 51.4250  5.053897 
#>   bes_parties     2    1.000 23.1822  5.119276 
#>  bes_polmoney     3    0.518 58.5526  1.032933 
#> bes_polstrust     3    0.515 52.0000  6.531973 
#>   bes_wmtrust     2    1.000 35.5000  5.500000 
#>       bsa_MPs    14    0.310 74.0429  2.244585 
#> bsa_govtrust2     3   -0.804 61.6667  0.713364 
#>   bsa_parties     7    0.618 68.2143  3.015876 
#>      bsa_pols     2    1.000 43.5000  2.500000 
#>  bsa_poltrust    15    0.335 91.9200  1.534579 
#>     bsa_votes    14    0.600 72.9071  3.774627 
#>   cspl_pubstd     6    0.910 21.5000  8.770215 
#>     eb_polcor     2   -1.000 59.7100  1.670000 
#>   eb_trustgov    16    0.879 64.9280  7.005034 
#>   eb_trustprl    17    0.935 59.3270  7.114013 
#> ess_trustparl     7    0.450 48.5786  2.430828 
#>  ess_trustpol     7    0.130 61.8500  2.343636 
#>   govtrust2_m     2    1.000 45.5000  2.500000 
#>  govtrust_bsa    19    0.859 25.2158  7.876030 
#>    govtrust_m     2    1.000 69.0063  1.993737 
#>      h_govsys    18    0.424 63.4316  7.112362 
#>      h_mpssat     5    0.642 38.0000  3.098387 
#>     h_parlsat     6    0.678 34.3333  1.885618 
#>      improp_g     5    0.903 57.6000  8.114185 
#>      mpgain_m     4    0.951 57.0000  9.617692 
#>     pollies_g     4    0.886 82.2500  3.832427 
#>      polmor_g     4    0.723 54.2500 11.431863 
#>     pols_mori     6    0.390 53.6667  5.120764 
#>       spint_g     2    1.000 72.0000  5.000000 
#>   trust_mori1    16    0.232 73.3750  3.568526 
#>   trust_mori2    17    0.518 75.6471  3.198183 
#>    trustmps_m     4    0.780 68.5006  5.852612 
#>    trustown_m     3    0.334 42.6667  2.624669

Some of these have much higher loadings than others. That means, they are more reliable indicators of government (dis)trust than others. For example, govtrust_bsa (actual question from the BSA: “Do not trust British governments to place needs of the nation above interests of their own party?”) is a very reliable indicator with observations in 20 different time-periods. On the other hand, bse_MPs (actual question from the BSA: “Those we elect as MPs lose touch with people pretty quickly”) has a quite lower loading indicating it is a much less reliable indicator of (dis)trust. We can also make a plot of mood:

plot(jennings_out)

Finally, to retrieve the latent dimension and the periods, we can use the get_mood function:

ests <- get_mood(jennings_out)
ests
#>    period  latent1
#> 1    1985 45.51698
#> 2    1986 45.34720
#> 3    1987 44.81175
#> 4    1988 45.27130
#> 5    1989 45.50108
#> 6    1990 46.16092
#> 7    1991 46.51931
#> 8    1992 43.56359
#> 9    1993 47.29239
#> 10   1994 50.09666
#> 11   1995 55.13573
#> 12   1996 52.54829
#> 13   1997 53.23653
#> 14   1998 50.58815
#> 15   1999 45.73180
#> 16   2000 48.50281
#> 17   2001 49.26694
#> 18   2002 49.87484
#> 19   2003 51.68624
#> 20   2004 52.11073
#> 21   2005 51.13270
#> 22   2006 52.25950
#> 23   2007 51.51119
#> 24   2008 52.93849
#> 25   2009 56.92052
#> 26   2010 57.34483
#> 27   2011 57.26424
#> 28   2012 58.08429
#> 29   2013 58.28522
#> 30   2014 58.12708
#> 31   2015 54.33388
#> 32   2016 58.88689

Bootstrapping

Dave Armstrong added this facet of the analytical framework.

There is no inferential framework baked-in to Stimson’s original method. In fact, in my conversations with Jim about this point, he seemed wholly uninterested in the sampling uncertainty of the measure. Given that I think Jim has forgotten more about this stuff than I know, he may well have the right of it. That said, I think it is interesting, so I decided to write a bootstrapping algorightm to get a sense of the variability of the estimates. I would argue that we should not do a simple non-parametric bootstrap because that will change the temporal density of the data. Instead, we could use a parametric bootstrap. This allows us to get a sense of how variable the estimates are given the data that we have. The algorithm uses the following steps.

Sample the number of cases as \(n^{(b)}_i \sim \text{Binomial}(\text{index}_i, \text{ncases}_i)\) for each observation in the model.
Calculate the boostrapped index value \(\text{index}_i^{(b)} = \frac{n^{(b)}_i}{\text{ncases}_i}\).
Estimate mood using \(\text{index}_{i}^{(b)}\) and the original number of cases. In this step, when the data are normalized (i.e., the weighted mean and standard deviation are calculated), we use the weighted mean and standard deviation from the original data so that differences in the mean and standard deviation are not responsible for the variation.
Repeat steps 1-3 for \(R\) bootstrap samples.
Calculate the \(\frac{1-\text{level}}{2}\) and \(1-\frac{1-\text{level}}{2}\) quantiles of the bootstrap distribution.

Here’s the bootstrapped analysis for the Jennings data.

data(jennings)
jennings_boot <- boot_dr(
  varname = jennings$variable, 
  date = jennings$date, 
  index = jennings$value, 
  ncases = jennings$n, 
  begindt = lubridate::ymd("1985-01-01"), 
  enddt = max(jennings$date), 
  npass=1, 
  R=1000, 
  parallel=FALSE, 
  pw = TRUE
)

The bootstrapped results (if pw=TRUE) contains two elements - ci and pw. The ci element contains the estimate and confidence intervals for the latent dimension. Here are the first few lines of the confidence intervals:

head(jennings_boot$ci)
#>   period   latent1       lwr       upr
#> 1   1985 0.4551698 0.4459494 0.4636531
#> 2   1986 0.4534720 0.4444081 0.4616434
#> 3   1987 0.4481175 0.4428765 0.4538996
#> 4   1988 0.4527130 0.4484173 0.4575534
#> 5   1989 0.4550108 0.4494970 0.4611415
#> 6   1990 0.4616092 0.4511779 0.4738552

The main use for this would be a plot like the following:

library(ggplot2)
ggplot(jennings_boot$ci, aes(x=period)) + 
  geom_ribbon(aes(ymin=lwr, ymax=upr), alpha=0.2) + 
  geom_line(aes(y=latent1)) + 
  labs(y="Mood (Distrust in Government)", x="Year") + 
  theme_bw()

If pw=TRUE, the pw element contains pairwise comparisons between all time periods. This would be most useful for identifying where “significant” changes in mood happen. For example, you could look at the significant differences, we’ll just look at the first few here:

jennings_boot$pw %>% 
  filter(p_diff > .95) %>% 
  head()
#>     p1   p2         diff p_diff
#> 1 1985 1987 -0.007052299  0.953
#> 2 1985 1992 -0.019533948  0.994
#> 3 1985 1993  0.017754110  0.970
#> 4 1985 1994  0.045796742  1.000
#> 5 1985 1995  0.096187463  1.000
#> 6 1985 1996  0.070313115  1.000

You could also make a plot of which differences are significant for all pairs.

library(tidyr)
#> 
#> Attaching package: 'tidyr'
#> The following object is masked from 'package:DyadRatios':
#> 
#>     extract
pwdiff <- jennings_boot$pw %>% 
  mutate(diff_sig = ifelse(p_diff > .95, diff, 0))

ggplot(pwdiff, aes(x=as.factor(p1), y=as.factor(p2), fill=diff_sig)) + 
  geom_tile(color="black") + 
  scale_fill_gradient2(low="#377eb8", mid="white", high="#e41a1c", na.value = "grey90") +
  labs(x="From", y="To", fill = "To - From") + 
  theme_classic() + 
  theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=1), 
        legend.position = "inside", 
        legend.position.inside = c(.85, .33))

In this instance, anywhere that is white indicates a non-significant difference. We can see from the plot above that there are many year-over-year changes that are statistically significant. We could print them out, arranged by size:

jennings_boot$pw %>% 
  filter(p_diff > .95 & p2 - p1 == 1) %>% 
  arrange(diff)
#>      p1   p2         diff p_diff
#> 1  1998 1999 -0.048563520  1.000
#> 2  2014 2015 -0.037931984  1.000
#> 3  1991 1992 -0.029557255  1.000
#> 4  1997 1998 -0.026483763  1.000
#> 5  1995 1996 -0.025874347  1.000
#> 6  2004 2005 -0.009780310  1.000
#> 7  2006 2007 -0.007483163  0.997
#> 8  1988 1989  0.002297753  0.970
#> 9  1987 1988  0.004595505  0.970
#> 10 2001 2002  0.006078972  0.979
#> 11 1996 1997  0.006882320  0.960
#> 12 2000 2001  0.007641352  0.994
#> 13 2011 2012  0.008200502  1.000
#> 14 2005 2006  0.011268046  1.000
#> 15 2007 2008  0.014273025  1.000
#> 16 2002 2003  0.018113990  1.000
#> 17 1999 2000  0.027710085  1.000
#> 18 1993 1994  0.028042632  1.000
#> 19 1992 1993  0.037288058  1.000
#> 20 2008 2009  0.039820260  1.000
#> 21 2015 2016  0.045530138  1.000
#> 22 1994 1995  0.050390721  1.000