Documentation for the Dyad Ratios Algorithm

James Stimson with light editing and updating by Dave Armstrong

General

extract() is an R function which implements the dyad ratios algorithm described in Stimson (2018) for constructing a latent time series from the raw materials of survey research marginals. It inputs data in archival form, as dated survey results, aggregates them into regular periods selected at run time, and extracts the latent dimension(s) which best account for the raw series.

Although its central mathematics is entirely different - based upon ratios of comparable observed series rather than covariance of “variables” - its behavior is strongly analogous to extraction of a “factor score” in standard (iterative) principal components analysis.

Usage

The routine is called from R by the command:

output<- extract(varname, date, index, ncases=NULL, 
  unit="A", mult=1, begindt=NA, enddt=NA, npass=1,
  smoothing=TRUE, endmonth=12)

Arguments

The first three arguments must be specified by the user, others are optional:

  1. varname is an \(n\)-length vector containing names given to input series. The routine assumes that any two observations with the same name are comparable and that any change in name signals noncomparability. Ratios are computed only between comparable observations.

  2. date is an \(n\)-length vector of dates, typically one of the dates of survey field work (e.g., first, last, or median day). This should be recorded in a date format. For example, if you had the string “2025-01-01”, you could use lubridate::ymd("2025-01-01") to convert it to a date format.

  3. index is an \(n\)-length vector of the numeric summary value of the result. It might be a percent or proportion responding in a single category, (e.g., the “approve” response in presidential approval) or some multi-response summary, for example,

\[\text{Index} = \frac{\text{Percent Agree}}{\text{Percent Agree} + \text{Percent Disagree}}\]

Interpretation of the derived latent dimension is eased by having the index coded such that polarity is the same across items—for example, if the concept being measured is liberalism, then high values always stand for the more liberal response - but as in principal component analysis, the routine deals appropriately with polarity switches.

  1. ncases is an \(n\)-length vector giving the number of cases for each value in index, typically sample size, is used during aggregation to produce a weighted average when multiple readings fall in one aggregation period. If this issue doesn’t occur or if the user prefers an unweighted average, then ncases=NULL—or omitting the ncases variable—will ignore case weighting. In the case of a mix of available and missing ncases indicators, 0 or NA values are reset to 1000.

  2. unit is the aggregation period, “D” (daily), “M” (monthly), “Q” (quarterly), “A” (annual, default), “O” (other, for multi-year aggregation). mult is the number of years, used only for unit option “O”.

  3. begindt is the beginning date for analysis. Default (NA) determines beginning date from the earliest date in the data. Entries for beginning and ending date use the ISOdate function. For example, to start an analysis in January, 1993, enter ISOdate(1993,1,1). (As always, R is case sensitive. So “ISO” must be caps and “date” lower case.)

  4. enddt is the ending date for analysis. Default (NA) determines ending date from the latest date in the data.

Warning: The routine cannot determine the earliest or latest dates of items which not actually are used in the analysis. The criterion for usage is that items must appear in more than one period after aggregation. So if the beginning or ending dates are determined by an item which is discarded because it does not meet this criterion, the routine will fail.

  1. smoothing specifies whether or not exponential smoothing is applied to intermediate estimates during the iterative solution process. Default is TRUE.

  2. npass number of dimensions to be extracted can be 1 or 2, defaults to 1.

Dating Considerations

The routine expects a date variable of R class date. Generally, import packages like foreign, readr, haven, rio will try to turn date-looking variables into date objects. The str() function will identify if your variable is a date or character. We can see from the output below that the date variable in the jennings dataset is indeed in a date format.

library(DyadRatios)
data(jennings)
str(jennings$date)
#>  Date[1:295], format: "1973-12-01" "1986-05-24" "1987-05-05" "1991-06-08" "1994-07-26" ...

If your variable is not already in a date format, you can transform it as such with the functions from the lubridate package. For example, if your date variable is in the format “2025-01-01”, you can use the ymd() function to convert it to a date format as follows:

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
temp <- data.frame(survey_date = c("2025-01-01", "2024-10-13", "2020-05-13"))
str(temp)
#> 'data.frame':    3 obs. of  1 variable:
#>  $ survey_date: chr  "2025-01-01" "2024-10-13" "2020-05-13"
temp$survey_date <- ymd(temp$survey_date)
str(temp)
#> 'data.frame':    3 obs. of  1 variable:
#>  $ survey_date: Date, format: "2025-01-01" "2024-10-13" ...

See how the class changed from chr to Date.

If you happened to have three variables for month, day and year, you could paste them together and use the lubridate functions:

temp <- data.frame(year= c(2025, 2024, 2020), 
                   month = c(1, 10, 5), 
                   day = c(1, 13, 13))
temp$date <- lubridate::ymd(paste(temp$year, temp$month, temp$day, sep="-"))
str(temp)
#> 'data.frame':    3 obs. of  4 variables:
#>  $ year : num  2025 2024 2020
#>  $ month: num  1 10 5
#>  $ day  : num  1 13 13
#>  $ date : Date, format: "2025-01-01" "2024-10-13" ...

Warning: The lubridate functions will not handle fake dates (for example, 1-32-05). It decodes dates that actually existed on past calendars or will exist on future ones (e.g., no Feb 29 unless year is actually a leap year.)

Ideally, the dates will be in a structured format. The lubridate functions will even parse string dates with words, e.g., “January 1, 1993” or “Aug 2, 2020” so long as the strings have month, day and year in the same position in the date.

temp <- data.frame(date = c("January 1, 1993", "Jan 1, 1993", "Aug 2, 2020"))
temp$date <- mdy(temp$date)

The formats produced by importing excel or csv documents will not be identical to those produced by lubridate functions, but they will still work with the extract() function as their components can still be extracted by the lubridate() functions month(), day() and year().

temp <- data.frame(date = ymd("2025-01-01", "1990-01-01", "1970-01-01", "1950-01-01"))
xl_temp <- tempfile(fileext = ".xlsx")
csv_temp <- tempfile(fileext = ".csv")

rio::export(temp, xl_temp)
rio::export(temp, csv_temp)


temp_xl <- rio::import(xl_temp)
temp_csv <- rio::import(csv_temp)

str(temp_xl)
#> 'data.frame':    4 obs. of  1 variable:
#>  $ date: POSIXct, format: "2025-01-01" "1990-01-01" ...
str(temp_csv)
#> 'data.frame':    4 obs. of  1 variable:
#>  $ date: IDate, format: "2025-01-01" "1990-01-01" ...

Notice that from excel, you get a POSIXct format variable and from csv you get an IDate object. They still are equal in integer form to the lubridate way. All values pass the equivalence test.

all(temp$date == temp_xl$date)
#> [1] TRUE
all(temp$date == temp_csv$date)
#> [1] TRUE

Output

The extract function produces as output 8 categories of information:

  1. formula reproduces the user call
  2. setup supplies basic information about options and the iterative solution
  3. period is a list of the aggregation periods, for example, 2005.02 for February, 2005
  4. varname is a list in order of the variables actually used in the analysis, a subset of all those in the data.
  5. loadings are the item-scale correlations from the final solution. Their square is the validity estimate used in weighting.
  6. means and std.deviations document the item descriptive information, and
  7. latent is the estimated time series, the purpose of everything. The variable latent1 is for the first dimension and latent2 for the second, if applicable.
  8. Diagnostic Information: there are several additional objects returned in the output, including hist - the iteration history, totalvar - the total variance to be explained, var_exp - the variance explained by each dimension, dropped - any series dropped from the analysis, and smoothing - whether smoothing was applied during the iterative solution process.

Output Functions

The raw output object created at run time contains everything of interest, but in an inconvenient format. There are several functions that can be used to display results in different ways.

  1. plot displays a time series ggplot of the number of dimensions estimated on y axes against time units on the x axis and can be called with plot(object)

  2. print displays the output of the iterative solution process. This is what previously was printed with verbose=TRUE which has been removed in favor of a print method for the function output.

  3. summary displays information about the raw series of the analysis. Under the heading: “Variable Name Cases Loading Mean Std Dev” it lists as many series as are used in the solution, giving variable name, number of cases (after aggregation), dimension loadings, and means and standard deviations of the raw series.

  4. get_mood retrieves the period and latent dimension estimate(s) from the model object and returns them in a data frame.

Negative Correlations?

Correlations, in the case of time series, measure whether two series vary in or out of phase. Thus the cross-sectional interpretation of the negative correlation—that two items are inversely related - does not hold. It is not unusual to observe negative “loadings” in extract analyses. They mean only that items move out of phase, not that they are opposites.

Model

Assume \(N\) survey results, each coded into a meaningful single indicator. These results consist of \(n\) subsets of comparable items measured at different times, \(t={1,\ldots, T}\). Designate each result \(x_{it}\), where the \(i\) subscript indicates item and \(t\) indicates aggregation period, \(1,\ldots, T\).

The starting assumption is that ratios, \(r_{it+k} = \frac{x_it+k}{x_{it}}\) of the comparable item \(i\) at different times will be meaningful indicators of the latent concept to be extracted. Metric information is thus lost, which is desirable because absent a science of question wording, we have no ability to compare different items. If there were no missing observations, then for each item \(i\), we could let \(r_{i1}=1.0\) and observe the complete set of ratios, \(r_{i2}, r_{i3}, \ldots, r_{iT}\). Then an average across the n items forms an excellent estimate of the latent concept \(\theta_{t}\):

\[\theta_{t} = \frac{\sum_{i=1}^{n}r_{it}}{n}\]

But we do have missing x’s - and in the typical case it is most of them. We would still be in good shape if we had a common period, say period 1, which was available for all series. We could then form the sum from the \(k\) available items, \(k\leqn\), and divide by \(k\). But we also lack such a common period. That motivates a recursive approach.

Forward Recursion: Begin by selecting that subset of items which are available at time 1. For them we can form \(\hat{\theta}\) for \(t=1,\ldots, T\) setting \(\hat{\theta}_{1} = 1.0\) and calculating \(\hat{\theta}_{2}, \ldots, \hat{\theta}_{T}\) from whatever subsets of items are available. Now proceed to period 2, setting \(\hat{\theta}_{2}\) to that value estimated from period 1 and now, using the subset of items which include period 2, estimating \(\hat{\theta}_{3}, \ldots, \hat{\theta}_{T}\) from the assumption that \(\theta_{2} = \hat{\theta}_{2}\). By projecting \(\hat{\theta}_{2}\) forward in this manner, the estimates for periods 3 through \(T\) become comparable to what they would have been had period 1 information been available. This procedure is repeated one period at a time through \(T-1\), giving from 1 to \(T-1\) different estimates of each of the \(\theta_{t}\). An average of all of them becomes \(\hat{\theta}_{t}\).

Backward Recursion: It will be seen that forward recursion very heavily weights early information relative to later information. Period 1 contributes to all subsequent estimates,whereas period \(T-1\) contributes only to \(T\), and period \(T\) only to itself. Thus the direction of recursion matters. Employing the same procedure backward puts a different weight on the items and gives a comparable, but not identical, set of estimates. Thus a more efficient set of estimates, one weighting all information equally, can be gained by averaging the two recursive estimates. (And the correlation between forward and backward series becomes a reliability estimate.)

Iterative Validity Estimation

As in iterated principal components we both make assumptions about item validity and then, post hoc, have the ability to observe empirical estimates of validities (the square of item/scale correlations). At the beginning validities are assumed to be 1.0 for all items. Then the empirically estimated validities become assumed validities for the next iteration. This procedure is repeated until the difference between assumed and estimated validities is effectively zero for all items, the maximum item discrepancy less than .001.

Example

We illustrate the use of extract with the data from Jennings et. al. (2017) who graciously shared their data through dataverse. The first few observations show the question names, dates, percentage distrusting in government, and sample size.

library(DyadRatios)
data(jennings)
head(jennings)
#>       variable       date value    n
#> 1 govtrust_bsa 1973-12-01  15.2 1857
#> 2 govtrust_bsa 1986-05-24  11.8 1548
#> 3 govtrust_bsa 1987-05-05  11.2 1410
#> 4 govtrust_bsa 1991-06-08  14.1 1445
#> 5 govtrust_bsa 1994-07-26  21.2 1137
#> 6 govtrust_bsa 1996-06-01  23.5 1180

We could see how many questions there are in the data.

library(dplyr) 
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
jennings %>% 
  group_by(variable) %>% 
  tally() %>% 
  arrange(desc(n))
#> # A tibble: 37 × 2
#>    variable         n
#>    <chr>        <int>
#>  1 eb_trustprl     30
#>  2 eb_trustgov     29
#>  3 govtrust_bsa    20
#>  4 h_govsys        20
#>  5 trust_mori2     18
#>  6 trust_mori1     17
#>  7 bsa_poltrust    15
#>  8 bsa_MPs         14
#>  9 bsa_votes       14
#> 10 bes_mpstrust     8
#> # ℹ 27 more rows

We can now estimate the dyad ratios algorithm:

jennings_out <- extract(
  varname = jennings$variable, 
  date = jennings$date, 
  index = jennings$value, 
  ncases = jennings$n, 
  begindt = min(jennings$date), 
  enddt = max(jennings$date), 
  npass=1
)

We can print the information about the estimates with:

print(jennings_out)
#> 
#> Estimation report:
#> Period: 1944  to 2016 ,  73  time points
#> Number of series:  37 
#> Number of usable series:  37 
#> Exponential smoothing:  Yes 
#> 
#> Iteration history: 
#>  
#>    Dimension Iter Convergence Criterion Reliability Alphaf Alphab
#>  Dimension 1    1      0.1364     0.001       0.923  0.500    0.5
#>  Dimension 1    2      0.0159     0.001       0.920  0.507    0.5
#>  Dimension 1    3      0.0041     0.001       0.918  0.510    0.5
#>  Dimension 1    4      0.0011     0.001       0.917  0.511    0.5
#>  Dimension 1    5       3e-04     0.001       0.917  0.511    0.5
#>  
#> Total Variance to be Explained =  3.49 
#> 
#> Percent Variance Explained: 
#>    Dimension Variance Proportion
#>  Dimension 1     1.77      0.506
#>  
#> Final Weighted Average Metric:  Mean:  51.72  St. Dev:  4.54

Our latent dimension explains about \(50\%\) of the variance in the raw series. We could also look at the loadings:

summary(jennings_out)
#> Variable Loadings and Descriptive Information: Dimension 1
#> Variable Name Cases Loading    Mean  Std Dev 
#>  bes_govtrust     2    1.0000 10.9000  1.700000 
#>  bes_mpstrust     3    0.9987 52.7362  3.119561 
#>     bes_nosay     4    0.9902 51.4250  5.053897 
#>   bes_parties     2    1.0000 23.1822  5.119276 
#>  bes_polmoney     3    0.6010 58.5526  1.032933 
#> bes_polstrust     3    0.5115 52.0000  6.531973 
#>   bes_wmtrust     2    1.0000 35.5000  5.500000 
#>       bsa_MPs    14    0.3353 74.0429  2.244585 
#> bsa_govtrust2     3   -0.7752 61.6667  0.713364 
#>   bsa_parties     7    0.6127 68.2143  3.015876 
#>      bsa_pols     2    1.0000 43.5000  2.500000 
#>  bsa_poltrust    15    0.3153 91.9200  1.534579 
#>     bsa_votes    14    0.6426 72.9071  3.774627 
#>   cspl_pubstd     5    0.9347 18.6000  6.468385 
#>     eb_polcor     2   -1.0000 59.7100  1.670000 
#>   eb_trustgov    16    0.8686 64.9280  7.005034 
#>   eb_trustprl    17    0.9451 59.3270  7.114013 
#>    efficacy_g     2    1.0000 69.5000  1.500000 
#> ess_trustparl     7    0.4183 48.5786  2.430828 
#>  ess_trustpol     7    0.0927 61.8500  2.343636 
#>   govtrust2_m     2    1.0000 45.5000  2.500000 
#>  govtrust_bsa    20    0.8655 24.7150  7.980932 
#>    govtrust_m     2    1.0000 69.0063  1.993737 
#>      h_govsys    19    0.5766 62.6721  7.635968 
#>      h_mpssat     5    0.6464 38.0000  3.098387 
#>     h_parlsat     6    0.6778 34.3333  1.885618 
#>      improp_g     5    0.9095 57.6000  8.114185 
#>      mpgain_m     4    0.9627 57.0000  9.617692 
#>     pollies_g     4    0.9532 82.2500  3.832427 
#>      polmor_g     4    0.7075 54.2500 11.431863 
#>        pols_g     3    0.9913 40.3333  5.557777 
#>     pols_mori     6    0.4465 53.6667  5.120764 
#>       spint_g     2    1.0000 72.0000  5.000000 
#>   trust_mori1    17    0.3702 73.4118  3.465100 
#>   trust_mori2    18    0.6150 75.6111  3.111607 
#>    trustmps_m     4    0.8029 68.5006  5.852612 
#>    trustown_m     4   -0.4189 40.7500  4.023369

Some of these have much higher loadings than others. That means, they are more reliable indicators of government (dis)trust than others. For example, govtrust_bsa (actual question from the BSA: “Do not trust British governments to place needs of the nation above interests of their own party?”) is a very reliable indicator with observations in 20 different time-periods. On the other hand, bse_MPs (actual question from the BSA: “Those we elect as MPs lose touch with people pretty quickly”) has a quite lower loading indicating it is a much less reliable indicator of (dis)trust. We can also make a plot of mood:

plot(jennings_out)

Finally, to retrieve the latent dimension and the periods, we can use the get_mood function:

ests <- get_mood(jennings_out)
ests
#>    period  latent1
#> 1    1944 49.08698
#> 2    1945 47.69880
#> 3    1946 47.00470
#> 4    1947 46.65765
#> 5    1948 46.48413
#> 6    1949 46.39737
#> 7    1950 46.35399
#> 8    1951 46.33230
#> 9    1952 46.32145
#> 10   1953 46.31603
#> 11   1954 46.31332
#> 12   1955 46.31196
#> 13   1956 46.31128
#> 14   1957 46.31095
#> 15   1958 46.31078
#> 16   1959 46.31069
#> 17   1960 46.31065
#> 18   1961 46.31063
#> 19   1962 46.31062
#> 20   1963 46.31061
#> 21   1964 46.31061
#> 22   1965 46.31061
#> 23   1966 46.31061
#> 24   1967 46.31061
#> 25   1968 46.31061
#> 26   1969 48.13153
#> 27   1970 49.04199
#> 28   1971 49.49722
#> 29   1972 50.11777
#> 30   1973 50.97421
#> 31   1974 52.69475
#> 32   1975 53.54804
#> 33   1976 53.97128
#> 34   1977 54.18123
#> 35   1978 54.28539
#> 36   1979 54.33708
#> 37   1980 54.36273
#> 38   1981 54.37545
#> 39   1982 54.38177
#> 40   1983 54.38491
#> 41   1984 52.97968
#> 42   1985 52.27706
#> 43   1986 51.47441
#> 44   1987 50.83300
#> 45   1988 50.92375
#> 46   1989 50.97084
#> 47   1990 51.66014
#> 48   1991 52.05198
#> 49   1992 49.92513
#> 50   1993 54.62718
#> 51   1994 55.22873
#> 52   1995 58.85351
#> 53   1996 56.35075
#> 54   1997 56.99313
#> 55   1998 55.04247
#> 56   1999 52.54010
#> 57   2000 53.95874
#> 58   2001 54.32868
#> 59   2002 54.54125
#> 60   2003 55.65126
#> 61   2004 55.84265
#> 62   2005 55.27056
#> 63   2006 55.86224
#> 64   2007 55.52446
#> 65   2008 56.40909
#> 66   2009 59.04136
#> 67   2010 59.21175
#> 68   2011 59.26123
#> 69   2012 59.80284
#> 70   2013 59.73838
#> 71   2014 59.50999
#> 72   2015 57.28657
#> 73   2016 60.30479

References

Jennings, W. N. Clarke, J. Moss and G. Stoker (2017). “The Decline in Diffuse Support for National Politics: The Long View on Political Discontent in Britain”, Public Opinion Quarterly, 81(3), 748-758.

Stimson, James A. 2018. “The Dyad Ratios Algorithm for Estimating Latent Public Opinion: Estimation, Testing, and Comparison to Other Approaches.” Bulletin of Methodological Sociology 137-138:201–218.