extract()
is an R function which implements the dyad
ratios algorithm described in Stimson (2018) for constructing a latent
time series from the raw materials of survey research marginals. It
inputs data in archival form, as dated survey results, aggregates them
into regular periods selected at run time, and extracts the latent
dimension(s) which best account for the raw series.
Although its central mathematics is entirely different - based upon ratios of comparable observed series rather than covariance of “variables” - its behavior is strongly analogous to extraction of a “factor score” in standard (iterative) principal components analysis.
The routine is called from R by the command:
output<- extract(varname, date, index, ncases=NULL,
unit="A", mult=1, begindt=NA, enddt=NA, npass=1,
smoothing=TRUE, endmonth=12)
The first three arguments must be specified by the user, others are optional:
varname
is an \(n\)-length vector containing names given to
input series. The routine assumes that any two observations with the
same name are comparable and that any change in name signals
noncomparability. Ratios are computed only between comparable
observations.
date
is an \(n\)-length vector of dates, typically one
of the dates of survey field work (e.g., first, last, or median day).
This should be recorded in a date format. For example, if you had the
string “2025-01-01”, you could use
lubridate::ymd("2025-01-01")
to convert it to a date
format.
index
is an \(n\)-length vector of the numeric summary
value of the result. It might be a percent or proportion responding in a
single category, (e.g., the “approve” response in presidential approval)
or some multi-response summary, for example,
\[\text{Index} = \frac{\text{Percent Agree}}{\text{Percent Agree} + \text{Percent Disagree}}\]
Interpretation of the derived latent dimension is eased by having the index coded such that polarity is the same across items—for example, if the concept being measured is liberalism, then high values always stand for the more liberal response - but as in principal component analysis, the routine deals appropriately with polarity switches.
ncases
is an \(n\)-length vector giving the number of
cases for each value in index
, typically sample size, is
used during aggregation to produce a weighted average when multiple
readings fall in one aggregation period. If this issue doesn’t occur or
if the user prefers an unweighted average, then ncases=NULL—or omitting
the ncases variable—will ignore case weighting. In the case of a mix of
available and missing ncases indicators, 0 or NA values are reset to
1000.
unit
is the aggregation period, “D” (daily), “M”
(monthly), “Q” (quarterly), “A” (annual, default), “O” (other, for
multi-year aggregation). mult
is the number of years, used
only for unit option “O”.
begindt
is the beginning date for analysis. Default
(NA
) determines beginning date from the earliest date in
the data. Entries for beginning and ending date use the ISOdate
function. For example, to start an analysis in January, 1993, enter
ISOdate(1993,1,1)
. (As always, R is case sensitive. So
“ISO” must be caps and “date” lower case.)
enddt
is the ending date for analysis. Default
(NA
) determines ending date from the latest date in the
data.
Warning: The routine cannot determine the earliest or latest dates of items which not actually are used in the analysis. The criterion for usage is that items must appear in more than one period after aggregation. So if the beginning or ending dates are determined by an item which is discarded because it does not meet this criterion, the routine will fail.
smoothing
specifies whether or not exponential
smoothing is applied to intermediate estimates during the iterative
solution process. Default is TRUE.
npass
number of dimensions to be extracted can be 1
or 2, defaults to 1.
The routine expects a date variable of R class date
.
Generally, import packages like foreign
,
readr
, haven
, rio
will try to
turn date-looking variables into date objects. The str()
function will identify if your variable is a date or character. We can
see from the output below that the date
variable in the
jennings
dataset is indeed in a date
format.
library(DyadRatios)
data(jennings)
str(jennings$date)
#> Date[1:295], format: "1973-12-01" "1986-05-24" "1987-05-05" "1991-06-08" "1994-07-26" ...
If your variable is not already in a date format, you can transform
it as such with the functions from the lubridate
package.
For example, if your date variable is in the format “2025-01-01”, you
can use the ymd()
function to convert it to a date format
as follows:
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
temp <- data.frame(survey_date = c("2025-01-01", "2024-10-13", "2020-05-13"))
str(temp)
#> 'data.frame': 3 obs. of 1 variable:
#> $ survey_date: chr "2025-01-01" "2024-10-13" "2020-05-13"
temp$survey_date <- ymd(temp$survey_date)
str(temp)
#> 'data.frame': 3 obs. of 1 variable:
#> $ survey_date: Date, format: "2025-01-01" "2024-10-13" ...
See how the class changed from chr
to
Date
.
If you happened to have three variables for month, day and year, you
could paste them together and use the lubridate
functions:
temp <- data.frame(year= c(2025, 2024, 2020),
month = c(1, 10, 5),
day = c(1, 13, 13))
temp$date <- lubridate::ymd(paste(temp$year, temp$month, temp$day, sep="-"))
str(temp)
#> 'data.frame': 3 obs. of 4 variables:
#> $ year : num 2025 2024 2020
#> $ month: num 1 10 5
#> $ day : num 1 13 13
#> $ date : Date, format: "2025-01-01" "2024-10-13" ...
Warning: The lubridate
functions will not handle fake
dates (for example, 1-32-05). It decodes dates that actually existed on
past calendars or will exist on future ones (e.g., no Feb 29 unless year
is actually a leap year.)
Ideally, the dates will be in a structured format. The lubridate functions will even parse string dates with words, e.g., “January 1, 1993” or “Aug 2, 2020” so long as the strings have month, day and year in the same position in the date.
temp <- data.frame(date = c("January 1, 1993", "Jan 1, 1993", "Aug 2, 2020"))
temp$date <- mdy(temp$date)
The formats produced by importing excel or csv documents will not be
identical to those produced by lubridate
functions, but
they will still work with the extract()
function as their
components can still be extracted by the lubridate()
functions month()
, day()
and
year()
.
temp <- data.frame(date = ymd("2025-01-01", "1990-01-01", "1970-01-01", "1950-01-01"))
xl_temp <- tempfile(fileext = ".xlsx")
csv_temp <- tempfile(fileext = ".csv")
rio::export(temp, xl_temp)
rio::export(temp, csv_temp)
temp_xl <- rio::import(xl_temp)
temp_csv <- rio::import(csv_temp)
str(temp_xl)
#> 'data.frame': 4 obs. of 1 variable:
#> $ date: POSIXct, format: "2025-01-01" "1990-01-01" ...
str(temp_csv)
#> 'data.frame': 4 obs. of 1 variable:
#> $ date: IDate, format: "2025-01-01" "1990-01-01" ...
Notice that from excel, you get a POSIXct
format
variable and from csv you get an IDate
object. They still
are equal in integer form to the lubridate
way. All values
pass the equivalence test.
The extract
function produces as output 8 categories of
information:
formula
reproduces the user callsetup
supplies basic information about options and the
iterative solutionperiod
is a list of the aggregation periods, for
example, 2005.02 for February, 2005varname
is a list in order of the variables actually
used in the analysis, a subset of all those in the data.loadings
are the item-scale correlations from the final
solution. Their square is the validity estimate used in weighting.means
and std.deviations
document the item
descriptive information, andlatent
is the estimated time series, the purpose of
everything. The variable latent1
is for the first dimension
and latent2
for the second, if applicable.hist
- the iteration
history, totalvar
- the total variance to be explained,
var_exp
- the variance explained by each dimension,
dropped
- any series dropped from the analysis, and
smoothing
- whether smoothing was applied during the
iterative solution process.The raw output object created at run time contains everything of interest, but in an inconvenient format. There are several functions that can be used to display results in different ways.
plot
displays a time series ggplot of the number of
dimensions estimated on y axes against time units on the x axis and can
be called with plot(object)
print
displays the output of the iterative solution
process. This is what previously was printed with
verbose=TRUE
which has been removed in favor of a print
method for the function output.
summary
displays information about the raw series of
the analysis. Under the heading: “Variable Name Cases Loading Mean Std
Dev” it lists as many series as are used in the solution, giving
variable name, number of cases (after aggregation), dimension loadings,
and means and standard deviations of the raw series.
get_mood
retrieves the period and latent dimension
estimate(s) from the model object and returns them in a data
frame.
Correlations, in the case of time series, measure whether two series vary in or out of phase. Thus the cross-sectional interpretation of the negative correlation—that two items are inversely related - does not hold. It is not unusual to observe negative “loadings” in extract analyses. They mean only that items move out of phase, not that they are opposites.
Assume \(N\) survey results, each coded into a meaningful single indicator. These results consist of \(n\) subsets of comparable items measured at different times, \(t={1,\ldots, T}\). Designate each result \(x_{it}\), where the \(i\) subscript indicates item and \(t\) indicates aggregation period, \(1,\ldots, T\).
The starting assumption is that ratios, \(r_{it+k} = \frac{x_it+k}{x_{it}}\) of the comparable item \(i\) at different times will be meaningful indicators of the latent concept to be extracted. Metric information is thus lost, which is desirable because absent a science of question wording, we have no ability to compare different items. If there were no missing observations, then for each item \(i\), we could let \(r_{i1}=1.0\) and observe the complete set of ratios, \(r_{i2}, r_{i3}, \ldots, r_{iT}\). Then an average across the n items forms an excellent estimate of the latent concept \(\theta_{t}\):
\[\theta_{t} = \frac{\sum_{i=1}^{n}r_{it}}{n}\]
But we do have missing x’s - and in the typical case it is most of them. We would still be in good shape if we had a common period, say period 1, which was available for all series. We could then form the sum from the \(k\) available items, \(k\leqn\), and divide by \(k\). But we also lack such a common period. That motivates a recursive approach.
Forward Recursion: Begin by selecting that subset of items which are available at time 1. For them we can form \(\hat{\theta}\) for \(t=1,\ldots, T\) setting \(\hat{\theta}_{1} = 1.0\) and calculating \(\hat{\theta}_{2}, \ldots, \hat{\theta}_{T}\) from whatever subsets of items are available. Now proceed to period 2, setting \(\hat{\theta}_{2}\) to that value estimated from period 1 and now, using the subset of items which include period 2, estimating \(\hat{\theta}_{3}, \ldots, \hat{\theta}_{T}\) from the assumption that \(\theta_{2} = \hat{\theta}_{2}\). By projecting \(\hat{\theta}_{2}\) forward in this manner, the estimates for periods 3 through \(T\) become comparable to what they would have been had period 1 information been available. This procedure is repeated one period at a time through \(T-1\), giving from 1 to \(T-1\) different estimates of each of the \(\theta_{t}\). An average of all of them becomes \(\hat{\theta}_{t}\).
Backward Recursion: It will be seen that forward recursion very heavily weights early information relative to later information. Period 1 contributes to all subsequent estimates,whereas period \(T-1\) contributes only to \(T\), and period \(T\) only to itself. Thus the direction of recursion matters. Employing the same procedure backward puts a different weight on the items and gives a comparable, but not identical, set of estimates. Thus a more efficient set of estimates, one weighting all information equally, can be gained by averaging the two recursive estimates. (And the correlation between forward and backward series becomes a reliability estimate.)
As in iterated principal components we both make assumptions about item validity and then, post hoc, have the ability to observe empirical estimates of validities (the square of item/scale correlations). At the beginning validities are assumed to be 1.0 for all items. Then the empirically estimated validities become assumed validities for the next iteration. This procedure is repeated until the difference between assumed and estimated validities is effectively zero for all items, the maximum item discrepancy less than .001.
We illustrate the use of extract
with the data from
Jennings et. al. (2017) who graciously shared their data through
dataverse. The first few observations show the question names, dates,
percentage distrusting in government, and sample size.
library(DyadRatios)
data(jennings)
head(jennings)
#> variable date value n
#> 1 govtrust_bsa 1973-12-01 15.2 1857
#> 2 govtrust_bsa 1986-05-24 11.8 1548
#> 3 govtrust_bsa 1987-05-05 11.2 1410
#> 4 govtrust_bsa 1991-06-08 14.1 1445
#> 5 govtrust_bsa 1994-07-26 21.2 1137
#> 6 govtrust_bsa 1996-06-01 23.5 1180
We could see how many questions there are in the data.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
jennings %>%
group_by(variable) %>%
tally() %>%
arrange(desc(n))
#> # A tibble: 37 × 2
#> variable n
#> <chr> <int>
#> 1 eb_trustprl 30
#> 2 eb_trustgov 29
#> 3 govtrust_bsa 20
#> 4 h_govsys 20
#> 5 trust_mori2 18
#> 6 trust_mori1 17
#> 7 bsa_poltrust 15
#> 8 bsa_MPs 14
#> 9 bsa_votes 14
#> 10 bes_mpstrust 8
#> # ℹ 27 more rows
We can now estimate the dyad ratios algorithm:
jennings_out <- extract(
varname = jennings$variable,
date = jennings$date,
index = jennings$value,
ncases = jennings$n,
begindt = min(jennings$date),
enddt = max(jennings$date),
npass=1
)
We can print the information about the estimates with:
print(jennings_out)
#>
#> Estimation report:
#> Period: 1944 to 2016 , 73 time points
#> Number of series: 37
#> Number of usable series: 37
#> Exponential smoothing: Yes
#>
#> Iteration history:
#>
#> Dimension Iter Convergence Criterion Reliability Alphaf Alphab
#> Dimension 1 1 0.1364 0.001 0.923 0.500 0.5
#> Dimension 1 2 0.0159 0.001 0.920 0.507 0.5
#> Dimension 1 3 0.0041 0.001 0.918 0.510 0.5
#> Dimension 1 4 0.0011 0.001 0.917 0.511 0.5
#> Dimension 1 5 3e-04 0.001 0.917 0.511 0.5
#>
#> Total Variance to be Explained = 3.49
#>
#> Percent Variance Explained:
#> Dimension Variance Proportion
#> Dimension 1 1.77 0.506
#>
#> Final Weighted Average Metric: Mean: 51.72 St. Dev: 4.54
Our latent dimension explains about \(50\%\) of the variance in the raw series. We could also look at the loadings:
summary(jennings_out)
#> Variable Loadings and Descriptive Information: Dimension 1
#> Variable Name Cases Loading Mean Std Dev
#> bes_govtrust 2 1.0000 10.9000 1.700000
#> bes_mpstrust 3 0.9987 52.7362 3.119561
#> bes_nosay 4 0.9902 51.4250 5.053897
#> bes_parties 2 1.0000 23.1822 5.119276
#> bes_polmoney 3 0.6010 58.5526 1.032933
#> bes_polstrust 3 0.5115 52.0000 6.531973
#> bes_wmtrust 2 1.0000 35.5000 5.500000
#> bsa_MPs 14 0.3353 74.0429 2.244585
#> bsa_govtrust2 3 -0.7752 61.6667 0.713364
#> bsa_parties 7 0.6127 68.2143 3.015876
#> bsa_pols 2 1.0000 43.5000 2.500000
#> bsa_poltrust 15 0.3153 91.9200 1.534579
#> bsa_votes 14 0.6426 72.9071 3.774627
#> cspl_pubstd 5 0.9347 18.6000 6.468385
#> eb_polcor 2 -1.0000 59.7100 1.670000
#> eb_trustgov 16 0.8686 64.9280 7.005034
#> eb_trustprl 17 0.9451 59.3270 7.114013
#> efficacy_g 2 1.0000 69.5000 1.500000
#> ess_trustparl 7 0.4183 48.5786 2.430828
#> ess_trustpol 7 0.0927 61.8500 2.343636
#> govtrust2_m 2 1.0000 45.5000 2.500000
#> govtrust_bsa 20 0.8655 24.7150 7.980932
#> govtrust_m 2 1.0000 69.0063 1.993737
#> h_govsys 19 0.5766 62.6721 7.635968
#> h_mpssat 5 0.6464 38.0000 3.098387
#> h_parlsat 6 0.6778 34.3333 1.885618
#> improp_g 5 0.9095 57.6000 8.114185
#> mpgain_m 4 0.9627 57.0000 9.617692
#> pollies_g 4 0.9532 82.2500 3.832427
#> polmor_g 4 0.7075 54.2500 11.431863
#> pols_g 3 0.9913 40.3333 5.557777
#> pols_mori 6 0.4465 53.6667 5.120764
#> spint_g 2 1.0000 72.0000 5.000000
#> trust_mori1 17 0.3702 73.4118 3.465100
#> trust_mori2 18 0.6150 75.6111 3.111607
#> trustmps_m 4 0.8029 68.5006 5.852612
#> trustown_m 4 -0.4189 40.7500 4.023369
Some of these have much higher loadings than others. That means, they
are more reliable indicators of government (dis)trust than others. For
example, govtrust_bsa
(actual question from the BSA: “Do
not trust British governments to place needs of the nation above
interests of their own party?”) is a very reliable indicator with
observations in 20 different time-periods. On the other hand,
bse_MPs
(actual question from the BSA: “Those we elect as
MPs lose touch with people pretty quickly”) has a quite lower loading
indicating it is a much less reliable indicator of (dis)trust. We can
also make a plot of mood:
Finally, to retrieve the latent dimension and the periods, we can use
the get_mood
function:
ests <- get_mood(jennings_out)
ests
#> period latent1
#> 1 1944 49.08698
#> 2 1945 47.69880
#> 3 1946 47.00470
#> 4 1947 46.65765
#> 5 1948 46.48413
#> 6 1949 46.39737
#> 7 1950 46.35399
#> 8 1951 46.33230
#> 9 1952 46.32145
#> 10 1953 46.31603
#> 11 1954 46.31332
#> 12 1955 46.31196
#> 13 1956 46.31128
#> 14 1957 46.31095
#> 15 1958 46.31078
#> 16 1959 46.31069
#> 17 1960 46.31065
#> 18 1961 46.31063
#> 19 1962 46.31062
#> 20 1963 46.31061
#> 21 1964 46.31061
#> 22 1965 46.31061
#> 23 1966 46.31061
#> 24 1967 46.31061
#> 25 1968 46.31061
#> 26 1969 48.13153
#> 27 1970 49.04199
#> 28 1971 49.49722
#> 29 1972 50.11777
#> 30 1973 50.97421
#> 31 1974 52.69475
#> 32 1975 53.54804
#> 33 1976 53.97128
#> 34 1977 54.18123
#> 35 1978 54.28539
#> 36 1979 54.33708
#> 37 1980 54.36273
#> 38 1981 54.37545
#> 39 1982 54.38177
#> 40 1983 54.38491
#> 41 1984 52.97968
#> 42 1985 52.27706
#> 43 1986 51.47441
#> 44 1987 50.83300
#> 45 1988 50.92375
#> 46 1989 50.97084
#> 47 1990 51.66014
#> 48 1991 52.05198
#> 49 1992 49.92513
#> 50 1993 54.62718
#> 51 1994 55.22873
#> 52 1995 58.85351
#> 53 1996 56.35075
#> 54 1997 56.99313
#> 55 1998 55.04247
#> 56 1999 52.54010
#> 57 2000 53.95874
#> 58 2001 54.32868
#> 59 2002 54.54125
#> 60 2003 55.65126
#> 61 2004 55.84265
#> 62 2005 55.27056
#> 63 2006 55.86224
#> 64 2007 55.52446
#> 65 2008 56.40909
#> 66 2009 59.04136
#> 67 2010 59.21175
#> 68 2011 59.26123
#> 69 2012 59.80284
#> 70 2013 59.73838
#> 71 2014 59.50999
#> 72 2015 57.28657
#> 73 2016 60.30479
Jennings, W. N. Clarke, J. Moss and G. Stoker (2017). “The Decline in Diffuse Support for National Politics: The Long View on Political Discontent in Britain”, Public Opinion Quarterly, 81(3), 748-758.
Stimson, James A. 2018. “The Dyad Ratios Algorithm for Estimating Latent Public Opinion: Estimation, Testing, and Comparison to Other Approaches.” Bulletin of Methodological Sociology 137-138:201–218.