---
title: "Case Study: Working with Eurobarometer surveys"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Case Study: Working with Eurobarometer surveys}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(retroharmonize)
load(file = system.file(
  "eurob", "eurob_vignette.rda", package = "retroharmonize"))
```

The goal in this case study is to analyze trust in the national and European parliaments, and in the European Commission, in Europe, with data from the Eurobarometer. 

The [Eurobarometer](https://ec.europa.eu/commfrontoffice/publicopinion/index.cfm) is a biannual survey conducted by the European Commission with the goal of monitoring the public opinion of populations of EU member states and -- occasionally -- also in candidate countries. Each EB wave is devoted to a particular topic, but most waves ask some "trend questions", i.e. questions that are repeated frequently in the same form. Trust in institutions is among such trend questions.

The [Eurobarometer](https://ec.europa.eu/commfrontoffice/publicopinion/index.cfm) data
[Eurobarometer](https://ec.europa.eu/commfrontoffice/publicopinion/index.cfm) raw data and related documentation (questionnaires, codebooks, etc.) are made available by *GESIS*, *ICPSR* and through the *Social Science Data Archive* networks. You should cite your source, in our examples, we rely on the [GESIS](https://www.gesis.org/en/eurobarometer-data-service/search-data-access/data-access) data files. In this case study we use nine waves of the Eurobarometer between 1996 and 2019: 
[44.2bis (January-March 1996)](https://search.gesis.org/research_data/ZA2828), [51.0 (March-April 1999)](https://search.gesis.org/research_data/ZA3171), [57.1 (March-May 2002)](https://search.gesis.org/research_data/ZA3639), [64.2 (October-November 2005)](https://search.gesis.org/research_data/ZA4414), [69.2 (Mar-May 2008)](https://search.gesis.org/research_data/ZA4744)
[75.3 (May 2011)](https://search.gesis.org/research_data/ZA5481), [81.2 (March 2014)](https://search.gesis.org/research_data/ZA5913), [87.3 (May 2017)](https://search.gesis.org/research_data/ZA6863), and [91.2 (March 2019)](https://search.gesis.org/research_data/ZA7562).  

In the [Afrobaromter Case Study](https://retroharmonize.dataobservatory.eu/articles/afrobarometer.html) we have shown how to merge two waves of a survey with a limited number of variables.  This workflow is not feasible with Eurobarometer on a PC or laptop, because there are too many large files to handle. 


```{r setup, message=FALSE}
library(retroharmonize)
library(dplyr)
```


```{r import-display, eval=FALSE}
eurobarometer_waves <- file.path("working", dir("working"))
eb_waves <- read_surveys(eurobarometer_waves, .f='read_rds')
```

```{r import-here, echo=FALSE}
working_dir  <- here::here("working")
eurobarometer_waves <- file.path(working_dir, dir(working_dir))
eb_waves <- read_surveys(eurobarometer_waves, .f='read_rds')
```

We can review if the main descriptive metadata is correctly present with `document_waves()`.

```{r document-waves, eval=FALSE}
documented_eb_waves <- document_waves(eb_waves) 
```

## Metadata map

We start by extracting metadata from the survey data files and storing them in a tidy table, where each row contains information about a variable from the survey data file. To keep the size manageable, we keep only a few variables: the row ID, the weighting variable, the country code, and variables that contain "parliament" or "commission" in their labels.

```{r metadata, eval=FALSE}
eb_trust_metadata <- lapply ( X = eb_waves, FUN = metadata_create )
eb_trust_metadata <- do.call(rbind, eb_trust_metadata)
#let's keep the example manageable:
eb_trust_metadata  <- eb_trust_metadata %>%
  filter ( grepl("parliament|commission|rowid|weight_poststrat|country_id", var_name_orig) )
```

```{r head}
head(eb_trust_metadata)
```

The value labels in this example are not too numerous.  The only variable that stands out is the one with `Can rely on` and `Cannot rely on` labels. 

```{r valid-labels}
collect_val_labels(eb_trust_metadata)
```

The following labels were marked by GESIS as missing values:

```{r missing-labels}
collect_na_labels(eb_trust_metadata)
```

We have created a helper function `subset_save_survey()` that programmatically reads in SPSS files, makes the necessary type conversion to `labelled_spss_survey()` without harmonization, and saves a small, subsetted `rds` file. Because this is a native R file, it is far more efficient to handle in the actual workflow.

```{r tempdir}
## You will likely use your own local working directory, or
## tempdir() that will create a temporary directory for your 
## session only. 
working_directory <- tempdir()
```


```{r subsetting, eval=FALSE}
# This code is for illustration only, it is not evaluated.
# To replicate the worklist, you need to have the SPSS file names 
# as a list, and you have to set up your own import and export path.

selected_eb_metadata <- readRDS(
  system.file("eurob", "selected_eb_waves.rds", package = "retroharmonize")
  ) %>%
  mutate ( id = substr(filename,1,6) ) %>%
  rename ( var_label = var_label_std ) %>%
  mutate ( var_name = var_label )

## This code is not evaluated, it is only an example. You are likely 
## to have a directory where you have already downloaded the data
## from GESIS after accepting their term use.

subset_save_surveys ( 
  var_harmonization = selected_eb_metadata, 
  selection_name = "trust",
  import_path = gesis_dir, 
  export_path = working_directory )
```

## Harmonize the labels

For easier looping we adopt the `harmonize_values()` function with new default settings. It would be tempting to preserve the `rely` labels as distinct from the `trust` labels, but if we use the same numeric coding, it will lead to confusion. If you want to keep the difference of the two type of category labels, than the harmonization should be done in a two-step process. 

```{r specificfunction}
harmonize_eb_trust <- function(x) {
  label_list <- list(
    from = c("^tend\\snot", "^cannot", "^tend\\sto", "^can\\srely",
             "^dk", "^inap", "na"), 
    to = c("not_trust", "not_trust", "trust", "trust",
           "do_not_know", "inap", "inap"), 
    numeric_values = c(0,0,1,1, 99997,99999,99999)
  )

  harmonize_values(x, 
                   harmonize_labels = label_list, 
                   na_values = c("do_not_know"= 99997,
                                 "declined"   = 99998,
                                 "inap"       = 99999 )
  )
}
```

Let's see if things did work out fine:

```{r, eval=FALSE}
document_waves(eb_waves)
```
```{r, echo=FALSE}
documented_eb_waves
```

To review the harmonization on a single survey use `pull_survey()`. 

```{r, test-trust, eval=FALSE}
test_trust <- pull_survey(eb_waves, filename = "ZA4414_trust.rds")
```

Before running our adapted harmonization function, we have this:

```{r, pulled-check}
test_trust$trust_european_commission[1:16]
```

After performing harmonization, it would look like this:

```{r}
harmonize_eb_trust(x=test_trust$trust_european_commission[1:16])
```

If you are satisfied with the results, run `harmonize_eb_trust()` through the 9 survey waves. Whenever a variable is missing from a wave, it is filled up with `inapproriate` missing values.

## Harmonize waves

We define a selection of countries: Belgium, Hungary, Italy, Malta, the Netherlands, Poland, Slovakia, and variables.

```{r ebwavesselected}
eb_waves_selected <- lapply ( eb_waves, function(x) x %>% select ( 
  any_of (c("rowid", "country_id", "weight_poststrat", 
            "trust_national_parliament", "trust_european_commission", 
            "trust_european_parliament"))) %>%
    filter ( country_id %in% c("NL", "PL", "HU", "SK", "BE", 
                               "MT", "IT")))

```


```{r harmonizewaves, eval=FALSE}
harmonized_eb_waves <- harmonize_waves ( 
  waves = eb_waves_selected, 
  .f = harmonize_eb_trust )
```

We cannot rely on `document_waves()` anymore, because the result is a single data frame. Let's have a look at the descriptive metadata.

```{r wavesattributes}
wave_attributes <- attributes(harmonized_eb_waves)
wave_attributes$id
wave_attributes$filename
wave_attributes$names
```

## Analyze the data

The harmonized data can be analyzed in R.  The labelled survey data is stored in `labelled_spss_survey()` vectors, which is a complex class that retains much metadata for reproducibility. Most statistical R packages do not know it. To them, the data should be presented either as numeric data with `as_numeric()` or as categorical with `as_factor()`.  (See more why you should not fall back on the more generic `as.factor()` or `as.numeric()` methods in [The labelled_spss_survey class vignette.](https://retroharmonize.dataobservatory.eu/articles/labelled_spss_survey.html))

First, let's treat the trust variables as factors. A summary of the resulting data allows us to screen for values that are outside of the expected range. In the trust variables, any values other than "trust" and "not trust" that are not defined as missing, are unacceptable. In our example, this is not the case. In our example, this is not the case. 

We also see some basic information about the weighting factors, which in the selected Eurobarometer subset range from below 0.01 to almost 7. The range of these values is pretty large, which needs to be taken into account when analyzing the data.

```{r factor}
harmonized_eb_waves %>%
  mutate_at ( vars(contains("trust")), as_factor ) %>%
  summary()
```

Now we convert the trust variables to numeric format, and look at the summary. Following the conversion, we lost information about the type of the missing values - now they are all lumped together as `NA`. What we gained is the proportion of positive responses (which ranges between 0.41 for trust in the national parliament and 0.63 for trust in the European Parliament), and the ability to, e.g., construct scales of the binary variables.

```{r numericrepr}
numeric_harmonization <- harmonized_eb_waves %>%
  mutate_at ( vars(contains("trust")), as_numeric )
summary(numeric_harmonization)
```

Finally, let's calculate weighted means of trust in the national parliament, the European Parliament, and the European Commission, for the selected countries, across all EB waves.

```{r numericharmonization}
numeric_harmonization %>%
  group_by(country_id) %>%
  summarize_at ( vars(contains("trust")), 
                 list(~mean(.*weight_poststrat, na.rm=TRUE))) 
```