--- title: "Introduction to summarytabl" output: rmarkdown::html_vignette description: > This document introduces you to some of summarytabl's most frequently used functions, and demonstrates how you can use them with data frames. vignette: > %\VignetteIndexEntry{Introduction to summarytabl} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- Welcome to the `summarytabl` package! This package makes it easy to create simple tables for summarizing continuous, ordinal, and categorical data. This document introduces you to some of `summarytabl`'s most frequently used functions, along with examples of how to apply them to your data frames. To begin, load `summarytabl` ```{r setup} library(summarytabl) ``` ## Types of functions This package has three types of functions to help you summarize your data. Those for: 1. Categorical data, such as binary (e.g., Unselected/Selected) or nominal (e.g., woman/man/non-binary) variables 2. Multiple response data, including binary (e.g., Unselected/Selected), multiple-response (e.g., never, sometimes, often), or ordinal-scale (e.g., strongly disagree to strongly agree) variables 3. Continuous data, like interval (e.g., test scores) and ratio-level (e.g., age) variables Functions for categorical data start with `cat_`, those for multiple response data start with `select_`, and functions for continuous data start with `mean_`. To learn more about how these functions work, read the next few sections. ## Categorical variables ### Summarize a single categorical variable The `cat_tbl()` function can be used to generate a frequency table for a categorical variable. ```{r} cat_tbl(data = nlsy, var = "race") ``` You can exclude certain values and eliminate missing values from the data using the `ignore` and `na.rm` arguments, respectively. ```{r} cat_tbl(data = nlsy, var = "race", ignore = "Black", na.rm = TRUE) ``` Finally, you can choose what information to return using the `only` argument. ```{r} # Default: counts and percentages cat_group_tbl(data = nlsy, row_var = "race", col_var = "bthwht", na.rm.row_var = TRUE) # Counts only cat_tbl(data = nlsy, var = "race", ignore = "Black", na.rm = TRUE, only = "count") # Percents only cat_group_tbl(data = nlsy, row_var = "race", col_var = "bthwht", na.rm.row_var = TRUE, only = "percent") ``` ### Summarize a categorical variable grouped by another variable To create a grouped frequency table for two categorical variables, use the `cat_group_tbl()` function. ```{r} cat_group_tbl(data = nlsy, row_var = "gender", col_var = "bthwht") ``` Like `cat_tbl()`, you have the option to exclude certain values and omit missing values (by row and/or column). ```{r} cat_group_tbl(data = nlsy, row_var = "race", col_var = "bthwht", na.rm.row_var = TRUE, ignore = c(race = "Non-Black,Non-Hispanic"), pivot = "wider") ``` If you want to ignore more than one value per row or column, provide them in a named list: ```{r} cat_group_tbl(data = nlsy, row_var = "race", col_var = "bthwht", na.rm.row_var = TRUE, ignore = list(race = c("Non-Black,Non-Hispanic", "Hispanic")), pivot = "wider") ``` Finally, you can choose what information to return using the `only` argument. ```{r} # Default: counts and percentages cat_group_tbl(data = nlsy, row_var = "race", col_var = "bthwht", na.rm.row_var = TRUE) # Counts only cat_group_tbl(data = nlsy, row_var = "race", col_var = "bthwht", na.rm.row_var = TRUE, only = "count") # Percents only cat_group_tbl(data = nlsy, row_var = "race", col_var = "bthwht", na.rm.row_var = TRUE, only = "percent") ``` ## Multiple response variables ### Summarize a series of multiple response variables With `select_tbl()`, you can produce a summary table for multiple response variables with the same variable stem. A variable stem is a common prefix found in related variable names, often corresponding to similar survey items, that represents a shared concept before unique identifiers. For example, the `depressive` dataset contains eight variables that share the same stem, with each one representing a different item (such as a statement, question, or indicator) used to measure depression: ```{r} names(depressive) ``` With the `select_tbl()` function, you can summarize these responses to see how many survey respondents chose each answer option for every variable: ```{r} select_tbl(data = depressive, var_stem = "dep") ``` You can also use the `ignore` and `na_removal` arguments to exclude values from the data and specify how missing values should be handled. By default, missing values are removed listwise, but you can set `na_removal` to `pairwise` for pairwise removal instead: ```{r} # Default listwise removal, value '3' removed from data select_tbl(data = depressive, var_stem = "dep", ignore = 3) # Pairwise removal, value '3' removed from data select_tbl(data = depressive, var_stem = "dep", ignore = 3, na_removal = "pairwise") ``` Set the `pivot` argument to `wider` to reshape the resulting table into a wide format. By default, the summary table is presented in the long format. ```{r} # Default longer format select_tbl(data = depressive, var_stem = "dep") # Wider format select_tbl(data = depressive, var_stem = "dep", pivot = "wider") ``` You can use the `var_labels` argument to include variable labels in your summary table to make the variable names easier to interpret: ```{r} select_tbl(data = depressive, var_stem = "dep", pivot = "wider", var_labels = c( dep_1="how often child feels sad and blue", dep_2="how often child feels nervous, tense, or on edge", dep_3="how often child feels happy", dep_4="how often child feels bored", dep_5="how often child feels lonely", dep_6="how often child feels tired or worn out", dep_7="how often child feels excited about something", dep_8="how often child feels too busy to get everything" ) ) ``` Finally, you can choose what information to return using the `only` argument. ```{r} # Default: counts and percentages select_tbl(data = depressive, var_stem = "dep", pivot = "wider") # Counts only select_tbl(data = depressive, var_stem = "dep", pivot = "wider", only = "count") # Percents only select_tbl(data = depressive, var_stem = "dep", pivot = "wider", only = "percent") ``` ### Summarize a series of multiple response variables grouped by a variable or matching pattern With `select_group_tbl()`, you can create a summary table for multiple response variables with the same variable stem, grouped either by another variable in your dataset or by matching a pattern in the variable names. For example, we often want to summarize survey responses by demographic variables like gender, age, or race: ```{r} dep_recoded <- depressive |> dplyr::mutate( race = dplyr::case_match(.x = race, 1 ~ "Hispanic", 2 ~ "Black", 3 ~ "Non-Black/Non-Hispanic", .default = NA) ) |> dplyr::mutate( dplyr::across( .cols = dplyr::starts_with("dep"), .fns = ~ dplyr::case_when(.x == 1 ~ "often", .x == 2 ~ "sometimes", .x == 3 ~ "hardly ever") ) ) # longer format select_group_tbl(data = dep_recoded, var_stem = "dep", group = "race", pivot = "longer") # wider format select_group_tbl(data = dep_recoded, var_stem = "dep", group = "race", pivot = "wider") ``` As with `cat_group_tbl()`, you can specify which values to exclude and how remove missing values. However, when specifying values to exclude, use the `var_stem` argument you provide to determine which values to exclude for variables sharing the same stem. ```{r} # Default listwise removal: 'often' value removed from all # dep_ variables, and 'Non-Black/Non-Hispanic' value removed # from race variable select_group_tbl(data = dep_recoded, var_stem = "dep", group = "race", pivot = "longer", ignore = c(dep = "often", race = "Non-Black/Non-Hispanic")) # Pairwise removal: 'often' value removed from all # dep_ variables, and 'Non-Black/Non-Hispanic' value removed # from race variable select_group_tbl(data = dep_recoded, var_stem = "dep", group = "race", pivot = "longer", ignore = c(dep = "often", race = "Non-Black/Non-Hispanic"), na_removal = "pairwise") ``` Use a list if you want to exclude several values from the same `var_stem` or `group` variable: ```{r} select_group_tbl(data = dep_recoded, var_stem = "dep", group = "race", pivot = "longer", ignore = list(race = c("Hispanic", "Non-Black/Non-Hispanic"))) ``` Another application of `select_group_tbl` is summarizing responses based on a matching pattern, such as survey time points (e.g., waves). To use this feature, set `group_type` to `pattern` and enter the pattern to search for in the `group` argument. For example, the `stem_social_psych` dataset includes a set of variables responded to by students at two different time points ("w1" and "w2"). You can summarize the responses for one of set of these variables using the following approach: ```{r} select_group_tbl(data = stem_social_psych, var_stem = "belong_belong", group = "_w\\d", group_type = "pattern", pivot = "longer") ``` Use the `group_name` argument to assign a descriptive name to the column containing the matched pattern values. ```{r} select_group_tbl(data = stem_social_psych, var_stem = "belong_belong", group = "_w\\d", group_type = "pattern", group_name = "wave", pivot = "longer") ``` You can use the `var_labels` argument to include variable labels in your summary table to make the variable names easier to interpret: ```{r} select_group_tbl(data = stem_social_psych, var_stem = "belong_belong", group = "_w\\d", group_type = "pattern", group_name = "wave", pivot = "longer", var_labels = c( belong_belongStem_w1 = "I feel like I belong in STEM (wave 1)", belong_belongStem_w2 = "I feel like I belong in STEM (wave 2)" )) ``` Finally, you can choose what information to return using the `only` argument. ```{r} # Default: counts and percentages select_group_tbl(data = stem_social_psych, var_stem = "belong_belong", group = "_w\\d", group_type = "pattern", group_name = "wave", pivot = "longer", only = "count") # Counts only select_group_tbl(data = stem_social_psych, var_stem = "belong_belong", group = "_w\\d", group_type = "pattern", group_name = "wave", pivot = "longer", only = "count") # Percents only select_group_tbl(data = stem_social_psych, var_stem = "belong_belong", group = "_w\\d", group_type = "pattern", group_name = "wave", pivot = "longer", only = "percent") ``` ## Continuous variables ### Summarize a series of continuous variables With the `mean_tbl()` function, you can summarize a group of continuous variables that share the same variable stem. The resulting table provides descriptive statistics for each variable, including the mean, standard deviation, minimum, maximum, and the number of non-missing values. ```{r} mean_tbl(data = social_psy_data, var_stem = "belong") ``` Like the other functions in this package, you can use the `ignore` argument to specify which values to exclude from all variables associated with the provided variable stem. ```{r} mean_tbl(data = social_psy_data, var_stem = "belong", ignore = 5) ``` You can also specify how missing values are removed: ```{r} # Default listwise removal mean_tbl(data = social_psy_data, var_stem = "belong", ignore = 5) # Pairwise removal mean_tbl(data = social_psy_data, var_stem = "belong", na_removal = "pairwise", ignore = 5) ``` Including variable labels in your summary table can help make the variable names easier to interpret. ```{r} mean_tbl(data = social_psy_data, var_stem = "belong", na_removal = "pairwise", var_labels = c( belong_1 = "I feel like I belong at this institution", belong_2 = "I feel like part of the community", belong_3 = "I feel valued by this institution") ) ``` Finally, you can choose what information to return using the `only` argument. ```{r} # Default: all summary statistics returned # (mean, sd, min, max, nobs) mean_tbl(data = social_psy_data, var_stem = "belong", na_removal = "pairwise") # Means and non-missing observations returned mean_tbl(data = social_psy_data, var_stem = "belong", na_removal = "pairwise", only = c("mean", "nobs")) # Means and standard deviations returned mean_tbl(data = social_psy_data, var_stem = "belong", na_removal = "pairwise", only = c("mean", "sd")) ``` ### Summarize a series of continuous variables grouped by a variable or matching pattern With the `mean_group_tbl()` function, you can produce a summary table for a series of continuous variables sharing the same variable stem, grouped by another variable in your dataset or by matching a pattern in variable names. For example, we often want to present summary statistics for responses by demographic variables like gender, age, or race: ```{r} mean_group_tbl(data = stem_social_psych, var_stem = "belong_belong", group = "urm", group_type = "variable") ``` As with `select_group_tbl()`, you can specify which values to exclude and whether to omit missing values. Be sure to use the `var_stem` argument you provide to determine which values to exclude for variables sharing the same stem. ```{r} # Default listwise removal mean_group_tbl(data = stem_social_psych, var_stem = "belong_belong", group = "urm", ignore = c(belong_belong = 5, urm = 0) ) # Pairwise removal mean_group_tbl(data = stem_social_psych, var_stem = "belong_belong", group = "urm", na_removal = "pairwise", ignore = c(belong_belong = 5, urm = 0) ) ``` Use a list if you want to exclude several values from the same `var_stem` or `group` variable: ```{r} mean_group_tbl(data = stem_social_psych, var_stem = "belong_belong", group = "urm", ignore = list(belong_belong = c(4,5), urm = 0) ) ``` Another application of `mean_group_tbl` is summarizing responses based on a matching pattern, such as survey time points (e.g., waves). To use this feature, set `group_type` to `pattern` and enter the pattern to search for in the `group` argument. ```{r} mean_group_tbl(data = stem_social_psych, var_stem = "belong_belong", group = "_w\\d", group_type = "pattern") ``` Use the `group_name` argument to give a descriptive label to the column with matched patterns or grouping variable values, and the `var_labels` argument to add labels to the variables in the summary table. ```{r} mean_group_tbl(data = stem_social_psych, var_stem = "belong_belong", group = "_w\\d", group_type = "pattern", group_name = "wave", var_labels = c( belong_belongStem_w1 = "I feel like I belong in computing", belong_belongStem_w2 = "I feel like I belong in computing") ) ``` Finally, you can choose what information to return using the `only` argument. ```{r} # Default: all summary statistics returned # (mean, sd, min, max, nobs) mean_group_tbl(data = stem_social_psych, var_stem = "belong_belong", group = "_w\\d", group_type = "pattern", group_name = "wave", var_labels = c( belong_belongStem_w1 = "I feel like I belong in computing", belong_belongStem_w2 = "I feel like I belong in computing") ) # Means and non-missing observations only mean_group_tbl(data = stem_social_psych, var_stem = "belong_belong", group = "_w\\d", group_type = "pattern", group_name = "wave", var_labels = c( belong_belongStem_w1 = "I feel like I belong in computing", belong_belongStem_w2 = "I feel like I belong in computing"), only = c("mean", "nobs") ) # Means and standard deviations only mean_group_tbl(data = stem_social_psych, var_stem = "belong_belong", group = "_w\\d", group_type = "pattern", group_name = "wave", var_labels = c( belong_belongStem_w1 = "I feel like I belong in computing", belong_belongStem_w2 = "I feel like I belong in computing"), only = c("mean", "sd") ) ```