--- title: "scipub vignette" author: "David Pagliaccio" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{scipub vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(knitr) library(ggplot2) library(htmlTable) ``` The scipub package contains functions for summarizing data for scientific publication. This includes making a "Table 1" to summarize demographics across groups, correlation tables with significance indicated by stars, and extracting formatted statistical summarizes from simple tests for in-text notation. The package also includes functions for Winsorizing data based on a Z-statistic cutoff. The sample dataset of demographic and clinical data from 5,000 children is used for examples. *** We'll start by loading scipub: ```{r, message = FALSE,include = TRUE} library(scipub) data(psydat) ``` *** ## `apastat` The `apastat` function summarizes simple statistical tests to include in the text of an article, typically in a parenthetical. This is built for t-tests, correlations, ANOVA, and regression. Regressions can be summarized by their overall model fit or the parameter estimates for one predictor variable. Effect sizes are calculated where possible (default: `es=TRUE`). For example: There is a significant positive correlation between age and height. 95% confidence intervals are requested. ```{r} apastat(stats::cor.test(psydat$Age, psydat$Height), ci = TRUE) ``` There is no significant sex difference in height in the sample. ```{r} apastat(stats::t.test(Height ~ Sex, data = psydat)) ``` A linear regression model predicting height was highly significant, with the predictors (age and sex) accounting for about 17% of the variance in height. ```{r} apastat(stats::lm(data = psydat, Height ~ Age + Sex)) ``` In this linear regression model, age was a highly significant predictor of height, controlling for sex. ```{r} apastat(stats::lm(data = psydat, Height ~ Age + Sex), var = "Age") ``` *** ## `correltable` The `correltable` function creates a summary correlation table with asterisks to indicate significance. Variables can be renamed as part of the function call. The full matrix or upper/lower triangle can be selected for output. For the selected triangle, the empty row/column can be kept or deleted as needed. The caption provides information on the statistics included, any missing data, and the * indications. For example: The lower triangle of inter-correlation among the age, height, and iq variables are shown. ```{r results="asis"} correltable(data = psydat, vars = c("Age", "Height", "iq"),tri="lower",html=TRUE) ``` These same variables can be relabeled in the output and, for conciseness, the columns can be indicated by corresponding number rather than variable name. ```{r results="asis"} correltable(data = psydat, vars = c("Age", "Height", "iq"), var_names = c("Age (months)", "Height (inches)", "IQ"), tri = "upper", colnum = TRUE,html=TRUE) ``` This can also be done with Spearman correlation. As well as using only complete data (list-wise deletion). And, the empty row/column can be removed if desired. ```{r results="asis"} correltable(data = psydat, vars = c("Age", "Height", "iq"), var_names = c("Age (months)", "Height (inches)", "IQ"), tri = "upper", method = "spearman", use = "complete", cutempty = TRUE, colnum = TRUE,html=TRUE) ``` The inter-correlation between two sets of variables can also be shown. ```{r results="asis"} correltable(data = psydat, vars = c("Age", "Height", "iq"), var_names = c("Age (months)", "Height (inches)", "IQ"), vars2 = c("depressT", "anxT"), var_names2 = c("Depression T", "Anxiety T"),html=TRUE) ``` The simplest call just correlates all variables in a dataset. Any non-numeric variables will be tested by t-test, chi-squared, or ANOVA as appropriate. ```{r results="asis"} correltable(data = psydat, html=TRUE) ``` *** ## `partial_correltable` The `partial_correltable` function provides similar functionality to `correltable` but allows for covariates to be partialled out of all correlations. This function will allow for binary/factor covariates to be partialled out but only numeric variables can be correlated. This involves residualizing all `vars` by all `partialvars` via linear regression (`lm`). For example: The lower triangle of partial correlations among the age, height, and iq variables are shown, residualizing for sex and income as factor variables. ```{r results="asis"} partial_correltable(data = psydat, vars = c("Age", "Height", "iq"), partialvars = c("Sex", "Income"), tri = "lower", html = TRUE) ``` These same variables can be relabeled in the output and, for conciseness, the columns can be indicated by corresponding number rather than variable name and shown in the supper triangle. ```{r results="asis"} partial_correltable(data = psydat, vars = c("Age", "Height", "iq"), var_names = c("Age (months)", "Height (inches)", "IQ"), partialvars = c("Sex", "Income"),tri = "upper", colnum = TRUE, html = TRUE) ``` This can also be done with Spearman correlation. As well as using only complete data (list-wise deletion). And, the empty row/column can be removed if desired. ```{r results="asis"} partial_correltable(data = psydat, vars = c("Age", "Height", "iq"), var_names = c("Age (months)", "Height (inches)", "IQ"), partialvars = c("Sex", "Income"),tri = "upper", method = "spearman", use = "complete", cutempty = TRUE, colnum = TRUE, html = TRUE) ``` *** ## `FullTable1` A "Table 1" can be created to summarize data, i.e. the typical first table in a paper that describes the sample characteristics. This can display information for a single group for the declared variables . ```{r results="asis"} FullTable1(data = psydat, vars = c("Age", "Sex","Height", "depressT"), html=TRUE) ``` Or commonly this can be shown for two groups if interest including the tests of group difference for all variables. ```{r results="asis"} FullTable1(data = psydat, vars = c("Age", "Height", "depressT"), strata = "Sex", html=TRUE) ``` This can also be created for more than two groups. As with `correltable` variables can be renamed in the call. Also the significance stars can be moved to the statistic column or variable name (or removed). The p-value column can be removed as well (same for the effect size column, but why would you want to remove that?). ```{r results="asis"} FullTable1(data = psydat, vars = c("Age", "Sex","Height", "depressT"), var_names = c("Age (months)", "Sex","Height (inches)", "Depression T"), strata = "Income", stars = "stat",p_col = FALSE, html=TRUE) ``` All variables will be summarized if none are declared Shown with significance stars on variable names. ```{r results="asis"} FullTable1(data = psydat, strata = "Sex",stars = "name",p_col = FALSE, html=TRUE) ``` You can also replace the caption with your own input and re-output to HTML. ```{r results="asis"} tmp <- FullTable1(data = psydat, vars = c("Age", "Height", "depressT"), strata = "Sex") tmp$caption <- "Write your own caption" print(htmlTable::htmlTable(tmp$table, useViewer=T, rnames=F,caption=tmp$caption, pos.caption="bottom")) ``` *** ## Winsorizing The `winsorZ` function allows for Winsorizing outliers based on a Z-score cutoff, i.e. replacing extreme values with the next most extreme outlier value. This is an alternative to other function, e.g. `DescTools::Winsorize` which identifies outlier based on quantile limits. The `winsorZ` function can be used in combination with the `winsorZ_find` function, which identifies the outlier values (1=outlier, 0=non-outlier). For example, in the example data, psydat$iq has 17 outliers that exceed a default |Z|>3 limit test. Here, we create a temporary data frame with the original iq scores, the Z-score winsorized iq values, and an indication of which scores were winsorized. We can see the change in mix/max iq values in the summary and the winsorized outliers are shown in blue in the plot. ```{r results="asis"} temp <- data.frame(iq=psydat$iq, iq_winsor=winsorZ(psydat$iq), iq_outlier = winsorZ_find(psydat$iq)) summary(temp) ggplot(temp[!is.na(temp$iq),], aes(x=iq, y=iq_winsor)) + geom_point(aes(color=iq_outlier),alpha=.7) + geom_line() + theme_bw() ``` *** ## Group Difference Plot The `gg_groupplot` function creates can be used to create group difference plots for scientific publication. This is intended to summarize a continuous outcome (`y`) based on a factor ('x') from an input dataset (`data`). The plot will include standard ggplot2::geom_boxplot indicating 25th, median, and 75th percentile for the box and 1.5 * IQR for the whiskers. Outliers are not highlighted. Raw data is displayed with standard ggplot2::geom_point and lateral but not vertical jittering. Histograms are shown with gghalves::geom_half_violin to the right of each boxplot.If meanline = = TRUE (default), gray dots will indicate the mean for each variable (vs. median in boxplot) connected by a gray line. This function will drop any NA values. ```{r results="asis"} gg_groupplot(data=psydat, x=Sex, y=depressT) ``` This can be combined with other `ggplot` graphics functions, e.g. facetting. ```{r results="asis"} gg_groupplot(data=psydat, x=Income, y=depressT) + facet_wrap(~Sex) + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) ``` This function is a simpler wrapper for: ggplot(data = data[!is.na(data[, x]) & !is.na(data[, y]), ], aes(x = get(x), y = get(y), color = get(x), fill = get(x), shape = get(x))) + geom_half_violin(position = position_nudge(x = .3, y = 0), alpha = .8, width = .5, side = "r", color = NA) + geom_point(position = position_jitterdodge(jitter.width = .5), alpha = .8, size = 1.5) + geom_boxplot(outlier.alpha = 0, width = .5, fill = NA, color = "black") + xlab("") + ylab("") + theme_bw(base_size = 12) + theme(legend.position = "none", panel.grid.minor = element_line(linetype = "dashed", size = .5), axis.title.x = element_text(face = "bold"), axis.title.y = element_text(face = "bold")) + scale_shape(solid = FALSE) + stat_summary(fun = mean, geom = "line", color = "darkgray", size = 1, aes(group = 1)) + stat_summary(fun = mean, geom = "point", color = "darkgray", size = 2, shape = 16, aes(group = 1))