%\VignetteIndexEntry{Using stm} %\documentclass[nojss]{jss} \documentclass[article,shortnames]{jss} \author{\hspace{1.1in}Margaret E. Roberts\\\hspace{1.1in}UCSD \And \hspace{1.5in}Brandon M. Stewart\\\hspace{1.5in}Princeton \And \hspace{1.5in}Dustin Tingley\\\hspace{1.5in}Harvard \And } \title{\pkg{stm}: \proglang{R} Package for Structural Topic Models} \Plainauthor{Margaret E. Roberts, Brandon M. Stewart, Dustin Tingley} %% comma-separated \Plaintitle{stm: R Package for Structural Topic Models} %% without formatting \Shorttitle{Structural Topic Models} %% a short title (if necessary) \Abstract{ This paper demonstrates how to use the Structural Topic Model \pkg{stm} \texttt{R} package. The Structural Topic Model allows researchers to flexibly estimate a topic model that includes document-level metadata. Estimation is accomplished through a fast variational approximation. The \pkg{stm} package provides many useful features, including rich ways to explore topics, estimate uncertainty, and visualize quantities of interest. } \Keywords{structural topic model, text analysis, LDA, \pkg{stm}, \proglang{R}} \Plainkeywords{structural topic model, text analysis, LDA, stm, R} %% without formatting \Address{ Margaret E. Roberts\\ Department of Political Science\\ University of California, San Diego\\ Social Sciences Building 301\\ 9500 Gillman Drive, 0521, La Jolla, CA, 92093-0521 \\ E-mail: \email{meroberts@ucsd.edu}\\ URL: \url{http://www.margaretroberts.net}\\ \\ Brandon M. Stewart\\ Department of Sociology\\ Princeton University\\ 149 Wallace Hall, Princeton, NJ 08544 \\ E-mail: \email{bms4@princeton.edu}\\ URL: \url{brandonstewart.org}\\ \\ Dustin Tingley\\ Department of Government\\ Harvard University\\ 1737 Cambridge St, Cambridge, MA, USA\\ E-mail: \email{dtingley@gov.harvard.edu}\\ URL: \url{http://scholar.harvard.edu/dtingley}\\ } % == BibTeX packages \usepackage{natbib} % == Other Packages \usepackage{amsmath, amsfonts, amssymb, url, bm} \usepackage{rotating} \usepackage{latexsym} \usepackage{graphicx} \usepackage{ulem} \usepackage{placeins} % == New Commands % = For general typesetting \newcommand\spacingset[1]{\renewcommand{\baselinestretch}{#1}\small\normalsize} \setkeys{Gin}{width=0.5\textwidth} \def\citepos#1{\citeauthor{#1}'s (\citeyear{#1})} \newcommand\citea{\citeauthor} \begin{document} \SweaveOpts{concordance=TRUE} \spacingset{1} <>= options(prompt = "R> ", continue = "+", width = 70, useFancyQuotes = FALSE) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} Text data is ubiquitous in social science research: traditional media, social media, survey data, and numerous other sources contribute to the massive quantity of text in the modern information age. The mounting availability of, and interest in, text data has been the development of a variety of statistical approaches for analyzing this data. We focus on the Structural Topic Model (STM), and its implementation in the \pkg{stm} \texttt{R} package, which provides tools for machine-assisted reading of text corpora.\footnote{We thank Antonio Coppola, Jetson Leder-Luis, Christopher Lucas, and Alex Storer for various assistance in the construction of this package. We also thank the many package users who have written in with bug reports and feature requests. We extend particular gratitude to users on Github who have contributed code via pull requests including Ken Benoit, Patrick Perry, Chris Baker, Jeffrey Arnold, Thomas Leeper, Vincent Arel-Bundock, Santiago Castro, Rose Hartman, Vineet Bansal and Github user sw1. We gratefully acknowledge grant support from the Spencer Foundation's New Civics initiative, the Hewlett Foundation, a National Science Foundation grant under the Resource Implementations for Data Intensive Research program, Princeton's Center for Statistics and Machine Learning, and The Eunice Kennedy Shriver National Institute of Child Health \& Human Development of the National Institutes of Health under Award Number P2CHD047879. The content is solely the responsibility of the authors and does not necessarily represent the official views of any of the funding institutions. Additional details and development version at \url{structuraltopicmodel.com}.} Building off of the tradition of probabilistic topic models, such as the Latent Dirichlet Allocation (LDA) \citep{blei2003latent}, the Correlated Topic Model (CTM) \citep{blei2007correlated}, and other topic models that have extended these \citep{mimno2008dmr, socher2009bayesian, eisenstein2010latent, rosen2010learning, quinn2010analyze, ahmed2010timeline, grimmer2010bayesian,eisenstein2011sparse, gerrish2012they, foulds2015latent, paul2015sprite}, the Structural Topic Model's key innovation is that it permits users to incorporate arbitrary metadata, defined as information about each document, into the topic model. With the STM, users can model the framing of international newspapers \citep{stmjasa}, open-ended survey responses in the American National Election Study \citep{ajps}, online class forums \citep{StudentText}, Twitter feeds and religious statements \citep{TextComparative}, lobbying reports \citep{sailing} and much more.\footnote{A fast growing range of other papers also utilize the model in a variety of ways. We maintain a list of papers using the software that we are aware of on our website \url{structuraltopicmodel.com}.} The goal of the Structural Topic Model is to allow researchers to discover topics and estimate their relationship to document metadata. Outputs of the model can be used to conduct hypothesis testing about these relationships. This of course mirrors the type of analysis that social scientists perform with other types of data, where the goal is to discover relationships between variables and test hypotheses. Thus the methods implemented in this paper help to build a bridge between statistical techniques and research goals. The \pkg{stm} package also provides tools to facilitate the work flow associated with analyzing textual data. The design of the package is such that users have a broad array of options to process raw text data, explore and analyze the data, and present findings using a variety of plotting tools. The outline of this paper is as follows. In Section~\ref{sec:est} we introduce the technical aspects of the STM, including the data generating process and an overview of estimation. In Section \ref{sec:use} we provide examples of how to use the model and the package \texttt{stm}, including implementing the model and plots to visualize model output. Sections~\ref{sec:defaults} and~\ref{sec:priors} cover settings to control details of estimation. Section~\ref{sec:performance} demonstrates the superior performance of our implementation over related alternatives. Section~\ref{sec:conclusion} concludes and the Appendix discusses several more advanced features. \section{The Structural Topic Model} \label{sec:est} We begin by providing a technical overview of the STM model. Like other topic models, the STM is a generative model of word counts. That means we define a data generating process for each document and then use the data to find the most likely values for the parameters within the model. Figure~\ref{fig:stmoverview} provides a graphical representation of the model. The generative model begins at the top, with document-topic and topic-word distributions generating documents that have metadata associated with them ($X_d$, where $d$ indexes the documents). Within this framework (which is the same as other topic models like LDA absent the metadata), a topic is defined as a mixture over words where each word has a probability of belonging to a topic. And a document is a mixture over topics, meaning that a single document can be composed of multiple topics. As such, the sum of the topic proportions across all topics for a document is one, and the sum word probabilities for a given topic is one. Figure~\ref{fig:stmoverview} and the statement below of the document generative process highlight the case where topical prevalence and topical content can be a function of document metadata. Topical prevalence refers to how much of a document is associated with a topic (described on the left hand side) and topical content refers to the words used within a topic (described on the right hand side). Hence metadata that explain topical prevalence are referred to as topical prevalence covariates, and variables that explain topical content are referred to as topical content covariates. It is important to note, however, that the model allows using topical prevalence covariates, a topical content covariate, both, or neither. In the case of no covariates, the model reduces to a (fast) implementation of the Correlated Topic Model \citep{blei2007correlated}. In the case of no covariates and point estimates for $\beta$, the model reduces to a (fast) implementation of the Correlated Topic Model \citep{blei2007correlated}. The generative process for each document (indexed by $d$) with vocabulary of size $V$ for a STM model with $K$ topics can be summarized as: \begin{enumerate} \item Draw the document-level attention to each topic from a logistic-normal generalized linear model based on a vector of document covariates $X_d$. \\ \begin{equation}\label{eqn:dgp1} \vec{\theta}_d | X_d\gamma, \Sigma \sim \text{ LogisticNormal}(\mu = X_d\gamma, \Sigma) \end{equation} where $X_d$ is a 1-by-$p$ vector, $\gamma$ is a $p$-by-$K-1$ matrix of coefficients and $\Sigma$ is $K-1$-by-$K-1$ covariance matrix. \item Given a document-level content covariate $y_d$, form the document-specific distribution over words representing each topic ($k$) using the baseline word distribution ($m$), the topic specific deviation $\kappa_{k}^{(\text{t})}$, the covariate group deviation $\kappa_{y_d}^{(\text{c})}$ and the interaction between the two $\kappa_{y_d,k}^{(\text{i})})$.\\ \begin{equation}\label{eqn:dgp2} \beta_{d,k} \propto \text{exp}(m + \kappa_{k}^{(\text{t})} + \kappa_{y_d}^{(\text{c})} + \kappa_{y_d,k}^{(\text{i})}) \end{equation} $m$, and each $\kappa_k^{(\text{t})}$, $\kappa_{y_d}^{(\text{c})}$ and $\kappa_{y_d,k}^{(\text{i})}$ are $V$-length vectors containing one entry per word in the vocabulary. When no convent covariate is present $\beta$ can be formed as $\beta_{d,k} \propto \text{exp}(m + \kappa_{k}^{(\text{t})})$ or simply point estimated (this latter behavior is the default). \item For each word in the document, ($n \in 1, \dots, N_d$): \begin{itemize} \item Draw word's topic assignment based on the document-specific distribution over topics.\\ \begin{equation}\label{eqn:dgp3} z_{d, n} | \vec{\theta}_d \sim \text{ Multinomial}(\vec{\theta}_d) \end{equation} \item Conditional on the topic chosen, draw an observed word from that topic.\\ \begin{equation}\label{eqn:dpg4} w_{d, n} | z_{d, n}, \beta_{d, k= z_{d,n}} \sim \text{ Multinomial}(\beta_{d, k = z_{d,n}}) \end{equation} \end{itemize} \end{enumerate} Our approach to estimation builds on prior work in variational inference for topic models \citep{blei2007correlated, ahmed2007tight,wang2013variational}. We develop a partially-collapsed variational Expectation-Maximization algorithm which upon convergence gives us estimates of the model parameters. Regularizing prior distributions are used for $\gamma, \kappa$, and (optionally) $\Sigma$, which help enhance interpretation and prevent overfitting. We note that within the object that \code{stm} produces, the names correspond with the notation used here (i.e., the list element named \code{theta} contains $\theta$, the $N$ by $K$ matrix where $\theta_{i,j}$ denotes the proportion of document $i$ allocated to topic $j$). Further technical details are provided in \cite{stmjasa}. In this paper, we provide brief interpretations of results, and we direct readers to the companion papers \citep{nips2013, ajps,TextComparative} for complete applications. \begin{figure} \centering \includegraphics[scale=.45]{STMdiagram.pdf} \caption{Heuristic description of generative process and estimation of the STM.}\label{fig:stmoverview} \end{figure} \FloatBarrier \section{Using the Structural Topic Model} \label{sec:use} In this section we demonstrate the basics of using the package.\footnote{The \pkg{stm} package leverages functions from a variety of other packages. Key computations in the main function use \pkg{Rcpp} \citep{eddelbuettel2011rcpp}, \pkg{RcppArmadillo} \citep{RcppArmadillo}, \pkg{MatrixStats} \citep{MatrixStats}, \pkg{slam} \citep{slam}, \pkg{lda} \citep{lda}, \pkg{splines} \citep{Rcore} and \pkg{glmnet} \citep{friedman2010regularization}. Supporting functions use a wide array of other packages including: \pkg{quanteda} \citep{quanteda}, \pkg{tm} \citep{meyer2008text}, \pkg{stringr} \citep{stringr}, \pkg{SnowballC} \citep{SnowballC},\pkg{igraph} \citep{igraph}, \pkg{huge} \citep{huge}, \pkg{clue} \citep{hornik2005clue}, \pkg{wordcloud} \citep{wordcloud}, \pkg{KernSmooth} \citep{KernSmooth}, \pkg{geometry} \citep{geometry}, \pkg{Rtsne} \citep{Rtsne} and others.} Figure~\ref{fig:stmreview} presents a heuristic overview of the package, which parallels a typical workflow. For each step we list different functions in the \pkg{stm} package that accomplish each task. First users ingest the data and prepare it for analysis. Next a structural topic model is estimated. As we discuss below, the ability to estimate the structural topic model quickly allows for the evaluation, understanding, and visualization of results. For each of these steps we provide examples of some of the functions that are in the package and discussed in this paper. All of the functions come with help files, and examples, that can be accessed by typing \code{?} and then the function's name.\footnote{We sometimes refer to generic functions such as \code{plot.estimateEffect} using their full name. We emphasize for those new to \code{R} that these functions can be accessed by using the generic function (in this case \code{plot}) on an object of the type \code{estimateEffect}.} \begin{figure} \centering \includegraphics{Flowchart} \caption{Heuristic description of selected \pkg{stm} package features.}\label{fig:stmreview} \end{figure} <>= library("stm") set.seed(2138) @ \subsection{Ingest: Reading and processing text data} The first step is to load data into \proglang{R}. The \pkg{stm} package represents a text corpus in three parts: a \code{documents} list containing word indices and their associated counts,\footnote{A full description of the sparse list format can be found in the help file for \code{stm}.} a \code{vocab} character vector containing the words associated with the word indices, and a \code{metadata} matrix containing document covariates. For illustration purposes, we print an example of two short documents from a larger corpus below. The first document contains five words (which would appear at positions 21, 23, 87, 98 and 112 of the vocab vector) and each word appears once except the 21st word which appears twice. <>= list(poliblog5k.docs[[1]][,1:5], poliblog5k.docs[[2]][,1:3]) @ In this section, we describe utility functions for reading text data into \proglang{R} or converting from a variety of common formats that text data may come in. Note that particular care must be taken to ensure that documents and their vocabularies are properly aligned with the metadata. In the sections below we describe some tools internal and external to the package that can be useful in this process. \subsubsection{Reading in data from a ``spreadsheet''} The \pkg{stm} provides a simple tool to quickly get started for a common use case: the researcher has stored text data alongside covariates related to the text in a spreadsheet, such as a .csv file that contains textual data and associated metadata. If the researcher reads this data into a R data frame, then the \pkg{stm} package's function \code{textProcessor} can conveniently convert and processes the data to ready it for analysis in the \pkg{stm} package. For example, users would first read in a .csv file that contains the textual data and associated metadata using native \proglang{R} functions, or load a pre-prepared dataframe as we do below.\footnote{Note that the model does not permit estimation when there are variables used in the model that have missing values. As such, it can be helpful to subset data to observations that do not have missing values for metadata that will be used in the STM model.} Next, they would pass this object through the \code{textProcessor} function. To illustrate how to use the \pkg{stm} package, we will use a collection of blogposts about American politics that were written in 2008, from the CMU 2008 Political Blog Corpus \citep{poliblog}.\footnote{The set of blogs is available at \url{http://sailing.cs.cmu.edu/socialmedia/blog2008.html} and documentation on the blogs is available at \url{http://www.sailing.cs.cmu.edu/socialmedia/blog2008.pdf}. You can find the cleaned version of the data we used for this vignette here: \url{http://scholar.princeton.edu/sites/default/files/bstewart/files/poliblogs2008.csv}. A link to a saved workspace containing the models we use here is provided below.} The blogposts were gathered from six different blogs: American Thinker, Digby, Hot Air, Michelle Malkin, Think Progress, and Talking Points Memo. Each blog has its own particular political bent. The day within 2008 when each blog was written was also recorded. Thus for each blogpost, there is metadata on the day it was written and the political ideology of the blog for which it was written. In this case, each blog post is a row in a .csv file, with the text contained in a variable called \code{documents}. <>= data <- read.csv("poliblogs2008.csv") processed <- textProcessor(data$documents, metadata = data) out <- prepDocuments(processed$documents, processed$vocab, processed$meta) docs <- out$documents vocab <- out$vocab meta <-out$meta @ In Section \ref{sec:estimate} we provide a link to a workspace with an estimated model that can be used to follow the examples in this vignette. \subsubsection{Using the quanteda package} A useful external tool for handling text processing is the \pkg{quanteda} package \citep{quanteda}, which makes it easy to import text and associated meta-data, prepare the texts for processing, and convert the documents into a document-term matrix. In \pkg{quanteda}, documents can be added to a corpus object using the \code{corpus} constructor function, which holds both the texts and the associated covariates in the form of document-level meta-data. The \pkg{readtext} package \citep{readtext} contains very flexible tools for reading many formats of texts, such as plain text, XML, and JSON formats and for easily creating a corpus from them. The function \code{dfm} from the \pkg{quanteda} package creates a document term matrix that can be supplied directly to the \code{stm} model fitting function. \pkg{quanteda} offers a much richer library of functions than the built-in \code{textProcessor} function and so can be particularly useful when more customization is required. \subsubsection{Reading in data from other text processing programs} Sometimes researchers will encounter data that is not in a spreadsheet format. The \code{readCorpus} function is capable of loading data from a wide variety of formats, including the standard \proglang{R} matrix class along with the sparse matrices from the packages \pkg{slam} and \pkg{Matrix}. Document term matrices created by the popular package \pkg{tm} are inherited from the \pkg{slam} sparse matrix format and thus are included as a special case. Another program that is helpful for setting up and processing text data with document metadata, is \href{www.txtorg.org}{txtorg} \citep{TextComparative}. The \pkg{txtorg} program generates three separate files: a metadata file, a vocabulary file, and a file with the original documents. The default export format for \pkg{txtorg} is the \texttt{ldac} sparse matrix format popularized by David Blei's implementation of LDA. The \code{readCorpus()} function can read in data of this type using the ``ldac'' option. Finally the \pkg{corpus} package \citep{corpus} provides another fast method for creating a corpus from raw text inputs. \subsubsection{Pre-processing text content} It is often useful to engage in some processing of the text data before modeling it. The most common processing steps are stemming (reducing words to their root form), dropping punctuation and stop word removal (e.g., the, is, at). The \code{textProcessor} function implements each of these steps across multiple languages by using the \pkg{tm} package.\footnote{The \pkg{tm} package has numerous additional features that are not included in \code{textProcessor} which is intended only to wrap a useful set of common defaults.} \subsection{Prepare: Associating text with metadata} After reading in the data, we suggest using the utility function \code{prepDocuments} to process the loaded data to make sure it is in the right format. \code{prepDocuments} also removes infrequent terms depending on the user-set parameter \code{lower.thresh}. The utility function \code{plotRemoved} will plot the number of words and documents removed for different thresholds. For example, the user can use: <>= plotRemoved(processed$documents, lower.thresh = seq(1, 200, by = 100)) out <- prepDocuments(processed$documents, processed$vocab, processed$meta, lower.thresh = 15) @ to evaluate how many words and documents would be removed from the dataset at each word threshold, which is the minimum number of documents a word needs to appear in order for the word to be kept within the vocabulary. Then the user can select their preferred threshold within \code{prepDocuments}. Importantly, \code{prepDocuments} will also re-index all metadata/document relationships if any changes occur due to processing. If a document is completely removed due to pre-processing (for example because it contained only rare words), then \code{prepDocuments} will drop the corresponding row in the metadata as well. After reading in and processing the text data, it is important to inspect features of the documents and the associated vocabulary list to make sure they have been correctly preprocessed. From here, researchers are ready to estimate a structural topic model. \subsection{Estimate: Estimating the structural topic model} \label{sec:estimate} The data import process will output documents, vocabulary and metadata that can be used for an analysis. In this section we illustrate how to estimate the STM. Next we move to a range of functions to evaluate, understand, and visualize the fitted model object. %\subsubsection{Ways to use metadata} The key innovation of the STM is that it incorporates metadata into the topic modeling framework. In STM, metadata can be entered in the topic model in two ways: topical prevalence and topical content. Metadata covariates for topical prevalence allow the observed metadata to affect the frequency with which a topic is discussed. Covariates in topical content allow the observed metadata to affect the word rate use within a given topic--that is, how a particular topic is discussed. Estimation for both topical prevalence and content proceeds via the workhorse \code{stm} function. \subsubsection{Estimation with topical prevalence parameter} In this example, we use the ratings variable (blog ideology) as a covariate in the topic prevalence portion of the model with the CMU Poliblog data described above. Each document is modeled as a mixture of multiple topics. Topical prevalence captures how much each topic contributes to a document. Because different documents come from different sources, it is natural to want to allow this prevalence to vary with metadata that we have about document sources. We will let prevalence be a function of the ``rating'' variable, which is coded as either ``Liberal'' or ``Conservative,'' and the variable ``day.'' which is an integer measure of days running from the first to the last day of 2008. To illustrate, we now estimate a 20 topic STM model. The user can then pass the output from the model, \code{poliblogPrevFit}, through the various functions we discuss below (e.g., \code{plot.STM}) to inspect the results. If a user wishes to specify additional prevalence covariates, she would do so using the standard formula notation in \proglang{R} which we discuss at greater length below. A feature of the \code{stm} function is that ``prevalence'' can be expressed as a formula that can include multiple covariates and factorial or continuous covariates. For example, by using the formula setup we can enter other covariates additively. Additionally users can include more flexible functional forms of continuous covariates, including standard transforms like \code{log()}, as well as \code{ns()} or \code{bs()} from the \pkg{splines} package. The \pkg{stm} package also includes a convenience function \code{s()}, which selects a fairly flexible b-spline basis. In the current example we allow for the variable ``date'' to be estimated with a spline. As we show later in the paper, interactions between covariates can also be added using the standard notation for \texttt{R} formulas. In the example below, we enter in the variables additively, by allowing for the day variable, an integer variable measuring which day the blog was posted, to have a non-linear relationship in the topic estimation stage. <>= poliblogPrevFit <- stm(documents = out$documents, vocab = out$vocab, K = 20, prevalence =~ rating + s(day), max.em.its = 75, data = out$meta, init.type = "Spectral") @ The model is set to run for a maximum of 75 EM iterations (controlled by \code{max.em.its}) using a seed we selected (\code{seed}). Convergence is monitored by the change in the approximate variational lower bound. Once the bound has a small enough change between iterations, the model is considered converged. To reduce compiling time, in this paper we do not run the models and instead load a workspace with the models already estimated.\footnote{We note that due to differences across versions of the software the code above may not exactly replicate the models we use below. We have analyzed the poliblog data in other papers including \cite{chaptermulti} and \cite{qmmr}.} <>= load(url("https://github.com/bstewart/stm/blob/gh-pages/files/VignetteObjects.RData?raw=true")) @ \subsection{Evaluate: Model selection and search} \subsubsection{Model initialization for a fixed number of number of topics} As with all mixed-membership topic models, the posterior is intractable and non-convex, which creates a multimodal estimation problem that can be sensitive to initialization. Put differently, the answers the estimation procedure comes up with may depend on starting values of the parameters (e.g., the distribution over words for a particular topic). There are two approaches to dealing with this that the \pkg{STM} package facilitates. The first is to use an initialization based on the method of moments, which is deterministic and globally consistent under reasonable conditions \citep{arora2012practical, chaptermulti}. This is known as a spectral initialization because it uses a spectral decomposition (non-negative matrix factorization) of the word co-occurrence matrix. In practice we have found this initialization to be very helpful. This can be chosen by setting \code{init.type = "Spectral"} in the \code{stm} function. We use this option in the above example. This means that no matter the seed that is set, the same results will be generated.\footnote{While the initialization is deterministic, we have observed in some circumstances it may not produce exactly the same results across machines due to differences in numerical precision. It will, however, always produce the same result within a machine for a given version of \pkg{stm}.} When the vocabulary is larger than 10,000 words, the function will temporarily subset the vocabulary size for the duration of the initialization. The second approach is to initialize the model with a short run of a collapsed Gibbs sampler for LDA. For completeness researchers can also initialize the model randomly, but this is generally not recommended. In practice, we generally recommend using the spectral initialization as we have found it to produce the best results consistently \citep{chaptermulti}.\footnote{Starting with version \code{1.3.0} the default initialization strategy was changed to \code{Spectral} from \code{LDA}.} \subsubsection{Model selection for a fixed number of number of topics} When not using the spectral initialization, the analyst should estimate many models, each from different initializations, and then evaluate each model according to some separate standard (we provide several below). The function \code{selectModel} automates this process to facilitate finding a model with desirable properties. Users specify the number of ``runs,'' which in the example below is set to 20. \code{selectModel} first casts a net where ``run'' (below 10) models are run for two EM steps, and then models with low likelihoods are discarded. Next, the default returns the 20\% of models with the highest likelihoods, which are then run until convergence or the EM iteration maximum is reached. Notice that options for the \code{stm} function can be passed to \code{selectModels}, such as \code{max.em.its}. If users would like to select a larger number of models to be run completely, this can also be set with an option specified in the help file for this function. <>= poliblogSelect <- selectModel(out$documents, out$vocab, K = 20, prevalence =~ rating + s(day), max.em.its = 75, data = out$meta, runs = 20, seed = 8458159) @ In order to select a model for further investigation, users must choose one of the candidate models' outputs from \code{selectModel}. To do this, \code{plotModels} can be used to plot two scores: semantic coherence and exclusivity for each model and topic. Semantic coherence is a criterion developed by \cite{mimno2011optimizing} and is closely related to pointwise mutual information \citep{newman2010automatic}: it is maximized when the most probable words in a given topic frequently co-occur together. \cite{mimno2011optimizing} show that the metric correlates well with human judgment of topic quality. Formally, let $D(v,v')$ be the number of times that words $v$ and $v'$ appear together in a document. Then for a list of the $M$ most probable words in topic $k$, the semantic coherence for topic $k$ is given as \begin{align} C_k = \sum_{i=2}^M \sum_{j=1}^{i-1} \text{log}\left(\frac{D(v_i, v_j) + 1}{D(v_j)}\right) \end{align} In \citet{ajps} we noted that attaining high semantic coherence was relatively easy by having a few topics dominated by very common words. We thus propose to measure topic quality through a combination of semantic coherence and exclusivity of words to topics. We use the FREX metric \citep{bischof2012summarizing,airoldi2016improving} to measure exclusivity in a way that balances word frequency.\footnote{We use the FREX framework from \cite{airoldi2016improving} to develop a score based on our model's parameters, but do not run the more complicated hierarchical Poisson convolution model developed in that work.} FREX is the weighted harmonic mean of the word's rank in terms of exclusivity and frequency. \begin{align} \label{eq:frex} \text{FREX}_{k,v} = \left( \frac{\omega}{\text{ECDF}(\beta_{k,v}/\sum_{j=1}^K \beta_{j,v})} + \frac{1-\omega}{\text{ECDF}(\beta_{k,v})} \right)^{-1} \end{align} where ECDF is the empirical CDF and $\omega$ is the weight which we set to .7 here to favor exclusivity.\footnote{Both functions are also each documented with keyword \code{internal} and can be accessed by \code{?semanticCoherence} and \code{?exclusivity}.} Each of these criteria are calculated for each topic within a model run. The \code{plotModels} function calculates the average across all topics for each run of the model and plots these by labeling the model run with a numeral. Often times users will select a model with desirable properties in both dimensions (i.e., models with average scores towards the upper right side of the plot). As shown in Figure~\ref{fig:select}, the \code{plotModels} function also plots each topic's values, which helps give a sense of the variation in these parameters. For a given model, the user can plot the semantic coherence and exclusivity scores with the \code{topicQuality} function. \begin{figure} \begin{center} <>= plotModels(poliblogSelect, pch=c(1,2,3,4), legend.position="bottomright") @ \caption{Plot of \code{selectModel} results. Numerals represent the average for each model, and dots represent topic specific scores.} \label{fig:select} \end{center} \end{figure} Next the user would want to select one of these models to work with. For example, the third model could be extracted from the object that is outputted by \code{selectModel}. <>= selectedmodel <- poliblogSelect$runout[[3]] @ Alternatively, as discussed in Section~\ref{sec:eval}, the user can evaluate the stability of particular topics across models. \subsubsection{Model search across numbers of topics} STM assumes a fixed user-specified number of topics. There is not a ``right'' answer to the number of topics that are appropriate for a given corpus \citep{grimmer2013text}, but the function \code{searchK} uses a data-driven approach to selecting the number of topics. The function will perform several automated tests to help choose the number of topics including calculating the held out likelihood \citep{wallach2009evaluation} and performing a residual analysis \citep{taddyestimation}. For example, one could estimate a STM model for 7 and 10 topics and compare the results along each of the criteria. The default initialization is the spectral initialization due to its stability. This function will also calculate a range of quantities of interest, including the average exclusivity and semantic coherence. <>= storage <- searchK(out$documents, out$vocab, K = c(7, 10), prevalence =~ rating + s(day), data = meta) @ There is another more preliminary selection strategy based on work by \cite{lee2014low}. When initialization type is set to \code{"Spectral"} the user can specify \code{K=0} to use the algorithm of \cite{lee2014low} to select the number of topics. The core idea of the spectral initialization is to approximately find the vertices of the convex hull of the word co-occurrences. The algorithm of \cite{lee2014low} projects the matrix into a low dimensional space using $t$-distributed stochastic neighbor embedding \citep{van2014accelerating} and then exactly solves for the convex hull. This has the advantage of automatically selecting the number of topics. The added randomness from the projection means that the algorithm is not deterministic like the standard \code{"Spectral"} initialization type. Running it with a different seed can result in not only different results but a different number of topics. We emphasize that this procedure has no particular statistical guarantees and should not be seen as estimating the ``true'' number of topics. However it can be useful place to start and has the computational advantage that it only needs to be run once. There are several other functions for evaluation shown in Figure~\ref{fig:stmreview}, and we discuss these in more detail in Appendix~\ref{sec:eval} so we can proceed with how to understand and visualize STM results. \subsection{Understand: Interpreting the STM by plotting and inspecting results} After choosing a model, the user must next interpret the model results. There are many ways to investigate the output, such as inspecting the words associated with topics or the relationship between metadata and topics. To investigate the output of the model, the \pkg{stm} package provides a number of options. \begin{enumerate} \item Displaying words associated with topics (\code{labelTopics}, \code{plot.STM(,type = "labels")}, \code{sageLabels}, \code{plot.STM(,type = "perspectives")}) or documents highly associated with particular topics (\code{findThoughts}, \code{plotQuote}). \item Estimating relationships between metadata and topics/topical content (\code{estimateEffect}). \item Calculating topic correlations (\code{topicCorr}). \end{enumerate} \subsubsection{Understanding topics through words and example documents} We next describe two approaches for users to explore the topics that have been estimated. The first approach is to look at collections of words that are associated with topics. The second approach is to examine actual documents that are estimated to be highly associated with each topic. Both of these approaches should be used. Below, we use the 20 topic model estimated with the spectral initialization. To explore the words associated with each topic we can use the \code{labelTopics} function. For models where a content covariate is included \code{sageLabels} can also be used. Both these functions will print words associated with each topic to the console. The function by default prints several different types of word profiles, including highest probability words and FREX words. FREX weights words by their overall frequency and how exclusive they are to the topic (calculated as given in Equation~\ref{eq:frex}).\footnote{In practice when using FREX for labeling we regularize our estimate of the topic-specific distribution over words using a James-Stein shrinkage estimator \citep{hausser2009entropy}. More details are available in the documentation.} Lift weights words by dividing by their frequency in other topics, therefore giving higher weight to words that appear less frequently in other topics. For more information on lift, see \citet{taddy2012multinomial}. Similar to lift, score divides the log frequency of the word in the topic by the log frequency of the word in other topics. For more information on score, see the \pkg{lda} \proglang{R} package, \url{https://cran.r-project.org/package=lda}. In order to translate these results to a format that can easily be used within a paper, the \code{plot.STM(,type = "labels")} function will print topic words to a graphic device. Notice that in this case, the labels option is specified as the \code{plot.STM} function has several functionalities that we describe below (the options for \code{"perspectives"} and \code{"summary"}). <>= labelTopics(poliblogPrevFit, c(3, 7, 20)) @ To examine documents that are highly associated with topics the \code{findThoughts} function can be used. This function will print the documents highly associated with each topic.\footnote{The \code{theta} object in the \code{stm} model output has the posterior probability of a topic given a document that this function uses.} Reading these documents is helpful for understanding the content of a topic and interpreting its meaning. Using syntax from \pkg{data.table} \citep{data.table} the user can also use \code{findThoughts} to make complex queries from the documents on the basis of topic proportions. In our example, for expositional purposes, we restrict the length of the document to the first 200 characters.\footnote{This uses the object \code{shortdoc} contained in the workspace loaded above, which is the first 200 characters of original text.} We see that Topic 3 describes discussions of the Obama campaign during the 2008 presidential election. Topic 20 discusses the Bush administration. To print example documents to a graphics device, \code{plotQuote} can be used. The results are displayed in Figure~\ref{fig:example}. <>= thoughts3 <- findThoughts(poliblogPrevFit, texts = shortdoc, n = 2, topics = 3)$docs[[1]] thoughts20 <- findThoughts(poliblogPrevFit, texts = shortdoc, n = 2, topics = 20)$docs[[1]] @ %manually correcting some unicode errors <>= thoughts3 <- c("Here's video of the ad we reported on below that the Obama campaign is running in Ohio responding to the earlier Swift-Boating spot tying Obama to former Weatherman Bill Ayers... With all our pr", "As noted here and elsewhere, the words 'William Ayers' appeared nowhere in yesterday's debate, despite the fact that the McCain campaign hinted for days that McCain would go hard at Obama's association") thoughts20 <- c("Waxman calls for release of FBI interviews with Bush and Cheney. In a letter to Attorney General Michael Mukasey today, Rep. Henry Waxman (D-CA), the Chairman of the Hous e Committee on Oversight", "Report: Bush 'Personally Directed' Gonzales To Strong-Arm Ashcroft At His BedsideIn his May 2007 testimony, describing the infamous strong-arming of John Ashcroft done by Andy Card and Alberto") @ \begin{figure} \begin{center} <>= par(mfrow = c(1, 2),mar = c(.5, .5, 1, .5)) plotQuote(thoughts3, width = 30, main = "Topic 3") plotQuote(thoughts20, width = 30, main = "Topic 20") @ \caption{Example documents highly associated with topics 3 and 20.} \label{fig:example} \end{center} \end{figure} \subsubsection{Estimating metadata/topic relationships} Estimating the relationship between metadata and topics is a core feature of the \pkg{stm} package. These relationships can also play a key role in validating the topic model's usefulness \citep{grimmer2010bayesian,grimmer2013text}. While \code{stm} estimates the relationship for the $K-1$ simplex, the workhorse function for extracting the relationships and associated uncertainty on all $K$ topics is \code{estimateEffect}. This function simulates a set of parameters which can then be plotted (which we discuss in great detail below). Typically, users will pass the same model of topical prevalence used in estimating the STM to the \code{estimateEffect} function. The syntax of the \code{estimateEffect} function is designed so users specify the set of topics they wish to use for estimation, and then a formula for metadata of interest. Different estimation strategies and standard plot design features can be used by calling the \code{plot.estimateEffect} function. \code{estimateEffect} can calculate uncertainty in several ways. The default is \code{"Global"}, which will incorporate estimation uncertainty of the topic proportions into the uncertainty estimates using the method of composition. If users do not propagate the full amount of uncertainty, e.g., in order to speed up computational time, they can choose \code{uncertainty = "None"}, which will generally result in narrower confidence intervals because it will not include the additional estimation uncertainty. Calling \code{summary} on the \code{estimateEffect} object will generate a regression table. <>= out$meta$rating <- as.factor(out$meta$rating) prep <- estimateEffect(1:20 ~ rating + s(day), poliblogPrevFit, meta = out$meta, uncertainty = "Global") summary(prep, topics=1) @ \subsection{Visualize: Presenting STM results} The functions we described previously to understand STM results can be leveraged to visualize results for formal presentation. In this section we focus on several of these visualization tools. \subsubsection{Summary visualization} Corpus level visualization can be done in several different ways. The first relates to the expected proportion of the corpus that belongs to each topic. This can be plotted using \code{plot.STM(,type = "summary")}. An example from the political blogs data is given in Figure~\ref{fig:summary}. We see, for example, that the Sarah Palin/Vice President topic (7) is actually a relatively minor proportion of the discourse. The most common topic is a general topic full of words that bloggers commonly use, and therefore is not very interpretable. The words listed in the figure are the top three words associated with the topic. \begin{figure} \begin{center} <>= plot(poliblogPrevFit, type = "summary", xlim = c(0, .3)) @ \caption{Graphical display of estimated topic proportions.} \label{fig:summary} \end{center} \end{figure} In order to plot features of topics in greater detail, there are a number of options in the \code{plot.STM} function, such as plotting larger sets of words highly associated with a topic or words that are exclusive to the topic. Furthermore, the \code{cloud} function will plot a standard word cloud of words in a topic\footnote{This uses the \pkg{wordcloud} package \citep{wordcloud}.} and the \code{plotQuote} function provides an easy to use graphical wrapper such that complete examples of specific documents can easily be included in the presentation of results. \subsubsection{Metadata/topic relationship visualization} We now discuss plotting metadata/topic relationships, as the ability to estimate these relationships is a core advantage of the STM model. The core plotting function is \code{plot.estimateEffect}, which handles the output of \code{estimateEffect}. First, users must specify the variable that they wish to use for calculating an effect. If there are multiple variables specified in \code{estimateEffect}, then all other variables are held at their sample median. These parameters include the expected proportion of a document that belongs to a topic as a function of a covariate, or a first difference type estimate, where topic prevalence for a particular topic is contrasted for two groups (e.g., liberal versus conservative). \code{estimateEffect} should be run and the output saved before plotting when it is time intensive to calculate uncertainty estimates and/or because users might wish to plot different quantities of interest using the same simulated parameters from \code{estimateEffect}.\footnote{The help file for this function describes several different ways for uncertainty estimate calculation, some of which are much faster than others.} The output can then be plotted. %In this example we use a string variable and it is required that for this function that we first convert it into a factor variable. When the covariate of interest is binary, or users are interested in a particular contrast, the \code{method = "difference"} option will plot the change in topic proportion shifting from one specific value to another. Figure~\ref{fig:difference} gives an example. For factor variables, users may wish to plot the marginal topic proportion for each of the levels (\code{"pointestimate"}). \begin{figure} \begin{center} <>= plot(prep, covariate = "rating", topics = c(3, 7, 20), model = poliblogPrevFit, method = "difference", cov.value1 = "Liberal", cov.value2 = "Conservative", xlab = "More Conservative ... More Liberal", main = "Effect of Liberal vs. Conservative", xlim = c(-.1, .1), labeltype = "custom", custom.labels = c('Obama', 'Sarah Palin','Bush Presidency')) @ \caption{Graphical display of topical prevalence contrast.} \label{fig:difference} \end{center} \end{figure} We see Topic 1 is strongly used by conservatives compared to liberals, while Topic 7 is close to the middle but still conservative-leaning. Topic 10, the discussion of Bush, was largely associated with liberal writers, which is in line with the observed trend of conservatives distancing from Bush after his presidency. Notice how the function makes use of standard labeling options available in the native \code{plot()} function. This allows the user to customize labels and other features of their plots. We note that in the package we leverage generics for the plot functions. As such, one can simply use \code{plot} instead of writing out the full extension. When users have variables that they want to treat continuously, users can choose between assuming a linear fit or using splines. In the previous example, we allowed for the day variable to have a non-linear relationship in the topic estimation stage. We can then plot its effect on topics. In Figure~\ref{fig:spline}, we plot the relationship between time and the vice presidential topic, topic 7. The topic peaks when Sarah Palin became John McCain's running mate at the end of August in 2008. \begin{figure} \begin{center} <>= plot(prep, "day", method = "continuous", topics = 7, model = z, printlegend = FALSE, xaxt = "n", xlab = "Time (2008)") monthseq <- seq(from = as.Date("2008-01-01"), to = as.Date("2008-12-01"), by = "month") monthnames <- months(monthseq) axis(1,at = as.numeric(monthseq) - min(as.numeric(monthseq)), labels = monthnames) @ \caption{Graphical display of topic prevalence. Topic 7 prevalence is plotted as a smooth function of day, holding rating at sample median, with 95\% confidence intervals.} \label{fig:spline} \end{center} \end{figure} \subsubsection{Topical content} We can also plot the influence of a topical content covariate. A topical content variable allows for the vocabulary used to talk about a particular topic to vary. First, the STM must be fit with a variable specified in the content option. In the below example, ratings serves this purpose. It is important to note that this is a completely new model, and so the actual topics may differ in both content and numbering compared to the previous example where no content covariate was used. <>= poliblogContent <- stm(out$documents, out$vocab, K = 20, prevalence =~ rating + s(day), content =~ rating, max.em.its = 75, data = out$meta, init.type = "Spectral") @ Next, the results can be plotted using the \code{plot.STM(,type = "perspectives")} function. This functions shows which words within a topic are more associated with one covariate value versus another. In Figure~\ref{fig:perp}, vocabulary differences by ratings is plotted for topic 11. Topic 11 is related to Guantanamo. Its top FREX words were ``tortur, detaine, court, justic, interrog, prosecut, legal''. However, Figure~\ref{fig:perp} lets us see how liberals and conservatives talk about this topic differently. In particular, liberals emphasized ``torture'' whereas conservatives emphasize typical court language such as ``illegal'' and ``law.'' \footnote{At this point you can only have a single variable as a content covariate, although that variable can have any number of groups. It cannot be continuous. Note that the computational cost of this type of model rises quickly with the number of groups and so it may be advisable to keep it small.} \begin{figure}[ht!] \begin{center} <>= plot(poliblogContent, type = "perspectives", topics = 11) @ \caption{Graphical display of topical perspectives.} \label{fig:perp} \end{center} \end{figure} This function can also be used to plot the contrast in words across two topics.\footnote{This plot calculates the difference in probability of a word for the two topics, normalized by the maximum difference in probability of any word between the two topics.} To show this we go back to our original model that did not include a content covariate and we contrast topic 12 (Iraq war) and 20 (Bush presidency). We plot the results in Figure~\ref{fig:perp2}. \begin{figure}[ht!] \begin{center} <>= plot(poliblogPrevFit, type = "perspectives", topics = c(12, 20)) @ \caption{Graphical display of topical contrast between topics 12 and 20.} \label{fig:perp2} \end{center} \end{figure} \FloatBarrier \subsubsection{Plotting covariate interactions} Another modification that is possible in this framework is to allow for interactions between covariates such that one variable may ``moderate'' the effect of another variable. In this example, we re-estimated the STM to allow for an interaction between day (entered linearly) and ratings. Then in \code{estimateEffect()} we include the same interaction. This allows us in \code{plot.estimateEffect} to have this interaction plotted. We display the results in Figure~\ref{fig:spline2} for topic 20 (Bush administration). We observe that conservatives never wrote much about this topic, whereas liberals discussed this topic a great deal, but over time the topic diminished in salience. <>= poliblogInteraction <- stm(out$documents, out$vocab, K = 20, prevalence =~ rating * day, max.em.its = 75, data = out$meta, init.type = "Spectral") @ We have chosen to enter the day variable here linearly for simplicity; however, we note that you can use the software to estimate interactions with non-linear variables such as splines. However, \code{plot.estimateEffect} only supports interactions with at least one binary effect modification covariate. \begin{figure} \begin{center} <>= prep <- estimateEffect(c(20) ~ rating * day, poliblogInteraction, metadata = out$meta, uncertainty = "None") plot(prep, covariate = "day", model = poliblogInteraction, method = "continuous", xlab = "Days", moderator = "rating", moderator.value = "Liberal", linecol = "blue", ylim = c(0, .12), printlegend = F) plot(prep, covariate = "day", model = poliblogInteraction, method = "continuous", xlab = "Days", moderator = "rating", moderator.value = "Conservative", linecol = "red", add = T, printlegend = F) legend(0, .08, c("Liberal", "Conservative"), lwd = 2, col = c("blue", "red")) @ \caption{Graphical display of topical content. This plots the interaction between time (day of blog post) and rating (liberal versus conservative). Topic 20 prevalence is plotted as linear function of time, holding the rating at either 0 (Liberal) or 1 (Conservative). Were other variables included in the model, they would be held at their sample medians.} \label{fig:spline2} \end{center} \end{figure} More details are available in the help file for this function.\footnote{An additional option is the use of local regression (loess). In this case, because multiple covariates are not possible a separate function is required, \code{plotTopicLoess}, which contains a help file for interested users.} \FloatBarrier \subsection{Extend: Additional tools for interpretation and visualization} There are multiple other ways to visualize results from an STM model. Topics themselves may be nicely presented as a word cloud. For example, Figure~\ref{fig:cloud} uses the \code{cloud} function to plot a word cloud of the words most likely to occur in blog posts related to the vice presidential candidates topic in the 2008 election. \begin{figure} \vspace{-20pt} \begin{center} <>= cloud(poliblogPrevFit, topic = 7, scale = c(2,.25)) @ \caption{Word cloud display of vice President topic.} \label{fig:cloud} \end{center} \vspace{-20pt} \end{figure} In addition, the Structural Topic Model permits correlations between topics. Positive correlations between topics indicate that both topics are likely to be discussed within a document. These can be visualized using \code{plot.topicCorr()}. The user can specify a correlation threshold. If two topics are correlated above that threshold, then those two topics are considered to be linked. After calculating which topics are correlated with one another, \code{plot.topicCorr} produces a layout of topic correlations using a force-directed layout algorithm, which we present in Figure~\ref{fig:correlations}. We can use the correlation graph to observe the connection between such as topics 12 (Iraq War) and 20 (Bush administration). \code{plot.topicCorr} has several options that are described in the help file. <>= mod.out.corr <- topicCorr(poliblogPrevFit) @ \begin{figure} \vspace{-20pt} \begin{center} <>= plot(mod.out.corr) @ \caption{Graphical display of topic correlations.} \label{fig:correlations} \end{center} \vspace{-20pt} \end{figure} Finally, there are several add-on packages that take output from a structural topic model and produce additional visualizations. In particular, the \pkg{stmBrowser} package contains functions to write out the results of a structural topic model to a d3 based web browser.\footnote{Available at \href{https://github.com/mroberts/stmBrowser}{https://github.com/mroberts/stmBrowser}.} The browser facilitates comparing topics, analyzing relationships between metadata and topics, and reading example documents. The \pkg{stmCorrViz} package provides a different d3 visualization environment that focuses on visualizing topic correlations using a hierarchical clustering approach that groups topics together.\footnote{Available at \href{https://cran.r-project.org/package=stmCorrViz}{https://cran.r-project.org/package=stmCorrViz}.} The function \code{toLDAvis} enables export to the \pkg{LDAvis} package \citep{LDAvis} which helps view topic-word distributions. Additional packages have been developed by the community to support use of STM including \pkg{tidystm} \citep{tidystm}, a package for making \pkg{ggplot2} graphics with STM output, \pkg{stminsights} \citep{stminsights}, a graphical user interface for exploring a fitted model, and \pkg{stmprinter} \citep{stmprinter}, a way to create automated reports of multiple stm runs. \FloatBarrier \subsubsection{Customizing Visualizations} The plotting functions invisibly return the calculations necessary to produce the plots. Thus by saving the result of a plot function to an object, the user can gain access to the necessary data to make plots of their own. We also provide the \code{thetaPosterior} function which allows the user to simulate draws of the document-topic proportions from the variational posterior. This can be used to include uncertainty in any calculation that the user might want to perform on the topics. \section{Changing basic estimation defaults} \label{sec:defaults} In this section, we explain how to change default settings in the \pkg{stm} package's estimation commands. We start by discussing how to chose among different methods for initializing model parameters. We then discuss how to set and evaluate convergence criteria. Next we describe a method for accelerating convergence when the analysis includes tens of thousands of documents or more. Finally we discuss some variations on content covariate models which allow the user to control model complexity. \subsection{Initialization} \label{sec:init} As with most topic models, the objective function maximized by STM is multimodal. This means that the way we choose the starting values for the variational EM algorithm can affect our final solution. We provide four methods of initialization that are accessed using the argument \code{init.type}: Latent Dirichlet Allocation via collapsed Gibbs sampling (\code{init.type = "LDA"}); a Spectral algorithm for Latent Dirichlet Allocation (\code{init.type = "Spectral"}); random starting values (\code{init.type = "Random"}); and user-specified values (\code{init.type= "Custom"}). Spectral is the default option and initializes parameters using a moment-based estimator for LDA due to \citet{arora2012practical}. LDA uses several passes of collapsed Gibbs sampling to initialize the algorithm. The exact parameters for this initialization can be set using the argument \code{control}. Finally, the random algorithm draws the initial state from a Dirichlet distribution. The random initialization strategy is included primarily for completeness; in general, the other two strategies should be preferred. \citet{chaptermulti} provides details on these initialization methods and provides a study of their performance. In general, the spectral initialization outperforms LDA which in turn outperforms random initialization. Each time the model is run, the random seed is saved in the output object under \code{settings$seed}. This can be passed to the \code{seed} argument of \code{stm} to replicate the same starting values. \subsection{Convergence criteria} Estimation in the STM proceeds by variational EM. Convergence is controlled by relative change in the variational objective. Denoting by $\ell_t$ the approximate variational objective at time $t$, convergence is declared when the quantity $(\ell_t - \ell_{t-1})/$abs($\ell_{t-1}$) drops below tolerance. The default tolerance is 1e-5 and can be changed using the \code{emtol} argument. The argument \code{max.em.its} sets the maximum number of iterations. If this threshold is reached before convergence is reached a message will be printed to the screen. The default of 500 iterations is simply a general guideline. A model that fails to converge can be restarted using the \code{model} argument in \code{stm}. See the documentation for \code{stm} for more information. The default is to have the status of iterations print to the screen. The \code{verbose} option turns printing to the screen on and off. During the E-step, the algorithm prints one dot for every $1\%$ of the corpus it completes and announces completion along with timing information. Printing for the M-Step depends on the algorithm being used. For models without content covariates or other changes to the topic-word distribution, M-step estimation should be nearly instantaneous. For models with content covariates, the algorithm is set to print dots to indicate progress. The exact interpretation of the dots differs with the choice of model (see the help file for more details). By default every 5th iteration will print a report of top topic and covariate words. The \code{reportevery} option sets how often these reports are printed. Once a model has been fit, convergence can easily be assessed by plotting the variational bound as in Figure \ref{fig:converge}. \begin{figure} \begin{center} <>= plot(poliblogPrevFit$convergence$bound, type = "l", ylab = "Approximate Objective", main = "Convergence") @ \caption{Graphical display of convergence.} \label{fig:converge} \end{center} \end{figure} \subsection{Accelerating convergence} When the number of documents is large, convergence in topic models can be slow. This is because each iteration requires a complete pass over all the documents before updating the global parameters. To accelerate convergence we can split the documents into several equal-sized blocks and update the global parameters after each block.\footnote{This is related to and inspired by the work of \cite{hughes2013memoized} on memoized online variational inference for the Dirichlet Process mixture model. This is still a batch algorithm because every document in the corpus is used in the global parameter updates, but we update the global parameters several times between the update of any particular document.} The option \code{ngroups} specifies the number of blocks, and setting it equal to an integer greater than one turns on this functionality. Note that increasing the \code{ngroups} value can dramatically increase the memory requirements of the function because a sufficient statistics matrix for $\beta$ must be maintained for every block. Thus as the number of blocks is increased we are trading off memory for computational efficiency. While theoretically this approach should increase the speed of convergence it needn't do so in any particular case. Also because the updates occur in a different order, the model is likely to converge to a different solution. \subsection{SAGE} The Sparse Additive Generative (SAGE) model conceptualizes topics as sparse deviations from a corpus-wide baseline \citep{eisenstein2011sparse}. While computationally more expensive, this can sometimes produce higher quality topics . Whereas LDA tends to assign many rare words (words that appear only a few times in the corpus) to a topic, the regularization of the SAGE model ensures that words load onto topics only when they have sufficient counts to overwhelm the prior. In general, this means that SAGE topics have fewer unique words that distinguish one topic from another, but those words are more likely to be meaningful. Importantly for our purposes, the SAGE framework makes it straightforward to add covariate effects into the content portion of the model. \paragraph{Covariate-Free SAGE} While SAGE topics are enabled automatically when using a covariate in the content model, they can also be used even without covariates. To activate SAGE topics simply set the option \code{LDAbeta = FALSE}. \paragraph{Covariate-Topic Interactions} By default when a content covariate is included in the model, we also include covariate-topic interactions. In our political blog corpus for example this means that the probability of observing a word from a Conservative blog in Topic 1 is formed by combining the baseline probability, the Topic 1 component, the Conservative component and the Topic 1 - Conservative interaction component. Users can turn off interactions by specifying the option \code{interactions = FALSE}. This can be helpful in settings where there isn't sufficient data to make reasonably inferences about all the interaction parameters. It also reduces the computational intensity of the model. \section{Alternate priors} \label{sec:priors} In this section we review options for altering the prior structure in the \code{stm} function. We highlight the alternatives and provide intuition for the properties of each option. We chose default settings that we expect will perform the best in the majority of cases and thus changing these settings should only be necessary if the defaults are not performing well. \subsection{Changing estimation of prevalence covariate coefficients} The user can choose between two options: \code{"Pooled"} and \code{"L1"}. The difference between these two is that the \code{"L1"} option can induce sparsity in the coefficients (i.e., many are set exactly to zero) while the \code{"Pooled"} estimator is computationally more efficient. \code{"Pooled"} is the default option and estimates a model where the coefficients on topic prevalence have a zero-mean Normal prior with variance given a Half-Cauchy(1,1) prior. This provides moderate shrinkage towards zero but does not induce sparsity. In practice we recommend the default \code{"Pooled"} estimator unless the prevalence covariates are very high dimensional (such as a factor with hundreds of categories). You can also choose \code{gamma.prior = "L1"} which uses the \code{glmnet} package \citep{friedman2010regularization} to allow for grouped penalties between the L1 and L2 norm. In these settings we estimate a regularization path and then select the optimal shrinkage parameter using a user-tunable information criterion. By default selecting the L1 option will apply the L1 penalty by selecting the optimal shrinkage parameter using AIC. The defaults have been specifically tuned for the STM but almost all the relevant arguments can be changed through the \code{control} argument. Changing the \code{gamma.enet} parameter by specifying \code{control = list(gamma.enet = .5)} allows the user to choose a mix between the L1 and L2 norms. When set to 1 (as by default) this is the lasso penalty, when set to 0 it is the ridge penalty. Any value in between is a mixture called the elastic net. Using some version of \code{gamma.prior = "L1"} is particularly computationally efficient when the prevalence covariate design matrix is highly sparse, for example because there is a factor variable with hundreds or thousands of levels. \subsection{Changing the covariance matrix prior} The \code{sigma.prior} argument is a value between 0 and 1; by default, it is set to 0. The update for the covariance matrix is formed by taking the convex combination of the diagonalized covariance and the MLE with weight given by the prior \citep{stmjasa}. Thus by default we are simply maximizing the likelihood. When \code{sigma.prior = 1} this amounts to setting a diagonal covariance matrix. This argument can be useful in settings where topics are at risk of becoming too highly correlated. However, in extensive testing we have come across very few cases where this was needed. \subsection{Changing the content covariate prior} The \code{kappa.prior} option provides two sparsity promoting priors for the content covariates. The default is \code{kappa.prior = "L1"} and uses \code{glmnet} and the distributed multinomial formulation of \citet{taddy2013distributed}. The core idea is to decouple the update into a sequence of independent L1-regularized poisson models with plugin estimators for the document level shared effects. See \citet{stmjasa} for more details on the estimation procedure. The regularization parameter is set automatically as detailed in the \code{stm} help file. To maintain backwards compatibility we also provide estimation using a scale mixture of Normals where the precisions $\tau$ are given improper Jeffreys priors $1/\tau$. This option can be accessed by setting \code{kappa.prior = "Jeffreys"}. We caution that this can be much slower than the default option. There are over twenty additional options accessed through the \code{control} argument and documented in \code{stm} for altering additional components of the prior. Essentially every aspect of the priors for the content covariate and prevalence covariate specifications can be specified. \section{Performance and design} \label{sec:performance} The primary reason to use the \pkg{stm} is the rich feature set summarized in Figure \ref{fig:stmreview}. However, a key part of making the tool practical for every day use is increasing the speed of estimation. Due to the non-conjugate model structure, bayesian inference for the Structural Topic Model is challenging and computationally intensive. Over the course of developing the \pkg{stm} package we have continually introduced new methods to make estimating the model faster. In this section, we demonstrate large performance gains over the closest analog accessible through \proglang{R} and then detail some of the approaches that make those gains possible. \subsection{Benchmarks} Without the inclusion of covariates, STM reduces to a logistic-normal topic model, often called the Correlated Topic Model (CTM) \citep{blei2007correlated}. The \pkg{topicmodels} package in \proglang{R} provides an interface to David Blei's original \proglang{C} code to estimate the CTM \citep{hornik2011topicmodels}. This provides us with the opportunity to produce a direct comparison to a comparable model. While the generative models are the same, the variational approximations to the posterior are actually distinct, with the Blei code using a different approach to the nonconjugacy problem, which builds off of later work by Blei's group \citep{wang2013variational}. In order to provide a direct comparison we use a set of 5000 randomly sampled documents from the poliblog2008 corpus described above. This set of documents is included as \code{poliblog5k} in the \pkg{stm} package. We want to evaluate both the speed with which the model estimates as well as the quality of the solution. Due to the differences in the variational approximation the objective functions are not directly comparable so we use an estimate of the expected per-document held-out log-likelihood. With the built-in function \code{make.heldout} we construct a dataset in which 10\% of documents have half of their words removed. We can then evaluate the quality of inferred topics on the same evaluation set across all models. <>= set.seed(2138) heldout <- make.heldout(poliblog5k.docs, poliblog5k.voc) @ We arbitrarily decided to evaluate performance using \code{K = 100} topics. The function \code{CTM} in the \pkg{topicmodels} package uses different default settings than the original Blei code. Thus we present results using both sets of settings. We start by converting our document format to the simple triplet matrix format used by the \pkg{topicmodels} package: <>= slam <- convertCorpus(heldout$documents, heldout$vocab, type="slam") @ We then estimate the model with both sets of defaults <>= mod1 <- CTM(slam, k = 100) control_CTM_VEM <- list(estimate.beta = TRUE, verbose = 1, seed = as.integer(2138), nstart = 1L, best = TRUE, var = list(iter.max = 20, tol = 1e-6), em = list(iter.max = 1000, tol = 1e-3), initialize = "random", cg = list(iter.max = -1, tol = 1e-6)) mod2 <- CTM(slam, k = 100, control = control_CTM_VEM) @ For the STM we estimate two versions, one with default convergence settings and one with \code{emtol = 1e-3} to match the Blei convergence tolerance. In both cases we use the spectral initialization method which we generally recommend. <>= stm.mod1 <- stm(heldout$documents, heldout$vocab, K = 100, init.type = "Spectral") stm.mod2 <- stm(heldout$documents, heldout$vocab, K = 100, init.type = "Spectral", emtol = 1e-3) @ \begin{table}[] \centering \begin{tabular}{r|cccc} & CTM & CTM (alt) & STM & STM (alt) \\ \hline \# of Iterations & 38 & 18 & 30 & 7 \\ Total Time & 4527.1 min & 142.6 min & 14.3 min & 5.3 min \\ Time Per Iteration* & 119.1 min & 7.9 min & 0.5 min & 0.8 min\\ Heldout Log-Likelihood & -7.11 & -7.10 & -6.81 & -6.86 \end{tabular} \caption{Performance Benchmarks. Models marked with ``(alt)'' are alternate specifications with different convergence thresholds as defined in text. *Time per iteration was calculated by dividing the total run time by the number of iterations. For CTM this is a good estimate of the average time per iteration, whereas for STM this distributes the cost of initialization across the iterations.} \label{tab:perform} \end{table} We report the results in Table \ref{tab:perform}. The results clearly demonstrate the superior performance of the \pkg{stm} implementation of the correlated topic model. Better solutions (as measured by higher heldout log-likelihood) are found with fewer iterations and a faster runtime per iteration. In fact, comparing comparable convergence thresholds the \pkg{stm} is able to run completely to convergence before the CTM has made a single pass through the data.\footnote{To even more closely mirror the CTM code we ran STM using a random initialization (which we don't typically recommend). Under default initialization settings the expected heldout log-likelihood was -6.86; better than the CTM implementation and consistent with the alternative convergence threshold spectral initialization. It took 56 minutes and 180 iterations to converge, averaging less than 19 seconds per iteration. Despite taking many more iterations, this is still dramatically faster in total time than the CTM.} These results are particularly surprising given that the variational approximation used by STM is more involved than the one used in \cite{blei2007correlated} and implemented in \pkg{topicmodels}. Rather than use a set of univariate Normals to represent the variational distribution, STM uses a Laplace approximation to the variational objective as in \cite{wang2013variational} which requires a full covariance matrix for each document. Nevertheless, through a series of design decisions which we highlight next we have been able to speed up the code considerably. \subsection{Design} In \cite{blei2007correlated} and \pkg{topicmodels}, the inference procedure for each document involves iterating over four blocks of variational parameters which control the mean of the topic proportions, the variance of the topic proportions, the individual token assignments and an ancillary parameter which handles the nonconjugacy. Two of these parameter blocks have no closed form updates and require numerical optimization. This in turn makes the sequence of updates very slow. By contrast in STM we combine the Laplace approximate variational inference scheme of \cite{wang2013variational} with a partially collapsed strategy inspired by \cite{khan2009variational} in which we analytically integrate out the token-level parameters. This allows us to perform one numerical optimization which optimizes the variational mean of the topic proportions ($\lambda$) and then solves in closed form for the variance and the implied token assignments. This removes iteration between blocks of parameters within each document dramatically speeding convergence. Details can be found in \cite{stmjasa}. We use quasi-Newton methods to optimize $\lambda$, initializing at the previous iteration's estimate. This process of warm-starting the optimization process means that the cost of inference per iteration often drops considerably as the model approaches convergence (e.g., in the example above the first iteration takes close to 45 seconds but quickly drops down to 30 seconds). Because this optimization step is the main bottleneck for performance we code the objective function, gradient in the fast \proglang{C++} library \code{Armadillo} using the \pkg{RcppArmadillo} package \citep{RcppArmadillo}. After computing the optimal $\lambda$ we calculate the variance of the variational posterior by analytically computing the hessian (also implemented in \proglang{C++}) and efficiently inverting it via the Cholesky decomposition. The \pkg{stm} implementations also benefit from better model initialization strategies. \pkg{topicmodels} only allows for a model to be initialized randomly or by a pre-existing model. By contrast \pkg{stm} provides two powerful and fast initialization strategies as described above in Section \ref{sec:init}. Numerous optimizations have been made to address models with covariates as well. Of particular note is the use of the distributed multinomial regression framework \citep{taddy2013distributed} in settings with content covariates and an $L_1$ penalty. This approach can often be orders of magnitude faster than the alternative. \subsection{Access to Underlying Functions} We also provide the function \code{optimizeDocument} which performs the E-step optimization for a single document so that users who may wish to extend the package can access the core optimization functionality.\footnote{For example, this function would be helpful in developing new approaches to modeling the metadata or implementations that parallelize over documents.} For users who simply want to find topic proportions for previously unseen documents we have the easier to use \code{fitNewDocuments} function. In order to use \code{fitNewDocuments} the vocabularies between the new documents and old must be aligned which can be accomplished through the helper function \code{alignCorpus}. \section{Conclusion} \label{sec:conclusion} The \pkg{stm} package provides a powerful and flexible environment to perform text analysis that integrates both document metadata and topic modeling. In doing so it allows researchers understand which variables are linked with different features of text within the topic modeling framework. This paper provides an overview of how to use some of the features of the \pkg{stm} package, starting with ingestion, preparation, and estimation, and leading up to evaluation, understanding, and visualization. We encourage users to consult the extensive help files for more details and additional functionality. On our website \url{structuraltopicmodel.com} we also provide the companion papers that illustrate the application of this method. We invite users to write their own add-on packages such as the \pkg{tidystm} \citep{tidystm} and \pkg{stminsights} \citep{stminsights} packages. Furthermore, there are always gains in efficiency to be had, both in theoretical optimality and in applied programming practice. The STM is undergoing constant streamlining and revision towards faster, more optimal computation. This includes an ongoing project on parallel computation of the STM. As corpus sizes increase, the STM will also increase in the capacity to handle more documents and more varied metadata. \section{Appendix: Additional evaluation tools} \label{sec:eval} In this appendix we discuss several more advanced features of the \pkg{stm} package. Topic estimation is fundamentally imprecise, as estimation in topic model space requires both an a priori number of topics input by the user, and furthermore an optimization in a space with multiple solutions. Due to the intractability underlying the computation of topic models, we rely on external analytics of our model to understand its unique tradeoffs between competing parameters. The \pkg{stm} package contains a variety of tools that can be used to evaluate the quality of the model as well as the user's choice of number of topics and of metadata selected for inclusion. \subsection{Held-out likelihood estimation} Sometimes users will want to compare model specifications to see how well each model does in predicting words within the document. The \texttt{stm} package contains two different functions to aid with held-out likelihood estimation. We use the document-completion held-out likelihood method which is the estimation of the probability of words appearing within a document when those words have been removed from the document in the estimation step \citep{asuncion2009smoothing, wallach2009evaluation, hoffman2013stochastic}. Similar to cross-validation, when some of the data is removed from estimation and then later used for validation, the held-out likelihood helps the user assess the model's prediction performance. We provide two different functions for the user to complete heldout likelihood estimation. The first, \code{make.heldout}, produces a document set where some of the words within the documents have been removed. The user can then run a STM model on the documents with missing words. The second, \code{eval.heldout}, evaluates the heldout likelihood for missing words based on the model run on the heldout documents. If users want to fit a previously unseen document that was not part of a heldout set created using \code{make.heldout} they can use the \code{fitNewDocuments} function. \subsection{Residuals checks} Users can also test the assumptions of the model within the package through the function \code{residuals}. This function implements residual checks described in Section 4.2 of \citet{taddyestimation}, testing for overdispersion of the variance of the multinomial within the data generating process of STM. As described in \citet{taddyestimation}, if the residuals are overdispersed, it could be that more topics are needed to soak up some of the extra variance. While no fool-proof method has been developed to choose the number of topics, both the residual checks and held-out likelihood estimation are useful indicators of the number of topics that should be chosen.\footnote{In addition to these functions one can also explore if there words that are extremely highly associated with a single topic via the \code{checkBeta} function.} \subsection{Checks for multi-modality} Another diagnostic that should be completed while running the STM when not using the spectral based initializations is checking to see how multi-modal the model of interest is. We provide a suite of methods to assess multi-modality, and we refer the reader to \citet{chaptermulti} for an explanation of all of them. The function \code{multiSTM} aligns topics across models. The function \code{plot.MultimodDiagnostic} plots the effects across topics and models. Both enable the researcher to investigate how different solutions lead to different inferences about the data. \subsection{Post-estimation permutation checks} Any statistical procedure can be abused and STM is no different. One concern is that users will simply search out covariate topic relationships that are the strongest. A related concern is that by combining the measurement model with the estimation of an effect, the user is ``baking in'' the conclusion. In the appendix of \cite{ajps} we address this concern using both a simulation and a permutation test approach. We have built in a function for conducting permuation tests using binary prevalence covariates.\footnote{Future work could extend this to other types of covariates.} The \code{permutationTest} function takes a formula containing a single binary covariate (and optionally other controls) and runs a permutation test where, rather than using the true assignment, the covariate is randomly assigned to a document with probability equal to its empirical probability in the data. After each shuffle of the covariate the same STM model is estimated at different starting values using the same initialization procedure as the original model, and the effect of the covariate across topics is calculated. Next the function records two quantities of interest across this set of ``runs" of the model. The first quantity reports the absolute maximum effect of the permuted covariate across all topics. The second quantity reports the effect of the (permuted) covariate on the topic in each additional STM run which is estimated to be the topic closest to the topic of interest. The object returned from \code{permutationTest} can then be passed to \code{plot.STMpermute} for plotting. \clearpage \pdfbookmark[1]{References}{References} \bibliography{extracted} \end{document} \begin{comment} Use of the STM typically proceeds in three key steps: \begin{enumerate} \item Ingest (Reading in text data) \\ (\code{textProcessor, readCorpus}) \item Process (Pre-process text data) \\ (\code{prepDocuments, plotRemoved}) \item Estimate (Fitting the Structural Topic Model) (\code{stm, selectModel, manyTopics}) \item Plotting and inspecting results (\code{plotModels},\code{plot.STM},\code{labelTopics}, \code{estimateEffect}, \code{plot.estimateEffect},\code{findThoughts}, \code{plotQuote}) \end{enumerate} \end{comment}