--- title: "Network Estimation and Analysis with Nestimate" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Network Estimation and Analysis with Nestimate} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", out.width = "100%", fig.width = 7, fig.height = 5, dpi = 96, warning = FALSE, message = FALSE, output.lines = 40 ) local({ hook_output <- knitr::knit_hooks$get("output") knitr::knit_hooks$set(output = function(x, options) { n <- options$output.lines if (!is.null(n)) { lines <- strsplit(x, "\n", fixed = TRUE)[[1]] if (length(lines) > n) { x <- paste(c(utils::head(lines, n), sprintf("#> [... %d more lines ...]", length(lines) - n)), collapse = "\n") } } hook_output(x, options) }) }) options(max.print = 100) ``` `Nestimate` is a unified framework for estimating, validating, and comparing networks from sequential and cross-sectional data. It implements two complementary paradigms: **Transition Network Analysis (TNA)**, which models the relational dynamics of temporal processes as weighted directed networks using stochastic Markov models; and **Psychological Network Analysis (PNA)**, which estimates the conditional dependency structure among variables using regularized partial correlations and graphical models. Both paradigms share the same `build_network()` interface, the same validation engine (bootstrap, permutation, centrality stability), and the same output format --- enabling researchers to apply a consistent analytic workflow across fundamentally different data types. This vignette demonstrates both paradigms, covering network estimation, statistical validation, data-driven clustering, and group comparison. # Part I: Transition Network Analysis ## Theoretical Grounding TNA uses stochastic process modeling to capture the dynamics of temporal processes via Markov models. Markov models align with the view that a temporal process is an outcome of a stochastic data generating process that produces various network configurations or patterns based on rules, constraints, or guiding principles. The transitions are governed by a stochastic process: the specific ways in which the system changes or evolves is rather random and therefore cannot be strictly determined. That is, the transitions are probabilistically dependent on preceding states --- a method that assumes events are probabilistically dependent on the preceding ones. The main principle of TNA is representing the transition matrix between events as a graph to take full advantage of graph theory potentials and the wealth of network analysis. TNA brings network measures at the node, edge, and graph level; pattern mining through dyads, triads, and communities; clustering of sub-networks into typical behavioral strategies; and rigorous statistical validation at each edge through bootstrapping, permutation, and case-dropping techniques. Such statistical rigor that brings validation and hypothesis testing at each step of the analysis offers a method for researchers to build, verify, and advance existing theories on the basis of a robust scientific approach. For a comprehensive introduction to TNA in learning analytics, see Saqr et al. (2025a) and Saqr (2024). For detailed tutorials on TNA clustering and heterogeneity analysis, see López-Pernas et al. (2025). ## Data The `human_long` dataset contains 10,796 coded human interaction turns from 429 human-AI pair programming sessions across 34 projects. Each row is a single turn, with `code` recording the interaction type, `session_id` identifying the session, and `timestamp` providing temporal ordering. For a detailed description of the dataset and coding scheme, see [Saqr (2026)](https://saqr.me/blog/2026/human-ai-interaction-cograph/). ```{r data} library(Nestimate) # Subsample for vignette speed (CRAN build-time limit) set.seed(1) keep <- sample(unique(human_long$session_id), 100) human_sub <- human_long[human_long$session_id %in% keep, ] head(human_sub) ``` The dataset is in long format: `code` records what happened, `session_id` who did it, and `timestamp` when. Additional columns like `project` and `cluster` are automatically preserved as metadata for downstream covariate analysis. ## Building Networks Building networks in Nestimate is a single step: `build_network()` is the universal entry point for all network estimation. It accepts long-format event data directly with three key parameters: - **`action`**: the column containing state labels - **`actor`**: the column identifying sequences (one sequence per actor) - **`time`**: the column providing temporal ordering Under the hood, `build_network()` automatically converts the long-format event log into wide-format sequences, handling chronological ordering, session detection, and metadata preservation. You can also call `prepare()` directly to inspect or reuse the processed data before passing it to `build_network()`. ### Transition Network (TNA) The standard TNA method estimates a first-order Markov model from sequence data. Given a sequence of events, the transition probability $P(v_j | v_i)$ is estimated as the ratio of observed transitions from state $v_i$ to state $v_j$ to the total number of outgoing transitions from $v_i$. These probabilities are assembled into a **transition matrix** $T$, where each element $T_{ij}$ represents the estimated probability of transitioning from $v_i$ to $v_j$ (Saqr et al., 2025a). ```{r tna} net_tna <- build_network(human_sub, method = "tna", action = "code", actor = "session_id", time = "timestamp") print(net_tna) ``` ### Frequency Network (FTNA) The frequency method preserves raw transition counts rather than normalizing to conditional probabilities. This is useful when absolute frequencies matter --- a transition occurring 500 times from a common state may be more practically important than one occurring 5 times from a rare state, even if the latter has a higher conditional probability. ```{r ftna} net_ftna <- build_network(human_sub, method = "ftna", action = "code", actor = "session_id", time = "timestamp") print(net_ftna) ``` ### Attention Network (ATNA) The attention method applies temporal decay weighting, giving more importance to recent transitions within each sequence. The `lambda` parameter controls the decay rate: higher values produce faster decay. This captures the idea that later events in a process may be more indicative of the underlying dynamics than early ones. ```{r atna} net_atna <- build_network(human_sub, method = "atna", action = "code", actor = "session_id", time = "timestamp") print(net_atna) ``` ### Co-occurrence Network from Binary Data When the data is binary (0/1) --- as is common in learning analytics where activities are coded as present or absent within time windows --- `build_network()` automatically detects the format and uses co-occurrence analysis. The resulting undirected network captures which events tend to co-occur. ```{r onehot} data(learning_activities) net <- build_network(learning_activities, method = "cna", actor = "student") print(net) ``` ### Window-based TNA (WTNA) The `wtna()` function computes networks from one-hot encoded (binary) data using temporal windowing. It supports three modes: - **`"transition"`**: directed transitions between consecutive windows - **`"cooccurrence"`**: undirected co-occurrence within windows - **`"both"`**: a mixed network combining transitions and co-occurrences ```{r wtna-freq} net_wtna <- wtna(learning_activities, actor = "student", method = "transition", type = "frequency") print(net_wtna) ``` ```{r wtna-relative} net_wtna_rel <- wtna(learning_activities, method = "transition", type = "relative") print(net_wtna_rel) ``` ### Mixed Network (Transitions + Co-occurrences) Since states can co-occur within the same window *and* follow each other across windows, a mixed network captures both relationships simultaneously --- modeling co-activity and temporal succession in a single structure. ```{r wtna-mixed} net_wtna_mixed <- wtna(learning_activities, method = "both", type = "relative") print(net_wtna_mixed) ``` ## Validation Most research on networks or process mining uses descriptive methods. The validation or the statistical significance of such models is almost absent in the literature. Having validated models allows us to assess the robustness and reproducibility of our models to ensure that the insights we get are not merely a product of chance and are therefore generalizable. ### Reliability Split-half reliability assesses whether the network structure is stable when the data is randomly divided into two halves. High reliability means the network structure is a consistent property of the data, not driven by a small number of idiosyncratic sequences. ```{r network-reliability} network_reliability(net_tna) ``` ### Bootstrap Analysis Bootstrapping repeatedly draws samples from the original dataset with replacement to estimate the model for each sample. When edges consistently appear across the majority of the estimated models, they are considered stable and significant. The bootstrap also provides confidence intervals and p-values for each edge weight, offering a quantifiable measure of uncertainty and robustness for each transition in the network. ```{r bootstrap} set.seed(42) boot <- bootstrap_network(net_tna, iter = 100) boot ``` ### Centrality Stability Centrality measures quantify the role or importance of each state in the process. Centrality stability analysis quantifies how robust centrality rankings are to case-dropping: the CS-coefficient is the maximum proportion of cases that can be dropped while maintaining a correlation of at least 0.7 with the original centrality values. A CS-coefficient above 0.5 indicates stable rankings; below 0.25 indicates instability. ```{r cs} centrality_stability(net_tna, iter = 100) ``` ## Clustering Clusters represent typical transition networks that recur across different instances. Unlike communities, clusters involve the entire network where groups of sequences are similarly interconnected and each exhibits a distinct transition pattern with its own set of transition probabilities. Identifying clusters captures the dynamics, revealing typical behavioral strategies that learners frequently adopt. `build_clusters()` computes pairwise dissimilarities between sequences and partitions them into `k` groups, then builds a separate network for each cluster (López-Pernas et al., 2025). ```{r clustering} Cls <- build_clusters(net_tna, k = 3) Clusters <- build_network(Cls, method = "tna") Clusters ``` ### Permutation Test for Clusters Permutation testing is particularly important for data-driven clusters: because clustering algorithms partition sequences to maximize between-group separation, some degree of apparent difference is guaranteed by construction. The permutation test provides the necessary corrective --- by randomly reassigning sequences to groups while preserving internal sequential structure, it constructs null distributions for edge-level differences. Only differences that exceed this null distribution constitute evidence of genuine structural divergence rather than algorithmic artifacts. ```{r perm-clusters} perm <- permutation(Clusters$`Cluster 1`, Clusters$`Cluster 2`, iter = 100) perm ``` ## Post-hoc Covariate Analysis `build_clusters()` supports post-hoc covariate analysis: covariates do not influence the clustering but are analyzed after the fact to characterize who ends up in which cluster. This is the appropriate approach when the clustering should reflect behavioral patterns alone, and the researcher then asks whether those patterns are associated with external variables. The `group_regulation_long` dataset contains self-regulated learning sequences annotated with an `Achiever` covariate distinguishing high and low achievers. ```{r posthoc-data} data("group_regulation_long") net_GR <- build_network(group_regulation_long, method = "tna", action = "Action", actor = "Actor", time = "Time") ``` ```{r posthoc} Post <- build_clusters(net_GR, k = 2, covariates = c("Achiever")) summary(Post) ``` ```{r posthoc-networks} Postgr <- build_network(Post) Postgr ``` # Part II: Psychological Network Analysis ## Theoretical Grounding Psychological network analysis estimates the conditional dependency structure among a set of variables. Variables (e.g., symptoms, traits, behaviors) are represented as nodes, and edges represent partial correlations --- the association between two variables after controlling for all others. This approach reveals which variables are directly connected versus those whose association is mediated through other variables (Saqr et al., 2024). `Nestimate` supports three estimation methods for psychological networks, all accessed through the same `build_network()` interface: - **Correlation networks** (`method = "cor"`) estimate pairwise Pearson correlations and produce fully connected undirected networks. While informative as a starting point, they do not distinguish direct from indirect associations. - **Partial correlation networks** (`method = "pcor"`) control for all other variables, revealing only direct associations. They provide a more accurate picture of the dependency structure, though results can be noisy in small samples. - **Regularized networks via EBICglasso** (`method = "glasso"`) apply L1 regularization to the precision matrix, shrinking weak or unreliable edges to exactly zero. This is the recommended approach for psychological network analysis, as it balances model fit against complexity and produces sparse, interpretable, and replicable structures. ## Data The `chatgpt_srl` dataset contains scale scores on five self-regulated learning (SRL) constructs --- Comprehension and Study Understanding (CSU), Intrinsic Value (IV), Self-Efficacy (SE), Self-Regulation (SR), and Task Avoidance (TA) --- for 1,000 responses generated by ChatGPT to a validated SRL questionnaire (Vogelsmeier et al., 2025). ```{r pna-data} data(chatgpt_srl) head(chatgpt_srl) ``` ## Regularized Network (EBICglasso) The graphical lasso applies L1 regularization to the precision matrix (the inverse of the covariance matrix), producing a sparse network where weak or unreliable edges are shrunk to exactly zero. The `gamma` parameter controls sparsity through EBIC model selection --- higher values yield sparser networks. ```{r glasso} net_glasso <- build_network(chatgpt_srl, method = "glasso", params = list(gamma = 0.5)) net_glasso ``` # References Saqr, M. (2024). Temporal Network Analysis: Introduction, Methods and Analysis with R. In M. Saqr & S. López-Pernas (Eds.), *Learning Analytics Methods and Tutorials: A Practical Guide Using R*. Springer. Saqr, M., Beck, E., & López-Pernas, S. (2024). Psychological Networks: A Modern Approach to Analysis of Learning and Complex Learning Processes. In M. Saqr & S. López-Pernas (Eds.), *Learning Analytics Methods and Tutorials: A Practical Guide Using R*. Springer. Saqr, M., López-Pernas, S., & Tikka, S. (2025a). Mapping Relational Dynamics with Transition Network Analysis: A Primer and Tutorial. In M. Saqr & S. López-Pernas (Eds.), *Advanced Learning Analytics Methods: AI, Precision and Complexity*. Springer Nature Switzerland. Saqr, M., López-Pernas, S., Törmänen, T., Kaliisa, R., Misiejuk, K., & Tikka, S. (2025b). Transition network analysis: A novel framework for modeling, visualizing, and identifying the temporal patterns of learners and learning processes. *Proceedings of the 15th International Learning Analytics and Knowledge Conference (LAK25)*. ACM. López-Pernas, S., Tikka, S., & Saqr, M. (2025). Mining Patterns and Clusters with Transition Network Analysis: A Heterogeneity Approach. In M. Saqr & S. López-Pernas (Eds.), *Advanced Learning Analytics Methods: AI, Precision and Complexity*. Springer Nature Switzerland. Vogelsmeier, L.V.D.E., Oliveira, E., Misiejuk, K., López-Pernas, S., & Saqr, M. (2025). Delving into the psychology of Machines: Exploring the structure of self-regulated learning via LLM-generated survey responses. *Computers in Human Behavior*, 173, 108769.