--- title: "Introduction to tipitaka.critical" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to tipitaka.critical} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` The **tipitaka.critical** package provides a lemmatized critical edition of the complete Pali Canon (Tipitaka), the canonical scripture of Theravada Buddhism. The text is based on a five-witness collation and lemmatized using the Digital Pali Dictionary. ## The texts dataset The package ships a single dataset, `texts`, containing 5,777 text units spanning all three pitakas: ```{r texts-overview} library(tipitaka.critical) dim(texts) names(texts) ``` Each row is a text unit (a sutta, a chapter, or a standalone text) with both the surface-form Pali text and a lemmatized version where every word is replaced by its dictionary headword: ```{r texts-example} # The Brahmajala Sutta (DN 1) dn1 <- texts[texts$id == "dn1", ] dn1$title # First 120 characters of surface text cat(substr(dn1$text, 1, 120), "...\n") # Same passage, lemmatized cat(substr(dn1$text_lemmatized, 1, 120), "...\n") ``` The three pitakas and seven collections are: ```{r texts-collections} table(texts$pitaka) table(texts$collection) ``` ## Lemma frequencies The `lemmas` dataset is a frequency table computed from the lemmatized text. It is not shipped with the package but computed automatically on first access (about 5 seconds): ```{r lemmas-overview} dim(lemmas) head(lemmas) ``` Each row gives the count and frequency of one lemma in one text unit. This makes it easy to find the most common words across the entire canon: ```{r lemmas-top} totals <- tapply(lemmas$n, lemmas$word, sum) head(sort(totals, decreasing = TRUE), 15) ``` The most frequent lemmas are grammatical particles: *ta* (that/it), *ti* (quotative marker), *ca* (and), *na* (not). The first content word is *dhamma* (teaching, truth, phenomenon) --- the central concept of the entire canon. Further down, *bhikkhave* (O monks, vocative) and *bhikkhu* (monk) both appear in the top 20, reflecting that the primary audience for these teachings was the monastic community. Or within a single collection: ```{r lemmas-by-collection} dn_lemmas <- lemmas[lemmas$collection == "dn", ] dn_totals <- tapply(dn_lemmas$n, dn_lemmas$word, sum) head(sort(dn_totals, decreasing = TRUE), 10) ``` ## Searching for a lemma The `search_lemma()` function finds all text units containing a given lemma, sorted by frequency: ```{r search} # Where does "nibbana" appear most frequently? nibbana <- search_lemma("nibbana") head(nibbana[, c("id", "collection", "n", "freq")]) ``` ```{r search-dhamma} # "dhamma" across collections dhamma <- search_lemma("dhamma") tapply(dhamma$n, dhamma$collection, sum) ``` ## Document-term matrix The `dtm` dataset is a sparse matrix (from the **Matrix** package) with text units as rows and lemmas as columns. Values are within-document frequencies. Like `lemmas`, it is computed on first access: ```{r dtm-overview} dim(dtm) class(dtm) # Sparsity (proportion of zero entries) 1 - length(dtm@x) / prod(dim(dtm)) ``` ## Visualizing the Canon The DTM enables standard text-analysis workflows. We can start with a simple example: hierarchical clustering of the 34 Digha Nikaya suttas. ```{r dn-cluster, fig.width=7, fig.height=4} dn_ids <- texts$id[texts$collection == "dn"] dn_dtm <- dtm[dn_ids, ] # Drop empty columns dn_dtm <- dn_dtm[, colSums(dn_dtm) > 0] d <- dist(as.matrix(dn_dtm)) hc <- hclust(d, method = "ward.D2") plot(hc, main = "Digha Nikaya — Hierarchical Clustering", xlab = "", sub = "", cex = 0.7) ``` ### PCA of the entire Canon To see how all 5,777 text units relate to each other, we can project the DTM into two dimensions using principal component analysis. We use the 500 most frequent lemmas to keep the computation fast: ```{r pca, fig.width=7, fig.height=6} # Select top 500 lemmas by total frequency col_sums <- colSums(dtm) top_terms <- names(sort(col_sums, decreasing = TRUE))[1:500] dtm_sub <- as.matrix(dtm[, top_terms]) # PCA pca <- prcomp(dtm_sub, center = TRUE, scale. = FALSE) pct_var <- summary(pca)$importance[2, 1:2] * 100 # Color by collection coll_colors <- c( abhidhamma = "#E41A1C", an = "#377EB8", dn = "#4DAF4A", kn = "#FF7F00", mn = "#984EA3", sn = "#A65628", vinaya = "#F781BF" ) pt_col <- coll_colors[texts$collection] plot(pca$x[, 1], pca$x[, 2], col = adjustcolor(pt_col, alpha.f = 0.5), pch = 16, cex = 0.6, xlab = paste0("PC1 (", round(pct_var[1], 1), "%)"), ylab = paste0("PC2 (", round(pct_var[2], 1), "%)"), main = "PCA of All Tipitaka Texts") legend("topright", c("Abhidhamma", "AN", "DN", "KN", "MN", "SN", "Vinaya"), col = coll_colors, pch = 16, cex = 0.8) ``` The Abhidhamma texts cluster distinctly from the Sutta Pitaka, reflecting their specialized technical vocabulary. Within the Sutta Pitaka, the five nikayas overlap substantially but show characteristic tendencies. ### Canon-wide hierarchical clustering For a dendrogram of the whole canon, we aggregate texts to an intermediate level: individual suttas for DN and MN, samyuttas for SN, nipatas for AN, and individual texts for KN, Vinaya, and Abhidhamma. ```{r canon-cluster, fig.width=7, fig.height=10} # Create group IDs at an intermediate level group_id <- texts$id # SN: sn1.1 -> sn1 (by samyutta) sn_mask <- texts$collection == "sn" group_id[sn_mask] <- sub("\\..*", "", group_id[sn_mask]) # AN: an1.1 -> an1 (by nipata) an_mask <- texts$collection == "an" group_id[an_mask] <- sub("\\..*", "", group_id[an_mask]) # KN: dhp1-20 -> dhp, snp1.1 -> snp, etc. (by text) kn_mask <- texts$collection == "kn" group_id[kn_mask] <- sub("[0-9].*", "", group_id[kn_mask]) # Aggregate DTM by group (mean of member frequencies) groups <- unique(group_id) group_dtm <- matrix(0, length(groups), length(top_terms)) group_coll <- character(length(groups)) for (i in seq_along(groups)) { rows <- which(group_id == groups[i]) if (length(rows) == 1) { group_dtm[i, ] <- dtm_sub[rows, ] } else { group_dtm[i, ] <- colMeans(dtm_sub[rows, ]) } group_coll[i] <- texts$collection[rows[1]] } rownames(group_dtm) <- groups # Cluster d <- dist(group_dtm) hc <- hclust(d, method = "ward.D2") # Color labels by collection label_col <- coll_colors[group_coll[hc$order]] dend <- as.dendrogram(hc) # Apply colors to leaf labels color_labels <- function(n, col_vec) { if (is.leaf(n)) { i <- match(attr(n, "label"), groups[hc$order]) attr(n, "nodePar") <- list(pch = NA, lab.col = col_vec[i], lab.cex = 0.45) } n } dend <- dendrapply(dend, color_labels, col_vec = label_col) oldpar <- par(mar = c(2, 1, 2, 8)) plot(dend, horiz = TRUE, main = "Tipitaka — Hierarchical Clustering", xlab = "") legend("topleft", c("Abhidhamma", "AN", "DN", "KN", "MN", "SN", "Vinaya"), text.col = coll_colors, cex = 0.7, bty = "n") par(oldpar) ``` The dendrogram reveals how texts cluster by vocabulary: Abhidhamma and Vinaya texts form their own branches, while within the Sutta Pitaka, texts with similar subject matter cluster together regardless of which nikaya they belong to. ## Further resources The companion package [tipitaka](https://CRAN.R-project.org/package=tipitaka) provides the original VRI edition text and Pali text tools including Pali-alphabet sorting.