--- title: "Know your nodes" format: html vignette: > %\VignetteIndexEntry{Know your nodes} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, comment = "#>") library(parsermd) ``` # Introduction The parsermd package parses R Markdown and Quarto documents into an Abstract Syntax Tree (AST) representation. This vignette introduces the different types of AST nodes and their properties, helping you understand how parsermd represents document structure. ## AST Container - `rmd_ast` The `rmd_ast` object serves as the container for all parsed document nodes. It holds a linear sequence of nodes representing different document elements, where each node type corresponds to a specific R Markdown or Quarto construct (headings, code chunks, text, etc.). **Important**: The AST represents documents as a linear sequence of nodes, not a nested tree structure. This means that structural elements like fenced divs are represented as separate opening and closing nodes in the sequence, rather than as nodes with children. The default print method for `rmd_ast`'s (`flat = FALSE`) presents an implicit tree structure based on heading levels. This provides a hierarchical view that reflects the document's logical organization, where content is grouped under headings based on their level. **Properties:** - `nodes`: A list containing all the parsed nodes in document order **Example:** Raw text that would be parsed: ````markdown --- title: "Example Document" --- # Introduction This is some text. ```{{r}} x <- 1:5 mean(x) ``` ```` This would create an `rmd_ast` object containing: 1. `rmd_yaml` node with the title 2. `rmd_heading` node with "Introduction" 3. `rmd_markdown` node with "This is some text." 4. `rmd_chunk` node with the R code Programmatic creation: ```{r} ast = rmd_ast(list( rmd_yaml(list(title = "Example Document")), rmd_heading(name = "Introduction", level = 1L), rmd_markdown(lines = "This is some text."), rmd_chunk( engine = "r", code = c("x <- 1:5", "mean(x)") ) )) ``` :::::: {.columns} ::: {.column width="50%"} **Hierarchical view (`flat = FALSE`):** ```{r} print(ast, flat = FALSE) ``` ::: ::: {.column width="50%"} **Linear view (`flat = TRUE`):** ```{r} print(ast, flat = TRUE) ``` ::: :::::: --- # S7 Class System parsermd uses the S7 object system for all AST node types. S7 provides a modern, robust class system with: - **Type safety**: Properties are validated when objects are created or modified - **Performance**: Efficient method dispatch and memory usage - **Consistency**: Uniform interface across all node types **Key S7 Features in parsermd:** - All node types inherit from the base `rmd_node` class - Properties are accessed using `@` syntax (e.g., `node@content`) - Validation ensures data integrity (proper types, lengths, etc.) - Method dispatch works seamlessly with generic functions **Property Access:** ```{r} # Create a heading node heading = rmd_heading(name = "Section Title", level = 2L) # Access properties with @ heading@name heading@level ``` --- # Core Node Types ## Document Structure Nodes ### YAML Header - `rmd_yaml` The `rmd_yaml` node represents YAML front matter at the beginning of documents. **Properties:** - `yaml`: List containing the parsed YAML content **Example:** Raw text that would be parsed: ```yaml --- title: "My Document" author: "John Doe" date: "2023-01-01" --- ``` Programmatic creation: ```{r} yaml_node = rmd_yaml(list( title = "My Document", author = "John Doe", date = "2023-01-01" )) yaml_node ``` --- ### Markdown Headings - `rmd_heading` The `rmd_heading` node represents section headings in markdown. **Properties:** - `name`: Character string containing the heading text - `level`: Integer from 1-6 indicating the heading level (# = 1, ## = 2, etc.) **Example:** Raw text that would be parsed: ```markdown # Introduction ``` Programmatic creation: ```{r} heading_node = rmd_heading( name = "Introduction", level = 1L ) heading_node ``` --- ### Markdown Text - `rmd_markdown` The `rmd_markdown` node represents plain markdown text content. **Properties:** - `lines`: Character vector containing the markdown text lines **Example:** Raw text that would be parsed: ```markdown This is a paragraph. With multiple lines. ``` Programmatic creation: ```{r} markdown_node = rmd_markdown( lines = c("This is a paragraph.", "With multiple lines.") ) markdown_node ``` --- ## Code and Execution Nodes ### Executable Code Chunks - `rmd_chunk` The `rmd_chunk` node represents executable code chunks with options and metadata. **Properties:** - `engine`: The code engine (default: "r") - `label`: Optional chunk name/label - `options`: List of chunk options containing both traditional and YAML options - `code`: Character vector containing the code lines - `indent`: Indentation string - `n_ticks`: Number of backticks used (default: 3) **Chunk Option Formats:** Chunks support two option formats that can be used independently or together: 1. **Traditional format**: Options specified in the chunk header after the engine and label ```{{r chunk-label, eval=TRUE, echo=FALSE}} 2. **YAML format**: Options specified as YAML comments within the chunk ```{{r chunk-label}} #| eval: true #| echo: false ``` **Option Conflict Resolution:** When the same option is specified in both formats, YAML options take precedence over traditional options. A warning is emitted when conflicts occur: ```{{r eval=TRUE}} #| eval: false ``` In this case, `eval: false` (YAML) wins over `eval=TRUE` (traditional), and the parser emits: "YAML options override traditional options for: eval" **Type Handling:** - **Traditional options**: Always stored as strings (e.g., `"TRUE"`, `"5"`) - **YAML options**: Preserve proper R types (e.g., `TRUE`, `5L`, `3.14`) **Examples:** **Traditional format chunk:** ````markdown ```{{r example, eval=TRUE, echo=FALSE}} x <- 1:10 mean(x) ``` ```` **YAML format chunk:** ````markdown ```{{r example}} #| eval: true #| echo: false x <- 1:10 mean(x) ``` ```` **Mixed format chunk (with conflict):** ````markdown ```{{r example, eval=TRUE}} #| eval: false #| message: false x <- 1:10 mean(x) ``` ```` In this case, `eval: false` (YAML) overrides `eval=TRUE` (traditional). **Programmatic creation:** ```{r} # Traditional-style options chunk_node_traditional = rmd_chunk( engine = "r", label = "example", options = list(eval = "TRUE", echo = "FALSE"), code = c("x <- 1:10", "mean(x)") ) # YAML-style options with proper types chunk_node_yaml = rmd_chunk( engine = "r", label = "example", options = list(eval = TRUE, echo = FALSE), code = c("x <- 1:10", "mean(x)") ) chunk_node_yaml ``` --- ### Raw Output Chunks - `rmd_raw_chunk` The `rmd_raw_chunk` node represents raw output chunks for specific formats. **Properties:** - `format`: The output format (e.g., "html", "latex") - `code`: Character vector containing the raw content - `indent`: Indentation string - `n_ticks`: Number of backticks used **Example:** Raw text that would be parsed: ````markdown ```{=html}

Custom HTML content

``` ```` Programmatic creation: ```{r} raw_chunk_node = rmd_raw_chunk( format = "html", code = c( "
", "

Custom HTML content

", "
" ) ) raw_chunk_node ``` --- ### Fenced Code Blocks - `rmd_code_block` The `rmd_code_block` node represents non-executable fenced code blocks. **Properties:** - `id`: Optional HTML ID attribute - `classes`: Character vector of CSS classes (e.g., language names like "python", "r") - `attr`: Named character vector for key-value attributes (e.g., `c(style="color:blue")`) - `code`: Character vector containing the code lines - `indent`: Indentation string - `n_ticks`: Number of backticks used **Example:** Raw text that would be parsed: ````markdown ```python def hello(): print('Hello, World!') ``` ```` Programmatic creation: ```{r} code_block_node = rmd_code_block( classes = c("python"), code = c( "def hello():", " print('Hello, World!')" ) ) code_block_node ``` --- ### Code Block Literals - `rmd_code_block_literal` The `rmd_code_block_literal` node represents code blocks with literal attribute capture using the `{{...}}` syntax. This format preserves the raw attribute content exactly as written, making it ideal for displaying code chunk examples. **Properties:** - `attr`: Raw attribute string (exactly as written between `{{` and `}}`) - `code`: Character vector containing the code lines - `indent`: Indentation string - `n_ticks`: Number of backticks used **Example:** Raw text that would be parsed: ```{{r, echo=TRUE, eval=FALSE}} x <- 1:10 mean(x) ``` Programmatic creation: ```{r} code_block_literal_node = rmd_code_block_literal( attr = "r, echo=TRUE, eval=FALSE", code = c( "x <- 1:10", "mean(x)" ) ) code_block_literal_node ``` **Nested Braces Support:** The literal format can handle nested braces in attributes: ```{{r, code='function() { return(1) }'}} ``` This captures the attribute as: `"r, code='function() { return(1) }'"` --- ## Structural Elements ### Fenced Divs - `rmd_fenced_div_open` & `rmd_fenced_div_close` Fenced divs are represented as pairs of nodes in the linear AST sequence. The `rmd_fenced_div_open` node marks the beginning of a fenced div block, and the `rmd_fenced_div_close` node marks the end. Any content between these nodes is considered to be inside the fenced div. **rmd_fenced_div_open Properties:** - `id`: Optional HTML ID attribute - `classes`: Character vector of CSS classes - `attr`: Named character vector for key-value attributes **rmd_fenced_div_close Properties:** None (just a marker) **Example:** Raw text that would be parsed: ```markdown ::: {.warning #important} This content is inside the fenced div. More content here. ::: ``` This would create a sequence of nodes: 1. `rmd_fenced_div_open` with attributes 2. `rmd_markdown` with "This content is inside the fenced div." 3. `rmd_markdown` with "More content here." 4. `rmd_fenced_div_close` Programmatic creation: ```{r} # Create the opening node fenced_div_open_node = rmd_fenced_div_open( classes = c(".warning"), attr = c(id = "important") ) # Create the closing node fenced_div_close_node = rmd_fenced_div_close() # These would typically be combined with content nodes in an rmd_ast ast_with_div = rmd_ast(list( fenced_div_open_node, rmd_markdown( lines = "This content is inside the fenced div." ), rmd_markdown( lines = "More content here." ), fenced_div_close_node )) ``` --- # Extracted Elements The following classes represent elements that can be extracted from AST nodes through secondary parsing, rather than being direct nodes in the AST structure. These elements are found within markdown text and code content. ## Inline Code - `rmd_inline_code` The `rmd_inline_code` class represents inline code expressions extracted from markdown text. **Properties:** - `engine`: The code engine (empty string for static code) - `code`: The inline code content - `braced`: Whether the code uses braced syntax - `start`: Starting position in the source text - `length`: Length of the inline code **Example:** Raw text containing inline code: ```markdown The result is `r 2 + 2`. ``` Programmatic creation: ```{r} # Create directly inline_code_obj = rmd_inline_code( engine = "r", code = "2 + 2", braced = FALSE ) inline_code_obj ``` --- ## Shortcode Function Calls - `rmd_shortcode` The `rmd_shortcode` class represents Quarto shortcode function calls extracted from markdown content. **Properties:** - `func`: The shortcode function name - `args`: Character vector of arguments - `start`: Starting position in the source text - `length`: Length of the shortcode **Example:** Raw text containing a shortcode: ```markdown {{< embed type=video src=example.mp4 >}} ``` Programmatic creation: ```{r} # Create directly shortcode_obj = rmd_shortcode( func = "embed", args = c("type=video", "src=example.mp4") ) shortcode_obj ``` --- ## Spans - `rmd_span` The `rmd_span` class represents inline span elements with attributes extracted from markdown text. **Properties:** - `text`: The text content of the span - `id`: Optional HTML ID (must start with '#' if present) - `classes`: Character vector of CSS classes (must start with '.' if present) - `attr`: Named character vector of additional attributes **Example:** Raw text containing a span: ```markdown [Important text]{.highlight #key} ``` Programmatic creation: ```{r} # Create directly span_obj = rmd_span( text = "Important text", id = c("#key"), classes = c(".highlight") ) span_obj ``` --- ## Extraction Functions These utility functions extract the above elements from AST nodes: - `rmd_extract_inline_code()`: Extract inline code from text - `rmd_extract_shortcodes()`: Extract shortcodes from text - `rmd_extract_spans()`: Extract spans from text - `rmd_has_inline_code()`: Check if text contains inline code - `rmd_has_shortcode()`: Check if text contains shortcodes - `rmd_has_span()`: Check if text contains spans ---