---
title: "Using mdbplyr"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using mdbplyr}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction

This repository contains `mdbplyr`, an R package that provides a disciplined,
lazy `dplyr`-style interface for MongoDB aggregation pipelines.

`mdbplyr` sits between raw `mongolite` usage and broader compatibility layers.
Compared with writing aggregation JSON by hand, it lets you express supported
queries with tidy verbs and inspect the generated pipeline before execution.
Compared with approaches that try to hide MongoDB behind table-like semantics,
it stays explicit about scope, keeps translation conservative, and fails clearly
when a verb or expression is outside the supported subset.

The practical advantages are:

- lazy query construction until `collect()`,
- cursor-based streaming with `cursor()`,
- direct translation to MongoDB aggregation stages,
- `show_query()` for inspectable pipeline output,
- explicit unsupported-feature errors instead of silent fallback,
- a smaller, testable semantic surface aligned with MongoDB rather than SQL.

# Usage examples

## Loading data

Suppose there is a running MongoDB instance on `localhost` with default port
and no authentication. The code below loads `dplyr::starwars` into a collection
named `starwars`.

```{r load-starwars, eval = FALSE}
library(dplyr)
library(mongolite)
library(mdbplyr)

starwars_collection <- mongolite::mongo(
  collection = "starwars",
  db = "mdbplyr"
)

starwars_collection$drop()
starwars_collection$insert(dplyr::starwars)

starwars_tbl <- tbl_mongo(
  starwars_collection,
  schema = names(dplyr::starwars)
)
```

Once the collection is loaded, `starwars_tbl` is the lazy table used in the
examples below.

```{r create-lazy-table, eval = FALSE}
library(dplyr)
library(mdbplyr)

starwars_collection <- mongolite::mongo(
  collection = "starwars",
  db = "mdbplyr"
)

starwars_tbl <- tbl_mongo(starwars_collection) %>% 
  infer_schema()
```

## Schema

`mdbplyr` uses a schema to know which fields are available in a collection.
This matters especially for:

- dotted paths in nested documents such as `` `message.timestamp` `` or `` `message.measurements.Fx` ``;
- schema-first expression translation in verbs such as `filter()`, `mutate()`, and `summarise()`;
- explicit operations on nested data such as `select()`, `flatten_fields()`, and `unwind_array()`.

The most reliable approach is to pass `schema = ...` explicitly when creating
the lazy table:

```{r schema-explicit, eval = FALSE}
starwars_tbl <- tbl_mongo(
  starwars_collection,
  schema = c("name", "species", "height", "mass", "homeworld")
)
```

When writing the schema by hand is inconvenient, `infer_schema()` can populate
it from the first document in the collection:

```{r schema-infer, eval = FALSE}
starwars_tbl <- tbl_mongo(starwars_collection) |>
  infer_schema()
```

This is convenient for exploratory work, but it has an important limitation:
it only sees one document. If the collection is heterogeneous, fields that do
not appear in the first document may still need to be added manually.

You can inspect the currently known fields with:

```{r schema-fields, eval = FALSE}
schema_fields(starwars_tbl)
```

## Basic inspection

Inspect the known schema and the generated pipeline without executing the
query.

```{r inspect, eval = FALSE}
schema_fields(starwars_tbl)

starwars_tbl |>
  filter(species == "Human", height > 180) |>
  select(name, height, mass) |>
  show_query()
```

## Streaming with `cursor()`

When you want a MongoDB iterator instead of an eagerly materialized tibble,
open the lazy query with `cursor()`.

```{r inspect-cursor, eval = FALSE}
iter <- starwars_tbl |>
  filter(species == "Human", height > 180) |>
  select(name, height, mass) |>
  cursor()

iter$page(10)
```

# Supported verbs

Each subsection below shows one of the supported `dplyr`-like verbs on the
`starwars` collection.

## `filter()`

```{r verb-filter, eval = FALSE}
starwars_tbl |>
  filter(species == "Droid", height > 100) |>
  collect()
```

## `select()`

```{r verb-select, eval = FALSE}
starwars_tbl |>
  select(name, species, homeworld) |>
  collect()
```

Selecting dotted paths preserves nested MongoDB structure by default. It does
not flatten nested fields unless you explicitly ask for that:

```{r verb-select-nested, eval = FALSE}
sensor_tbl |>
  select(`message.timestamp`, `message.measurements`) |>
  collect()
```

## `rename()`

```{r verb-rename, eval = FALSE}
starwars_tbl |>
  rename(character_name = name, planet = homeworld) |>
  collect()
```

## `mutate()`

```{r verb-mutate, eval = FALSE}
starwars_tbl |>
  mutate(height_m = height / 100, bmi_like = mass / (height_m * height_m)) |>
  select(name, height, mass, height_m, bmi_like) |>
  collect()
```

## `transmute()`

```{r verb-transmute, eval = FALSE}
starwars_tbl |>
  transmute(name = name, height_m = height / 100) |>
  collect()
```

## `arrange()`

```{r verb-arrange, eval = FALSE}
starwars_tbl |>
  arrange(desc(height), name) |>
  select(name, height) |>
  slice_head(n = 10) |>
  collect()
```

## `group_by()`

```{r verb-group-by, eval = FALSE}
starwars_tbl |>
  group_by(species)
```

## `summarise()`

```{r verb-summarise, eval = FALSE}
starwars_tbl |>
  group_by(species) |>
  summarise(
    n = n(),
    avg_height = mean(height),
    max_mass = max(mass)
  ) |>
  arrange(desc(n)) |>
  collect()
```

## `slice_head()`

```{r verb-slice-head, eval = FALSE}
starwars_tbl |>
  select(name, species) |>
  slice_head(n = 5) |>
  collect()
```

## `slice_tail()`

```{r verb-slice-tail, eval = FALSE}
starwars_tbl |>
  select(name, species) |>
  slice_tail(n = 5) |>
  collect()
```

## `head()`

```{r verb-head, eval = FALSE}
head(starwars_tbl, 5) |>
  collect()
```

## `flatten_fields()`

Use `flatten_fields()` when you explicitly want nested object leaves to become
flat tibble columns. By default the output names are the schema dot paths.

```{r verb-flatten-fields, eval = FALSE}
sensor_tbl |>
  select(`message.timestamp`, `message.measurements`) |>
  flatten_fields() |>
  collect()
```

You can also target a specific nested root and optionally rename the flattened
output columns:

```{r verb-flatten-fields-names, eval = FALSE}
sensor_tbl |>
  flatten_fields(
    `message.measurements`,
    names_fn = function(x) gsub(".", "_", x, fixed = TRUE)
  ) |>
  collect()
```

## `unwind_array()`

Use `unwind_array()` when a document field contains an array and you want one
output row per array element.

```{r verb-unwind-array, eval = FALSE}
orders_tbl |>
  unwind_array(items) |>
  collect()
```

If array elements are nested objects, `unwind_array()` and `flatten_fields()`
can be chained:

```{r verb-unwind-array-flatten, eval = FALSE}
orders_tbl |>
  unwind_array(items) |>
  flatten_fields(items) |>
  collect()
```

# Notes

The examples above stay within the currently supported subset:

- explicit field references,
- schema-first expression resolution, with `.data$...` for explicit field references and `.env$...` for local values,
- scalar expressions for `mutate()` and `transmute()`, plus the special `1:n()` row-numbering case,
- conservative grouped summaries,
- explicit flattening of nested object fields with `flatten_fields()`,
- array row expansion with `unwind_array()`,
- no joins, window functions, or automatic client-side fallback.

If a query falls outside that subset, `mdbplyr` is designed to fail
explicitly rather than guess or silently change execution semantics.