---
title: "Getting started with scrutr"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting started with scrutr}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

`scrutr` helps you inspect, profile, and convert **collections of structured datasets**. This vignette walks through the main workflows.

## Setup

```{r setup}
library(scrutr)
```

## Inspecting a single data frame

`inspect()` produces a one-row-per-variable summary: class, distinct count, missing values, void strings, character lengths, and sample modalities.

```{r}
result <- inspect(CO2)
result
```

Use `nrow = TRUE` to also print the number of observations:

```{r}
result <- inspect(CO2, nrow = TRUE)
```

## Comparing variables across datasets

When working with several related tables, you often need to know which variables appear where and whether their types are consistent.

```{r}
data_list <- list(
  cars = cars,
  mtcars = mtcars[, c("mpg", "hp", "wt", "speed")  |> intersect(names(mtcars))],
  iris = iris
)

# Which variables are in which datasets?
vars_detect(data_list)
```

`vars_compclasses()` goes further and compares the class of each shared variable:

```{r}
# Use two datasets that share some columns
shared_list <- list(
  df1 = data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE),
  df2 = data.frame(x = c(1.1, 2.2, 3.3), y = c("d", "e", "f"), stringsAsFactors = FALSE)
)

vars_compclasses(shared_list)
```

## Inspecting a whole folder of datasets

`inspect_vars()` is the main collection-level function. Point it at a folder, and it reads all matching files, inspects each one, then writes a comprehensive Excel report.

```{r, eval = FALSE}
# Create a temporary folder with example datasets
mydir <- file.path(tempdir(), "scrutr_demo")
dir.create(mydir, showWarnings = FALSE)

saveRDS(cars, file.path(mydir, "cars.rds"))
saveRDS(mtcars, file.path(mydir, "mtcars.rds"))
saveRDS(iris, file.path(mydir, "iris.rds"))

# Run the full inspection pipeline
inspect_vars(
  input_path = mydir,
  output_path = mydir,
  output_label = "demo",
  considered_extensions = "rds"
)

# The output Excel file contains multiple sheets:
# dims, inspect_tot, one sheet per dataset, vars_detect, vars_compclasses, etc.
list.files(mydir, pattern = "\\.xlsx$")
```

## Batch format conversion

### Simple folder conversion with `convert_all()`

Convert all matching files in a folder to another format:

```{r, eval = FALSE}
convert_all(
  input_folderpath = mydir,
  considered_extensions = "rds",
  to = "csv",
  output_folderpath = file.path(mydir, "csv")
)
```

### Mask-driven conversion with `convert_r()`

For more control, use an Excel mask that specifies exactly which files to convert and how to name the outputs:

```{r, eval = FALSE}
# 1. Generate a mask template
mask_convert_r(output_path = mydir)

# 2. Edit the mask in Excel, then:
convert_r(
  mask_filepath = file.path(mydir, "mask_convert_r.xlsx"),
  output_path = mydir
)
```

## Data hygiene helpers

`scrutr` includes several utilities for common data quality checks:

```{r}
# Find duplicates in a data frame
df <- data.frame(id = c(1, 2, 2, 3, 3, 3), value = letters[1:6])
dupl_show(df, "id")
```

```{r}
# Check a left join for key issues
left_df <- data.frame(key = c("a", "b", "c"))
right_df <- data.frame(key = c("a", "b", "b", "d"), val = 1:4)
ljoin_checks(left_df, right_df, "key")
```

## Path utilities

```{r}
paths <- c("data/raw/2024/file1.csv", "data/raw/2024/file2.csv")

# Keep only the first 2 levels
path_move(paths, "/", 2)

# Remove the last level (filename)
path_move(paths, "/", -1)
```