--- title: "An Introduction to the dataset Package" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{An Introduction to the dataset Package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) if (!requireNamespace("rdflib", quietly = TRUE)) { stop("Please install 'rdflib' to run this vignette.") } ``` ## Overview The `dataset` package enriches R’s native data structures with machine-readable metadata. It allows variables and datasets to carry semantic definitions — such as URIs, labels, units, and provenance — which makes them suitable for long-term reuse, FAIR-compliant publishing, and integration into semantic web systems. Unlike most metadata packages that attach metadata after the fact, `dataset` follows a **semantic early-binding** approach: metadata is embedded as soon as the data is created. This vignette provides a high-level introduction. For details on key components, see the following: - `vignette("defined", package = "dataset")`: Semantic vectors with `defined()` - `vignette("dataset_df", package = "dataset")`: Structuring and metadata with `dataset_df()` - `vignette("rdf", package = "dataset")`: Exporting to RDF and Linked Data - `vignette("bibrecord", package = "dataset")`: Creating rich citation metadata using `bibrecord()` ## Why extend tidy data? Hadley Wickham (2014) defines tidy data with three principles: - Each variable forms a column - Each observation forms a row - Each observational unit forms a table This structure is ideal for analysis, but lacks **semantic clarity**, particularly when an analyst is working in a realistic, but not ideal scenario with several datasets received from various internet services. For example, two datasets might both contain a column named `gdp`, but one might be in euros and the other in dollars. Without metadata, tools cannot detect this mismatch. The `dataset` package addresses this by allowing you to define variables explicitly, and to store dataset-level metadata within a tidy tibble. ## Example: defining semantically rich vectors Semantically rich vectors are vectors in a data.frame that contain richer semantics than a simple column name; a long-form human-readable title; a machine- and human-readable variable definition; and if needed, an external resource that contains the codebook. ```{r} library(dataset) gdp <- defined( c(2355, 2592, 2884), label = "Gross Domestic Product", unit = "CP_MEUR", concept = "http://data.europa.eu/83i/aa/GDP" ) geo <- defined( rep("AD", 3), label = "Geopolitical Entity", concept = "http://purl.org/linked-data/sdmx/2009/dimension#refArea", namespace = "https://www.geonames.org/countries/$1/" ) gdp geo ``` In this case, we define `geo` as the geopolitical entity , and we know that the `AD` value can resolve to Andorra: . These vectors now carry metadata you can inspect directly — including their label, unit, and concept URI — which will be preserved even after transformation or storage. ## Example: creating a dataset from a metadata-enriched data frame ```{r smalldatasetexample} small_dataset <- dataset_df( geo = geo, gdp = gdp, identifier = c(gdp = "http://example.com/dataset#gdp"), dataset_bibentry = dublincore( title = "Small GDP Dataset", creator = person("Jane", "Doe", role = "aut"), publisher = "Small Repository", subject = "Gross Domestic Product" ) ) small_dataset ``` This dataset not only stores the variables and values, but also includes embedded metadata that supports precise interpretation and repository-level publication. ```{r} as_dublincore(small_dataset) ``` ## Exporting to RDF As Carl Boettinger has shown in the vignettes accompanying the R-binding to the popular Python library [rdflib](https://CRAN.R-project.org/package=rdflib), (see: [A tidyverse lover's intro to RDF](https://docs.ropensci.org/rdflib/articles/rdf_intro.html)), tidy datasets can be retrofitted with rich metadata if they are pivoted to a strictly three-column long format. Our packages tries to lower the burden of such retrofitting with early binding and sensible defaults to serialise the dataset's contents and the dataset's bibliographic data to this format for those who are not familiar with RDF. You can convert any `dataset_df` object into a tidy 3-column representation (subject–predicate–object) using `dataset_to_triples()`: ```{r triplesexample} triples <- dataset_to_triples(small_dataset, format = "nt" ) triples ``` This 3-column format (subject–predicate–object) is compatible with semantic web tools such as SPARQL, `rdflib`, and triple stores. ```{r ntexample} mycon <- tempfile("my_dataset", fileext = "nt") my_description <- describe(x = small_dataset, con = mycon) # Only three statements are shown: readLines(mycon)[c(4, 8, 12)] ``` ```{r provenancexample} ## Show two lines of provenance: provenance(small_dataset)[c(6, 7)] ``` ## Summary The *dataset* package enriches tidy data by attaching metadata from the start of the workflow. It helps avoid semantic mismatches, supports RDF publication, and meets interoperability standards like SDMX, DataCite, and Dublin Core. Use it when you need: - Meaningful variable descriptions and URIs - Dataset-level metadata embedded directly in .rds or .rda files - Easy export to RDF and semantic web formats For deeper examples, see: - `vignette("defined", package = "dataset")`: Working with semantic vectors - `vignette("dataset_df", package = "dataset")`: Dataset-level metadata and structure - `vignette("rdf", package = "dataset")`: Linked Data and export - `vignette("bibrecord", package = "dataset")`: Creating rich citation metadata using `bibrecord()`