The dataset
package extends R’s native data structures
with machine-readable metadata. It follows a semantic
early-binding approach, which means metadata is embedded as soon as
the data is created, making datasets suitable for long-term reuse,
FAIR-compliant publishing, and integration into semantic web
systems.
defined
works naturally with data structured according
to tidy data principles (Wickham, 2014), where each variable is a
column, each observation is a row, and each type of observational unit
forms a table. It adds an additional semantic layer to individual
vectors so their meaning is explicit, consistent, and
machine-readable.
This vignette focuses specifically on the defined
function, which you can use to create a semantically enriched vector.
For details on semantically enriched data frames, see
vignette("dataset_df", package = "dataset")
.
The defined()
function helps you create
semantically rich labelled vectors that are easier
to:
By attaching metadata at creation time, defined
prevents
the loss of context and meaning that often occurs when data is exchanged
or archived. This approach supports the FAIR data principles (Findable,
Accessible, Interoperable, Reusable) and facilitates integration into
semantic web systems.
We’ll start by wrapping a numeric GDP vector using
defined()
.
gdp_1 <- defined(
gdp$gdp,
label = "Gross Domestic Product",
unit = "CP_MEUR",
concept = "http://data.europa.eu/83i/aa/GDP"
)
The defined()
class builds on labelled vectors by adding
rich metadata:
CP_MEUR
)This is particularly useful for reproducible research,
standard-compliant data, or long-term interoperability. The class is
implemented with R’s attributes()
function, which
guarantees wide compatibility. A defined vector can be used even in base
R.
attributes(gdp_1)
#> $label
#> [1] "Gross Domestic Product"
#>
#> $class
#> [1] "haven_labelled_defined" "haven_labelled" "vctrs_vctr"
#> [4] "double"
#>
#> $unit
#> [1] "CP_MEUR"
#>
#> $concept
#> [1] "http://data.europa.eu/83i/aa/GDP"
From this output it is clear that the actual S3 class is called
haven_labelled_defined
, which clearly indicates the
inheritance from haven_labelled
(See: labelled::labelled).
In the dataset summary headers the <defined>
abbreviation is used.
Use the var_label()
, var_unit()
and
var_concept()
helper functions to set or retrieve metadata
individually.
cat("Get the label only: ", var_label(gdp_1), "\n")
#> Get the label only: Gross Domestic Product
cat("Get the unit only: ", var_unit(gdp_1), "\n")
#> Get the unit only: CP_MEUR
cat("Get the concept definition only: ", var_concept(gdp_1), "\n")
#> Get the concept definition only: http://data.europa.eu/83i/aa/GDP
cat("All attributes:\n")
#> All attributes:
The most frequently used vector methods, such as print or summary are implemented as expected:
If you try to concatenate a semantically under-specified new vector
to an existing defined
vector, you will get an intended
error indicating that some attributes are not compatible. This prevents
combining values that differ in meaning, such as GDP figures expressed
in different currencies.
In the following example, gdp_1
and gdp_2
are not defined with the same level of precision.
Error in vec_c():
! Can't combine ..1 <haven_labelled_defined> and ..2 <haven_labelled_defined>.
✖ Some attributes are incompatible.
To resolve this, you can add the missing attributes so that the vectors are semantically compatible.
Let’s define better the GDP of the Faroe Islands:
Once the metadata matches, you can combine them.
You can also define variables that store codes (like country codes) with a namespace that points to a human- and machine-readable definition of those codes. In statistical datasets, such attribute columns describe characteristics of the observations or the measured variables.
country <- defined(
c("AD", "LI", "SM"),
label = "Country name",
concept = "http://purl.org/linked-data/sdmx/2009/dimension#refArea",
namespace = "https://www.geonames.org/countries/$1/"
)
For example, the namespace definition above points to:
You can get or set the namespace of a defined vector with
var_namespace()
.
A URI such as http://publications.europa.eu/resource/authority/bna/c_6c2bb82d resolves to a machine-readable definition of geographical names.
The use of several defined
vectors in a
dataset_df
object is explained in a separate vignette.
You can create defined
vectors from character values as
well as numeric values. Methods like as_character()
and
as_numeric()
let you coerce back to base R types while
controlling what happens to the metadata.
countries <- defined(
c("AD", "LI"),
label = "Country code",
namespace = "https://www.geonames.org/countries/$1/"
)
countries
#> x: Country code
#> Defined vector
#> [1] "AD" "LI"
as_character(countries)
#> [1] "AD" "LI"
Subsetting a defined
vector works like subsetting any
other vector.
gdp_1[1:2]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR
#> [1] 2354.8 2593.9
gdp_1[gdp_1 > 5000]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR
#> [1] 5430.5 6423.7 6758.6
as.vector()
removes the metadata entirely.as.list()
retains the metadata for each element
(definitions are repeated for each entry).as.vector(gdp_1)
#> [1] 2354.8 2593.9 2883.7 3119.5 5430.5 6423.7 6758.6 1265.1 1461.4 1612.3
as.list(gdp_1)
#> [[1]]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR
#> [1] 2354.8
#>
#> [[2]]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR
#> [1] 2593.9
#>
#> [[3]]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR
#> [1] 2883.7
#>
#> [[4]]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR
#> [1] 3119.5
#>
#> [[5]]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR
#> [1] 5430.5
#>
#> [[6]]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR
#> [1] 6423.7
#>
#> [[7]]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR
#> [1] 6758.6
#>
#> [[8]]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR
#> [1] 1265.1
#>
#> [[9]]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR
#> [1] 1461.4
#>
#> [[10]]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR
#> [1] 1612.3
Use as_character()
to convert to a character vector.
as_character(country)
#> [1] "AD" "LI" "SM"
as_character(c(gdp_1, gdp_2))
#> [1] "2354.8" "2593.9" "2883.7" "3119.5" "5430.5" "6423.7" "6758.6" "1265.1"
#> [9] "1461.4" "1612.3" "2523.6" "2725.8" "3013.2"
Use as_factor()
to convert a categorical variable to a
factor
:
Use as_numeric()
to convert to a numeric vector.
The defined()
function provides a lightweight yet
powerful way to make vectors self-descriptive by attaching semantic
metadata directly to them. By combining a variable label, unit of
measurement, concept definition, and optional namespace,
defined
ensures that each vector’s meaning is explicit,
consistent, and machine-readable.
Because the metadata is embedded at creation time, it travels with
the vector throughout your workflow — whether you are analysing,
transforming, or exporting data.
This prevents context loss, supports the FAIR data principles (Findable,
Accessible, Interoperable, Reusable), and facilitates integration with
semantic web technologies.
defined
vectors work seamlessly with the dataset_df
class to create semantically enriched data frames where both datasets
and their constituent variables carry rich, standardised metadata.
For more on creating semantically enriched datasets, see the
dataset_df vignette.
For guidance on recording bibliographic metadata and citations, see the bibrecord vignette.