--- title: "Uncertainty estimation" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Uncertainty estimation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette shows how to estimate the uncertainty of the efflux and production estimates using `bootstrap_error()`. We will first generate a dataset of 'measurement uncertainty' in the input parameters and then use `bootstrap_error()` to estimate the resulting uncertainty in the models. ```{r setup} library(ConFluxPro) ``` # Uncertainty of input variables ## General workflow and uncertainty from `gasdata` The most relevant source of uncertainty in FGM-based models is that of the input parameters. Especially parameters related to the estimation of the calculation of the diffusion coefficient `DS` are hard to measure and soil heterogeneity introduces more uncertainty. We'll use the example dataset to demonstrate this effect. ```{r load base_dat} data("base_dat", package = "ConFluxPro") mod_pf <- pro_flux(base_dat) ``` In this dataset, the molar fraction of CO~2~ is measured in three replicates at each depth within the soil. In a normal model, we use all replicates to fit a single profile. By randomly sampling from these replicates, we can estimate the uncertainty the concentration profiles measurement introduces into the model. This can be done in `bootstrap_error()` ```{r bootstrap-gasdata} set.seed(42) # to get exactly same result every time mod_pf_bs_gasdata <- bootstrap_error(mod_pf, n_samples = 25, sample_from = "gasdata") ``` With `n_samples = 25` we created 25 new models. In each of these, a random selection of all measured `x_ppm` values was created per profile. This is done by resampling the same number of observations at each depth while allowing replacing (meaning that a single measurement may be sampled multiple times!). If we extract the efflux, we now get a second column `DELTA_efflux` that gives the uncertainty. Notice, that the value of `efflux` has changed slightly from the original model. This is because it is now estimated as the mean value of all 25 models, while the standard deviation is the estimate of `DELTA_efflux`. ```{r} ## after bootstrapping mod_pf_bs_gasdata %>% filter(prof_id == 1) %>% efflux() ## originial model mod_pf %>% filter(prof_id == 1) %>% efflux() ``` Similarly, we can extract the `production()`, where respective columns have also been added. ```{r} mod_pf_bs_gasdata %>% filter(prof_id == 1) %>% production() ``` ## Uncertainty from `soilphys` However, not only the gas concentration data carries uncertainty. Indeed, many parameters contained within the `soilphys` dataset are subject to significant uncertainty due to sampling error and soil heterogeneity. In the present dataset this is not represented at all. So, let's pretend that we measured `TPS` in three replicates identified by `replicate_id` that have some spread around the mean. We will do this by randomly sampling from a normal distribution with a standard deviation of 0.01. ```{r} library(dplyr) soilphys <- cfp_soilphys(base_dat) set.seed(42) soilphys_replicate_TPS <- soilphys %>% # get base TPS info select(site, upper, lower, TPS) %>% distinct() %>% rename(TPS_mean = TPS) %>% # repeat each row 3 times cross_join(data.frame(replicate_id = 1:3)) %>% # generate new, random TPS values mutate(TPS = rnorm(n(), TPS_mean, 0.01)) %>% # join with rest of dataset left_join(soilphys %>% select(!TPS), by = c("site", "upper", "lower"), relationship = "many-to-many") %>% # recaluclate DS and c_air complete_soilphys(DSD0_formula = "a*AFPS^b", quiet = TRUE) %>% cfp_soilphys(id_cols = c("site", "Date", "replicate_id")) ``` Now we can create a new dataset that includes these new values of TPS and create a `cfp_pfmod` model. Of course you could use real measurements instead and include the variability of other parameters (`SWC`, `t`, ...) as well. We don't call `pro_flux()`, because we don't want to calulcate the result from each replication individual. ```{r} replicate_dat <- cfp_dat(cfp_gasdata(base_dat), soilphys_replicate_TPS, cfp_layers_map(base_dat)) mod_pf_replicate <- cfp_pfmod(replicate_dat) ``` Now we can run `bootstrap_error()` again. This time, we will tell the function to resample from the `soilphys` dataset when creating the bootstrapping runs. To do this, we need to tell it which `id_cols` define(s) the replications of the dataset. In our case, this is `replicate_id`. Here, only a single value for each profile and depth is sampled for each run. The new call looks like this: ```{r} set.seed(42) mod_pf_bs_soilphys <- bootstrap_error(mod_pf_replicate, n_samples = 25, sample_from = "soilphys", rep_cols = "replicate_id") ``` If we take a look into the result of `efflux()` again, we can see that the uncertainty of the efflux estimate has increased compared to using the variability in `gasdata`. This is both because `TPS`, from which `DS` is derived, has a strong impact on the flux calculation and because the variability within the `gasdata` dataset is probably unrealistically low and more representative of an instrument than of a measurement error. ```{r} ## boostrapping from soilphys mod_pf_bs_soilphys %>% filter(prof_id < 10) %>% efflux() ## boostrapping from gasdata mod_pf_bs_gasdata %>% filter(prof_id < 10) %>% efflux() ``` Finally, we can also use the variability from both datasets at the same time. This runs both sampling strategies independent from another. ```{r} set.seed(42) mod_pf_bs_both <- bootstrap_error(mod_pf_replicate, n_samples = 25, sample_from = "both", rep_cols = "replicate_id") mod_pf_bs_both %>% filter(prof_id < 10) %>% efflux() ```