--- vignette: > %\VignetteIndexEntry{1 Demonstration of SOAKED on simulations} %\VignetteEngine{litedown::vignette} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} litedown::reactor( print=NA, collapse = TRUE, comment = "#>", fig.width=10, fig.height=3) data.table::setDTthreads(1) ``` The goal of this vignette is explain how SOAKED (Same/Other/All K-fold cross-validation Extension Downsampling) can be used with `ResamplingSameOtherSizesCV` to determine why there is a difference between Same/Other/All models. # Simulations We begin by simulating some regression data according to two different patterns. ```{r} library(data.table) N <- 2400 abs.x <- 3*pi set.seed(2) grid.dt <- data.table( x=seq(-abs.x,abs.x, l=201), y=0) x.vec <- runif(N, -abs.x, abs.x) standard.deviation.vec <- c( easy=0.1, hard=1.7) ``` There are two standard deviation parameters for the simulation: easy (small noise) and hard (large noise). Below we simulate and plot the data. ```{r} reg.data.list <- list() grid.signal.dt.list <- list() sim_fun <- sin for(difficulty in names(standard.deviation.vec)){ standard.deviation <- standard.deviation.vec[[difficulty]] signal.vec <- sim_fun(x.vec) y <- signal.vec+rnorm(N,sd=standard.deviation) task.dt <- data.table(x=x.vec, y) reg.data.list[[difficulty]] <- data.table(difficulty, task.dt) grid.signal.dt.list[[difficulty]] <- data.table( difficulty, algorithm="ideal", x=grid.dt$x, y=sim_fun(grid.dt$x)) } reg.data <- rbindlist(reg.data.list) grid.signal.dt <- rbindlist(grid.signal.dt.list) algo.colors <- c( featureless="blue", rpart="red", ideal="black") if(require(ggplot2)){ my_theme <- theme_bw(15) ggplot()+ my_theme+ theme(panel.spacing=grid::unit(1, "cm"))+ geom_point(aes( x, y), fill="white", color="grey", data=reg.data)+ geom_line(aes( x, y, color=algorithm), linewidth=2, data=grid.signal.dt)+ scale_color_manual(values=algo.colors)+ facet_grid(. ~ difficulty, labeller=label_both) } ``` Above we see the simulated data, which represent regression problems in 1D. There is a panel for each difficulty level, with a grey dot for each training data point, and a black curve that represents the ideal prediction function (same in both difficulty levels). # mlr3 benchmark In this section, we define a benchmark using these simulated data. First, we create the SOAKED instance by setting `sizes=0`, and we use 10-fold CV. ```{r} SOAKED <- mlr3resampling::ResamplingSameOtherSizesCV$new() SOAKED$param_set$values$sizes <- 0 SOAKED$param_set$values$folds <- 10 ``` Next, we create and visualize two Tasks: ```{r fig.height=4} set.seed(1) sim.meta.list <- list( different=rbind( reg.data[difficulty=="easy"][sample(.N, 400)], reg.data[difficulty=="hard"][sample(.N, 200)] )[, .(x,y,Subset=ifelse(difficulty=="easy", "large", "small"))], iid_easy=reg.data[ difficulty=="easy" ][sample(.N, 120)][ , Subset := rep(c("large","large","small"), l=.N) ][, .(x,y,Subset)]) d_task_list <- list() gg_list <- list() for(sim.name in names(sim.meta.list)){ sim.i.dt <- sim.meta.list[[sim.name]] sub_task <- mlr3::TaskRegr$new( sim.name, sim.i.dt, target="y") sub_task$col_roles$subset <- "Subset" sub_task$col_roles$feature <- "x" d_task_list[[sim.name]] <- sub_task if(require("ggplot2")){ gg_list[[sim.name]] <- ggplot()+ my_theme+ ggtitle(paste("Task:", sim.name))+ geom_point(aes( x, y), shape=21, color="black", fill="white", data=sim.i.dt)+ geom_line(aes( x, y, color=algorithm), data=grid.signal.dt)+ scale_color_manual(values=algo.colors)+ facet_grid(Subset~., labeller=label_both) } } gg_list ``` The figures above show the two Tasks. Each Task has two subsets: large and small. * The two subsets in `different` have different noise levels, so we expect to see significant differences when training using either the same or different numbers of samples. * The two subsets in `iid_easy` have the same noise level, so we expect to see significant test error differences only when training using different numbers of samples. Below we create the benchmark grid. ```{r} reg.learner.list <- list( if(requireNamespace("rpart"))mlr3::LearnerRegrRpart$new(), mlr3::LearnerRegrFeatureless$new()) (reg.bench.grid <- mlr3::benchmark_grid( d_task_list, reg.learner.list, SOAKED)) ``` The benchmark includes both data sets, and two learners: rpart decision tree and featureless baseline. ```{r fig.height=6} if(require(future))plan("multisession") if(require(lgr))get_logger("mlr3")$set_threshold("warn") (reg.bench.result <- mlr3::benchmark(reg.bench.grid)) score_dt <- mlr3resampling::score( reg.bench.result, mlr3::msr("regr.rmse")) plot(score_dt)+my_theme ``` The figure above shows one dot for every train/test split. There are five rows of data per panel, because we train same/other/all at full sample size, and two downsampled models (all and same or other). Below we compute P-values using the full sample sizes. ```{r fig.height=5} plist <- mlr3resampling::pvalue(score_dt) plot(plist)+my_theme ``` Above we see that for the rpart learner, there are significant differences between same and other/all. Comparing same and other, we consistently see better predictions (smaller error values) for models with more training data. There are two possible explanations * sample size effect: there is no distributional difference between subsets. The predictions are more accurate because the learning algorithm has seen more relevant training data. * subset effect: there is a distributional difference between subsets that makes learning and prediction easier in the larger subset. In the next section, we show how downsampling can be used to determine which of these interpretations is consistent with the data. # Downsample analysis In this section, we do downsample analysis to determine if the differences observed at full sample size are due to the different sample sizes, or distributional differences between subsets. ## iid easy task In this simulation, we want to verify that SOAKED can detect that the two subsets are iid from the same distribution. ```{r} dlist <- mlr3resampling::pvalue_downsample(score_dt[ algorithm=="rpart" & task_id=="iid_easy" & test.subset=="large"]) plot(dlist)+my_theme ``` In both test subsets (above and below), we see significant differences at full sample size (left), that disappear at smallest sample size (right). This is a clear sample size effect (no distributional difference between subsets), as expected for the `iid_easy` task. ```{r} dlist <- mlr3resampling::pvalue_downsample(score_dt[ algorithm=="rpart" & task_id=="iid_easy" & test.subset=="small"]) plot(dlist)+my_theme ``` ## different task In this simulation, we want to verify that SOAKED can detect that the two subsets are iid have a real distributional difference that makes it easier to learn using data from the larger subset. ```{r} dlist <- mlr3resampling::pvalue_downsample(score_dt[ algorithm=="rpart" & task_id=="different" & test.subset=="large"]) plot(dlist)+my_theme ``` Both above and below (but especially above), we see significant differences at full sample size (left), that persist after downsampling (right), which is a clear indication of a distributional difference between subsets, as expected for the `different` task. ```{r} dlist <- mlr3resampling::pvalue_downsample(score_dt[ algorithm=="rpart" & task_id=="different" & test.subset=="small"]) plot(dlist)+my_theme ``` # Conclusion We have shown how SOAKED (cross-validation with subsets and downsampling) can be used to determine if there are differences in learnable and predictable patterns between subsets. # Stop future background workers This code is needed to avoid R CMD check NOTE about detritus in the temp directory. ```{r} if(require(future))plan("sequential") ```