--- title: "Piggyback Data atop your GitHub Repository!" author: "Carl Boettiger" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{piggyback} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", results="hide", eval = Sys.getenv("CBOETTIG_TOKEN", FALSE) ) Sys.setenv(piggyback_cache_duration=0) ``` # Why `piggyback`? `piggyback` grew out of the needs of students both in my classroom and in my research group, who frequently need to work with data files somewhat larger than one can conveniently manage by committing directly to GitHub. As we frequently want to share and run code that depends on >50MB data files on each of our own machines, on continuous integration, and on larger computational servers, data sharing quickly becomes a bottleneck. [GitHub allows](https://docs.github.com/en/github/managing-large-files/distributing-large-binaries) repositories to attach files of up to 2 GB each to releases as a way to distribute large files associated with the project source code. There is no limit on the number of files or bandwidth to deliver them. ## Installation Install the latest release from CRAN using: ``` r install.packages("piggyback") ``` You can install the development version from [GitHub](https://github.com/) with: ``` r # install.packages("devtools") devtools::install_github("ropensci/piggyback") ``` ## Authentication No authentication is required to download data from *public* GitHub repositories using `piggyback`. Nevertheless, `piggyback` recommends setting a token when possible to avoid rate limits. To upload data to any repository, or to download data from *private* repositories, you will need to authenticate first. To do so, add your [GitHub Token](https://github.com/settings/tokens/new?scopes=repo,gist&description=R:GITHUB_PAT) to an environmental variable, e.g. in a `.Renviron` file in your home directory or project directory (any private place you won't upload), see `usethis::edit_r_environ()`. For one-off use you can also set your token from the R console using: ```r Sys.setenv(GITHUB_PAT="xxxxxx") ``` But try to avoid putting `Sys.setenv()` in any R scripts -- remember, the goal here is to avoid writing your private token in any file that might be shared, even privately. For more information, please see the [usethis guide to GitHub credentials](https://usethis.r-lib.org/articles/git-credentials.html) ## Downloading data Download the latest version or a specific version of the data: ```r library(piggyback) ``` ```r pb_download("iris2.tsv.gz", repo = "cboettig/piggyback-tests", tag = "v0.0.1", dest = tempdir()) ``` **Note**: Whenever you are working from a location inside a git repository corresponding to your GitHub repo, you can simply omit the `repo` argument and it will be detected automatically. Likewise, if you omit the release `tag`, the `pb_download` will simply pull data from most recent release (`latest`). Third, you can omit `tempdir()` if you are using an RStudio Project (`.Rproj` file) in your repository, and then the download location will be relative to Project root. `tempdir()` is used throughout the examples only to meet CRAN policies and is unlikely to be the choice you actually want here. Lastly, simply omit the file name to download all assets connected with a given release. ```r pb_download(repo = "cboettig/piggyback-tests", tag = "v0.0.1", dest = tempdir()) ``` These defaults mean that in most cases, it is sufficient to simply call `pb_download()` without additional arguments to pull in any data associated with a project on a GitHub repo that is too large to commit to git directly. `pb_download()` will skip the download of any file that already exists locally if the timestamp on the local copy is more recent than the timestamp on the GitHub copy. `pb_download()` also includes arguments to control the timestamp behavior, progress bar, whether existing files should be overwritten, or if any particular files should not be downloaded. See function documentation for details. Sometimes it is preferable to have a URL from which the data can be read in directly, rather than downloading the data to a local file. For example, such a URL can be embedded directly into another R script, avoiding any dependence on `piggyback` (provided the repository is already public.) To get a list of URLs rather than actually downloading the files, use `pb_download_url()`: ```r pb_download_url("data/mtcars.tsv.gz", repo = "cboettig/piggyback-tests", tag = "v0.0.1") ``` ## Uploading data If your GitHub repository doesn't have any [releases](https://docs.github.com/en/github/administering-a-repository/managing-releases-in-a-repository) yet, `piggyback` will help you quickly create one. Create new releases to manage multiple versions of a given data file. While you can create releases as often as you like, making a new release is by no means necessary each time you upload a file. If maintaining old versions of the data is not useful, you can stick with a single release and upload all of your data there. ```r pb_new_release("cboettig/piggyback-tests", "v0.0.2") ``` Once we have at least one release available, we are ready to upload. By default, `pb_upload` will attach data to the latest release. ```r ## We'll need some example data first. ## Pro tip: compress your tabular data to save space & speed upload/downloads readr::write_tsv(mtcars, "mtcars.tsv.gz") pb_upload("mtcars.tsv.gz", repo = "cboettig/piggyback-tests", tag = "v0.0.1") ``` Like `pb_download()`, `pb_upload()` will overwrite any file of the same name already attached to the release file by default, unless the timestamp the previously uploaded version is more recent. You can toggle these settings with `overwrite=FALSE` and `use_timestamps=FALSE`. ## Additional convenience functions List all files currently piggybacking on a given release. Omit the `tag` to see files on all releases. ```r pb_list(repo = "cboettig/piggyback-tests", tag = "v0.0.1") ``` Delete a file from a release: ```r pb_delete(file = "mtcars.tsv.gz", repo = "cboettig/piggyback-tests", tag = "v0.0.1") ``` Note that this is irreversible unless you have a copy of the data elsewhere. ## Multiple files You can pass in a vector of file paths with something like `list.files()` to the `file` argument of `pb_upload()` in order to upload multiple files. Some common patterns: ```r library(magrittr) ## upload a folder of data list.files("data") %>% pb_upload(repo = "cboettig/piggyback-tests", tag = "v0.0.1") ## upload certain file extensions list.files(pattern = c("*.tsv.gz", "*.tif", "*.zip")) %>% pb_upload(repo = "cboettig/piggyback-tests", tag = "v0.0.1") ``` Similarly, you can download all current data assets of the latest or specified release by using `pb_download()` with no arguments. ## Caching To reduce API calls to GitHub, piggyback caches most calls with a timeout of 1 second by default. This avoids repeating identical requests to update it's internal record of the repository data (releases, assets, timestamps, etc) during programmatic use. You can increase or decrease this delay by setting the environmental variable in seconds, e.g. `Sys.setenv("piggyback_cache_duration"=10)` for a longer delay or `Sys.setenv("piggyback_cache_duration"=0)` to disable caching, and then restarting R. ## Valid file names GitHub assets attached to a release do not support file paths, and will convert most special characters (`#`, `%`, etc) to `.` or throw an error (e.g. for file names containing `$`, `@`, `/`). piggyback will default to using the base name of the file only (i.e. will only use `"mtcars.csv"` if provided a file path like `"data/mtcars.csv"`) ## A Note on GitHub Releases vs Data Archiving `piggyback` is not intended as a data archiving solution. Importantly, bear in mind that there is nothing special about multiple "versions" in releases, as far as data assets uploaded by `piggyback` are concerned. The data files `piggyback` attaches to a Release can be deleted or modified at any time -- creating a new release to store data assets is the functional equivalent of just creating new directories `v0.1`, `v0.2` to store your data. (GitHub Releases are always pinned to a particular `git` tag, so the code/git-managed contents associated with repo are more immutable, but remember our data assets just piggyback on top of the repo). Permanent, published data should always be archived in a proper data repository with a DOI, such as [zenodo.org](https://zenodo.org). Zenodo can freely archive public research data files up to 50 GB in size, and data is strictly versioned (once released, a DOI always refers to the same version of the data, new releases are given new DOIs). `piggyback` is meant only to lower the friction of working with data during the research process. (e.g. provide data accessible to collaborators or continuous integration systems during research process, including for private repositories.) ## What will GitHub think of this? [GitHub documentation](https://docs.github.com/en/github/managing-large-files/distributing-large-binaries) at the time of writing endorses the use of attachments to releases as a solution for distributing large files as part of your project: ![](https://github.com/ropensci/piggyback/raw/83776863b34bb1c9962154608a5af41867a0622f/man/figures/github-policy.png) Of course, it will be up to GitHub to decide if this use of release attachments is acceptable in the long term.