--- title: "Parsing POS Layout Files for Descriptive Names" author: "Robert J. Gambrel" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Parsing POS Layout Files for Descriptive Names} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- This vignette will show how a Provider of Services report layout file can be quickly parsed to extract the descriptive variable names it contains. POS datasets from year 2010 and earlier have generic variable names like `PROV0001, PROV0002, ...` that offer no insight into what the variable actually is. In the Layout file, along with a data dictionary explaining the variable's values, there is also a COBOL descriptive name. The `pos_names_extract()` function will parse this file and return the descriptive names, in the order that matches the variables in the dataset. ## Provider of Services Data I have included a sample of the 2010 Provider of Services data for hospices. The full 2010 file (along with many other years) is available [from the NBER](https://www.nber.org/data/provider-of-services.html) and contains data from other provider types as well. ```{r, eval = T} library(medicare) # load the package data data(pos2010, package = "medicare") names(pos2010)[1:10] ``` These variable names are useless, and with over 500 variables it is impractical to look up each one. Instead, we can parse the layout file to obtain useful names. In this example, I have bundled the Layout 2010 file with this package, but I expect the user to have the downloaded text file that corresponds to each dataset in use. ```{r} # filepath should be changed by user filepath <- system.file("extdata", "layout10.txt", package = "medicare") names_2010 <- pos_names_extract(filepath, pos2010) names_2010[1:10] ``` These are much more descriptive variable names and worth using. ```{r} pos2010_renamed <- pos2010 names(pos2010_renamed) <- names_2010 ``` Note that it is up to the user to make sure that the layout file is appropriate for the chosen data file. Each year's layout file is different, so each year must be parsed separately. The function checks whether the number of variables in the layout file and dataset match and whether the generic variable names are the same in both. It will stop if there's a problem. **If the generic names from dataset 20XX are the same as in layout 20YY, the parsing should work, but won't necessarily be accurate. CMS is not 100% consistent with variable naming across years.** ```{r, error = T} pos2010_short <- pos2010[, 1:500] names_2010_short <- pos_names_extract(filepath, pos2010_short) ``` ```{r, error = T} pos2010_wrong_names <- pos2010 names(pos2010_wrong_names)[1:3] <- c("wrong1", "wrong2", "wrong3") names_2010_wrong_names <- pos_names_extract(filepath, pos2010_wrong_names) ``` ## Pre-compiled dataset names In order to same the user time and headaches of downloading each year's Layout file, I have pre-compiled dataset names for years 2000-2010. These can be accessed via the `pos_names()` function. By looking at inner variables, this also illustrates how the dataset layouts change over time: ```{r} for (year in 2000:2010) { print(year) print(pos_names(year)[200:205]) } ```