--- title: "Messy.Cats Introduction" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Messy.Cats Introduction} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(messy.cats) library(dplyr) library(stringdist) library(varhandle) library(rapportools) library(gt) ``` `messy.cats` contains various functions that employ string distance tools in order to make data management easier for users working with categorical data. Categorical data, especially user inputted categorical data that often tends to be plagued by typos and different formatting choices, can be difficult to work with. `messy.cats` aims to provide functions that make cleaning categorical data simple and easy. This introduction will lead you through examples of the functions in use, explain the arguments, and show how to get the most out of these functions. First lets create some example vectors: ```{r car lists} cars_bad = c("teal Mazda RX4", "black Mazda RX4 Wag", "green Datsun 710", "Hornet 4 Drive", "green Hornet Sportabout", "Valiant", "Duster 360", "orange Merc 240D", "Merc 230", "teal Merc 280", "Merc 280C", "green Merc 450SE", "Merc 450SL", "blue Merc 450SLC", "green Cadillac Fleetwood", "Lincoln Continental", "Chrysler Imperial") cars_good = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout", "Valiant", "Duster 360", "Merc 240D", "Merc 230", "Merc 280", "Merc 280C", "Merc 450SE", "Merc 450SL", "Merc 450SLC", "Cadillac Fleetwood", "Lincoln Continental", "Chrysler Imperial") ``` Suppose you have two lists of cars descriptions, one containing information on the make of the car, and the other containing make and color. Instead of string processing and deleting the color descriptors, which can be a fincky and time consuming process, cat_match() can match the contents of the two lists. ```{r first cat_match example} cat_match(cars_bad, cars_good, method = "jw") ``` After making sure that the string distance calculation is not making error with cat_match(), a user can use cat_replace to swap the contents of one list for their closest match in another. ```{r cat_replace ex} cat_replace(cars_bad, cars_good, method = "jw") ``` Alternatively, a user could join together two dataframes that use these lists as id variables with the function cat_join(). ```{r cat_join ex} bad_cars_df = data.frame(car = cars_bad, state_registration = "CA") good_cars_df= data.frame(car = cars_good, insur_comp = "All State") head(bad_cars_df) head(good_cars_df) cat_join(bad_cars_df, good_cars_df, by="car", method="jw", join="left") ``` These are some of the most basic uses of the core functions in the messy.cats package. Each function mentioned has a plethora of additional arguments that users can utilize in order to fine tune their string distance calculations or increase the ease with which they use the functions. What happens when the lists are not the same length, or the matches aren't perfect? ```{r not perfect matches} messy_short <- c("Fiat 128", "red Honda Civic", "Toyota Corolla", "Toyota Corona", "Dodge Challenger", "red AMC Javelin", "Camaro Z28", "Pontiac Firebird", "black Fiat X1-9", "blue Porsche 914-2", "Lotus Europa", "Ford Pantera L", "black Ferrari Dino", "black Maserati Bora", "black Volvo 142E") clean_short <- c( "Honda Civic", "Toyota Corona", "AMC Javelin", "Pontiac Firebird", "Fiat X1-9", "Porsche 914-2", "Lotus Europa", "Ford Pantera L", "Ferrari Dino", "Maserati Bora", "Volvo 142E") cat_match(messy_short,clean_short, method = "jaccard") %>% arrange(desc(dists)) ``` You can see that some of these matches are wrong, you can use the `return_lists` argument to return a list of the top matches. We suggest setting this to 2 or 3, if the top 3 are not still not correct, there may not be a clean match, or you may need to try a different string distance method (see `select_metric()`. When combined with a threshold, `cat_match` will only return a list for values that have a best match above that threshold. ```{r pick_lists} cat_match(messy_short,clean_short, method = "jaccard", return_lists = 3, threshold = 0.2) %>% arrange(desc(dists)) %>% gt::gt() ``` Some of our matches have good choices, but others seem to be lacking a match in the clean vector. You can output this dataframe and manually choose the correct match or that there is no suitable match, or you can use the `pick_lists` argument, which prompts you in the console to choose the correct option or none of the above. ```{r picked_list} #data("picked_list") ``` In this more extensive example we have two datasets of biological data. The first: `messy_caterpillars` contains information about the average weight and length of caterpillars, and as the name suggests, has messy very caterpillar names.`clean_caterpillars` is a dataset containing the species and the number of caterpillars found and clean caterpillar names. ```{r return_lists cat_match example,eval=F} # load in messy_caterpillars and clean_caterpillars data("clean_caterpillars") data("messy_caterpillars") head(messy_caterpillars) str(messy_caterpillars) head(clean_caterpillars) str(clean_caterpillars) ``` To fix these names we can either use `cat_replace()` and change the caterpillar name variables and then use a merging function such as the dplyr join functions, or use `cat_join()`. But first, in order to properly configure our string distance arguments, we will first use `cat_match()` to explore how the messy and clean caterpillar names match up. We input the messy and clean vectors—in this case columns of caterpillar names—and specify no other arguments other than to return the distance between each string pair. ```{r pick_lists cat_match example,eval=F} cat_match(messy_caterpillars$CaterpillarSpecies, clean_caterpillars$species, return_dists = T) ``` The output shows the clean string with the lowest string distance from each messy string, and the distance between the pair is returned as a third column. If we arrange by the distance in descending order, we can see the items of the messy vector with the worst matches. We can observe that the worst match is between "Papilio_glaucus" and "Orgyia leucostigma". Additionally, this is the only incorrect match. This means that if we set a threshold lower than .5, cat_match will return no incorrect matches. ```{r using outputs cat_match example,eval=F} cat_match(messy_caterpillars$CaterpillarSpecies,clean_caterpillars$species,return_dists = T,method="jaccard") %>% arrange(desc(dists)) ``` ```{r using outputs inner_join,eval=F} messy_caterpillars$CaterpillarSpecies = cat_replace(messy_caterpillars$CaterpillarSpecies,clean_caterpillars$species,method="jaccard", threshold = .49) dplyr::left_join(clean_caterpillars,messy_caterpillars, by = c("species"="CaterpillarSpecies")) ``` Alternatively, a user could accomplish this task in one step using `cat_join()`. ```{r} data("clean_caterpillars") data("messy_caterpillars") cat_join(messy_df = messy_caterpillars, clean_df = clean_caterpillars, by = c("CaterpillarSpecies", "species"), method="jaccard", threshold = .49,join="full") ``` ```{r} data("mtcars") mtcars_colnames_messy = mtcars colnames(mtcars_colnames_messy)[1:5] = paste0(colnames(mtcars)[1:5], "_17") colnames(mtcars_colnames_messy)[6:11] = paste0(colnames(mtcars)[6:11], "_2017") ``` Another messy dataset problem that our package hopes to help solve is row binding two datasets with different columns names. `fuzzy_rbind()` allows a user to join columns in dataframes using string distance matching. Any two columns with similar enough names will be bound together, and `fuzzy_rbind()` takes similar arguments as the rest of the functions in `messy.cats` to allow the user to fine tune their string distance matching. ```{r} fuzzy_rbind(df1 = mtcars, df2 = mtcars_colnames_messy, threshold = .5, method = "jw") ``` ```{r} fuzzy_rbind(df1 = mtcars, df2 = mtcars_colnames_messy, threshold = .2, method = "jw") ``` The second `fuzzy_rbind()` call results in fewer bound columns because the user asked for a lower threshold. messy.cats also provides a function to help the user choose a string distance metric in order to get the best matches possible. `select_metric()` gives an approximate measure of "certainty" of matches for some of R's string distance metrics on a given clean and messy vector pair. ```{r} select_metric(c("axxxxx", "bxxxxx", "cxxxxx"), c("apples", "banana", "carrot")) ``` In the above example, `select_metric()` recommends the Jaro-Winkler distance with a p value of 0.1, since this metric weights agreement in the beginning of strings more heavily. The other metrics do not consider the position of discrepancies in strings, and as such are less "certain" of their matches given the nature of the example. ```{r} select_metric(c("ipzza", "rgegplants", "vrem aof wheat"), c("pizza", "eggplants", "cream of wheat")) ``` This example recommends Optimal String Alignment distance, since this metric is more forgiving of adjacent character swaps. Notably, Jaro-Winkler with p of 0.1 is a particularly bad choice for this dataset, since all intended matches disagree on their first character.