Title: | Employs String Distance Tools to Help Clean Categorical Data |
Version: | 1.0 |
BugReports: | https://github.com/hkarp1/messy.cats/issues |
Description: | Matching with string distance has never been easier! 'messy.cats' contains various functions that employ string distance tools in order to make data management easier for users working with categorical data. Categorical data, especially user inputted categorical data that often tends to be plagued by typos, can be difficult to work with. 'messy.cats' aims to provide functions that make cleaning categorical data simple and easy. |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.2.2 |
Depends: | R (≥ 3.5.0) |
Imports: | dplyr, stringdist, varhandle, rapportools, stringr, gt |
Suggests: | knitr, rmarkdown |
VignetteBuilder: | knitr |
NeedsCompilation: | no |
Packaged: | 2022-11-29 23:21:24 UTC; andrewhennessy |
Author: | Andrew Hennessy [aut], Duncan Kreutter [aut], Harrison Karp [cre] |
Maintainer: | Harrison Karp <hkarp@wesleyan.edu> |
Repository: | CRAN |
Date/Publication: | 2022-11-30 11:40:02 UTC |
messy.cats: Employs String Distance Tools to Help Clean Categorical Data
Description
Matching with string distance has never been easier! 'messy.cats' contains various functions that employ string distance tools in order to make data management easier for users working with categorical data. Categorical data, especially user inputted categorical data that often tends to be plagued by typos, can be difficult to work with. 'messy.cats' aims to provide functions that make cleaning categorical data simple and easy.
Author(s)
Maintainer: Harrison Karp hkarp@wesleyan.edu
Authors:
Andrew Hennessy ahennessy@wesleyan.edu
Duncan Kreutter dkreutter@wesleyan.edu
See Also
Useful links:
Report bugs at https://github.com/hkarp1/messy.cats/issues
cat_join
Description
cat_join() joins two dataframes using the closest match between two specified columns with misspellings or slight format differences. The closest match can be found using a variety of different string distance measurement options.
Usage
cat_join(
messy_df,
clean_df,
by,
threshold = NA,
method = "jw",
q = 1,
p = 0,
bt = 0,
useBytes = FALSE,
weight = c(d = 1, i = 1, t = 1),
join = "left"
)
Arguments
messy_df |
The dataframe to be joined using a messy categorical variable. |
clean_df |
The dataframe to be joined with a clean categorical variable to be used as a reference for the messy column. |
by |
A vector that specifies the columns to match and join by. If the column names are the same input: "column_name". If the columns have different names input: c("messy_column" = "clean_column") |
threshold |
The maximum distance that will form a match. If this argument is specified, any element in the messy vector that has no match closer than the threshold distance will be replaced with NA. Default: NA |
method |
The type of string distance calculation to use. Possible methods are : osa, lv, dl, hamming, lcs, qgram, cosine, jaccard, jw, and soundex. See package stringdist for more information. Default: 'jw' |
q |
Size of the q-gram used in string distance calculation. Default: 1 |
p |
Only used with method "jw", the Jaro-Winkler penatly size. Default: 0 |
bt |
Only used with method "jw" with p > 0, Winkler's boost threshold. Default: 0 |
useBytes |
Whether or not to perform byte-wise comparison. Default: FALSE |
weight |
Only used with methods "osa" or "dl", a vector representing the penalty for deletion, insertion, substitution, and transposition, in that order. Default: c(d = 1, i = 1, t = 1) |
join |
Choose a join function from the dplyr package to use in joining the datasets. Default: 'left' |
Details
When dealing with messy categorical string data, string distance matching can be an easy and efficient cleaning tool. A variety of string distance calculation algorithms have been developed for different types of data, and these algorithms can be used to detect and remedy problems with categorical string data.
By providing a correctly spelled and specified vector of categories to be compared against a vector of messy strings, a cleaned vector of categories can be generated by finding the correctly specificed string most similar to a messy string. This method works particularly well for messy user-inputted data that often suffers from transposition or misspelling errors.
cat_join()
joins the messy and clean datasets using the closest matching elements from
designated columns. The columns from the datasets are inputted into cat_replace()
as the
messy and clean vectors, and the datasets are joined using a user inputted dplyr join verb.
Value
Returns a dataframe consisting of the two inputted dataframes joined by their designated columns.
Examples
if(interactive()){
#EXAMPLE1
messy_trees = data.frame()
messy_trees[1:9,1] = c("red oak", "williw", "hemluck", "white elm",
"fir tree", "birch tree", "pone", "dagwood", "mople")
messy_trees[1:9,2] = c(34,12,43,32,65,23,12,45,35)
clean_trees=data.frame()
clean_trees[1:9,1] = c("oak", "willow", "hemlock", "elm", "fir",
"birch", "pine", "dogwood", "maple")
clean_trees[1:9,2] = "y"
cat_join(messy_trees,clean_trees,by="V1",method="jaccard")
}
cat_match
Description
cat_match()
matches the contents of a messy vector with
the closest match in a clean vector. The closest match can be found
using a variety of different string distance measurement options.
Usage
cat_match(
messy_v,
clean_v,
return_dists = TRUE,
return_lists = NA,
pick_lists = FALSE,
threshold = NA,
method = "jw",
q = 1,
p = 0,
bt = 0,
useBytes = FALSE,
weight = c(d = 1, i = 1, t = 1)
)
Arguments
messy_v |
The messy string vector that will be restructured. This can come in the form of a column of a dataframe or a lone vector. |
clean_v |
The clean string vector that will be referenced to perform the restructing. Again, this argument can be a dataframe column or vector. |
return_dists |
If set to TRUE the distance between the matched strings will be returned as a third column in the output dataframe, Default: TRUE |
return_lists |
Return list of top X matches, Default: NA |
pick_lists |
Set to TRUE to manually choose matches, Default: F |
threshold |
The maximum distance that will form a match. If this argument is specified, any element in the messy vector that has no match closer than the threshold distance will be replaced with NA. Default: NA |
method |
The type of string distance calculation to use. Possible methods are : osa, lv, dl, hamming, lcs, qgram, cosine, jaccard, jw, and soundex. See package stringdist for more information. Default: 'jw' |
q |
Size of the q-gram used in string distance calculation. Default: 1 |
p |
Only used with method "jw", the Jaro-Winkler penatly size. Default: 0 |
bt |
Only used with method "jw" with p > 0, Winkler's boost threshold. Default: 0 |
useBytes |
Whether or not to perform byte-wise comparison. Default: FALSE |
weight |
Only used with methods "osa" or "dl", a vector representing the penalty for deletion, insertion, substitution, and transposition, in that order. Default: c(d = 1, i = 1, t = 1) |
Details
When dealing with messy categorical string data, string distance matching can be an easy and efficient cleaning tool. A variety of string distance calculation algorithms have been developed for different types of data, and these algorithms can be used to detect and remedy problems with categorical string data.
By providing a correctly spelled and specified vector of categories to be compared against a vector of messy strings, a cleaned vector of categories can be generated by finding the correctly specificed string most similar to a messy string. This method works particularly well for messy user-inputted data that often suffers from transposition or misspelling errors.
cat_match()
is meant as an exploratory tool to discover how the elements
of two vectors will match using string distance measures, and has added functionality
to solve issues by hand and create a dataframe that can be used to create custom
matches between the clean and messy vectors.
Value
Returns a dataframe with each unique value in the bad vector and it's closest match in the good vector. If return_dists is TRUE the distances between the matches are added as a column.
Examples
if(interactive()){
messy_trees = c("red oak", "williw", "hemluck", "white elm",
"fir tree", "birch tree", "pone", "dagwood", "mople")
clean_trees = c("oak", "willow", "hemlock", "elm", "fir", "birch", "pine", "dogwood", "maple")
matched_trees = cat_match(messy_trees, clean_trees)
}
cat_replace
Description
cat_replace()
replaces the contents of a messy vector with
the closest match in a clean vector. The closest match can be found
using a variety of different string distance measurement options.
Usage
cat_replace(
messy_v,
clean_v,
threshold = NA,
method = "jw",
q = 1,
p = 0,
bt = 0,
useBytes = FALSE,
weight = c(d = 1, i = 1, t = 1)
)
Arguments
messy_v |
The messy string vector that will be restructured. This can come in the form of a column of a dataframe or a lone vector. |
clean_v |
The clean string vector that will be referenced to perform the restructing. Again, this argument can be a dataframe column or vector. |
threshold |
The maximum distance that will form a match. If this argument is specified, any element in the messy vector that has no match closer than the threshold distance will be replaced with NA. Default: NA |
method |
The type of string distance calculation to use. Possible methods are : osa, lv, dl, hamming, lcs, qgram, cosine, jaccard, jw, and soundex. See package stringdist for more information. Default: 'jw' |
q |
Size of the q-gram used in string distance calculation. Default: 1 |
p |
Only used with method "jw", the Jaro-Winkler penatly size. Default: 0 |
bt |
Only used with method "jw" with p > 0, Winkler's boost threshold. Default: 0 |
useBytes |
Whether or not to perform byte-wise comparison. Default: FALSE |
weight |
Only used with methods "osa" or "dl", a vector representing the penalty for deletion, insertion, substitution, and transposition, in that order. Default: c(d = 1, i = 1, t = 1) |
Details
When dealing with messy categorical string data, string distance matching can be an easy and efficient cleaning tool. A variety of string distance calculation algorithms have been developed for different types of data, and these algorithms can be used to detect and remedy problems with categorical string data.
By providing a correctly spelled and specified vector of categories to be compared against a vector of messy strings, a cleaned vector of categories can be generated by finding the correctly specificed string most similar to a messy string. This method works particularly well for messy user-inputted data that often suffers from transposition or misspelling errors.
cat_replace()
replaces the elements of the messy vector with the closest matching
element from the clean vector.
Value
cat_replace() returns a cleaned version of the bad vector, with each element replaced by the most similar element of the good vector.
Examples
if(interactive()){
messy_trees = c("red oak", "williw", "hemluck", "white elm", "fir tree",
"birch tree", "pone", "dagwood", "mople")
clean_trees = c("oak", "willow", "hemlock", "elm", "fir", "birch", "pine", "dogwood", "maple")
cleaned_trees = cat_replace(messy_trees, clean_trees)
}
clean_caterpillars
Description
Dataframe with caterpillar counts from three summers.
Usage
clean_caterpillars
Format
A data frame with 74 rows and 3 variables:
species
character Full latin names of 29 caterpillar species.
count
integer Randomly generated fake counts of the caterpillars.
year
double Year of caterpillar observations.
Details
An example dataset with clean caterpillar species names.
clean_names.df
Description
Data set of clean names
Usage
clean_names.df
Format
A data frame with 20 rows and 2 variables:
first
character Clean first names
last
character Clean last names
Details
An example data that can be used in testing messy.cats functions
country.names
Description
Dataframe with country names as only variable, contains many popular and official names for countries.
Usage
country.names
Format
A data frame with 203 rows and 1 variables:
name
character Names of countries
Details
This dataframe contains a list of clean country names with many popular and official names for countries.
country_match
Description
A wrapper function for cat_match()
that only requires an inputted
vector of messy country names. country_match()
uses a built in clean list of
country names country.names
as the reference clean vector.
Usage
country_match(messy_countries, threshold = NA, p = 0)
Arguments
messy_countries |
Vector containing the messy country names that will be replaced
by the closest match from |
threshold |
The maximum distance that will form a match. If this argument is specified, any element in the messy vector that has no match closer than the threshold distance will be replaced with NA. Default: NA |
p |
Only used with method "jw", the Jaro-Winkler penatly size. Default: 0 |
Details
Country names are often misspelled or abbreviated in datasets, especially datasets that have been
manually digitized or created. country_match()
is a warpper function of cat_match()
that quickly solves
this common issue of mispellings or different formats of country names across datasets. This wrapper
function uses a built in clean list of country names country.names
as the reference clean vector and
matches your inputted messy vector of names to their nearest country in country.names
.
Value
country_match()
returns a cleaned version of the bad vector, with each
element replaced by the most similar element of the good vector.
Examples
if(interactive()){
#EXAMPLE1
lst <- c("Conagoa", "Blearaus", "Venzesual", "Uruagsya", "England")
matched <- country_match(lst)
}
country_replace
Description
A wrapper function for cat_replace()
that only requires an inputted
vector of messy countries. country_replace()
uses a built in clean list of
country names country.names
as the reference clean vector.
Usage
country_replace(messy_countries, threshold = NA, p = 0)
Arguments
messy_countries |
Vector containing the messy country names that will be replaced
by the closest match from |
threshold |
The maximum distance that will form a match. If this argument is specified, any element in the messy vector that has no match closer than the threshold distance will be replaced with NA. Default: NA |
p |
Only used with method "jw", the Jaro-Winkler penatly size. Default: 0 |
Details
Country names are often misspelled or abbreviated in datasets, especially datasets that have been
manually digitized or created. country_replace()
is a warpper function of cat_replace()
that quickly solves
this common issue of mispellings or different formats of country names across datasets. This wrapper
function uses a built in clean list of country names country.names
as the reference clean vector and
replaces your inputted messy vector of names to their nearest match in country.names
.
Value
country_replace()
returns a cleaned version of the bad vector, with each
element replaced by the most similar element of the good vector.
Examples
if(interactive()){
#EXAMPLE1
lst <- c("Conagoa", "Blearaus", "Venzesual", "Uruagsya", "England")
fixed <- country_replace(lst)
}
fix_typos
Description
This function is meant to allow users to fix typos in strings that are not normally found in dictionaries.
Usage
fix_typos(typo_v, thr, occ_ratio)
Arguments
typo_v |
vector of strings that will have its typos cleaned |
thr |
the string distance maximum used to determine typos. This argument is specified as the percentage of a typo that should at most be expected to be insertions, additons, deletions, and transpositions. |
occ_ratio |
the minimum ratio of correctly spelled words to their typo. This argument helps to weed out words that are similar but valid. For example commonly occurring valid names such as Adam and Amy will not be recognized as typos even though they are similar because they both appear often. Typos are recognized by their similarity in addition to their infrequent occurrence. |
Details
There are great tools like the hunspell package that allow users to fix typos for words found in dictionaries, but these functions struggle to work for strings like proper nouns and other specific terminology not usually found in common dictionaries. This function uses the text being cleaned as a dictionary. It finds probable correctly spelled words based on their high occurrence and finds typos based on their low occurence. This is based on the theory that typos will appear as infrequently used words due no one using them on purpose, and they will be a short string distance from commonly occurring correctly spelled words.
Value
reformatted vector with typos replaced with correctly spelled words
Examples
if(interactive()){
#EXAMPLE1
}
fuzzy_rbind
Description
fuzzy_rbind() binds dataframes based on columns with slightly different names.
Usage
fuzzy_rbind(
df1,
df2,
threshold,
method = "jw",
q = 1,
p = 0,
bt = 0,
useBytes = FALSE,
weight = c(d = 1, i = 1, t = 1)
)
Arguments
df1 |
The first dataframe to be bound. |
df2 |
The second dataframe to be bound. |
threshold |
The maximum string distance between column names, if the distance between columns is greater than this threshold the columns will not be bound. |
method |
The type of string distance calculation to use. Possible methods are : osa, lv, dl, hamming, lcs, qgram, cosine, jaccard, jw, and soundex. See package stringdist for more information. Default: 'jw', Default: 'jw' |
q |
Size of the q-gram used in string distance calculation. Default: 1 |
p |
Only used with method "jw", the Jaro-Winkler penatly size. Default: 0 |
bt |
Only used with method "jw" with p > 0, Winkler's boost threshold. Default: 0 |
useBytes |
Whether or not to perform byte-wise comparison. Default: FALSE |
weight |
Only used with methods "osa" or "dl", a vector representing the penalty for deletion, insertion, substitution, and transposition, in that order. Default: c(d = 1, i = 1, t = 1) |
Details
When using datasets often times column names are slightly different, and fuzzy_rbind()
helps
to bind dataframes using fuzzy matching of the column names.
Value
fuzzy_rbind() returns a dataframe that has bound the two inputted dataframes based on the closest matching columns, column names from dataframe 1 are preserved.
Examples
if(interactive()){
mtcars_colnames_messy = mtcars
colnames(mtcars_colnames_messy)[1:5] = paste0(colnames(mtcars)[1:5], "_17")
colnames(mtcars_colnames_messy)[6:11] = paste0(colnames(mtcars)[6:11], "_2017")
x = fuzzy_rbind(mtcars, mtcars_colnames_messy, .5)
x = fuzzy_rbind(mtcars, mtcars_colnames_messy, .2)
}
messy_caterpillars
Description
DATASET_DESCRIPTION
Usage
messy_caterpillars
Format
A data frame with 39 rows and 3 variables:
CaterpillarSpecies
character Full latin names of 39 caterpillar species with spelling and formatting errors.
Avg Weight (mg)
double Randomly generated fake weight data for each caterpillar species.
Avg Length (cm)
double Randomly generated fake length data for each caterpillar species.
Details
An example dataset with messy caterpillar species names.
messy_names.df
Description
Data set of messy names
Usage
messy_names.df
Format
A data frame with 80 rows and 2 variables:
first
character Messy first names
last
character MEssy last names
Details
An example data set of messy names that can be used in testing messy.cats functions.
messy_states1
Description
US State names with 1 character randomly changed.
Usage
messy_states1
Format
A data frame with 50 rows and 1 variables:
messy_states1
character All 50 US states with 1 randomly changed character.
Details
An example dataset with mispelled US state names. The names have had 1 character randomly changed.
messy_states2
Description
US State names with 2 characters randomly changed.
Usage
messy_states2
Format
A data frame with 50 rows and 1 variables:
messy_states2
character All 50 US states with 2 randomly changed characters.
Details
An example dataset with mispelled US state names. The names have had 2 characters randomly changed.
picked_list
Description
Handpicked matches from the datasets in intro.rmd.
Usage
picked_list
Format
A data frame with 15 rows and 3 variables:
bad
character column of bad car names
match
character column of good car names
dists
double string distance between the good and bad car names
Details
An example dataset of matched car names.
select_metric
Description
Uses heuristic algorithm to suggest a stringdist metric from among hamming, lv, osa, dl, lcs, jw
Usage
select_metric(messy_v, clean_v)
Arguments
messy_v |
a messy vector of strings |
clean_v |
a vector of strings for messy_v to be matched against |
Details
for each metric, measures certainty via the difference between the best matches for each word and the average of all matches for each word
Value
a string representing the suggested stringdist metric
See Also
Examples
select_metric(c("aapple", "bamana", "clemtidne"), c("apple", "banana", "clementine"))
state.name
Description
Testing data set of state names
Usage
state.name
Format
A data frame with 50 rows and 1 variables:
states
character State names
Details
Testing data set of state names
state_match
Description
A wrapper function for cat_match()
hat only requires an inputted
vector of messy states. state_match()
uses a built in clean list of
state names state.name
as the reference clean vector.
Usage
state_match(messy_states, threshold = NA, p = 0)
Arguments
messy_states |
Vector containing the messy state names that will be replaced
by the closest match from |
threshold |
The maximum distance that will form a match. If this argument is specified, any element in the messy vector that has no match closer than the threshold distance will be replaced with NA. Default: NA |
p |
Only used with method "jw", the Jaro-Winkler penatly size. Default: 0 |
Details
State names are often misspelled or abbreviated in datasets, especially datasets that have been
manually digitized or created. state_match()
is a warpper function of cat_match()
that quickly solves
this common issue of mispellings or different formats of country names across datasets. This wrapper
function uses a built in clean list of country names state.name
as the reference clean vector and
matches your inputted messy vector of names to their nearest state in state.name
.
Value
state_match()
returns a cleaned version of the bad vector, with each
element replaced by the most similar element of the good vector.
Examples
if(interactive()){
#EXAMPLE1
lst <- c("Indianaa", "Wisvconsin", "aLaska", "NewJersey", "Claifoarni")
matched <- state_match(lst)
}
state_replace
Description
A wrapper function for cat_replace()
that only requires an inputted
vector of messy US state names. state_replace()
uses the built-in character
vector state.name
as the reference clean vector.
Usage
state_replace(messy_states, threshold = NA, p = 0)
Arguments
messy_states |
Vector containing the messy state names that will be replaced
by the closest match from |
threshold |
The maximum distance that will form a match. If this argument is specified, any element in the messy vector that has no match closer than the threshold distance will be replaced with NA. Default: NA |
p |
Only used with method "jw", the Jaro-Winkler penatly size. Default: 0 |
Details
State names are often misspelled or abbreviated in datasets, especially datasets that have been
manually digitized or created. state_replace()
is a warpper function of cat_replace()
that quickly solves
this common issue of mispellings or different formats of state names across datasets. This wrapper
function uses a built in clean list of country names state.name
as the reference clean vector and
replaces your inputted messy vector of names to their nearest match in state.name
.
Value
state_replace()
returns a cleaned version of the bad vector, with each
element replaced by the most similar element of the good vector.
Examples
if(interactive()){
#EXAMPLE1
lst <- c("Indianaa", "Wisvconsin", "aLaska", "NewJersey", "Claifoarni")
fixed <- state_replace(lst)
}
typos
Description
Data set of words, some correctly spelled, some typos, with their occurrence in text
Usage
typos
Format
A data frame with 27 rows and 2 variables:
occurrence
double number of times word appears in text
species
character words in text
Details
An example data set that can be used in testing fix_typos().