convert your data typesBest practise of data analysis is to fix data types directly after importing data into R. This helps in many ways:
Additionally, if every column is converted to its appropriate data type then you won’t be surprised by data type errors the next time you run the script.
convert(.x, ...) where .x is a data frame.
... is a placeholder for data type specific conversion
functions.
convert must be used in conjunction with data type
conversion functions:
chr converts to character.num converts to numeric.int converts to integer.lgl converts to logical.fct converts to factor.dte converts to date.dtm converts to date time.Imagine you have a data frame where you want to change columns:
a and b to numericalc to dated and e to characterThen you can write:
df %>% convert(num(a, b), dte(c), chr(d, e))
The easiest way to understand how simple convert is to
use is with examples. Have a look at the a gapminder dataset from the
package gapminder:
library(gapminder)
gapminder#> # A tibble: 1,704 × 6
#>   country     continent  year lifeExp      pop gdpPercap
#>   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#> 1 Afghanistan Asia       1952    28.8  8425333      779.
#> 2 Afghanistan Asia       1957    30.3  9240934      821.
#> 3 Afghanistan Asia       1962    32.0 10267083      853.
#> 4 Afghanistan Asia       1967    34.0 11537966      836.
#> # … with 1,700 more rows
#> # ℹ Use `print(n = ...)` to see more rowsWe might want to change the country column to character instead of
factor. To do this we use chr together with the column name
inside convert:
gapminder %>% 
  convert(chr(country))#> # A tibble: 1,704 × 6
#>   country     continent  year lifeExp      pop gdpPercap
#>   <chr>       <fct>     <int>   <dbl>    <int>     <dbl>
#> 1 Afghanistan Asia       1952    28.8  8425333      779.
#> 2 Afghanistan Asia       1957    30.3  9240934      821.
#> 3 Afghanistan Asia       1962    32.0 10267083      853.
#> 4 Afghanistan Asia       1967    34.0 11537966      836.
#> # … with 1,700 more rows
#> # ℹ Use `print(n = ...)` to see more rowsThis converted the country column to the data type character. But we do not have to make this whole procedure for each column if we want to convert more columns. Let’s say that we also want to convert continent to character and the column lifeExp to integer, pop to double and gdpPercap to numeric. It is simply done:
gapminder %>% 
  convert(chr(country, 
              continent),
          int(lifeExp),
          dbl(pop),
          num(gdpPercap))#> # A tibble: 1,704 × 6
#>   country     continent  year lifeExp      pop gdpPercap
#>   <chr>       <chr>     <int>   <int>    <dbl>     <dbl>
#> 1 Afghanistan Asia       1952      28  8425333      779.
#> 2 Afghanistan Asia       1957      30  9240934      821.
#> 3 Afghanistan Asia       1962      31 10267083      853.
#> 4 Afghanistan Asia       1967      34 11537966      836.
#> # … with 1,700 more rows
#> # ℹ Use `print(n = ...)` to see more rowsconvert?You can change alot of data types with little code. Consider using
mutate from dplyr to do the same
operation:
gapminder %>%
  mutate(country = as.character(country),
         continent = as.character(continent),
         lifeExp = as.integer(lifeExp),
         pop = as.double(pop),
         gdpPercap = as.numeric(gdpPercap))#> # A tibble: 1,704 × 6
#>   country     continent  year lifeExp      pop gdpPercap
#>   <chr>       <chr>     <int>   <int>    <dbl>     <dbl>
#> 1 Afghanistan Asia       1952      28  8425333      779.
#> 2 Afghanistan Asia       1957      30  9240934      821.
#> 3 Afghanistan Asia       1962      31 10267083      853.
#> 4 Afghanistan Asia       1967      34 11537966      836.
#> # … with 1,700 more rows
#> # ℹ Use `print(n = ...)` to see more rowsWhich gives the same result. However, you need to refer to the column name twice and the data type conversion function for each column. Imagine the code to convert 20 columns.
However, dplyr have another way of applying the same
function to multiple columns which could help, mutate_at.
The same example would then look like:
gapminder %>% 
  mutate_at(vars(country, continent), funs(as.character)) %>% 
  mutate_at(vars(lifeExp), funs(as.integer)) %>% 
  mutate_at(vars(pop), funs(as.double)) %>% 
  mutate_at(vars(gdpPercap), funs(as.numeric))#> Warning: `funs()` was deprecated in dplyr 0.8.0.
#> ℹ Please use a list of either functions or lambdas:
#> 
#> # Simple named list: list(mean = mean, median = median)
#> 
#> # Auto named with `tibble::lst()`: tibble::lst(mean, median)
#> 
#> # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))#> # A tibble: 1,704 × 6
#>   country     continent  year lifeExp      pop gdpPercap
#>   <chr>       <chr>     <int>   <int>    <dbl>     <dbl>
#> 1 Afghanistan Asia       1952      28  8425333      779.
#> 2 Afghanistan Asia       1957      30  9240934      821.
#> 3 Afghanistan Asia       1962      31 10267083      853.
#> 4 Afghanistan Asia       1967      34 11537966      836.
#> # … with 1,700 more rows
#> # ℹ Use `print(n = ...)` to see more rowsWhich is more easily scaled to deal with data type conversion of
large numbers of variables. However, convert does the same
job with much less code. In fact, convert uses
mutate_at internally. The difference is syntax and code
readability. Compare again with convert:
gapminder %>% 
  convert(chr(country, 
               continent),
           int(lifeExp),
           dbl(pop),
           num(gdpPercap))#> # A tibble: 1,704 × 6
#>   country     continent  year lifeExp      pop gdpPercap
#>   <chr>       <chr>     <int>   <int>    <dbl>     <dbl>
#> 1 Afghanistan Asia       1952      28  8425333      779.
#> 2 Afghanistan Asia       1957      30  9240934      821.
#> 3 Afghanistan Asia       1962      31 10267083      853.
#> 4 Afghanistan Asia       1967      34 11537966      836.
#> # … with 1,700 more rows
#> # ℹ Use `print(n = ...)` to see more rowsconvert also supports functions of convert
support additional arguments to be passed. For example, if you want to
convert a number to a date and want to include an origin
argument you can write:
tibble(dates = c(12818, 13891),
        sunny = c("yes", "no")) %>% 
  convert(dte(dates, .args = list(origin = "1900-01-01")))#> # A tibble: 2 × 2
#>   dates      sunny
#>   <date>     <chr>
#> 1 1935-02-05 yes  
#> 2 1938-01-13 noconvert is built upon dplyr and it will
share some amazing features of dplyr. For example,
tidyselect works with convert which helps you
to select multiple columns at the same time. A simple example, if you
want to change all columns with names that includes the letter “e” to
factors, you can write:
gapminder %>% 
  convert(fct(contains("e")))#> # A tibble: 1,704 × 6
#>   country     continent year  lifeExp      pop gdpPercap  
#>   <fct>       <fct>     <fct> <fct>      <int> <fct>      
#> 1 Afghanistan Asia      1952  28.801   8425333 779.4453145
#> 2 Afghanistan Asia      1957  30.332   9240934 820.8530296
#> 3 Afghanistan Asia      1962  31.997  10267083 853.10071  
#> 4 Afghanistan Asia      1967  34.02   11537966 836.1971382
#> # … with 1,700 more rows
#> # ℹ Use `print(n = ...)` to see more rows