The mission of hablar is for you to get non-astonishing
results! That means that functions return what you expected. R has some
intuitive quirks that beginners and experienced programmers fail to
identify. Some of the first weird features of R that hablar
solves:
Missing values NA and irrational values
Inf, NaN is dominant. For example, in R
sum(c(1, 2, NA)) is NA and not 3. In
hablar the addition of an underscore
sum_(c(1, 2, NA)) returns 3, as is often expected.
Factors (categorical variables) that are converted to numeric
returns the number of the category rather than the value. In
hablar the convert() function always changes
the type of the values.
Finding duplicates, and rows with NA can be
cumbersome. The functions find_duplicates() and
find_na() make it easy to find where the data frame needs
to be fixed. When the issues are found the utility replacement
functions, e.g. if_else_(), if_na(),
zero_if() easily fixes many of the most common problems you
face.
hablar follows the syntax API of tidyverse
and works seamlessly with dplyr and
tidyselect.
A common issue in R is how R treats missing values
(i.e. NA). Sometimes NA in your data frame
means that there is missing values in the sense that you need to
estimate or replace them with values. But often it is not a problem!
Often NA means that there is no value, and should
not be. hablar provide useful functions that handle
NA intuitively. Let’s take a simple example:
#> # A tibble: 3 × 3
#>   name    graduation_date   age
#>   <chr>   <date>          <int>
#> 1 Fredrik 2016-06-15         21
#> 2 Maria   NA                 16
#> 3 Astrid  2014-06-15         23min() to
min_()The graduation_date is missing for Maria. In this case
it is not because we do not know. It is because she has not graduated
yet, she is younger than Fredrik and Astrid. If we would like to know
the first graduation date of the three observation in R with a naive
min() we get NA. But with min_()
from hablar we get the minimum value that is not missing.
See:
df %>% 
  mutate(min_baseR = min(graduation_date),
         min_hablar = min_(graduation_date))
#> # A tibble: 3 × 5
#>   name    graduation_date   age min_baseR min_hablar
#>   <chr>   <date>          <int> <date>    <date>    
#> 1 Fredrik 2016-06-15         21 NA        2014-06-15
#> 2 Maria   NA                 16 NA        2014-06-15
#> 3 Astrid  2014-06-15         23 NA        2014-06-15The hablar package provides the same functionality
for
max_()mean_()median_()sd_()first_()… and more. For more documentation type help(min_()) or
vignette("s") for an in-depth description.
In hablar the function convert provides a
robust, readable and dynamic way to change type of a column.
mtcars %>% 
  convert(int(cyl, am),
          num(disp:drat))
#>                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#> Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
#> Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
#> Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#> Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#> Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#> Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
#> AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
#> Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#> Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#> Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
#> Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2The above chunk converts the columns cyl and
am to integers, and the columns disp through
drat to numeric. If a column is of type factor
it always converts it to character before further conversion.
With convert and tidyselect you can easily
change type of a wide range of columns.
mtcars %>% 
  convert(
    chr(last_col()),       # Last colum to character
    int(1:2),              # First two columns to integer
    fct(hp, wt),           # hp and wt to factors
    dte(vs),               # vs to date (if you really want)
    num(contains("car"))   # car as in carb to numeric
  )           
#>                     mpg cyl  disp  hp drat    wt  qsec         vs am gear carb
#> Mazda RX4            21   6 160.0 110 3.90  2.62 16.46 1970-01-01  1    4    4
#> Mazda RX4 Wag        21   6 160.0 110 3.90 2.875 17.02 1970-01-01  1    4    4
#> Datsun 710           22   4 108.0  93 3.85  2.32 18.61 1970-01-02  1    4    1
#> Hornet 4 Drive       21   6 258.0 110 3.08 3.215 19.44 1970-01-02  0    3    1
#> Hornet Sportabout    18   8 360.0 175 3.15  3.44 17.02 1970-01-01  0    3    2
#> Valiant              18   6 225.0 105 2.76  3.46 20.22 1970-01-02  0    3    1
#> Duster 360           14   8 360.0 245 3.21  3.57 15.84 1970-01-01  0    3    4
#> Merc 240D            24   4 146.7  62 3.69  3.19 20.00 1970-01-02  0    4    2
#> Merc 230             22   4 140.8  95 3.92  3.15 22.90 1970-01-02  0    4    2
#> Merc 280             19   6 167.6 123 3.92  3.44 18.30 1970-01-02  0    4    4
#> Merc 280C            17   6 167.6 123 3.92  3.44 18.90 1970-01-02  0    4    4
#> Merc 450SE           16   8 275.8 180 3.07  4.07 17.40 1970-01-01  0    3    3
#> Merc 450SL           17   8 275.8 180 3.07  3.73 17.60 1970-01-01  0    3    3
#> Merc 450SLC          15   8 275.8 180 3.07  3.78 18.00 1970-01-01  0    3    3
#> Cadillac Fleetwood   10   8 472.0 205 2.93  5.25 17.98 1970-01-01  0    3    4
#> Lincoln Continental  10   8 460.0 215 3.00 5.424 17.82 1970-01-01  0    3    4
#> Chrysler Imperial    14   8 440.0 230 3.23 5.345 17.42 1970-01-01  0    3    4
#> Fiat 128             32   4  78.7  66 4.08   2.2 19.47 1970-01-02  1    4    1
#> Honda Civic          30   4  75.7  52 4.93 1.615 18.52 1970-01-02  1    4    2
#> Toyota Corolla       33   4  71.1  65 4.22 1.835 19.90 1970-01-02  1    4    1
#> Toyota Corona        21   4 120.1  97 3.70 2.465 20.01 1970-01-02  0    3    1
#> Dodge Challenger     15   8 318.0 150 2.76  3.52 16.87 1970-01-01  0    3    2
#> AMC Javelin          15   8 304.0 150 3.15 3.435 17.30 1970-01-01  0    3    2
#> Camaro Z28           13   8 350.0 245 3.73  3.84 15.41 1970-01-01  0    3    4
#> Pontiac Firebird     19   8 400.0 175 3.08 3.845 17.05 1970-01-01  0    3    2
#> Fiat X1-9            27   4  79.0  66 4.08 1.935 18.90 1970-01-02  1    4    1
#> Porsche 914-2        26   4 120.3  91 4.43  2.14 16.70 1970-01-01  1    5    2
#> Lotus Europa         30   4  95.1 113 3.77 1.513 16.90 1970-01-02  1    5    2
#> Ford Pantera L       15   8 351.0 264 4.22  3.17 14.50 1970-01-01  1    5    4
#> Ferrari Dino         19   6 145.0 175 3.62  2.77 15.50 1970-01-01  1    5    6
#> Maserati Bora        15   8 301.0 335 3.54  3.57 14.60 1970-01-01  1    5    8
#> Volvo 142E           21   4 121.0 109 4.11  2.78 18.60 1970-01-02  1    4    2For more information, see help(hablar) or
vignette("convert").
When cleaning data you spend a lot of time understanding your data.
Sometimes you get more row than you expected when doing a
left_join(). Or you did not know that certain column
contained missing values NA or irrational values like
Inf or NaN.
In hablar the find_* functions speeds up
your search for the problem. To find duplicated rows you simply
df %>% find_duplicates(). You can also find duplicates
in in specific columns, which can be useful before joins.
# Create df with duplicates
df <- mtcars %>% 
  bind_rows(mtcars %>% slice(1, 5, 9))
# Return rows with duplicates in cyl and am
df %>% 
  find_duplicates(cyl, am)
#> # A tibble: 35 × 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
#> # … with 31 more rows
#> # ℹ Use `print(n = ...)` to see more rowsThere are also find functions for other cases. For example
find_na() returns rows with missing values.
starwars %>% 
  find_na(height)
#> # A tibble: 6 × 14
#>   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#> 1 Arvel Crynyd     NA    NA brown   fair    brown        NA male  mascu… <NA>   
#> 2 Finn             NA    NA black   dark    dark         NA male  mascu… <NA>   
#> 3 Rey              NA    NA brown   light   hazel        NA fema… femin… <NA>   
#> 4 Poe Dameron      NA    NA brown   light   brown        NA male  mascu… <NA>   
#> # … with 2 more rows, 4 more variables: species <chr>, films <list>,
#> #   vehicles <list>, starships <list>, and abbreviated variable names
#> #   ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable namesIf you rather want a Boolean value instead then
e.g. check_duplicates() returns TRUE if the
data frame contains duplicates, otherwise it returns
FALSE.
Let’s say that we have found a problem is caused by missing values in
the column height and you want to replace all missing
values with the integer 100. hablar comes with an
additional ways of doing if-or-else.
starwars %>% 
  find_na(height) %>% 
  mutate(height = if_na(height, 100L))
#> # A tibble: 6 × 14
#>   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#> 1 Arvel Crynyd    100    NA brown   fair    brown        NA male  mascu… <NA>   
#> 2 Finn            100    NA black   dark    dark         NA male  mascu… <NA>   
#> 3 Rey             100    NA brown   light   hazel        NA fema… femin… <NA>   
#> 4 Poe Dameron     100    NA brown   light   brown        NA male  mascu… <NA>   
#> # … with 2 more rows, 4 more variables: species <chr>, films <list>,
#> #   vehicles <list>, starships <list>, and abbreviated variable names
#> #   ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable namesIn the chunk above we successfully replaced all missing heights with
the integer 100. hablar also contain the self
explained:
if_zero() and zero_if()if_inf() and inf_if()if_nan() and nan_if()which works in the same way as the examples above.
The generic function if_else_() provides the same
rigidity as if_else() in dplyr but ads some
flexibility. In dplyr you need to specify which type
NA should have. In if_else_() you can
write:
starwars %>% 
  mutate(skin_color = if_else_(hair_color == "brown", NA, hair_color))
#> # A tibble: 87 × 14
#>   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#> 1 Luke Skywal…    172    77 blond   blond   blue       19   male  mascu… Tatooi…
#> 2 C-3PO           167    75 <NA>    <NA>    yellow    112   none  mascu… Tatooi…
#> 3 R2-D2            96    32 <NA>    <NA>    red        33   none  mascu… Naboo  
#> 4 Darth Vader     202   136 none    none    yellow     41.9 male  mascu… Tatooi…
#> # … with 83 more rows, 4 more variables: species <chr>, films <list>,
#> #   vehicles <list>, starships <list>, and abbreviated variable names
#> #   ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable namesIn if_else() from dplyr you would have had
to specified NA_character_.