--- title: "Getting Started with unexcel" author: "Hercules Freitas" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with unexcel} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction Spreadsheets such as Microsoft Excel and LibreOffice Calc often auto-convert entries like 30.3 into dates. When imported into R, these values appear as Excel date serials (integers) instead of the intended numeric values. For example: - In Excel, 30.3 might be stored as 45812, which represents 2025-03-30 in the 1900 date system. - What the user originally wanted was simply 30.3. The unexcel package provides tools to safely detect these cases and reconstruct the originally intended day.month numerics. Non-serial values are preserved, and both the 1900 and 1904 date systems are supported. # Excel date systems Excel represents dates as the number of days since an origin date. - 1900 system: Day 1 is 1900-01-01. Excel mistakenly treats 1900 as a leap year. To compensate, R and most packages use "1899-12-30" as the origin, so calculations align with Excel’s behavior. - 1904 system: Day 0 is 1904-01-01, historically used by older Mac versions of Excel. Example: ```{r} as.Date(45812, origin = "1899-12-30") #> "2025-03-30" ``` Here, 45812 is the serial for March 30, 2025. # The 1900 leap-year bug Excel was originally designed to be compatible with Lotus 1-2-3, which treated 1900 as a leap year (incorrectly). This means Excel includes the non-existent date 1900-02-29 in its calendar. To correct for this, R uses an origin of "1899-12-30" rather than "1900-01-01". This offset ensures that serial numbers map correctly to the dates displayed inside Excel. # Reconstructing day.month numerics The function restore_day_month() converts Excel serials back into numerics of the form day.month. ```{r} library(unexcel) # Mixed vector: two serials + one real number x <- c(45812, 12.5, 44730) restore_day_month(x, origin_mode = "1900") #> [1] 30.3 12.5 15.6 ``` Explanation: - 45812 (2025-03-30) → 30.3 - 44730 (2022-06-15) → 15.6 - 12.5 (not a serial) → preserved as 12.5 # Guardrails against false conversions To avoid changing ordinary numbers, unexcel applies guardrails: - Only integer-like values are considered potential serials. - Values must fall in a plausible range (20000–65000, roughly 1954–2078). - The converted year must lie inside a configurable year_window (default: 1990–2035). Anything outside these conditions is left unchanged. # Difference from formatted strings By design, unexcel reconstructs numeric day.month values, not character strings: - Numeric output: 30.3 (easily compared or plotted). - Character output: "30/03" (typical formatted date). If you prefer a string, you can use R’s base functions: ```{r} format(as.Date(45812, origin = "1899-12-30"), "%d/%m") #> "30/03" ``` But the goal of unexcel is to give you back the numeric decimal you originally typed in Excel. # Working with data frames The helper fix_serial_columns() scans an entire data frame and corrects only the columns that appear dominated by Excel serials. ```{r} df <- data.frame( a = c(45812, 44730, 45900), b = c(1.2, 3.4, 5.6) ) fix_serial_columns(df, origin_mode = "1900") #> a b #> 1 30.3 1.2 #> 2 15.6 3.4 #> 3 19.4 5.6 ``` Column a was fixed; column b was left untouched. # Summary - Excel stores dates as day counts since 1900-01-01 or 1904-01-01. - Due to a leap-year bug, R uses "1899-12-30" as the origin for 1900-system serials. - The package reconstructs numeric day.month values that users originally typed, while leaving unrelated numbers intact. - Guardrails ensure safety and prevent over-conversion. - Use restore_day_month() for vectors and fix_serial_columns() for data frames. # Session information ```{r} sessionInfo() ```