1 Introduction

The dataSDA package (v0.1.8) gathers various symbolic data tailored to different research themes and provides a comprehensive set of functions for reading, writing, converting, and analyzing symbolic data. The package is available on CRAN at https://CRAN.R-project.org/package=dataSDA and on GitHub at https://github.com/hanmingwu1103/dataSDA.

The package provides functions organized into the following categories:

Category	Functions	Count
Format detection & conversion	`int_detect_format`, `int_list_conversions`, `int_convert_format`, `RSDA_to_MM`, `iGAP_to_MM`, `SODAS_to_MM`, `MM_to_iGAP`, `RSDA_to_iGAP`, `SODAS_to_iGAP`, `MM_to_RSDA`, `iGAP_to_RSDA`	11
Core statistics	`int_mean`, `int_var`, `int_cov`, `int_cor`	4
Geometric properties	`int_width`, `int_radius`, `int_center`, `int_midrange`, `int_overlap`, `int_containment`	6
Position & scale	`int_median`, `int_quantile`, `int_range`, `int_iqr`, `int_mad`, `int_mode`	6
Robust statistics	`int_trimmed_mean`, `int_winsorized_mean`, `int_trimmed_var`, `int_winsorized_var`	4
Distribution shape	`int_skewness`, `int_kurtosis`, `int_symmetry`, `int_tailedness`	4
Similarity measures	`int_jaccard`, `int_dice`, `int_cosine`, `int_overlap_coefficient`, `int_tanimoto`, `int_similarity_matrix`	6
Uncertainty & variability	`int_entropy`, `int_cv`, `int_dispersion`, `int_imprecision`, `int_granularity`, `int_uniformity`, `int_information_content`	7
Distance measures	`int_dist`, `int_dist_matrix`, `int_pairwise_dist`, `int_dist_all`	4
Histogram statistics	`hist_mean`, `hist_var`, `hist_cov`, `hist_cor`	4
Utilities	`clean_colnames`, `RSDA_format`, `set_variable_format`, `write_csv_table`	4

2 Data Formats and Conversion

2.1 Interval data formats overview

The dataSDA package works with three primary formats for interval-valued data:

RSDA format: symbolic_tbl objects where intervals are encoded as complex numbers (min + max*i). Used by the RSDA package.
MM format: Standard data frames with paired _min / _max columns for each variable.
iGAP format: Data frames where each interval is a comma-separated string (e.g., "2.5,4.0").

data(mushroom.int)
head(mushroom.int, 3)
#> # A tibble: 3 × 5
#>   Species Pileus.Cap.Width   Stipe.Length Stipe.Thickness Edibility
#>   <chr>         <symblc_n>     <symblc_n>      <symblc_n> <chr>    
#> 1 arorae     [3.00 : 8.00]  [4.00 : 9.00]   [0.50 : 2.50] U        
#> 2 arvenis   [6.00 : 21.00] [4.00 : 14.00]   [1.00 : 3.50] Y        
#> 3 benesi     [4.00 : 8.00] [5.00 : 11.00]   [1.00 : 2.00] Y
class(mushroom.int)
#> [1] "symbolic_tbl" "tbl_df"       "tbl"          "data.frame"

data(abalone.int)
head(abalone.int, 3)
#>         Length_min Length_max Diameter_min Diameter_max Height_min Height_max
#> F-10-12     0.1275     0.9975        0.075        0.815    -0.0175     0.3125
#> F-13-15     0.1775     1.0275        0.125        0.825      0.025      0.325
#> F-16-18       0.22       0.92       0.1725       0.7425     0.0375     0.3075
#>         Whole_min Whole_max Shucked_min Shucked_max Viscera_min Viscera_max
#> F-10-12    -1.021     3.883     -0.6322      2.1948     -0.2077      0.7712
#> F-13-15   -0.8567    3.6303     -0.4548      1.7942     -0.1905      0.7555
#> F-16-18   -0.5725    3.1235      -0.244       1.206     -0.1037      0.6752
#>         Shell_min Shell_max
#> F-10-12    -0.258     1.054
#> F-13-15    -0.269     1.153
#> F-16-18   -0.3233    1.4477
class(abalone.int)
#> [1] "data.frame"

data(abalone.iGAP)
head(abalone.iGAP, 3)
#>                 Length       Diameter          Height           Whole
#> F-10-12  0.1275,0.9975   0.075, 0.815 -0.0175, 0.3125   -1.021, 3.883
#> F-13-15  0.1775,1.0275    0.125,0.825    0.025, 0.325 -0.8567, 3.6303
#> F-16-18      0.22,0.92 0.1725, 0.7425  0.0375, 0.3075 -0.5725, 3.1235
#>                 Shucked         Viscera           Shell
#> F-10-12 -0.6322, 2.1948 -0.2077, 0.7712   -0.258, 1.054
#> F-13-15 -0.4548, 1.7942 -0.1905, 0.7555   -0.269, 1.153
#> F-16-18   -0.244, 1.206 -0.1037, 0.6752 -0.3233, 1.4477
class(abalone.iGAP)
#> [1] "data.frame"

The int_detect_format() function automatically identifies the format of a dataset:

int_detect_format(mushroom.int)
#> [1] "RSDA"
int_detect_format(abalone.int)
#> [1] "MM"
int_detect_format(abalone.iGAP)
#> [1] "iGAP"

Use int_list_conversions() to see all available format conversion paths:

int_list_conversions()
#>    from   to function_name
#> 1  RSDA   MM    RSDA_to_MM
#> 2  RSDA iGAP  RSDA_to_iGAP
#> 3  iGAP   MM    iGAP_to_MM
#> 4 SODAS   MM   SODAS_to_MM
#> 5 SODAS iGAP SODAS_to_iGAP
#> 6    MM iGAP    MM_to_iGAP
#> 7    MM RSDA    MM_to_RSDA
#> 8  iGAP RSDA  iGAP_to_RSDA

2.2 Unified format conversion

The int_convert_format() function provides a unified interface for converting between formats. It auto-detects the source format and applies the appropriate conversion:

# RSDA to MM
mushroom.MM <- int_convert_format(mushroom.int, to = "MM")
head(mushroom.MM, 3)
#>   Species Pileus.Cap.Width_min Pileus.Cap.Width_max Stipe.Length_min
#> 1  arorae                    3                    8                4
#> 2 arvenis                    6                   21                4
#> 3  benesi                    4                    8                5
#>   Stipe.Length_max Stipe.Thickness_min Stipe.Thickness_max Edibility
#> 1                9                 0.5                 2.5         U
#> 2               14                 1.0                 3.5         Y
#> 3               11                 1.0                 2.0         Y

# iGAP to MM
abalone.MM <- int_convert_format(abalone.iGAP, to = "MM")
head(abalone.MM, 3)
#>         Length_min Length_max Diameter_min Diameter_max Height_min Height_max
#> F-10-12     0.1275     0.9975        0.075        0.815    -0.0175     0.3125
#> F-13-15     0.1775     1.0275        0.125        0.825      0.025      0.325
#> F-16-18       0.22       0.92       0.1725       0.7425     0.0375     0.3075
#>         Whole_min Whole_max Shucked_min Shucked_max Viscera_min Viscera_max
#> F-10-12    -1.021     3.883     -0.6322      2.1948     -0.2077      0.7712
#> F-13-15   -0.8567    3.6303     -0.4548      1.7942     -0.1905      0.7555
#> F-16-18   -0.5725    3.1235      -0.244       1.206     -0.1037      0.6752
#>         Shell_min Shell_max
#> F-10-12    -0.258     1.054
#> F-13-15    -0.269     1.153
#> F-16-18   -0.3233    1.4477

# iGAP to RSDA
data(face.iGAP)
face.RSDA <- int_convert_format(face.iGAP, to = "RSDA")
head(face.RSDA, 3)
#>               AD        BC             AH             DH           EH
#> 1 155.00+157.00i 58+61.01i 100.45+103.28i 105.00+107.30i 61.40+65.73i
#> 2 154.00+160.01i 57+64.00i 101.98+105.55i 104.35+107.30i 60.88+63.03i
#> 3 154.01+161.00i 57+63.00i  99.36+105.65i 101.04+109.04i 60.95+65.60i
#>             GH
#> 1 64.20+67.80i
#> 2 62.94+66.47i
#> 3 60.42+66.40i

2.3 Direct conversion functions

For explicit control, direct conversion functions are available:

# RSDA to MM
mushroom.MM <- RSDA_to_MM(mushroom.int, RSDA = TRUE)
head(mushroom.MM, 3)
#>   Species Pileus.Cap.Width_min Pileus.Cap.Width_max Stipe.Length_min
#> 1  arorae                    3                    8                4
#> 2 arvenis                    6                   21                4
#> 3  benesi                    4                    8                5
#>   Stipe.Length_max Stipe.Thickness_min Stipe.Thickness_max Edibility
#> 1                9                 0.5                 2.5         U
#> 2               14                 1.0                 3.5         Y
#> 3               11                 1.0                 2.0         Y

# MM to iGAP
mushroom.iGAP <- MM_to_iGAP(mushroom.MM)
head(mushroom.iGAP, 3)
#>   Species Pileus.Cap.Width Stipe.Length Stipe.Thickness Edibility
#> 1  arorae              3,8          4,9         0.5,2.5         U
#> 2 arvenis             6,21         4,14           1,3.5         Y
#> 3  benesi              4,8         5,11             1,2         Y

# iGAP to MM
data(face.iGAP)
face.MM <- iGAP_to_MM(face.iGAP, location = 1:6)
head(face.MM, 3)
#>      AD_min AD_max BC_min BC_max AH_min AH_max DH_min DH_max EH_min EH_max
#> FRA1 155.00 157.00  58.00  61.01 100.45 103.28 105.00 107.30  61.40  65.73
#> FRA2 154.00 160.01  57.00  64.00 101.98 105.55 104.35 107.30  60.88  63.03
#> FRA3 154.01 161.00  57.00  63.00  99.36 105.65 101.04 109.04  60.95  65.60
#>      GH_min GH_max
#> FRA1  64.20  67.80
#> FRA2  62.94  66.47
#> FRA3  60.42  66.40

# MM to RSDA
face.RSDA <- MM_to_RSDA(face.MM)
head(face.RSDA, 3)
#>               AD        BC             AH             DH           EH
#> 1 155.00+157.00i 58+61.01i 100.45+103.28i 105.00+107.30i 61.40+65.73i
#> 2 154.00+160.01i 57+64.00i 101.98+105.55i 104.35+107.30i 60.88+63.03i
#> 3 154.01+161.00i 57+63.00i  99.36+105.65i 101.04+109.04i 60.95+65.60i
#>             GH
#> 1 64.20+67.80i
#> 2 62.94+66.47i
#> 3 60.42+66.40i
class(face.RSDA)
#> [1] "symbolic_tbl" "data.frame"

# iGAP to RSDA (direct, one-step)
abalone.RSDA <- iGAP_to_RSDA(abalone.iGAP, location = 1:7)
head(abalone.RSDA, 3)
#>           Length       Diameter          Height           Whole         Shucked
#> 1 0.1275+0.9975i 0.0750+0.8150i -0.0175+0.3125i -1.0210+3.8830i -0.6322+2.1948i
#> 2 0.1775+1.0275i 0.1250+0.8250i  0.0250+0.3250i -0.8567+3.6303i -0.4548+1.7942i
#> 3 0.2200+0.9200i 0.1725+0.7425i  0.0375+0.3075i -0.5725+3.1235i -0.2440+1.2060i
#>           Viscera           Shell
#> 1 -0.2077+0.7712i -0.2580+1.0540i
#> 2 -0.1905+0.7555i -0.2690+1.1530i
#> 3 -0.1037+0.6752i -0.3233+1.4477i
class(abalone.RSDA)
#> [1] "symbolic_tbl" "data.frame"

# RSDA to iGAP
mushroom.iGAP2 <- RSDA_to_iGAP(mushroom.int)
head(mushroom.iGAP2, 3)
#>   Species Pileus.Cap.Width Stipe.Length Stipe.Thickness Edibility
#> 1  arorae              3,8          4,9         0.5,2.5         U
#> 2 arvenis             6,21         4,14           1,3.5         Y
#> 3  benesi              4,8         5,11             1,2         Y

The SODAS_to_MM() and SODAS_to_iGAP() functions convert SODAS XML files but require an XML file path and are not demonstrated here.

2.4 Legacy workflow: creating symbolic_tbl from raw data

The traditional workflow for converting a raw data frame into the symbolic_tbl class used by RSDA involves several steps. We illustrate with the mushroom dataset, which contains 23 species described by 3 interval-valued variables and 2 categorical variables.

data(mushroom)
head(mushroom, 3)
#>   Species Pileus.Cap.Width_min Pileus.Cap.Width_max Stipe.Length_min
#> 1  arorae                    3                    8                4
#> 2 arvenis                    6                   21                4
#> 3  benesi                    4                    8                5
#>   Stipe.Length_max Stipe.Thickness_min Stipe.Thickness_max Edibility
#> 1                9                 0.5                 2.5         U
#> 2               14                 1.0                 3.5         Y
#> 3               11                 1.0                 2.0         Y

First, use set_variable_format() to create pseudo-variables for each category using one-hot encoding:

mushroom_set <- set_variable_format(data = mushroom, location = 8,
                                    var = "Species")
head(mushroom_set, 3)
#>   Species arorae arvenis benesi bernardii bisporus bitorquis califorinus
#> 1      23      1       0      0         0        0         0           0
#> 2      23      0       1      0         0        0         0           0
#> 3      23      0       0      1         0        0         0           0
#>   campestris comtulus cupreo-brunneus diminutives fuseo-fibrillosus
#> 1          0        0               0           0                 0
#> 2          0        0               0           0                 0
#> 3          0        0               0           0                 0
#>   fuscovelatus hondensis lilaceps micromegathus praeclaresquamosus pattersonae
#> 1            0         0        0             0                  0           0
#> 2            0         0        0             0                  0           0
#> 3            0         0        0             0                  0           0
#>   perobscurus semotus silvicola subrutilescens xanthodermus
#> 1           0       0         0              0            0
#> 2           0       0         0              0            0
#> 3           0       0         0              0            0
#>   Pileus.Cap.Width_min Pileus.Cap.Width_max Stipe.Length_min Stipe.Length_max
#> 1                    3                    8                4                9
#> 2                    6                   21                4               14
#> 3                    4                    8                5               11
#>   Stipe.Thickness_min Stipe.Thickness_max Edibility U Y T
#> 1                 0.5                 2.5         3 1 0 0
#> 2                 1.0                 3.5         3 0 1 0
#> 3                 1.0                 2.0         3 0 1 0

Next, apply RSDA_format() to prefix each variable with $I (interval) or $S (set) tags:

mushroom_tmp <- RSDA_format(data = mushroom_set,
                            sym_type1 = c("I", "I", "I", "S"),
                            location = c(25, 27, 29, 31),
                            sym_type2 = c("S"),
                            var = c("Species"))
head(mushroom_tmp, 3)
#>   $S Species arorae arvenis benesi bernardii bisporus bitorquis califorinus
#> 1 $S      23      1       0      0         0        0         0           0
#> 2 $S      23      0       1      0         0        0         0           0
#> 3 $S      23      0       0      1         0        0         0           0
#>   campestris comtulus cupreo-brunneus diminutives fuseo-fibrillosus
#> 1          0        0               0           0                 0
#> 2          0        0               0           0                 0
#> 3          0        0               0           0                 0
#>   fuscovelatus hondensis lilaceps micromegathus praeclaresquamosus pattersonae
#> 1            0         0        0             0                  0           0
#> 2            0         0        0             0                  0           0
#> 3            0         0        0             0                  0           0
#>   perobscurus semotus silvicola subrutilescens xanthodermus $I
#> 1           0       0         0              0            0 $I
#> 2           0       0         0              0            0 $I
#> 3           0       0         0              0            0 $I
#>   Pileus.Cap.Width_min Pileus.Cap.Width_max $I Stipe.Length_min
#> 1                    3                    8 $I                4
#> 2                    6                   21 $I                4
#> 3                    4                    8 $I                5
#>   Stipe.Length_max $I Stipe.Thickness_min Stipe.Thickness_max $S Edibility U Y
#> 1                9 $I                 0.5                 2.5 $S         3 1 0
#> 2               14 $I                 1.0                 3.5 $S         3 0 1
#> 3               11 $I                 1.0                 2.0 $S         3 0 1
#>   T
#> 1 0
#> 2 0
#> 3 0

Clean up variable names with clean_colnames() and write to CSV with write_csv_table():

mushroom_clean <- clean_colnames(data = mushroom_tmp)
head(mushroom_clean, 3)
#>   $S Species arorae arvenis benesi bernardii bisporus bitorquis califorinus
#> 1 $S      23      1       0      0         0        0         0           0
#> 2 $S      23      0       1      0         0        0         0           0
#> 3 $S      23      0       0      1         0        0         0           0
#>   campestris comtulus cupreo-brunneus dutives fuseo-fibrillosus fuscovelatus
#> 1          0        0               0       0                 0            0
#> 2          0        0               0       0                 0            0
#> 3          0        0               0       0                 0            0
#>   hondensis lilaceps micromegathus praeclaresquamosus pattersonae perobscurus
#> 1         0        0             0                  0           0           0
#> 2         0        0             0                  0           0           0
#> 3         0        0             0                  0           0           0
#>   semotus silvicola subrutilescens xanthodermus $I Pileus.Cap.Width
#> 1       0         0              0            0 $I                3
#> 2       0         0              0            0 $I                6
#> 3       0         0              0            0 $I                4
#>   Pileus.Cap.Width $I Stipe.Length Stipe.Length $I Stipe.Thickness
#> 1                8 $I            4            9 $I             0.5
#> 2               21 $I            4           14 $I             1.0
#> 3                8 $I            5           11 $I             1.0
#>   Stipe.Thickness $S Edibility U Y T
#> 1             2.5 $S         3 1 0 0
#> 2             3.5 $S         3 0 1 0
#> 3             2.0 $S         3 0 1 0

write_csv_table(data = mushroom_clean, file = "mushroom_interval.csv")
mushroom_int <- read.sym.table(file = "mushroom_interval.csv",
                               header = TRUE, sep = ";", dec = ".",
                               row.names = 1)
head(mushroom_int, 3)
#> # A tibble: 3 × 5
#>      Species Pileus.Cap.Width   Stipe.Length Stipe.Thickness  Edibility
#>   <symblc_s>       <symblc_n>     <symblc_n>      <symblc_n> <symblc_s>
#> 1   {arorae}    [3.00 : 8.00]  [4.00 : 9.00]   [0.50 : 2.50]        {U}
#> 2  {arvenis}   [6.00 : 21.00] [4.00 : 14.00]   [1.00 : 3.50]        {Y}
#> 3   {benesi}    [4.00 : 8.00] [5.00 : 11.00]   [1.00 : 2.00]        {Y}
class(mushroom_int)
#> [1] "symbolic_tbl" "tbl_df"       "tbl"          "data.frame"

Note: The MM_to_RSDA() function provides a simpler one-step alternative to this workflow.

2.5 Histogram data: the MatH class

Histogram-valued data uses the MatH class from the HistDAWass package. The built-in BLOOD dataset is a MatH object with 14 patient groups and 3 distributional variables:

BLOOD[1:3, 1:2]
#> a matrix of distributions 
#>  2  variables  3  rows 
#>  each distibution in the cell is represented by the mean and the standard deviation 
#>                  Cholesterol               Hemoglobin        
#> u1: F-20  [m= 150.1  ,s= 26.336 ]  [m= 13.695  ,s= 0.55031 ]
#> u2: F-30 [m= 150.71  ,s= 25.284 ]  [m= 12.158  ,s= 0.52834 ]
#> u3: F-40 [m= 164.96  ,s= 25.334 ]  [m= 12.134  ,s= 0.50739 ]

Below we illustrate constructing a MatH object from raw histogram data:

A1 <- c(50, 60, 70, 80, 90, 100, 110, 120)
B1 <- c(0.00, 0.02, 0.08, 0.32, 0.62, 0.86, 0.92, 1.00)
A2 <- c(50, 60, 70, 80, 90, 100, 110, 120)
B2 <- c(0.00, 0.05, 0.12, 0.42, 0.68, 0.88, 0.94, 1.00)
A3 <- c(50, 60, 70, 80, 90, 100, 110, 120)
B3 <- c(0.00, 0.03, 0.24, 0.36, 0.75, 0.85, 0.98, 1.00)

ListOfWeight <- list(
  distributionH(A1, B1),
  distributionH(A2, B2),
  distributionH(A3, B3)
)

Weight <- methods::new("MatH",
                       nrows = 3, ncols = 1, ListOfDist = ListOfWeight,
                       names.rows = c("20s", "30s", "40s"),
                       names.cols = c("weight"), by.row = FALSE)
Weight
#> a matrix of distributions 
#>  1  variables  3  rows 
#>  each distibution in the cell is represented by the mean and the standard deviation 
#>              weight        
#> 20s [m= 86.8  ,s= 13.824 ]
#> 30s [m= 84.1  ,s= 14.44 ] 
#> 40s [m= 82.9  ,s= 14.385 ]

3 The Eight Interval Methods

Many dataSDA functions accept a method parameter that determines how interval boundaries are used in computations. The eight available methods (Wu, Kao and Chen, 2020) are:

Method	Name	Description
CM	Center Method	Uses the midpoint (center) of each interval
VM	Vertices Method	Uses both endpoints of the intervals
QM	Quantile Method	Uses a quantile-based representation
SE	Stacked Endpoints Method	Stacks the lower and upper values of an interval
FV	Fitted Values Method	Fits a linear regression model
EJD	Empirical Joint Density Method	Joint distribution of lower and upper bounds
GQ	Symbolic Covariance Method	Alternative expression of the symbolic sample variance
SPT	Total Sum of Products	Decomposition of the SPT

Quick demonstration:

data(mushroom.int)
var_name <- c("Stipe.Length", "Stipe.Thickness")
int_mean(mushroom.int, var_name, method = c("CM", "FV", "EJD"))
#>     Stipe.Length Stipe.Thickness
#> CM      7.391304        1.823913
#> FV     10.304348        2.371739
#> EJD     7.391304        1.823913

4 Descriptive Statistics for Interval-Valued Data

The core statistical functions int_mean, int_var, int_cov, and int_cor compute descriptive statistics for interval-valued data across any combination of the eight methods.

4.1 Mean and variance

data(mushroom.int)

# Mean of a single variable (default method = "CM")
int_mean(mushroom.int, var_name = "Pileus.Cap.Width")
#>    Pileus.Cap.Width
#> CM         7.978261

# Mean with multiple variables and methods
var_name <- c("Stipe.Length", "Stipe.Thickness")
method <- c("CM", "FV", "EJD")
int_mean(mushroom.int, var_name, method)
#>     Stipe.Length Stipe.Thickness
#> CM      7.391304        1.823913
#> FV     10.304348        2.371739
#> EJD     7.391304        1.823913

# Variance
int_var(mushroom.int, var_name, method)
#>     Stipe.Length Stipe.Thickness
#> CM      9.544466       0.9872431
#> FV     13.858573       1.1729910
#> EJD    12.651229       1.0836673

4.2 Covariance and correlation

Note: EJD, GQ, and SPT methods require character variable names (not numeric indices).

var_name1 <- "Pileus.Cap.Width"
var_name2 <- c("Stipe.Length", "Stipe.Thickness")
method <- c("CM", "VM", "QM", "SE", "FV", "EJD", "GQ", "SPT")

int_cov(mushroom.int, var_name1, var_name2, method)
#> $CM
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width     8.417984        2.480657
#> 
#> $VM
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width     8.095985        2.385769
#> 
#> $QM
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width     14.02174        3.523277
#> 
#> $SE
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width     20.18647        4.714976
#> 
#> $FV
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width     10.37204        2.994745
#> 
#> $EJD
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width     8.051985        2.372802
#> 
#> $GQ
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width     11.46091        3.243229
#> 
#> $SPT
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width     11.95054         3.11936
int_cor(mushroom.int, var_name1, var_name2, method)
#> $CM
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width    0.8047063       0.7373264
#> 
#> $VM
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width    0.3555993       0.3984261
#> 
#> $QM
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width     0.857843       0.7619691
#> 
#> $SE
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width    0.8817636       0.7830681
#> 
#> $FV
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width    0.7496622       0.7440004
#> 
#> $EJD
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width    0.5695142       0.5734316
#> 
#> $GQ
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width    0.8106262       0.7837862
#> 
#> $SPT
#>                  Stipe.Length Stipe.Thickness
#> Pileus.Cap.Width    0.8452575       0.7538511

5 Geometric Properties

Geometric functions characterize the shape and spatial properties of individual intervals and relationships between interval variables.

5.1 Width, radius, center, and midrange

data(mushroom.int)

# Width = upper - lower
head(int_width(mushroom.int, "Stipe.Length"))
#>   Stipe.Length
#> 1            5
#> 2           10
#> 3            6
#> 4            3
#> 5            3
#> 6            6

# Radius = width / 2
head(int_radius(mushroom.int, "Stipe.Length"))
#>   Stipe.Length
#> 1          2.5
#> 2          5.0
#> 3          3.0
#> 4          1.5
#> 5          1.5
#> 6          3.0

# Center = (lower + upper) / 2
head(int_center(mushroom.int, "Stipe.Length"))
#>   Stipe.Length
#> 1          6.5
#> 2          9.0
#> 3          8.0
#> 4          5.5
#> 5          3.5
#> 6          7.0

# Midrange
head(int_midrange(mushroom.int, "Stipe.Length"))
#>   Stipe.Length
#> 1          2.5
#> 2          5.0
#> 3          3.0
#> 4          1.5
#> 5          1.5
#> 6          3.0

5.2 Overlap and containment

These functions measure the degree to which intervals from two variables overlap or contain each other, observation by observation:

# Overlap between two interval variables
head(int_overlap(mushroom.int, "Stipe.Length", "Stipe.Thickness"))
#>   Stipe.Length_Stipe.Thickness
#> 1                    0.0000000
#> 2                    0.0000000
#> 3                    0.0000000
#> 4                    0.1250000
#> 5                    0.1428571
#> 6                    0.0000000

# Containment: proportion of var_name2 contained within var_name1
head(int_containment(mushroom.int, "Stipe.Length", "Stipe.Thickness"))
#>   Stipe.Length_in_Stipe.Thickness
#> 1                           FALSE
#> 2                           FALSE
#> 3                           FALSE
#> 4                           FALSE
#> 5                           FALSE
#> 6                           FALSE

6 Position and Scale Measures

6.1 Median and quantiles

data(mushroom.int)

# Median (default method = "CM")
int_median(mushroom.int, "Stipe.Length")
#>    Stipe.Length
#> CM            7

# Quantiles
int_quantile(mushroom.int, "Stipe.Length", probs = c(0.25, 0.5, 0.75))
#> $CM
#>     Stipe.Length
#> 25%         4.75
#> 50%         7.00
#> 75%         9.25

# Compare median across methods
int_median(mushroom.int, "Stipe.Length", method = c("CM", "FV"))
#>    Stipe.Length
#> CM     7.000000
#> FV     9.405092

6.2 Range, IQR, MAD, and mode

# Range (max - min)
int_range(mushroom.int, "Stipe.Length")
#>    Stipe.Length
#> CM         11.5

# Interquartile range (Q3 - Q1)
int_iqr(mushroom.int, "Stipe.Length")
#>     Stipe.Length
#> IQR          4.5

# Median absolute deviation
int_mad(mushroom.int, "Stipe.Length")
#>    Stipe.Length
#> CM          2.5

# Mode (histogram-based estimation)
int_mode(mushroom.int, "Stipe.Length")
#>    Stipe.Length
#> CM         8.75

7 Robust Statistics

Robust statistics reduce the influence of outliers by trimming or winsorizing extreme values.

7.1 Trimmed and winsorized means

data(mushroom.int)

# Compare standard mean vs trimmed mean (10% trim)
int_mean(mushroom.int, "Stipe.Length", method = "CM")
#>    Stipe.Length
#> CM     7.391304
int_trimmed_mean(mushroom.int, "Stipe.Length", trim = 0.1, method = "CM")
#>    Stipe.Length
#> CM     7.289474

# Winsorized mean: extreme values are replaced (not removed)
int_winsorized_mean(mushroom.int, "Stipe.Length", trim = 0.1, method = "CM")
#>    Stipe.Length
#> CM     7.282609

7.2 Trimmed and winsorized variances

int_var(mushroom.int, "Stipe.Length", method = "CM")
#>    Stipe.Length
#> CM     9.544466
int_trimmed_var(mushroom.int, "Stipe.Length", trim = 0.1, method = "CM")
#>    Stipe.Length
#> CM     6.119883
int_winsorized_var(mushroom.int, "Stipe.Length", trim = 0.1, method = "CM")
#>    Stipe.Length
#> CM     7.564229

8 Distribution Shape

Shape functions characterize the distribution of interval-valued data.

data(mushroom.int)

# Skewness: asymmetry of the distribution
int_skewness(mushroom.int, "Stipe.Length", method = "CM")
#>    Stipe.Length
#> CM    0.2228348

# Kurtosis: tail heaviness
int_kurtosis(mushroom.int, "Stipe.Length", method = "CM")
#>    Stipe.Length
#> CM    -1.065302

# Symmetry coefficient
int_symmetry(mushroom.int, "Stipe.Length", method = "CM")
#>    Stipe.Length
#> CM     0.800247

# Tailedness (related to kurtosis)
int_tailedness(mushroom.int, "Stipe.Length", method = "CM")
#>    Stipe.Length
#> CM    -1.065302

9 Similarity Measures

Similarity functions quantify how alike two interval variables are across all observations. Available measures include Jaccard, Dice, cosine, and overlap coefficient.

data(mushroom.int)

int_jaccard(mushroom.int, "Stipe.Length", "Stipe.Thickness")
#>    Stipe.Length_Stipe.Thickness
#> 1                     0.0000000
#> 2                     0.0000000
#> 3                     0.0000000
#> 4                     0.1250000
#> 5                     0.1428571
#> 6                     0.0000000
#> 7                     0.0000000
#> 8                     0.0000000
#> 9                     0.0000000
#> 10                    0.0000000
#> 11                    0.0000000
#> 12                    0.0000000
#> 13                    0.0000000
#> 14                    0.0000000
#> 15                    0.0000000
#> 16                    0.0000000
#> 17                    0.0000000
#> 18                    0.0000000
#> 19                    0.0000000
#> 20                    0.0000000
#> 21                    0.0000000
#> 22                    0.0000000
#> 23                    0.0000000
int_dice(mushroom.int, "Stipe.Length", "Stipe.Thickness")
#>    Stipe.Length_Stipe.Thickness
#> 1                     0.0000000
#> 2                     0.0000000
#> 3                     0.0000000
#> 4                     0.2222222
#> 5                     0.2500000
#> 6                     0.0000000
#> 7                     0.0000000
#> 8                     0.0000000
#> 9                     0.0000000
#> 10                    0.0000000
#> 11                    0.0000000
#> 12                    0.0000000
#> 13                    0.0000000
#> 14                    0.0000000
#> 15                    0.0000000
#> 16                    0.0000000
#> 17                    0.0000000
#> 18                    0.0000000
#> 19                    0.0000000
#> 20                    0.0000000
#> 21                    0.0000000
#> 22                    0.0000000
#> 23                    0.0000000
int_cosine(mushroom.int, "Stipe.Length", "Stipe.Thickness")
#>        Stipe.Length_Stipe.Thickness
#> Cosine                    0.9257023
int_overlap_coefficient(mushroom.int, "Stipe.Length", "Stipe.Thickness")
#>    Stipe.Length_Stipe.Thickness
#> 1                     0.0000000
#> 2                     0.0000000
#> 3                     0.0000000
#> 4                     0.3333333
#> 5                     0.5000000
#> 6                     0.0000000
#> 7                     0.0000000
#> 8                     0.0000000
#> 9                     0.0000000
#> 10                    0.0000000
#> 11                    0.0000000
#> 12                    0.0000000
#> 13                    0.0000000
#> 14                    0.0000000
#> 15                    0.0000000
#> 16                    0.0000000
#> 17                    0.0000000
#> 18                    0.0000000
#> 19                    0.0000000
#> 20                    0.0000000
#> 21                    0.0000000
#> 22                    0.0000000
#> 23                    0.0000000

Note: int_tanimoto() is equivalent to int_jaccard() for interval-valued data:

int_tanimoto(mushroom.int, "Stipe.Length", "Stipe.Thickness")
#>    Stipe.Length_Stipe.Thickness
#> 1                     0.0000000
#> 2                     0.0000000
#> 3                     0.0000000
#> 4                     0.1250000
#> 5                     0.1428571
#> 6                     0.0000000
#> 7                     0.0000000
#> 8                     0.0000000
#> 9                     0.0000000
#> 10                    0.0000000
#> 11                    0.0000000
#> 12                    0.0000000
#> 13                    0.0000000
#> 14                    0.0000000
#> 15                    0.0000000
#> 16                    0.0000000
#> 17                    0.0000000
#> 18                    0.0000000
#> 19                    0.0000000
#> 20                    0.0000000
#> 21                    0.0000000
#> 22                    0.0000000
#> 23                    0.0000000

The int_similarity_matrix() function computes a pairwise similarity matrix across all interval variables:

int_similarity_matrix(mushroom.int, method = "jaccard")
#>            1          2          3          4          5          6          7
#> 1  1.0000000 0.37037037 0.62380952 0.20000000 0.32539683 0.40873016 0.41269841
#> 2  0.3703704 1.00000000 0.37254902 0.14761905 0.28611111 0.55416667 0.18894831
#> 3  0.6238095 0.37254902 1.00000000 0.09523810 0.23611111 0.32900433 0.27380952
#> 4  0.2000000 0.14761905 0.09523810 1.00000000 0.06666667 0.30000000 0.25000000
#> 5  0.3253968 0.28611111 0.23611111 0.06666667 1.00000000 0.34166667 0.38333333
#> 6  0.4087302 0.55416667 0.32900433 0.30000000 0.34166667 1.00000000 0.32467532
#> 7  0.4126984 0.18894831 0.27380952 0.25000000 0.38333333 0.32467532 1.00000000
#> 8  0.4206349 0.27727273 0.54166667 0.16666667 0.51587302 0.26190476 0.48809524
#> 9  0.1479076 0.03030303 0.00000000 0.08333333 0.22222222 0.04761905 0.33333333
#> 10 0.2651515 0.06666667 0.28787879 0.00000000 0.17794486 0.02666667 0.10873440
#> 11 0.1111111 0.06060606 0.04166667 0.16666667 0.16666667 0.09523810 0.25000000
#> 12 0.4292929 0.61283422 0.41414141 0.09090909 0.57109557 0.55151515 0.29545455
#> 13 0.7444444 0.37142857 0.86772487 0.16666667 0.27042484 0.42028986 0.32063492
#> 14 0.2303030 0.48888889 0.25555556 0.00000000 0.51851852 0.36666667 0.13333333
#> 15 0.0000000 0.41944444 0.04761905 0.25000000 0.08888889 0.28888889 0.06250000
#> 16 0.1179931 0.01449275 0.00000000 0.03703704 0.22222222 0.02222222 0.27777778
#> 17 0.1066919 0.64848485 0.12222222 0.06666667 0.20238095 0.50108225 0.08888889
#> 18 0.1742424 0.56325758 0.25757576 0.11363636 0.23333333 0.62121212 0.20959596
#> 19 0.2083333 0.35555556 0.40476190 0.04166667 0.35714286 0.30000000 0.16203704
#> 20 0.3809524 0.09090909 0.19444444 0.25000000 0.16666667 0.16849817 0.62962963
#> 21 0.2824074 0.40000000 0.48809524 0.04166667 0.45238095 0.36666667 0.24537037
#> 22 0.3240741 0.48888889 0.56818182 0.02777778 0.39682540 0.31111111 0.23397436
#> 23 0.4047619 0.89583333 0.41025641 0.14761905 0.35555556 0.64444444 0.24475524
#>             8          9         10         11         12         13        14
#> 1  0.42063492 0.14790765 0.26515152 0.11111111 0.42929293 0.74444444 0.2303030
#> 2  0.27727273 0.03030303 0.06666667 0.06060606 0.61283422 0.37142857 0.4888889
#> 3  0.54166667 0.00000000 0.28787879 0.04166667 0.41414141 0.86772487 0.2555556
#> 4  0.16666667 0.08333333 0.00000000 0.16666667 0.09090909 0.16666667 0.0000000
#> 5  0.51587302 0.22222222 0.17794486 0.16666667 0.57109557 0.27042484 0.5185185
#> 6  0.26190476 0.04761905 0.02666667 0.09523810 0.55151515 0.42028986 0.3666667
#> 7  0.48809524 0.33333333 0.10873440 0.25000000 0.29545455 0.32063492 0.1333333
#> 8  1.00000000 0.22222222 0.24814815 0.33333333 0.31818182 0.58241758 0.2222222
#> 9  0.22222222 1.00000000 0.19047619 0.22222222 0.02777778 0.07792208 0.0000000
#> 10 0.24814815 0.19047619 1.00000000 0.03703704 0.05333333 0.31818182 0.0000000
#> 11 0.33333333 0.22222222 0.03703704 1.00000000 0.05555556 0.09523810 0.0000000
#> 12 0.31818182 0.02777778 0.05333333 0.05555556 1.00000000 0.40887132 0.7272727
#> 13 0.58241758 0.07792208 0.31818182 0.09523810 0.40887132 1.00000000 0.2095238
#> 14 0.22222222 0.00000000 0.00000000 0.00000000 0.72727273 0.20952381 1.0000000
#> 15 0.04444444 0.00000000 0.00000000 0.00000000 0.27916667 0.02222222 0.3053613
#> 16 0.14285714 0.86666667 0.25396825 0.14285714 0.01333333 0.05252525 0.0000000
#> 17 0.07142857 0.00000000 0.00000000 0.00000000 0.47323232 0.08211144 0.5634921
#> 18 0.16666667 0.00000000 0.02666667 0.00000000 0.57575758 0.20816864 0.4555556
#> 19 0.26190476 0.00000000 0.00000000 0.00000000 0.46969697 0.33333333 0.5238095
#> 20 0.29166667 0.54166667 0.32196970 0.28703704 0.13461538 0.28174603 0.0000000
#> 21 0.35714286 0.00000000 0.00000000 0.00000000 0.53030303 0.41176471 0.5416667
#> 22 0.52380952 0.00000000 0.16666667 0.00000000 0.54292929 0.52287582 0.5194444
#> 23 0.33282828 0.03030303 0.08965517 0.06060606 0.69277389 0.40740741 0.5277778
#>            15         16         17         18         19         20         21
#> 1  0.00000000 0.11799312 0.10669192 0.17424242 0.20833333 0.38095238 0.28240741
#> 2  0.41944444 0.01449275 0.64848485 0.56325758 0.35555556 0.09090909 0.40000000
#> 3  0.04761905 0.00000000 0.12222222 0.25757576 0.40476190 0.19444444 0.48809524
#> 4  0.25000000 0.03703704 0.06666667 0.11363636 0.04166667 0.25000000 0.04166667
#> 5  0.08888889 0.22222222 0.20238095 0.23333333 0.35714286 0.16666667 0.45238095
#> 6  0.28888889 0.02222222 0.50108225 0.62121212 0.30000000 0.16849817 0.36666667
#> 7  0.06250000 0.27777778 0.08888889 0.20959596 0.16203704 0.62962963 0.24537037
#> 8  0.04444444 0.14285714 0.07142857 0.16666667 0.26190476 0.29166667 0.35714286
#> 9  0.00000000 0.86666667 0.00000000 0.00000000 0.00000000 0.54166667 0.00000000
#> 10 0.00000000 0.25396825 0.00000000 0.02666667 0.00000000 0.32196970 0.00000000
#> 11 0.00000000 0.14285714 0.00000000 0.00000000 0.00000000 0.28703704 0.00000000
#> 12 0.27916667 0.01333333 0.47323232 0.57575758 0.46969697 0.13461538 0.53030303
#> 13 0.02222222 0.05252525 0.08211144 0.20816864 0.33333333 0.28174603 0.41176471
#> 14 0.30536131 0.00000000 0.56349206 0.45555556 0.52380952 0.00000000 0.54166667
#> 15 1.00000000 0.00000000 0.51942502 0.37606838 0.18803419 0.00000000 0.17216117
#> 16 0.00000000 1.00000000 0.00000000 0.00000000 0.00000000 0.48611111 0.00000000
#> 17 0.51942502 0.00000000 1.00000000 0.67195767 0.25925926 0.00000000 0.27635328
#> 18 0.37606838 0.00000000 0.67195767 1.00000000 0.35555556 0.05341880 0.42222222
#> 19 0.18803419 0.00000000 0.25925926 0.35555556 1.00000000 0.03703704 0.88888889
#> 20 0.00000000 0.48611111 0.00000000 0.05341880 0.03703704 1.00000000 0.03703704
#> 21 0.17216117 0.00000000 0.27635328 0.42222222 0.88888889 0.03703704 1.00000000
#> 22 0.27472527 0.00000000 0.36153846 0.50000000 0.58888889 0.02564103 0.70000000
#> 23 0.35277778 0.01449275 0.61991342 0.65353535 0.37777778 0.11313131 0.43333333
#>            22         23
#> 1  0.32407407 0.40476190
#> 2  0.48888889 0.89583333
#> 3  0.56818182 0.41025641
#> 4  0.02777778 0.14761905
#> 5  0.39682540 0.35555556
#> 6  0.31111111 0.64444444
#> 7  0.23397436 0.24475524
#> 8  0.52380952 0.33282828
#> 9  0.00000000 0.03030303
#> 10 0.16666667 0.08965517
#> 11 0.00000000 0.06060606
#> 12 0.54292929 0.69277389
#> 13 0.52287582 0.40740741
#> 14 0.51944444 0.52777778
#> 15 0.27472527 0.35277778
#> 16 0.00000000 0.01449275
#> 17 0.36153846 0.61991342
#> 18 0.50000000 0.65353535
#> 19 0.58888889 0.37777778
#> 20 0.02564103 0.11313131
#> 21 0.70000000 0.43333333
#> 22 1.00000000 0.52222222
#> 23 0.52222222 1.00000000

10 Uncertainty and Variability

These functions measure the uncertainty, variability, and information content of interval-valued data.

10.1 Entropy, CV, and dispersion

data(mushroom.int)

# Shannon entropy (higher = more uncertainty)
int_entropy(mushroom.int, "Stipe.Length", method = "CM")
#>    Stipe.Length
#> CM     3.740953

# Coefficient of variation (SD / mean)
int_cv(mushroom.int, "Stipe.Length", method = "CM")
#>    Stipe.Length
#> CM    0.4179793

# Dispersion index
int_dispersion(mushroom.int, "Stipe.Length", method = "CM")
#>    Stipe.Length
#> CM     2.608696

10.2 Imprecision, granularity, uniformity, and information content

# Imprecision: based on interval widths
int_imprecision(mushroom.int, "Stipe.Length")
#>             Stipe.Length
#> Imprecision    0.7882353

# Granularity: variability in interval sizes
int_granularity(mushroom.int, "Stipe.Length")
#>             Stipe.Length
#> Granularity     0.506144

# Uniformity: inverse of granularity (higher = more uniform)
int_uniformity(mushroom.int, "Stipe.Length")
#>            Stipe.Length
#> Uniformity    0.6639471

# Normalized information content (between 0 and 1)
int_information_content(mushroom.int, "Stipe.Length", method = "CM")
#>    Stipe.Length
#> CM    0.8269928

11 Distance Measures

Distance functions compute dissimilarity between observations in interval-valued datasets. Available methods include: euclidean, hausdorff, ichino, de_carvalho, and others.

We use the interval columns of car.int for distance examples (excluding the character Car column):

data(car.int)
car_num <- car.int[, 2:5]
head(car_num, 3)
#> # A tibble: 3 × 4
#>               Price      Max_Velocity      Accn_Time     Cylinder_Capacity
#>          <symblc_n>        <symblc_n>     <symblc_n>            <symblc_n>
#> 1 [260.50 : 460.00] [298.00 : 306.00]  [4.70 : 5.00] [5,935.00 : 5,935.00]
#> 2  [68.20 : 140.30] [216.00 : 250.00]  [6.70 : 9.70] [1,781.00 : 4,172.00]
#> 3 [123.80 : 171.40] [232.00 : 250.00] [5.40 : 10.10] [2,771.00 : 4,172.00]

11.1 Single distance method

# Euclidean distance between observations
int_dist(car_num, method = "euclidean")
#>            1          2          3          4          5          6          7
#> 2 2970.35864                                                                  
#> 3 2473.41498  496.95918                                                       
#> 4 1849.03463 1121.84802  625.03745                                            
#> 5 1405.70741 1569.15384 1073.25182  456.94909                                 
#> 6 2861.17713  150.18676  399.17381 1017.65776 1456.19073                      
#> 7 3348.56490  378.47417  875.27157 1500.20546 1946.33817  496.67582           
#> 8 2446.96685  528.63346   74.77202  604.37782 1043.31078  416.61899  904.08797

11.2 Distance matrix

# Return as a full matrix
dm <- int_dist_matrix(car_num, method = "hausdorff")
dm[1:5, 1:5]
#>        1      2      3      4      5
#> 1    0.0 4560.4 3523.7 3398.8 2425.5
#> 2 4560.4    0.0 1062.9 1374.6 2139.9
#> 3 3523.7 1062.9    0.0 1342.0 1590.2
#> 4 3398.8 1374.6 1342.0    0.0  998.8
#> 5 2425.5 2139.9 1590.2  998.8    0.0

11.3 Pairwise distance between variables

int_pairwise_dist(car_num, "Price", "Max_Velocity", method = "euclidean")
#>      1      2      3      4      5      6      7      8 
#>  58.25 128.75  93.40  43.15  19.50  54.80 144.45  95.45

11.4 All distance methods at once

all_dists <- int_dist_all(car_num)
names(all_dists)
#>  [1] "GD"    "IY"    "L1"    "L2"    "CB"    "HD"    "EHD"   "nEHD"  "snEHD"
#> [10] "TD"    "WD"

12 Descriptive Statistics for Histogram-Valued Data

The hist_mean, hist_var, hist_cov, and hist_cor functions compute descriptive statistics for histogram-valued data (MatH objects).

# Mean and variance with BG method (default)
hist_mean(BLOOD, "Cholesterol")
#> [1] 180.677
hist_var(BLOOD, "Cholesterol")
#> [1] 1002.339

# L2W method
hist_mean(BLOOD, "Cholesterol", method = "L2W")
#> [1] 180.677
hist_var(BLOOD, "Cholesterol", method = "L2W")
#> [1] 388.1376

# Covariance and correlation
hist_cov(BLOOD, "Cholesterol", "Hemoglobin", method = "B")
#> [1] -4.692686
hist_cor(BLOOD, "Cholesterol", "Hemoglobin", method = "L2W")
#> [1] -0.4794806

13 Symbolic Dataset Donation/Submission Guidelines

We welcome contributions of high-quality datasets for symbolic data analysis. Submitted datasets will be made publicly available (or under specified constraints) to support research in machine learning, statistics, and related fields. You can submit the related files via email to wuhm@g.nccu.edu.tw or through the Google Form at Symbolic Dataset Submission Form. The submission requirements are as follows.

Dataset Format:
- Preferred formats: .csv, .xlsx, or any symbolic format in plain text.
- Compressed (.zip or .gz) if multiple files are included.

Required Metadata: Contributors must provide the following details:

Field	Description	Example
Dataset Name	A clear, descriptive title.	“face recognition data”
Dataset Short Name	A clear,abbreviation title.	“face data”
Authors	Full names of donator.	“First name, Last name”
E-mail	Contact email.	“abc123@gmail.com”
Institutes	Affiliated organizations.	“-”
Country	Origin of the dataset.	“France”
Dataset Descriptions	Data descriptive	See ‘README’
Sample Size	Number of instances/rows.	27
Number of Variables	Total features/columns (categorical/numeric).	6 (interval)
Missing Values	Indicate if missing values exist and how they are handled.	“None” / “Yes, marked as NA”
Variable Descriptions	Detailed description of each column (name, type, units, range).	See ‘README’
Source	Original data source (if applicable).	“Leroy et al. (1996)”
References	Citations for prior work using the dataset.	“Douzal-Chouakria, Billard, and Diday (2011)”
Applied Areas	Relevant fields (e.g., biology, finance).	“Machine Learning”
Usage Constraints	Licensing (CC-BY, MIT) or restrictions.	“Public domain”
Data Link	URL to download the dataset (Google Drive, GitHub, etc.).	“(https)”

Quality Assurance:
- Datasets should be clean (no sensitive/private data).
Optional (Recommended):
- A companion README file with:
  - Dataset background.
  - Suggested use cases.
  - Known limitations.

14 Citation

Po-Wei Chen, Chun-houh Chen, Han-Ming Wu (2026), dataSDA: datasets and basic statistics for symbolic data analysis in R (v0.1.8). Technical report.

Introduction to dataSDA

Datasets and Basic Statistics for Symbolic Data Analysis

Po-Wei Chen, Chun-houh Chen and Han-Ming Wu*

February 11, 2026