--- title: "synthetic_data_00" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{synthetic_data_00} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(SmoothPLS) library(ggplot2) library(dplyr) ``` This document show some examples of some of the package's functions. It is divided by parts dedicated to some subjects. # Parameters We will encounter some parameters in this package. Here we will fix them. ```{r} nind = 50 # number of individuals (train set) start = 0 # First time end = 100 # end time lambda_0 = 0.2 # Exponential law parameter for state 0 lambda_1 = 0.1 # Exponential law parameter for state 1 prob_start = 0.55 # Probability of starting with state 1 curve_type = 'cat' TTRatio = 0.2 # Train Test Ratio NotS_ratio = 0.2 # noise variance over total variance for Y beta_0_real = 5.4321 # Intercept value for the link between X(t) and Y nbasis = 15 # number of basis functions norder = 4 # 4 for cubic splines basis ``` # Integral evaluation ## evaluate_id_func_integral This function evaluate the following integral : $\int_0^T X(t) func(t) dt$. With $X(t)$ a categorical functional data which states are 0 or 1. WARNING this function as no use if $X(t)$ is a Scalar Functional Data. ```{r} id_df = data.frame(id=rep(1,5), time=seq(0, 40, 10), state=c(0, 1, 1, 0, 1)) id_df ``` ```{r, fig.alt="Decay plot"} func = function(t){0.01*t^2 - t} plot(0:40, func(0:40)) ``` ```{r} evaluate_id_func_integral(id_df, func) ``` # Data regularisation ## regularize_time_series This function regularize a dataframe on another time sequence. If the input dataframe get more than one id, all the ids will share the same time sequence in the output dataframe. ```{r} id_df_regul = regularize_time_series(id_df, time_seq = seq(0, 40, 2), curve_type = 'cat') id_df_regul ``` ```{r} print(id_df$time) print(id_df_regul$time) ``` ## convert_to_wide_format This function converts a regularized dataframe into a long format by pivoting the data. Useful for classic Multivariate Functional PLS or else. The input dataframe of *convert_to_wide_format* is the output of the function *regularize_time_series* because after the regularisation, all the individuals have the same time interval and time sampling which allows the pivot table. ```{r} id_df_long = convert_to_wide_format(id_df_regul) id_df_long ``` # Other ## from_fd_to_func This function transform an fd object into a function. This function will be used to easier the integral calculation inside the PLS. ```{r} basis = create_bspline_basis(0, 100, 10, 4) coef = c(10, 8, 6, 4, 2, 1, 3, 5, 7, 9) fd_obj = fda::fd(coef = coef, basisobj = basis) func_from_fd = from_fd_to_func(coef = coef, basis = basis) ``` Here if the fd object : ```{r, fig.alt="fd object"} plot(fd_obj) ``` Now we can evaluate, we should find the same values : ```{r} fda::eval.fd(fd_obj, 13) ``` # Synthetic one state CFD data This package provides some function to create some synthetic data. These data are individuals with state 0 or 1. The state law are exponential laws. One important input is : type = 'cat'. ## generate_X_df This function generate nind individuals, for a time between start and end, with parameters for the two exponential law for state 0 and state 1 lambda_0 and lambda_1. The first state at time = start as the probability prob_start of being 1 (binomial law). Here we generate df with the declared parameters at the top of this notebook. ```{r} df = generate_X_df(nind = nind, start = start,end = end, curve_type = 'cat', lambda_0 = lambda_0, lambda_1 = lambda_1, prob_start = prob_start) head(df) ``` ```{r} df_2 = generate_X_df(nind=20, start = 13, end = 60, curve_type = 'cat', lambda_0 = 0.21, lambda_1 = 0.13, prob_start = 0.55) length(unique(df_2$id)) ``` ## plot_individuals This function plots the selected first individuals of the given dataframe. ```{r, fig.alt="Binary CFD individuals"} plot_CFD_individuals(df) ``` ```{r, fig.alt="3 binary CFD individuals"} plot_CFD_individuals(df_2, n_ind_to_plot = 3) ``` ## Create df test this function generates the test set of X. It uses the same arguments than the previous function *generate_X_df* plus the train test ration TTRatio. It evaluates the number of individuals to create in order to follow the TTRatio for all the individuals, Train set and test set. ```{r} nind_test = number_of_test_id(TTRatio = TTRatio, nind = nind) df_test = generate_X_df(nind_test, start, end, curve_type = 'cat', lambda_0, lambda_1, prob_start) length(unique(df_test$id)) ``` ```{r} df_test_2 = generate_X_df(nind=number_of_test_id(TTRatio = TTRatio, nind = 80), start, end, curve_type = 'cat', lambda_0, lambda_1, prob_start) # Here the number of individuals will be 20 because : # 20 = 0.2 (80 + 20) or # 20 = floor(80*TTRatio/(1-TTRatio)) length(unique(df_test_2$id)) ``` ## Beta_real This package gives 3 functions to link $X(t)$ with a scalar $Y$ by the following equation : $Y = \beta_0 + \int_{0}^{T}X(t) \beta(t) dt$. ### beta_1_real_func ```{r, fig.alt="beta_1_real_func"} plot(x=0:100, y=beta_1_real_func(0:100, 100), type='l', main="Beta_1") ``` ### beta_2_real_func ```{r, fig.alt="beta_2_real_func"} plot(x=0:100, y=beta_2_real_func(0:100, 100), type='l', main="Beta_2") ``` ### beta_3_real_func ```{r, fig.alt="beta_3_real_func"} plot(x=0:100, y=beta_3_real_func(0:100, 100), type='l', main="Beta_3") ``` ### Other beta functions ```{r, fig.alt="Other beta functions"} plot(x=0:100, y=beta_4_real_func(0:100, 100), type='l', main="Beta_4") plot(x=0:100, y=beta_5_real_func(0:100, 100), type='l', main="Beta_5") plot(x=0:100, y=beta_6_real_func(0:100, 100), type='l', main="Beta_6") plot(x=0:100, y=beta_7_real_func(0:100, 100), type='l', main="Beta_7") ``` ## generate_Y_df This function generates a Y dataframe base on the following relation between $X(t)$ and $Y$ : $Y = \beta_0 + \int_{0}^{T}X(t) \beta(t) dt$. It also add some noise to Y. The given ration NotS_ratio gives the part of the total variance due to some gaussian noise. ```{r} Y_df = generate_Y_df(df, curve_type = 'cat', beta_1_real_func, beta_0_real, NotS_ratio) names(Y_df) ``` ```{r} head(Y_df) ``` We can look at the variance : ```{r} var(Y_df$Y_real) var(Y_df$Y_noised) var(Y_df$Y_real)/var(Y_df$Y_noised) (var(Y_df$Y_noised) - var(Y_df$Y_real))/var(Y_df$Y_noised) ``` ```{r, fig.alt="Y_df real and noised value histograms"} oldpar <- par(mfrow=c(1,2)) hist(Y_df$Y_real) hist(Y_df$Y_noised) par(oldpar) ``` We can generate Y_df_test the same way by simply changing df to df_test: ```{r} Y_df_test = generate_Y_df(df_test, curve_type = 'cat', beta_1_real_func, beta_0_real, NotS_ratio) head(Y_df_test) ``` # Synthetic SFD data We can also generate synthetic Scalar Functional Data SFD. The important input is type='num'. ## generate_X_df For SFD for X_df two new arguments are important: the noise added to the signal and the seed for repeatability. ```{r} df = generate_X_df(nind = nind, start = start,end = end, curve_type = 'num', noise_sd = 0.15, seed = 123) head(df) ``` ```{r, fig.alt="Noised cosinus curves"} # Visualisation ggplot(df, aes(x = time, y = value, group = id, color = factor(id))) + geom_line(alpha = 0.8) + labs(title = "Noised cosinus curves", x = "Time", y = "Value", color = "Individual") + theme_minimal() ``` ## generate_Y_df ```{r} Y_df = generate_Y_df(df = df, curve_type = 'num', beta_real_func_or_list = beta_1_real_func, beta_0_real = beta_0_real, NotS_ratio = NotS_ratio, seed = 123) head(Y_df) ``` ## regularize_time_series ```{r} id_df_regul = regularize_time_series(df, time_seq = seq(0, 40, 2), curve_type = 'num') id_df_regul ``` ## convert_to_wide_format ```{r} id_df_long = convert_to_wide_format(id_df_regul) id_df_long ``` # Synthetic multi state CFD ```{r} N_states = 4 ``` ```{r} # Initialized the lambdas values lambdas = lambda_determination(N_states) lambdas ``` ```{r} # Initialized the transition matrix transition_df = transfer_probabilities(N_states) transition_df ``` ## Data generation ```{r} df = generate_X_df_multistates(nind = 100, N_states, start=0, end=100, lambdas, transition_df) head(df) ``` We can plot some individuals with the plot_individual() function. ```{r, fig.alt="Multistates individuals"} plot_CFD_individuals(df) ``` ## plotData We can use the package *cfda* and its functions to make some analysis on the data. ```{r, fig.alt="Multistates individuals plot by cfda"} cfda::plotData(df) ``` ## estimate_pt We can still use the CFDA package to estimate the probabilities : ```{r} proba = cfda::estimate_pt(df) ``` ```{r, fig.alt="Marginal probabilities"} plot(proba, ribbon = FALSE) plot(proba, ribbon = TRUE) ``` # Multi state CFD manipulation Before performing the fpls or the smooth-PLS, we have to manipulate the categorical functional data of multiple states into multiple categorical functional data of one state each. ```{r} head(df) ``` We need to order the states ```{r} str(df$state) unique(df$state) order(unique(df$state)) # Warning, give the indices of the order! state_ordered = unique(df$state)[order(unique(df$state))] state_ordered ``` ## state_indicator This function transform a categorical functional data with its indicator function into a dedicated list of all the state (one per different state) This function sort the states by ascending order (if numeric) and put the name 'state_X' as the column of the output concerning the 'X' state. This function will also work with character states. *Now for the different lists, the [[i]] element of a list concern the [[i]] states ordered.* ```{r} si_df = state_indicator(df, id_col='id', time_col='time') names(si_df) ``` ```{r} head(si_df) ``` ## split_in_state_df This function transform a categorical functional data with its indicator function into a dedicated list of all the state (one per different state) ```{r} split_df = split_in_state_df(si_df, id_col='id', time_col='time') names(split_df) mode(split_df) ``` ```{r} names(split_df)[4] head(split_df[[4]]) ``` ## build_df_per_state This function takes the data_list with one dataframe per state indicator function and remove the duplicated state of each state indicator with the function *remove_duplicate_states()*. ```{r} states_df = build_df_per_state(split_df, id_col='id', time_col='time') names(states_df) mode(states_df) ``` ```{r, fig.alt="Indicator function per state"} plot_CFD_individuals(states_df[[1]]) ``` ## cat_data_to_indicator This function apply all functions to go from a categorical functional data with different states to a list of one dataframe per state indicator function. whose duplicated states where removed ```{r} df_list = cat_data_to_indicator(df) names(df_list) head(df_list$state_1) ``` Now the data is ready for the different operations needed for the Smooth PLS or the FPLS.