Type: | Package |
Title: | Collinearity Detection using Redefined Variance Inflation Factor and Graphical Methods |
Version: | 2.0 |
Date: | 2024-11-04 |
Author: | R. Salmerón [aut, cre], C.B. García [aut] |
Maintainer: | R. Salmerón <romansg@ugr.es> |
Description: | The detection of troubling approximate collinearity in a multiple linear regression model is a classical problem in Econometrics. The objective of this package is to detect it using the variance inflation factor redefined and the scatterplot between the variance inflation factor and the coefficient of variation. For more details see Salmerón R., García C.B. and García J. (2018) <doi:10.1080/00949655.2018.1463376>, Salmerón, R., Rodríguez, A. and García C. (2020) <doi:10.1007/s00180-019-00922-x>, Salmerón, R., García, C.B, Rodríguez, A. and García, C. (2022) <doi:10.32614/RJ-2023-010>, Salmerón, R., García, C.B. and García, J. (2024) <doi:10.1007/s10614-024-10575-8> and Salmerón, R., García, C.B, García J. (2023, working paper) <doi:10.48550/arXiv.2005.02245>. |
License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
Encoding: | UTF-8 |
URL: | http://colldetreat.r-forge.r-project.org/ |
Depends: | R (≥ 3.5.0), multiColl |
NeedsCompilation: | no |
Packaged: | 2024-11-14 10:10:04 UTC; usuario |
Repository: | CRAN |
Date/Publication: | 2024-11-14 11:50:03 UTC |
Multicollinearity detection using RVIF and graphical methods
Description
The detection of troubling near multicollinearity in a multiple linear regression model is a classical problem in Econometrics. The purpose of this package is to detect it by using the Redefined Variance Inflation Factor (RVIF) and the scatterplot between the Variance Inflation Factor (VIF) and the Coefficient of Variation (CV).
In addition, the RVIF is used to determine whether the statistical analysis of the model is affected by the degree of multicollinearity of the model.
Details
This package contains three functions. The first, CV_VIF, returns the values of the Variance Inflation Factor (VIF) and the Coefficient of Variation (CV), as well as their representation in a scatterplot. Taking into account that the VIF is useful for detecting essential multicollinearity and the CV is useful for detecting non-essential multicollinearity, the scatterplot of both measures can provide interesting information for determining whether there is a troubling degree of multicollinearity, what kind of multicollinearity it is and what variables are causing the multicollinearity.
On the other hand, the funcion RVIF calculates the redefined VIF, the percentage of near multicollinearity due to each independent variable and, using the above function, the scatterplot between the CV and VIF.
Finally, Theorem determines whether the degree of multicollinearity in the regression model affects the statistical analysis of the model, i.e., whether the non-rejection of the null hypothesis in the individual significance tests is due to the linear relationships between the independent variables of the model.
Author(s)
Román Salmerón Gómez (University of Granada) and Catalina García García (University of Granada).
Maintainer: Román Salmerón Gómez (romansg@ugr.es)
References
R. Salmerón, C. García, and J. García. Variance inflation factor and condition number in multiple linear regression. Journal of Statistical Computation and Simulation, 88:2365-2384, 2018.
R. Salmerón, A. Rodríguez, and C. García. Diagnosis and quantification of the non-essential collinearity. Computational Statistics, 35:647-666, 2020.
Salmerón, R., García, C.B., Rodríguez, A. and García, C. Limitations in detecting multicollinearity due to scaling issues in the mcvis package. R Journal, 14(4), 264-279, 2022.
Salmerón, R., García, C.B. and García, J. A redefined Variance Inflation Factor: overcoming the limitations of the Variance Inflation Factor. Computational Economics (2024, online), doi: https://doi.org/10.1007/s10614-024-10575-8.
Overcoming the inconsistences of the variance inflation factor: a redefined VIF and a test to detect statistical troubling multicollinearity by Salmerón, R., García, C.B and García, J. (working paper, https://arxiv.org/pdf/2005.02245).
Cobb-Douglas data
Description
Data used in Example 2 of Salmerón, García and García (2024) (sub-section 4.2) on data for the Cobb-Douglas production function.
Usage
data("CDpf")
Format
A data frame with 28 observations on the following 4 variables:
P
Production (dependent variable).
cte
Intercept.
logK
Capital (in logarithm).
logW
Work (in logarithm).
Details
This dataset is originally used by Olva Maldonado (2009).
References
Olva Maldonado, H. Análisis de la función de producción Cobb-Douglas y su aplicación en el sector productivo mexicano. Tesis, Universidad Autónoma de Chapingo, 2009.
Salmerón, R., García, C.B. and García, J. A redefined Variance Inflation Factor: overcoming the limitations of the Variance Inflation Factor. Computational Economics (2024, online), doi: https://doi.org/10.1007/s10614-024-10575-8.
Examples
data(CDpf)
head(CDpf, n=5)
y = CDpf[,1]
X = as.matrix(CDpf[,2:4])
Theorem(y, X)
VIF, CV and a common scatter plot
Description
This function provides the values for the Variance Inflation Factor (VIF) and the Coefficient of Variation (CV), as well as a common representation of both.
Usage
CV_VIF(X, size=NULL, top=82.64, limit=40, dummy=FALSE, pos=NULL, intercept=TRUE)
Arguments
X |
A numerical design matrix that should contain more than one regressor (including the intercept). |
size |
A numerical vector containing the percentage of multicollinearity due to each variable. By default |
top |
A real number that indicates the threshold from which the percentage of multicollinearity due to each variable is considered troubling. By default |
limit |
A real number that indicates the lower limit of the vertical axis. By default |
dummy |
A logical value that indicates if there are dummy variables in the design matrix |
pos |
A numerical vector indicating the position of the dummy variables, if any, in the design matrix |
intercept |
A logical value used only by the function RVIF. By default |
Details
It is interesting to note the distinction between essential (near-linear relationship between at least two independent variables excluding the intercept) and non-essential multicollinearity (near-linear relationship between the intercept and at least one of the remaining independent variables), due to the VIF is not an appropriate measure to detect non-essential collinearity (only detects essential collinearity), while the CV is useful to detect only non-essential collinearity.
Then, this distinction between essential and non-essential multicollinearity and the limitations of each measure for detecting the different kinds of multicollinearity, can be very useful to detect if there is a troubling degree of multicollinearity, what kind of multicollinearity it is and what variables are causing the multicollinearity.
For this purpose, it is important to include in the figures the lines corresponding to the established thresholds for each measure (CV and VIF): dashed vertical line for 0.1002506 (CV) and dotted horizontal line for 10 (VIF). These lines determine four regions (see Example 1) which can be interpreted as follows: A, existence of troubling non-essential and non-troubling essential multicollinearity; B, existence of troubling essential and non-essential multicollinearity; C, existence of non-troubling non-essential and troubling essential multicollinearity; D: non-troubling degree of existing multicollinearity (essential and non-essential).
Value
CV |
Coefficient of Variation of each independent variable. |
VIF |
Variance Inflation Factor of each independent variable. |
Author(s)
R. Salmerón (romansg@ugr.es) and C. García (cbgarcia@ugr.es).
References
R. Salmerón, C. García, and J. García. Variance inflation factor and condition number in multiple linear regression. Journal of Statistical Computation and Simulation, 88:2365-2384, 2018.
R. Salmerón, A. Rodríguez, and C. García. Diagnosis and quantification of the non-essential collinearity. Computational Statistics, 35:647-666, 2020.
Salmerón, R., García, C.B., Rodríguez, A. and García, C. Limitations in detecting multicollinearity due to scaling issues in the mcvis package. R Journal, 14(4), 264-279, 2022.
Examples
## Example 1
plot(-2:20, -2:20, type = "n", xlab="Coefficient of Variation", ylab="Variance Inflation Factor")
abline(h=10, col="black", lwd=3, lty=2)
abline(v=0.1002506, col="black", lwd=3, lty=3)
text(-1.25, 2, "A", pos=3, col="red")
text(-1.25, 12, "B", pos=3, col="red")
text(10, 12, "C", pos=3, col="red")
text(10, 2, "D", pos=3, col="red")
## Example 2
library(multiColl)
set.seed(2022)
obs = 100
cte = rep(1, obs)
x2 = rnorm(obs, 5, 0.01)
x3 = rnorm(obs, 5, 10)
x4 = x3 + rnorm(obs, 5, 1)
x5 = rnorm(obs, -1, 30)
x = cbind(cte, x2, x3, x4, x5)
CV_VIF(x, size = c(1, 1, 1, 1))
RVIF calculation
Description
This function provides the values of the Redefined Variance Inflation Factor (RVIF) and the the percentage of near multicollinearity due to each independent variable.
Usage
RVIF(X, l_u=TRUE, l=40, intercept=TRUE, graf=TRUE)
Arguments
X |
A numerical design matrix that should contain more than one regressor. |
l_u |
A logical value that indicates if the variables in the design matrix |
l |
A real number that indicates the lower limit of the vertical axis of the scatter plot between the Variance Inflation Factor (VIF) and the Coefficient of Variation (CV). By default |
intercept |
A logical value that indicates if the design matrix |
graf |
A logical value that indicates if the scatter plot between the VIF and CV is represented by using the CV_VIF function. By default |
Details
The Redefined Variation Inflation Factor (RVIF) is capable to detect both kind of multicollinearity: the essential (near-linear relationship between at least two independent variables excluding the intercept) and non-essential (near-linear relationship between the intercept and at least one of the remaining independent variables). This measure also quantifies the percentage of near multicollinearity due to each independent variable.
Value
RVIF |
Redefined Variance Inflation Factor of each independent variable. |
% |
Percentage of near multicollinearity due to each independent variable. |
Graph |
Scatter plot of VIF and CV. |
Author(s)
R. Salmerón (romansg@ugr.es) and C. García (cbgarcia@ugr.es).
References
R. Salmerón, C. García, and J. García. Variance inflation factor and condition number in multiple linear regression. Journal of Statistical Computation and Simulation, 88:2365-2384, 2018.
R. Salmerón, A. Rodríguez, and C. García. Diagnosis and quantification of the non-essential collinearity. Computational Statistics, 35:647-666, 2020.
Salmerón, R., García, C.B. y García, J. A redefined Variance Inflation Factor: overcoming the limitations of the Variance Inflation Factor. Computational Economics (2024, online), doi: https://doi.org/10.1007/s10614-024-10575-8.
See Also
Examples
library(multiColl)
set.seed(2022)
obs = 100
cte = rep(1, obs)
x2 = rnorm(obs, 5, 0.01)
x3 = rnorm(obs, 5, 10)
x4 = x3 + rnorm(obs, 5, 1)
x5 = rnorm(obs, -1, 30)
x = cbind(cte, x2, x3, x4, x5)
RVIF(x)
First simulated data for the simple linear regression model
Description
First data used in example 4 of Salmerón, García and García (2024) (subsection 4.4) on special case of simple linear model.
Usage
data("SLM1")
Format
A data frame with 50 observations on the following 3 variables:
y1
Dependent variable simulated as y = 3 + 4*V + u with u is normally distributed with a mean equal to 0 and a variance equal to 2.
cte
Intercept.
V
Simulated from a normal distribution with mean equal to 10 and variance equal to 100.
References
Salmerón, R., García, C.B. and García, J. A redefined Variance Inflation Factor: overcoming the limitations of the Variance Inflation Factor. Computational Economics (2024, online), doi: https://doi.org/10.1007/s10614-024-10575-8.
Examples
data(SLM1)
head(SLM1, n=5)
y = SLM1[,1]
X = as.matrix(SLM1[,2:3])
Theorem(y, X)
Second simulated data for the simple linear regression model
Description
Second data used in example 4 of Salmerón, García and García (2024) (subsection 4.4) on special case of simple linear model.
Usage
data("SLM2")
Format
A data frame with 50 observations on the following 3 variables:
y2
Dependent variable simulated as y = 3 + 4*Z + u where u is normally distributed with a mean equal to 0 and a variance equal to 2.
cte
Intercept.
Z
Simulated from a normal distribution with mean equal to 10 and variance equal to 0.1.
References
Salmerón, R., García, C.B. and García, J. A redefined Variance Inflation Factor: overcoming the limitations of the Variance Inflation Factor. Computational Economics (2024, online), doi: https://doi.org/10.1007/s10614-024-10575-8.
Examples
data(SLM2)
head(SLM2, n=5)
y = SLM2[,1]
X = as.matrix(SLM2[,2:3])
Theorem(y, X)
Theorem
Description
Given a multiple linear regression model with n observations and k independent variables, the degree of near-multicollinearity affects its statistical analysis (with a level of significance of afa%) if there is a variable i, with i = 1,...,k, that verifies that the null hypothesis is not rejected in the original model and is rejected in the orthogonal model of reference.
Usage
Theorem(y, X, alfa = 0.05)
Arguments
y |
A numerical vector representing the dependent variable of the model. |
X |
A numerical design matrix that should contain more than one regressor (intercept included). |
alfa |
Level of significance (by default, 5%). |
Details
This function compares the individual inference of the original model with that of the orthonormal model taken as reference.
Thus, if the null hypothesis is rejected in the individual significance tests in the model where there are no linear relationships between the independent variables (orthonormal) and is not rejected in the original model, the reason for the non-rejection is due to the existing linear relationships between the independent variables (multicollinearity) of the original model.
The second model is obtained from the first model by performing a QR decomposition which allows to eliminate the initial linear relationships.
Value
The function returns the value of the RVIF, the thresholds established as worroying and whether or not the individual significance analysis is affected by multicollinearity (at the significance level used).
Author(s)
Román Salmerón Gómez (University of Granada) and Catalina García García (University of Granada).
Maintainer: Román Salmerón Gómez (romansg@ugr.es)
References
Salmerón, R., García, C.B. and García, J. A redefined Variance Inflation Factor: overcoming the limitations of the Variance Inflation Factor. Computational Economics (2024, online), doi: https://doi.org/10.1007/s10614-024-10575-8.
Overcoming the inconsistences of the variance inflation factor: a redefined VIF and a test to detect statistical troubling multicollinearity by Salmerón, R., García, C.B and García, J. (working paper, https://arxiv.org/pdf/2005.02245).
See Also
Examples
## Example 1
set.seed(2024)
obs = 100
cte = rep(1, obs)
x2 = rnorm(obs, 5, 0.01) # related to intercept: non essential
x3 = rnorm(obs, 5, 10)
x4 = x3 + rnorm(obs, 5, 0.5) # related to x3: essential
x5 = rnorm(obs, -1, 3)
x6 = rnorm(obs, 15, 0.5)
y = 4 + 5*x2 - 9*x3 -2*x4 + 2*x5 + 7*x6 + rnorm(obs, 0, 2)
X = cbind(cte, x2, x3, x4, x5, x6)
Theorem(y, X)
## Example 2
obs = 25 # by decreasing the number of observations affected to x4
cte = rep(1, obs)
x2 = rnorm(obs, 5, 0.01) # related to intercept: non essential
x3 = rnorm(obs, 5, 10)
x4 = x3 + rnorm(obs, 5, 0.5) # related to x3: essential
x5 = rnorm(obs, -1, 3)
x6 = rnorm(obs, 15, 0.5)
y = 4 + 5*x2 - 9*x3 -2*x4 + 2*x5 + 7*x6 + rnorm(obs, 0, 2)
X = cbind(cte, x2, x3, x4, x5, x6)
Theorem(y, X)
## Example 3
y = 4 - 9*x3 - 2*x5 + rnorm(obs, 0, 2)
X = cbind(cte, x3, x5) # independently generated
Theorem(y, X)
Wissel data
Description
Wissel data on outstanding mortgage debt.
Usage
data("Wissel")
Format
A data frame with 17 observations on the following 6 variables:
t
Year.
D
Outstanding mortgage debt (dependent variable).
cte
Intercept.
C
Personal consumption (trillions of dollars).
I
Personal income (trillions of dollars).
CP
Outstanding consumer credit (trillions of dollars).
References
Wissel, J. (2009). A new biased estimator for multivariate regression models with highly collinear variables. Ph.D. thesis, Erlangung des naturwissenschaftlichen Doktorgrades der Bayerischen Julius-Maximilians-Universität Würzburg.
Examples
data(Wissel)
head(Wissel, n=5)
y = Wissel[,2]
X = as.matrix(Wissel[,3:6])
Theorem(y, X)
Spanish company employee data
Description
Data used in example 3 of Salmerón, García and García (2024) (subsection 4.3) on number of employees of Spanish companies.
Usage
data("employees")
Format
A data frame with 15 observations on the following 5 variables:
NE
Number of employees (dependent variable).
cte
Intercept.
FA
Fixed assets (in euros).
OI
Operating income (in euros).
S
Sales (in euros).
Details
This dataset is originally used by Salmerón, Rodríguez, García and García (2020).
References
Salmerón, R., Rodríguez, A., García, C.B. and García, J. The VIF and MSE in raise regression. Mathematics, 8(4), 2020.
Salmerón, R., García, C.B. and García, J. A redefined Variance Inflation Factor: overcoming the limitations of the Variance Inflation Factor. Computational Economics (2024, online), doi: https://doi.org/10.1007/s10614-024-10575-8.
Examples
data(employees)
head(employees, n=5)
y = employees[,1]
X = as.matrix(employees[,3:5])
Theorem(y, X)
Euribor data
Description
Data used in example 1 of Salmerón, García and García (2024) (subsection 4.1) on euribor data.
Usage
data("euribor")
Format
A data frame with 47 observations on the following 5 variables:
E
Euribor (dependent variable, in percentage).
cte
Intercept.
HIPC
Harmonized index of consumer prices (in percentage).
BC
Balance of payments to net current account (millions of euros).
GD
Goverment deficit to net nonfinancial accounts (millions of euros).
Details
This dataset is originally used by Salmerón, Rodríguez and García (2020).
References
Salmerón, R., Rodríguez, A. and García, C.B. Diagnosis and quantification of the non-essential collinearity. Computational Statistics, 35(2), 647-666, 2020.
Salmerón, R., García, C.B. and García, J. A redefined Variance Inflation Factor: overcoming the limitations of the Variance Inflation Factor. Computational Economics (2024, online), doi: https://doi.org/10.1007/s10614-024-10575-8.
Examples
data(euribor)
head(euribor, n=5)
y = euribor[,1]
X = as.matrix(euribor[,2:5])
Theorem(y, X)