\documentclass{ubmanual} \usepackage{amsmath,amssymb} \usepackage{hyperref} \hypersetup{colorlinks=false, pdfborder=0 0 0, pdftitle={RoCoCo - An R Package Implementing a Robust Rank Correlation Coefficient and a Corresponding Test}, pdfauthor={Ulrich Bodenhofer}} \title{{\Huge RoCoCo}\\[5mm] An R Package Implementing a \underline{Ro}bust Rank \underline{Co}rrelation \underline{Co}efficient and a Corresponding Test} \author{Ulrich Bodenhofer\affilmark{1,2} and Martin Krone\affilmark{3}} \affiliation{\affilmark{1} School of Informatics, Communication and Media\\ University of Applied Sciences Upper Austria\\ Softwarepark 11, 4232 Hagenberg, Austria\\[2mm] {\grey\affilmark{2} previously with: Institute of Bioinformatics, Johannes Kepler University\\Altenberger Str.\ 69, 4040 Linz, Austria\\[2mm] \affilmark{3} previously with: Department of Computer Science\\ Ostfalia University of Applied Sciences\\ Salzdahlumer Str.\ 46/48, 38302 Wolfenb\"uttel, Germany}} \newcommand{\RoCoCo}{\texttt{rococo}} \newcommand{\Rcpp}{\texttt{Rcpp}} \newcommand{\R}{R} \newcommand{\Real}{\mathbb{R}} \newcommand{\SUBL}{_{\mathbf{L}}} \newcommand{\SUBP}{_{\mathbf{P}}} \newcommand{\SUBM}{_{\mathbf{M}}} \newcommand{\TL}{\ensuremath{T\SUBL}} \newcommand{\TP}{\ensuremath{T\SUBP}} \newcommand{\TM}{\ensuremath{T\SUBM}} \newcommand{\CINT}[1]{\mbox{$[#1]$}} \newcommand{\OINT}[1]{\mbox{$]#1[$}} \newcommand{\COINT}[1]{\mbox{$[#1[$}} \newcommand{\OCINT}[1]{\mbox{$]#1]$}} \newcommand{\UINT}{\CINT{0,1}} \renewcommand{\vec}[1]{\mathbf{#1}} %\VignetteIndexEntry{RoCoCo - An R Package Implementing a Robust Rank Correlation Coefficient and a Corresponding Test} %\VignetteDepends{Rcpp, methods, stats, graphics, utils, datasets} %\VignetteEngine{knitr::knitr} \begin{document} <>= options(width=72) knitr::opts_knit$set(width=72) set.seed(0) library(rococo, quietly=TRUE) rococoVersion <- packageDescription("rococo")$Version rococoDateRaw <- packageDescription("rococo")$Date rococoDateYear <- as.numeric(substr(rococoDateRaw, 1, 4)) rococoDateMonth <- as.numeric(substr(rococoDateRaw, 6, 7)) rococoDateDay <- as.numeric(substr(rococoDateRaw, 9, 10)) rococoDate <- paste(month.name[rococoDateMonth], " ", rococoDateDay, ", ", rococoDateYear, sep="") @ \newcommand{\RoCoCoVer}{\Sexpr{rococoVersion}} \newcommand{\RoCoCoDate}{\Sexpr{rococoDate}} \manualtitlepage{Version \RoCoCoVer, \RoCoCoDate}{https://github.com/UBod/rococo/} \section*{Scope and Purpose of this Document} This document is a user manual for the \R\ package \RoCoCo. It is only meant as a gentle introduction into how to use the basic functions implemented in this package. Not all features of the \R\ package are described in full detail. Such details can be obtained from the documentation enclosed in the \R\ package. Further note the following: (1) this is not an introduction to robust rank correlation; (2) this is not an introduction to \R. If you lack the background for understanding this manual, you first have to read literature on these subjects. \vspace{1cm} \newlength{\auxparskip} \setlength{\auxparskip}{\parskip} \setlength{\parskip}{0pt} \tableofcontents \clearpage \setlength{\parskip}{\auxparskip} \newlength{\Nboxwidth} \setlength{\Nboxwidth}{\textwidth} \addtolength{\Nboxwidth}{-2\fboxrule} \addtolength{\Nboxwidth}{-2\fboxsep} \newcommand{\notebox}[1]{% \begin{center} \fbox{\begin{minipage}{\Nboxwidth} \noindent{\sffamily\bfseries Note:} #1 \end{minipage}} \end{center}} \section{Introduction} Correlation measures are among the most basic tools in statistical data analysis and machine learning. They are applied to pairs of observations to measure to which extent the two observations comply with a certain model. The most prominent representative is surely {\em Pearson's product moment coefficient}\/ \cite{Abdi07b,Pearson20}, often nonchalantly called {\em correlation coefficient}\/ for short. Pearson's product moment coefficient can be applied to numerical data and assumes a linear relationship as the underlying model; therefore, it can be used to detect linear relationships, but no non-linear ones. Rank correlation measures \cite{GoodmanKruskal54,Kendall62,Kruskal58} are intended to measure to which extent a monotonic function is able to model the inherent relationship between the two observables. They neither assume a specific parametric model nor specific distributions of the observables. They can be applied to ordinal data and, if some ordering relation is given, to numerical data too. Therefore, rank correlation measures are ideally suited for detecting monotonic relationships, in particular, if more specific information about the data is not available. The two most common approaches are {\em Spearman's rank correlation coefficient}\/ (short {\em Spearman's rho}) \cite{Spearman04,Spearman07} and {\em Kendall's tau (rank correlation coefficient)}\/ \cite{Abdi07,Kendall38,Kendall62}. Another simple rank correlation measure is the {\em gamma rank correlation measure}\/ according to Goodman and Kruskal \cite{GoodmanKruskal54}. The rank correlation measures cited above are designed for ordinal data. However, as argued in \cite{BodenhoferKlawonn08}, they are not ideally suited for measuring rank correlation for numerical data that are perturbed by noise. Consequently, \cite{BodenhoferKlawonn08} introduces a family of robust rank correlation measures. The idea is to replace the classical ordering of real numbers used in Goodman's and Kruskal's gamma \cite{GoodmanKruskal54} by some fuzzy ordering \cite{HoehleBlanchard85,Bodenhofer00,BodenhoferDemirci08} with smooth transitions --- thereby ensuring that the correlation measure is continuous with respect to the data. This package provides this family of rank correlation measures along with corresponding statistical tests (as introduced in \cite{BodenhoferKroneKlawonn13}). Moreover, the package also implements the Gaussian rank correlation estimator \cite{BoudtCornelissenCroux12} and a corresponding statistical test. \section{Installation} \subsection{Installation via CRAN} The \R\ package \RoCoCo\ (current version: \RoCoCoVer) is part of the {\em Comprehensive R Archive Network (CRAN)}% \footnote{\url{http://cran.r-project.org/}}. The simplest way to install the package, therefore, is to enter the following command into your \R\ session: <>= install.packages("rococo") @ If you use R on Windows or Mac OS, you can also conveniently use the package installation menu of your R GUI. \subsection{Manual installation from source} Under special circumstances, e.g. if you want to compile the C++ code included in the package with some custom options, you may prefer to install the package manually from source. To this end, open the package's page at CRAN% \footnote{\url{http://cran.r-project.org/web/packages/rococo/index.html}} and then proceed as follows: \begin{enumerate} \item Download \texttt{rococo\_\RoCoCoVer.tar.gz} and save it to your harddisk. \item Open a shell/terminal/command prompt window and change to the directory where you put {\ttfamily rococo\_\RoCoCoVer.tar.gz}. Enter \begin{quote} \ttfamily R CMD INSTALL rococo\_\RoCoCoVer.tar.gz \end{quote} to install the package. \end{enumerate} Note that this might require additional software on some platforms. Windows requires Rtools\footnote{\url{http://cran.r-project.org/bin/windows/Rtools/}} to be installed and to be available in the default search path (environment variable \verb+PATH+). Mac OS X requires Xcode developer tools% \footnote{\url{https://developer.apple.com/technologies/tools/}} (make sure that you have the command line tools installed with Xcode). \subsection{Compatibility issues} All versions downloadable from CRAN have been built using the latest version, \R\ \Sexpr{R.version$major}.\Sexpr{R.version$minor}. However, the package should work without severe problems on \R\ versions $\geq$3.0.0. \section{Getting Started} To load the package, enter the following in your \R\ session: <>= library(rococo) @ You will probably see a message that package \Rcpp\ has been loaded (in case it has not been loaded previously in the current \R\ session). Apart from this, you can be sure that the package has been installed successfully if this command terminates without any further error message or warning. If so, the package is ready for use now. The package includes both a user manual (this document) and a reference manual (help pages for each function). To view the user manual, enter <>= vignette("rococo") @ Help pages can be viewed using the \verb+help+ command. It is recommended to start with <>= help(rococo) @ For demonstration purposes, let us first create an artificial toy data set: \begin{center} <>= x1 <- rnorm(15) y1 <- 2 * x1 + rnorm(length(x1), sd=0.25) plot(x1, y1) @ \end{center} Obviously, these are linearly correlated Gaussian data, so Pearson's product moment correlation coefficient \cite{Abdi07b,Pearson20} would be the optimal choice. We use these data anyway, only for illustration purposes. The function \texttt{rococo()} can be used to compute the robust rank correlation coefficient as follows: <>= rococo(x1, y1) @ To perform a robust rank correlation test, use the function \texttt{rococo.test}: <>= rococo.test(x1, y1, alternative="two.sided") @ The argument \texttt{alternative} works in the same way as for the standard function \texttt{cor.test()}. The function \texttt{rococo.test()} is a generic method that can be called on two numeric vectors (as above) or, alternatively, using a formula to conveniently extract two columns from a data frame: \begin{center} <>= data(iris) plot(~ Sepal.Length + Petal.Length, iris) rococo.test(~ Sepal.Length + Petal.Length, iris, alternative="two.sided") @ \end{center} All examples above use default settings for the fuzzy orderings that are used to define the rank correlation coefficient. In the following section, we introduce the concept behind the robust gamma rank correlation coefficient in greater depth and describe how to adjust the corresponding settings properly. \section{Adjusting Similarities and t-Norms}\label{sec:simt} \subsection{Background} The robust gamma rank correlation coefficient requires the definition of two {\em strict fuzzy orderings} \cite{BodenhoferDemirci08}, $R_X$ and $R_Y$. A strict fuzzy ordering is a two-place function that measures to which degree its second argument is strictly larger than its first argument. Given a data set consisting of $n$ pairs of observations (where $n\geq 2$) \begin{equation}\label{eq:pairs} (x_i,y_i)_{i=1}^n, \end{equation} $R_X$ is used for comparing $x$ observations and $R_Y$ is used for comparing $y$ observations. Given a data set as in Eq.\ \eqref{eq:pairs}, the strict fuzzy orderings $R_X$ and $R_Y$ are used to compute two important numbers, the {\em number of concordant pairs $\tilde C$} and the number of {\em discordant pairs $\tilde D$}: \begin{align*} \tilde C &= \sum_{i=1}^n\sum_{j\not=i} \bar T(R_X(x_i,x_j),R_Y(y_i,y_j))\\ \tilde D &= \sum_{i=1}^n\sum_{j\not=i} \bar T(R_X(x_i,x_j),R_Y(y_j,y_i)) \end{align*} The function $\bar T$ is a triangular (t-norm) \cite{KlementMesiarPap00} that is used to aggregate degrees of relationships between pairs of $x$ observations and the corresponding degrees for $y$ observations (see below). The final {\em robust gamma rank correlation coefficient} is then computed as \[ \tilde\gamma=\frac{\tilde C-\tilde D}{\tilde C+\tilde D} \] in perfect analogy to Goodman's and Kruskal's gamma \cite{GoodmanKruskal54}. \subsection{Choosing the Family of Similarities} It should be clear from the description above that the robust gamma rank correlation coefficient requires the following ingredients: \begin{enumerate} \item A fuzzy ordering $R_X$ for the $x$ observations \item A fuzzy ordering $R_Y$ for the $y$ observations \item A t-norm $\bar T$ for aggregation. \end{enumerate} For $R_X$ and $R_Y$, the \RoCoCo\ package provides five possible choices which are identified by the {\em similarity} that is used to define the strict fuzzy ordering (for more details see \cite{BodenhoferDemirci08,BodenhoferKlawonn08}). Table \ref{tab:sim} provides an overview. \begin{table} \caption{Overview of strict fuzzy orderings implemented in the \RoCoCo\ package:}\label{tab:sim} \begin{center}\footnotesize \begin{tabular}{|c|c|c|} \hline \bf Setting & \bf Similarity & \bf Strict fuzzy ordering \\ \hline \hline \rule[-3mm]{0pt}{9mm}\texttt{"linear"} & $E(x,x')=\max(0,1-\frac{1}{r}|x-x'|)$ & $R(x, x')=\max(0,\min(1,\frac{1}{r}(x'-x)))$ \\ \hline \rule[-3mm]{0pt}{9mm}\texttt{"exp"} & $E(x,x')=\exp(-\frac{1}{r}|x-x'|)$ & $R(x, x')=\max(0,1-\exp(-\frac{1}{r}(x'-x)))$ \\ \hline \rule[-6mm]{0pt}{15mm}\texttt{"gauss"} & $E(x,x')=\exp(-\frac{1}{2r^2}(x-x')^2)$ & $R(x, x')=\begin{cases} 1-\exp(-\frac{1}{2r^2}(x-x')^2) & \text{if } x\leq x'\\ 0 & \text{otherwise} \end{cases}$\\ \hline \rule[-6mm]{0pt}{15mm}\texttt{"epstol"} & $E(x,x')=\begin{cases} 1 & \text{if } |x-x'|\leq r \\ 0 & \text{otherwise} \end{cases}$ & $R(x, x')=\begin{cases} 1 & \text{if } x'>x+r \\ 0 & \text{otherwise} \end{cases}$\\ \hline \rule[-6mm]{0pt}{15mm}\texttt{"classical"} & $E(x,x')=\begin{cases} 1 & \text{if } x=x' \\ 0 & \text{otherwise} \end{cases}$ & $R(x, x')=\begin{cases} 1 & \text{if } x'>x \\ 0 & \text{otherwise} \end{cases}$\\ \hline \end{tabular} \end{center} \end{table} Obviously, in all five cases, the strict fuzzy ordering $R$ is computed from the similarity $E$ in the following way: \[ R(x,x')=\begin{cases} 1-E(x,x') & \text{if } x' > x \\ 0 & \text{otherwise} \end{cases} \] Further note that the choices \texttt{"epstol"} and \texttt{"classical"} are not continuous. The former is the well-known $\varepsilon$-intolerant similarity that has been discussed widely in the context of the Poincar\'e paradox (note that this similarity is not a transitive relation). The latter setting \texttt{"classical"} corresponds to the classical Goodman and Kruskal gamma. The functions \texttt{rococo()} and \texttt{rococo.test()} choose the first variant \texttt{"linear"} by default, both for $x$ and $y$ observations. If one wants to choose a different setting or two different strict fuzzy orderings for $x$ and $y$ observations, the \texttt{similarity} argument can be used. For demonstrating this, we create a noisy data set from a function $f$ that has a large flat area: \begin{center} <>= x2 <- rnorm(15) f2 <- function(x) ifelse(x > 0.8, x - 0.8, ifelse(x < -0.8, x + 0.8, 0)) y2 <- f2(x2) + rnorm(length(x2), sd=0.1) plot(x2, y2) @ \end{center} As said before, \texttt{"linear"} is the default choice: <>= rococo.test(x2, y2, alternative="greater") @ This default setting makes sense in particular if no information about the data, their distribution, or not even the noise distribution is known. Now let us try some different settings: <>= rococo.test(x2, y2, similarity="gauss", alternative="greater") rococo.test(x2, y2, similarity=c("classical", "gauss"), alternative="greater") @ Particularly note the latter of the two examples: different settings for $x$ and $y$ observations can be done by supplying a vector to the \texttt{similarity} argument, where the first element determines the choice for $x$ observations and the second element determines the choice for $y$ observations. \subsection{Parametrizing Similarities} So far, we have neglected that four of the five similarities/fuzzy orderings listed in Table \ref{tab:sim} require an additional parameter $r$. In all four cases, $r$ controls the importance of observations that are close to each other. The smaller $r$, the more similar observations are taken into account when computing the numbers of concordant and discordant pairs. This entails that, the smaller $r$, the easier noise can corrupt the result. The larger $r$, the less similar observations are considered, i.e.\ the more noise-tolerant the result will be. However, that does not mean that the largest possible $r$ is the best choice. An overly large $r$ can result in unspecific and unsignificant results. The \RoCoCo\ package allows for setting one $r$ for both $x$ and $y$ observations: <>= rococo.test(x2, y2, similarity="gauss", r=0.1, alternative="greater") @ It is also possible to specify different values of $r$ for $x$ and $y$ observations. Analogously to the parameter \texttt{similarity}, this can be done by supplying a two-element vector to the parameter \texttt{r}: <>= rococo.test(x2, y2, similarity=c("linear", "gauss"), r=c(0.05, 0.1), alternative="greater") @ It should be clear from the formulas in Table \ref{tab:sim} that $r=0$ is either invalid or does not make sense. The \RoCoCo\ package still admits choosing a zero value. In this case, \RoCoCo\ makes an {\em automatic adjustment} of $r$ to 10\% of the interquartile range (the difference between the 75\% and the 25\% quantile) of the observation under consideration: <>= rococo.test(x2, y2, similarity=c("linear", "gauss"), r=0, alternative="greater") IQR(x2) * 0.1 IQR(y2) * 0.1 @ In the results above, note the values \texttt{rx} and \texttt{ry} in the output of \texttt{rococo.test()}. These are the values that have been specified or that have been determined automatically. The computations of 10\% of the interquartile range should demonstrate that this is what takes place when $r$ is set to 0. Note that this is also done by default if the argument \texttt{r} is not specified. \subsection{Choosing the t-Norm for Aggregation} As mentioned above, the robust gamma rank correlation coefficient further requires to specify a t-norm $\bar T$ that is used for aggregation of the ordering measures from $x$ and $y$ observations. The \RoCoCo\ package offers three built-in t-norms that can be selected by specifying the \texttt{tnorm} argument when calling the functions \texttt{rococo()} and \texttt{rococo.test()}. Table \ref{tab:tnorm} provides an overview. Here is an example that uses the product t-norm for aggregation: <>= rococo.test(x2, y2, similarity=c("linear", "gauss"), tnorm="prod", alternative="greater") @ \begin{table} \caption{Overview of strict fuzzy orderings implemented in the \RoCoCo\ package:}\label{tab:tnorm} \begin{center} \begin{tabular}{|c|c|} \hline \bf Setting & \bf t-Norm \\ \hline \hline \rule[-3mm]{0pt}{9mm}\texttt{"min"} (default) & $\TM(x,y)=\min(x,y)$ \\ \hline \rule[-3mm]{0pt}{9mm}\texttt{"prod"} & $\TP(x,y)=x\cdot y$\\ \hline \rule[-3mm]{0pt}{9mm}\texttt{"lukasiewicz"} & $\TL(x,y)=\max(0,x+y-1)$ \\ \hline \end{tabular} \end{center} \end{table} The choice of the t-norm $\bar T$ is not particularly critical and should not have a strong influence on the significance of results. Most users will suffice with the default setting \texttt{"min"}. The choice \texttt{"prod"} produces similar, but slightly smoother results which may be suitable in combination with the only differentiable fuzzy ordering (\texttt{similarity="gauss"}). Even though the choice is not critical, the \RoCoCo\ package also offers the choice of user-defined t-norms by supplying a two-place function as \texttt{tnorm} argument. Here is an example with some Yager t-norm \cite{KlementMesiarPap00,Yager80b}: <>= DrastictNorm <- function(x, y) { if (x == 1) y else if (y == 1) x else 0 } YagertNorm <- function(lambda) { fun <- function(x, y) { if (lambda == 0) DrastictNorm(x, y) else if (is.infinite(lambda)) min(x, y) else max(0, 1 - ((1 - x)^lambda + (1 - y)^lambda)^(1 / lambda)) } attr(fun, "name") <- paste("Yager t-norm with lambda =", lambda) fun } rococo(x2, y2, tnorm=YagertNorm(0.5)) rococo.test(x2, y2, tnorm=YagertNorm(0.2)) @ Note that the \RoCoCo\ package performs only a few basic checks on a user-defined t-norm. It remains the duty of the user to ensure that the function is actually a valid t-norm. The three built-in t-norms (see Table\ \ref{tab:tnorm}) are efficiently implemented in C++ and called with the help of the \Rcpp\ package \cite{EddelbuettelFrancois11} while user-defined t-norms have to be evaluated in \R\ loops. Even though we use the \texttt{compiler} package (if available) to pre-compile the user-defined t-norm, this may still result in a drastic slowdown of computations, particularly for larger data sets or when using \texttt{rococo.test()} with \texttt{exact=TRUE} (see end of next section). \section{A Note on Permutation Testing} Classical rank correlation measures only depend on the sorting of $x$ and $y$ observations. Thus, for a given number of samples $n$, one can deduce the distribution of the test statistic under the $\mathcal{H}_0$ hypothesis as a simple function of the number of samples $n$. We neither want to make any prior assumption about the distribution of data nor does the complex inner structure of the robust gamma rank correlation coefficient and the variety of possible parameter settings allow for an analytic deduction of the test statistic's distribution. Therefore, we use {\em permutation testing} for estimating the null distribution of the test statistic under the assumption of independence. This is done in the following way: For a given data set $(x_i,y_i)_{i=1}^n$, we first compute the robust rank correlation coefficient according to the specified parameters. Then we create $K$ random shuffles $(y'_i)_{i=1}^n$ of $(y_i)_{i=1}^n$ and compute the robust rank correlation coefficient for $(x_i,y'_i)_{i=1}^n$ according to the specified parameters. Due to the shuffling, $(x_i)_{i=1}^n$ and $(y'_i)_{i=1}^n$ are independent, where $(y'_i)_{i=1}^n$ has the same marginal distribution as $(y_i)_{i=1}^n$. Thus, the robust rank correlation coefficients obtained for a sufficiently large number of shuffles allows for estimating the distribution of the test statistics under the $\mathcal{H}_0$ hypothesis with given marginal distributions and we can estimate the test's $p$-value in the following simple way: \begin{itemize} \item as the relative frequency the test statistics' absolute value exceeded the absolute value of the test statistic for the unshuffled data in case we perform a two-sided test, \item as the relative frequency the test statistic was greater than the test statistic for the unshuffled data in case we perform a one-sided test with alternative hypothesis that there is a positive correlation, and \item as the relative frequency the test statistic was less than the test statistic for the unshuffled data in case we perform a one-sided test with alternative hypothesis that there is a negative correlation. \end{itemize} This $p$-value is contained in the slot \verb+p.value+ of the output object the test returns. This output object further contains a slot \verb+count+ that corresponds to the absolute number of times \begin{itemize} \item the test statistics' absolute value exceeded the absolute value of the test statistic for the unshuffled data in case we perform a two-sided test, \item the test statistic was greater than the test statistic for the unshuffled data in case we perform a one-sided test with alternative hypothesis that there is a positive correlation, and \item the test statistic was less than the test statistic for the unshuffled data in case we perform a one-sided test with alternative hypothesis that there is a negative correlation. \end{itemize} Please note that these $p$-values are only estimates. The smaller the number of shuffles, the higher the variance of the estimates. This number of shuffles performed by \texttt{rococo.test()} is controlled by the parameter \texttt{numtests} (the default is 1000). We concur that 1000 samples are sufficient (at least with high probability) if one wants to test whether the association is significant with a significance threshold of 95\% or 99\%. If the user, however, is interested in much more exact estimates of the $p$-value or if he/she wants to test against much more stringest significance thresholds, it may be necessary to perform much higher numbers of shuffles. Needless to mention, this will also result in an increase of computation times, where the computation time grows linearly with the number of shuffles. Here is an example with 100,000 shuffles, where we use the option \texttt{storeValue=TRUE} to get access to the rank correlation values for all random shuffles: <>= res <- rococo.test(x2, y2, numtests=100000, storeValues=TRUE) res @ For small data sets (not more than 10 samples), the package further provides computations of {\em exact $p$-values}. This can be enforced by passing the argument \texttt{exact=TRUE} to \texttt{rococo.test()}. In this case, {\em all possible permutations} are considered (using the Steinhaus-Johnson-Trotter algorithm; see, e.g., \cite{Sedgewick77}) and the exact $p$-value is computed as the quotient of the above count (slot \texttt{count}) over the number of trials (the factorial of the length of \texttt{x} and \texttt{y}): <>= rococo.test(x2[1:8], y2[1:8], exact=TRUE) @ Using the exact test for 10 samples results in $10!=\Sexpr{factorial(10)}$ permutations that have to be considered. For the built-in t-norms, such a computation should finish within a few seconds. However, in conjunction with a user-defined t-norm, computations may be significantly longer, in the range of several minutes. Kendall's rank correlation test is based on a normal approximation of the null distribution. The following picture suggests that a normal approximation may also be suitable for the robust rank correlation test (data from the above example with 100,000 shuffles): \begin{center} <>= hist(res@perm.gamma, breaks=100, probability=TRUE, xlab="gamma", main="Distribution of gamma for random shuffles") plot(function(x) dnorm(x, mean=res@H0gamma.mu, sd=res@H0gamma.sd), min(res@perm.gamma), max(res@perm.gamma), col="red", lwd=2, add=TRUE) @ \end{center} However, statistical analyses have shown that the null distributions are seldom normal. Anyway, the \RoCoCo\ package also provides $p$-values based on the normal approximation in the slot \verb+p.value.approx+ of the resulting output object: <>= res@p.value.approx @ The user should be warned, however, that these $p$-values are systematically biased and that he/she is recommended to use the standard $p$-values as described above (for a sufficiently large number of shuffles). \section{The Gaussian Rank Correlation Estimator} The Gaussian rank correlation estimator \cite{BoudtCornelissenCroux12} is a simple and well-performing alternative to our proposed robust gamma rank correlation coefficient. Since, to our best knowledge, no R implementation has been available so far, we implemented this rank correlation measure and the corresponding test in the \RoCoCo\ package. Example: <>= gauss.cor(x1, y1) gauss.cor.test(x1, y1, alternative="two.sided") gauss.cor.test(~ Sepal.Length + Petal.Length, iris, alternative="two.sided") @ \section{How to Cite This Package} If you use this package for research that is published later, you are kindly asked to cite it as follows: \begin{quotation} \noindent U. Bodenhofer, M. Krone, and F. Klawonn (2013). Testing noisy numerical data for monotonic association. {\em Inform. Sci.} {\bf 245}:21--37. DOI: 10.1016/j.ins.2012.11.026.\\[5mm] \noindent U.\ Bodenhofer and F.\ Klawonn (2008). Robust rank correlation coefficients on the basis of fuzzy orderings: initial steps. {\em Mathware Soft Comput.} {\bf 15}(1):5--20. \end{quotation} To obtain a Bib\TeX\ entry of the reference, you can enter the following into your R session: <>= toBibtex(citation("rococo")) @ \section*{Acknowledgment} The core of this package was implemented during a short-term scientific mission of Martin Krone at the Institute of Bioinformatics, Johannes Kepler University, within the framework of {\em COST Action IC0702 ``SoftStat --- Combining Soft Computing Techniques and Statistical Methods to Improve Data Analysis Solutions''}. Therefore, the support of this project is gratefully acknowledged. %\bibliographystyle{plain} %\bibliography{BodenhoferPub,Bioinformatics,MathematicsMisc,% %ConnectivesOperators,FuzzySetsRelations,ComputerScienceMisc,Statistics} \begin{thebibliography}{10} \bibitem{Abdi07b} H.~Abdi. \newblock Coefficients of correlation, alienation and determination. \newblock In N.~J. Salkind, editor, {\em Encyclopedia of Measurement and Statistics}. Sage, Thousand Oaks, CA, 2007. \bibitem{Abdi07} H.~Abdi. \newblock The {Kendall} rank correlation coefficient. \newblock In N.~J. Salkind, editor, {\em Encyclopedia of Measurement and Statistics}. Sage, Thousand Oaks, CA, 2007. \bibitem{Bodenhofer00} U.~Bodenhofer. \newblock A similarity-based generalization of fuzzy orderings preserving the classical axioms. \newblock {\em Internat. J. Uncertain. Fuzziness Knowledge-Based Systems}, 8(5):593--610, 2000. \bibitem{BodenhoferDemirci08} U.~Bodenhofer and M.~Demirci. \newblock Strict fuzzy orderings with a given context of similarity. \newblock {\em Internat. J. Uncertain. Fuzziness Knowledge-Based Systems}, 16(2):147--178, 2008. \bibitem{BodenhoferKlawonn08} U.~Bodenhofer and F.~Klawonn. \newblock Robust rank correlation coefficients on the basis of fuzzy orderings: initial steps. \newblock {\em Mathware Soft Comput.}, 15(1):5--20, 2008. \bibitem{BodenhoferKroneKlawonn13} U.~Bodenhofer, M.~Krone, and F.~Klawonn. \newblock Testing noisy numerical data for monotonic association. \newblock {\em Inform. Sci.}, 245:21--37, 2013. \bibitem{BoudtCornelissenCroux12} K.~Boudt, J.~Cornelissen, and C.~Croux. \newblock The {G}aussian rank correlation estimator: robustness properties. \newblock {\em Stat. Comput.}, 22(2):471--483, 2012. \bibitem{EddelbuettelFrancois11} D.~Eddelbuettel and R.~Fran\c{c}ois. \newblock Rcpp: seamless {R} and {C++} integration. \newblock {\em J. Stat. Softw.}, 40(8):1--18, 2011. \bibitem{GoodmanKruskal54} L.~A. Goodman and W.~H. Kruskal. \newblock Measures of association for cross classifications. \newblock {\em J. Amer. Statist. Assoc.}, 49(268):732--764, 1954. \bibitem{HoehleBlanchard85} U.~H\"ohle and N.~Blanchard. \newblock Partial ordering in {$L$}-un\-der\-de\-ter\-mi\-nate sets. \newblock {\em Inform. Sci.}, 35:133--144, 1985. \bibitem{Kendall38} M.~G. Kendall. \newblock A new measure of rank correlation. \newblock {\em Biometrika}, 30:81--93, 1938. \bibitem{Kendall62} M.~G. Kendall. \newblock {\em Rank Correlation Methods}. \newblock Charles Griffin \&\ Co., London, third edition, 1962. \bibitem{KlementMesiarPap00} E.~P. Klement, R.~Mesiar, and E.~Pap. \newblock {\em Triangular Norms}, volume~8 of {\em Trends in Logic}. \newblock Kluwer Academic Publishers, Dordrecht, 2000. \bibitem{Kruskal58} W.~H. Kruskal. \newblock Ordinal measures of association. \newblock {\em J. Amer. Statist. Assoc.}, 53(284):814--861, 1958. \bibitem{Pearson20} K.~Pearson. \newblock Notes on the history of correlation. \newblock {\em Biometrika}, 13:25--45, 1920. \bibitem{Sedgewick77} R.~Sedgewick. \newblock Permutation generation methods. \newblock {\em ACM Comput. Surv.}, 9(2):137--164, 1977. \bibitem{Spearman04} C.~Spearman. \newblock The proof and measurement of association between two things. \newblock {\em Am. J. Psychol.}, 15(1):72--101, 1904. \bibitem{Spearman07} C.~Spearman. \newblock Demonstration of formulae for true measurement of correlation. \newblock {\em Am. J. Psychol.}, 18(2):161--169, 1907. \bibitem{Yager80b} R.~R. Yager. \newblock On a general class of fuzzy connectives. \newblock {\em Fuzzy Sets and Systems}, 4:235--242, 1980. \end{thebibliography} \end{document}