% $Id: random-intro.Rnw,v 1.2 2007/04/26 18:37:01 edd Exp $ \documentclass[11pt]{article} \usepackage{vmargin,url,hyperref,booktabs,graphicx} \usepackage[authoryear,round]{natbib} \setpapersize{USletter} % use USletter paper \setmarginsrb{1in}{1in}{1in}{1in}{0pt}{0pt}{\footheight}{\footskip} \hypersetup{ hyperindex,% colorlinks,% linktocpage,% plainpages=false,% linkcolor=DarkBlue,% citecolor=DarkBlue,% urlcolor=DarkBlue,% pdfstartview=Fit,% pdfview={XYZ null null null}% } \RequirePackage{color} \definecolor{Red}{rgb}{0.7,0,0} \definecolor{Blue}{rgb}{0,0,0.8} \definecolor{DarkBlue}{rgb}{0.1,0.1,0.4} \definecolor{DarkGrey}{rgb}{0.15,0.15,0.15} %% from jss.cls \let\code=\texttt \let\proglang=\textsf \newcommand{\pkg}[1]{{\normalfont\fontseries{b}\selectfont #1}} %% shortcuts \newcommand{\randomorg}{\href{http://random.org}{random.org}} \begin{document} \SweaveOpts{engine=R,eps=FALSE} %\VignetteIndexEntry{random: An R package for true random numbers} %\VignetteDepends{random} %\VignetteKeywords{true random numbers, R} %\VignettePackage{random} <>= library(random) options(SweaveHooks=list(twofig=function() {par(mfrow=c(1,2))}, twofig2=function() {par(mfrow=c(2,1))}, onefig=function() {par(mfrow=c(1,1))})) @ \title{\pkg{random}: An \proglang{R} package for true random numbers} \author{\href{http://dirk.eddelbuettel.com}{Dirk Eddelbuettel}} \date{August 2006} \maketitle \section{Introduction} Simulation techniques are a core component of scientific computing and, more specifically, computational statistics. All simulation methods---Monte Carlo methods, bootstrapping, estimation by simulation to name but a few---rely on `computer-generated randomness' (more on this below). In practice, this means sequences of random numbers. Generating `good' (for a suitable metric) random numbers is therefore of critical importance for scientific software environments. As a concrete example, consider that the current version of the \proglang{R} environment \citep{R:2006} contains no fewer than six random number generation algorithms all of which are available to the user at run-time, and refers to a total of twelve publications as further references. Deterministically generated random numbers---i.e.~numbers obtained by means of executing a computer program that implements a particular algorithm---are commonly referred to as \textsl{pseudo random numbers}. The adjective pseudo stresses the implicit contradiction between the deterministic manner in which `randomness' is being produced. The algorithms and computer programs that produce these random number are generally referred to as pseudo-random number generators, or PRNGs for short. An distinction is sometimes made between pseudo- and quasi-random number generators (QRNGs). The latter are preferable for certain simluation studies due to better convergence properties, and which leads these methods also being called low-discrepancy sequences. For our purposes, the distinction between QRNGs and PRNGs is not as important, and we will use the term PNG to refer to both types. % The % Wikipedia has a few good articles at % \url{http://en.wikipedia.org/wiki/Pseudorandomness}, % \url{http://en.wikipedia.org/wiki/Low-discrepancy_sequence}, % \url{http://en.wikipedia.org/wiki/Pseudorandom_number_generator}, % \url{http://en.wikipedia.org/wiki/Cryptographically_secure_pseudorandom_number_generator}.} The Comprehensive R Archive Network (CRAN) contains numerous packages that provide additional random number generators for \proglang{R}. The \pkg{gsl} package \citep{gsl:R:2005} provides the battery of generators for pseudo- and quasi-random number generators from the GNU Scientific Library, or GSL \citep{gsl:2006}. The \pkg{randaes} package \citep{randaes:2005} provides a cryptographically secure RNG based on Ferguson and Schneier's Fortuna generator. Package \pkg{SuppDists} \citep{SuppDists:2006} provides another RNG. Quasi-random number generators are provided by the Rmetrics component package \pkg{fOptions} \citep{fOptions:2006}. Lastly, RNGs suitable for distributed computing are available via \pkg{rsprng} \citep{rsprng:2006} and \pkg{rstream} \citep{rstream:2006}. All of these packages have one major commonality: they produce pseudo random numbers determined by `software': an algorithm implemented as executable computer code. As such, given the initial conditions of the algorithm and implementation, typically one or more so-called numbers acting as `seeds', parts or the entire sequence can be replicated. In other words, these software-generated random numbers are entirely determinisitic. In many cases, this is in fact a desirable property as it provides replicability, a key feature for reproducing scientific results. On the other hand, so-called \textsl{true} random numbers require actual unpredictability arising from genuinely physical processes such as radioactive decay, photon emissions or atmospheric noise. An overview of hardware random number generators is provided by \cite{wp:hardware_rng}. \cite{Davies:2000} discusses testing hardware generators. There are a number of situations in which it is desirable to use \textsl{non-deterministically} determined random numbers. Examples include \begin{itemize} \item to seed distributed computing on different nodes with truly indepedent seeds; \item to obtain portable initializations for RNGs that do not depend on particular operating system or hardware features; \item to validate simulation results using non-deterministic random numbers; \item to provide indeterministic seeds used for lottery drawings or games: imagine playing \proglang{R}'s \pkg{sudoku} \citep{sudoku:2006} seeded by true randomness! \end{itemize} Until now, obtaining non-deterministic random numbers was not (easily) possible in R.\footnote{Under Linux, access to \texttt{/dev/random} provides hardware entropy, but only in a non-portable manner.} The \pkg{random} package aims to fill this gap by offering to supply random integers with possible replacement, random integer sequences without replacement, and random bytes from the \randomorg\ \citep{Haahr:1999,Haahr:2006} web service. The rest of this paper is organised as followes. The next section introduces \randomorg. This is followed by a section on testing the properties of the generator. Next, we disucss the implementation of the \pkg{random} package, and provide a simple example. Potential problems are briefly discussed in the following section, before a summary concludes. \section{About random.org} As discussed above, generating random number via computer programs is critically important for simulation methodologies. Yet at the same time, it is a contradiction in terms as there is nothing `random' about generating sequences in a deterministic fashion. Given an initial condition, the entire sequence can (at least in therory) be predicted. An alternative to these deterministic software generators are truly non-deterministic hardware based generators. One of the longer-running freely available services is the \randomorg\ service created by Mads Haahr \citep{Haahr:2006}, and described briefly, along with some motivation and background, in \cite{Haahr:1999}.\footnote{This esay is also included in the \pkg{random} package as the vignette \textsl{random-essay}.} According to \cite{Haahr:1999}, the original motivation for \randomorg\ was to generate random numbers for the prototype of an on-line gambling site. The basic idea is to gather atmospheric noise via a radio-receiver card tuned to an unused frequency and connected to a computer where it is sampled and digitized. A skew-correction algorithm, originally developped by John von Neumann in the 1940s and discussed by \cite{Davies:2000}, is then applied to the bit stream. The \randomorg\ site offers random numbers in several different formats: as integer random numbers with possible replication (similar to rolling die), as re-arranged integer sequences (without replication and similar to random reshuffling of a given set) and as random bits in several common formats (hexadecimal, decimal, octal and binary). Lastly, a covenience function is offered to query the state of the generator. It is recommended to delay requests for larger sets of random numbers if the generator is indicating less than 20\% capacity. These random numbers can be retrieved using three different interfaces. The oldest, and original, interface is a standard CGI script accessible via the HTTP protocal used by webservers and browsers. The second is using the (somewhat more complex) Corba networking protocal and infrastructure, and the third is using SOAP, an exchange of files in XML markup that also uses HTTP as a transport. Use of the \randomorg\ service, as reported on the site, is varied. It covers actual lotteries (both for outright prices as well for assignments of, say, students to classrooms), games, arts as well as conventional programming with random numbers. It is not immediately apparent which of these uses are truly improved due to their use of a non-deterministic random number generator. \section{Testing random.org data} Given the critical importance of random numbers for scientific computing via simulation methods, fairly rigorous approaches for measuring and ensuring the quality of random numbers have been developed \citep{Davies:2000}. One of the better-known test suites is the so-called \pkg{diehard} collection of tests by \cite{Marsaglia:1995}, which has been re-implemented and extended by \cite{Brown:dieharder:2006} under the name \pkg{dieharder}. The basic idea of testing a random number generator is as follows (and this follows the discussion by \cite{Brown:dieharder:2006} in the extended documentation by of \pkg{dieharder}). For a given suite of $n$ uniform random numbers \begin{displaymath} u_n = u_1, u_2, \ldots \quad \mbox{where} \quad u_n \sim U(l,h) \end{displaymath} that are distributed uniformly between the bounds $l$ and $h$, we have analytical results for the expected mean and variance given by \begin{displaymath} E(u) = \frac{h - l}{2} \quad \end{displaymath} and \begin{displaymath} V(u) = \frac{(h - l)^2}{12} . \end{displaymath} For a sample of uniformly distributed random numbers from a random number generator we wish to test, as well as knowledge of the theoretical sample mean and variance, we can undertake a classic hypothesis test to compute a marginal probability (or $p$-value) for obtaining the given sample by chance from a (hypothetical) perfect generator. Any one such $p$-value from a single sample is likely to be non-informative. However, we can repeat the test for series of sets of random generators. This provides us with a series of associated $p$-values that satisfy the common assumptions of independent and identically distributed statistics. Now, under the null of a perfect random generator, this series of $p$-values ought to be uniformly distributed over the range of possible probabilities between zero and one. This property is itself testable, using for example using a Kolmogorov-Smirnov or Anderson-Darling test. Test suites such as diehard and dieharder provide a variety of different tests each of which applies this basic principle of going from one $p$-value from a testable expression to testing a suite of such $p$-values from one particular test against the uniform distribution. Dieharder provides 24 tests in its current version 1.4.24. Besides the original diehard test, the set comprises tests from NIST as well as new tests, see \cite{Brown:dieharder:2006}. The aggregate number of random numbers required to run all these tests is considerable. In fact, it is much, much larger than what \randomorg\ can provide as a live service (leading to the author's IP address being blacklisted and blocked by \randomorg\ for a few days). In order to test the generator, we had to rely on a set of fifteen pre-generated binary files of 10 megabytes size each, which we combined into one large file of 150 megabytes. Given the standard (four byte) size of an unsigned integer, and the adjustment factor of $2^{10}/10^3$ stemming from the colloquial `kilobyte' being 1024 bytes, the dieharder tests are provided with $150/4 \times 1024^2 \times 10^{-6} = 39,321,600$ unsigned integers. The output log provided by the program reports that the binary input file was rewound a total of 297 times. It follows that a lower bound for the total number of random numbers consumed by dieharder in testing the \randomorg\ generator is $39.3 \times 10^6 \times 297$ or approximately 11.6 billion random numbers. Table \ref{table1} summarizes the results of running version 1.4.24 of dieharder over the binary data retrieved from \randomorg. With the caveat that we do not actually know whether these were in fact generated by the \randomorg\ generator, we note that the generator passes a large majority of tests. Diehard, and by extension dieharder, are known as rigorous test suites, and passing a large majority of their component tests is a satisfying outcome.\footnote{We should note that at least one of the tests is still considered experimental, and that, strictly speaking, the test suite itself has not been validated.} \begin{table}[htb] \begin{center} \begin{tabular}{ll} \toprule Test & Outcome \\ \midrule RGB Timing Test & 4.45 million rands/second\\ RGB Bit Persistence Test & Passed at 5\% \\ RGB Bit Distributioen Test & Passed at 5\% for n=1,2,3,4,5 \\ & Failed at < 0.01\% for n=6 \\ Diehard Birthdays test (mod.) & Passed at 5\% \\ Diehard Overlapping 5-Permutations & Failed at < 0.01\% \\ Diehard 32x32 Binary Rank Test & Failed at < 0.01\% \\ Diehard 6x8 Binary Rank Test & Passed at 5\% \\ Diehard Bitstream Test & Failed at < 0.01\% \\ Diehard Overlapping Pairs (OPSO) & Passed at 5\% \\ Diehard Overlapping Quadruples (OQSO) & Passed at 5\% \\ Diehard DNA Test & Passed at 5\% \\ Diehard Count the 1s (stream) (mod.) & Passed at 5\% \\ Diehard Count the 1s (byte) (mod.) & Passed at 5\% \\ Diehard Parking Lot Test (mod.) & Passed at 5\% \\ Diehard Minimum Distance Test & Passed at 5\% \\ Diehard 3d Sphere Test & Passed at 5\% \\ Diehard Squeeze Test & Possibly weak \\ Diehard Sums Test (up) & Passed at 5\% \\ Diehard Sums Test (down) & Passed at 5\% \\ Diehard Craps Test (mean) & Passed at 5\% \\ Diehard Craps Test (freq) & Passed at 5\% \\ Marsaglia and Tsang GCD Test & Failed at < 0.01\% \\ Marsaglia and Tsang Gorilla Test (preli& Passed at 5\% \\ STS Monobit Test & Passed at 5\% \\ STS Runs Test & Passed at 5\% \\ Dieharder User Example Lagged Sums & Passed at 5\% \\ \bottomrule \end{tabular} \caption{Results of \pkg{dieharder} for \texttt{random.org}} \label{table1} \end{center} \end{table} \cite{Brown:emails:2006} provides additional information on some of the failed tests. The `RGB Bit Distributioen Test' in known to fail at $n=6$ for every generator. The `Diehard Overlapping 5-Permutations' may be prone to type I errors so the failure may be disregarded. On the other hand, the `Diehard 32x32 Binary Rank Test' result is a genuine test failure: better generators pass this test. The `Diehard Bitstream Test' is also somewhat doubtful as known `good' generators can fail this test yet pass the related OPSO test. `Marsaglia and Tsang GCD Test' is a well-respected that the best generators pass, so failure here in also informative. These comments notwithstanding, the test outcome does not lead us to recommend against use of \randomorg. A comparison from running \pkg{dieharder} over random number streams from the six generators provided by \proglang{R} in its base package is left for future research. \section{Implementation of the \pkg{random} package} The combination of excellent networking support in \proglang{R} and the simple HTTP interface at \randomorg\ allows for a fairly light implementation. The standard data-reading function \code{read.table} in \proglang{R} is able to operate directly on a remote file, or URL. It is hence straightforward to connect \proglang{R} to CGI scripts providing data. In fact, the \pkg{tseries} package \citep{tseries:2006} has had the ability to download data for financial instruements directly off Yahoo! Finance for at a least a half-decade. Consequently, the main requirement for functions to access \randomorg\ are validity checks for the function arguments, construction of the URL query string, fetching the data and possibly reformatting the data before retuning it to the user. This high-level description summarizes the functions provided by the \pkg{random} package. Let us consider a simple example where we illustrate simple data acquisition from \randomorg. To avoid needless stress the server on re-runs of this document, we cache data, once downloaded, in a file and skip the download if the cache file is still present. <>= ## cached data to not depend on a) a network connection ## and b) data at random.org if ( !(file.exists("random.Rdata")) ) { randomOrg <- randomNumbers(n=5000, min=1, max=1e6, col=2)/1e6 save(randomOrg, file="random.Rdata") } else { load("random.Rdata") } # @ This shows that the call to obtain data from \randomorg\ is straightforward: we specify how many draws (with replication) over which interval we desire, and how many columns we would like in the return matrix. Exploratory data analysis using moments and quantiles, not shown here but available on request, reveal nothing suspicous that would indicate that the data is not distributed according to the requested $U(0,1)$ distribution. <<>= summary(randomOrg) apply(randomOrg, 2, function(X) quantile(X,c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99), digits=4)) # @ Further analysis using a simple scatter plot of half the uniform random numbers against the other, as well as a histogram of the entire data set, confirm the intial analysis. \setkeys{Gin}{width=0.95\textwidth} \begin{figure}[htbp] \begin{center} <>= plot(randomOrg, ylab="", xlab="", main="5000 random,org U(0,1) draws", pch='.') hist(matrix(randomOrg, ncol=1), xlab = "", ylab = "", main="Histogram") @ \caption{\label{plotunif} Scatter plot and histogram} \end{center} \end{figure} \section{Potential problems} A few potential problems with the non-deterministic random numbers provided by \randomorg\ should be mentioned. First, there is no actual service guarantee for the random numbers. Deterministic random numbers can always be produced requiring `just code': given a suitable compiler and computer, one should always be able to produce deterministic random number. And, given the so-far continued progression of hardware according to Moore's Law, more and more deterministic random numbers can be produced per unit of time. On the other hand, \randomorg\ is a web-based service. While the website has been on the Internet since 1999, and seemingly without any incident or break in the service, nothing guarantees that the site will be available at any given future point in time. This could block serious production-level implementation using \randomorg. Second, the data transfer from the service is entirely \textsl{in the open}. There is no provision for either a simple login or other forms of authentication, or an encryption layer covering the data. This could blocks any serious applications, in particular those involving cryptography. However, it it probably safe to assume that cryptography was never a focus of \randomorg. Third, the \randomorg\ service, like other hardware generators, is comparatively slow---an order of magnitude slower than software-based generators even on modest hardware. In fact, \cite{Davies:2000} already suggested to use files to store the output from hardware random number generators, and to then access these files for access at higher speeds. Fourth, while the result of the \pkg{dieharder} test suite are encouraging, they have been performed using static binary files. And, simply put, we just do not know for sure that the data was in fact generated by the \randomorg\ generator. %All that stands behind it is the academic reputation of its %principal investigator. A replicated study with a better guarantee concerning the origin of a sufficient amount of test data would be welcomed. \section{Summary} The \pkg{random} package complements the existing \textsl{deterministic} random number generators in \proglang{R} by providing a first \textsl{non-deterministic} source based on the \randomorg\ service \citep{Haahr:2006}. The \pkg{random} package is portable and cross-platform. This paper also provides the first rigorous tests of the \randomorg\ service using the \pkg{dieharder} battery of tests \citep{Brown:dieharder:2006} which indicate that the generated random numbers are of sufficiently high quality. We hope that \pkg{random} package presented here will prove to be useful for users of the \proglang{R} enviroment. \bibliographystyle{plainnat} \bibliography{random-intro} \end{document}