📄 paper.tex
字号:
\title{Reproducible computational experiments \\ using SCons}\lefthead{Fomel \& Hennenfent}\righthead{Reproducible research}\author{Sergey Fomel\/\footnotemark[1] and Gilles Hennenfent\/\footnotemark[2]}\footnotetext[1]{University of Texas at Austin, E-mail: sergey.fomel@beg.utexas.edu}\footnotetext[2]{Earth \& Ocean Sciences, University of British Columbia, E-mail: ghennenfent@eos.ubc.ca}\maketitle\begin{abstract} SCons (from Software Construction) is a well-known open-source program designed primarily for building software. In this paper, we describe our method of extending SCons for managing data processing flows and reproducible computational experiments. We demonstrate our usage of SCons with a couple of simple examples.\end{abstract}\section{Introduction}This paper introduces an environment for reproducible computationalexperiments developed as part of the ``Madagascar'' software package.To reproduce the example experiments in this paper, you can downloadMadagascar from \url{http://rsf.sourceforge.net/}. At the moment, themain Madagascar interface is the Unix shell command line so that youwill need a Unix/POSIX system (Linux, Mac OS X, Solaris, etc.) or Unixemulation under Windows (Cygwin, SFU, etc.)Our focus, however, is not only on particulartools we use in our research but also on the general philosophy ofreproducible computations.\subsection{Reproducible research philosophy}Peer review is the backbone of scientific progress. From the ancientalchemists, who worked in secret on magic solutions to insolvableproblems, the modern science has come a long way to become a socialenterprise, where hypotheses, theories, and experimental results areopenly published and verified by the community. By reproducing andverifying previously published research, a researcher can take newsteps to advance the progress of science.Traditionally, scientific disciplines are divided into theoretical andexperimental studies. Reproduction and verification of theoreticalresults usually requires only imagination (apart from pencils andpaper), experimental results are verified in laboratories usingequipment and materials similar to those described in the publication.During the last century, computational studies emerged as a newscientific discipline. Computational experiments are carried out on acomputer by applying numerical algorithms to digital data. Howreproducible are such experiments? On one hand, reproducing the resultof a numerical experiment is a difficult undertaking. The reader needsto have access to precisely the same kind of input data, software andhardware as the author of the publication in order to reproduce thepublished result. It is often difficult or impossible to providedetailed specifications for these components. On the other hand, basiccomputational system components such as operating systems andfile formats are getting increasingly standardized, and new componentscan be shared in principle because they simply represent digitalinformation transferable over the Internet.The practice of software sharing has fueled the miraculously efficientdevelopment of Linux, Apache, and many other open-source softwareprojects. Its proponents often refer to this ideology as an analog ofthe scientific peer review tradition. Eric Raymond, a well-knownopen-source advocate, writes \cite[]{taoup}:\begin{quote} Abandoning the habit of secrecy in favor of process transparency and peer review was the crucial step by which alchemy became chemistry. In the same way, it is beginning to appear that open-source development may signal the long-awaited maturation of software development as a discipline.\end{quote}While software development is trying to imitate science, computationalscience needs to borrow from the open-source model in order to sustainitself as a fully scientific discipline. In words of Randy LeVeque, aprominent mathematician \cite[]{randy},\begin{quote} Within the world of science, computation is now rightly seen as a third vertex of a triangle complementing experiment and theory. However, as it is now often practiced, one can make a good case that computing is the last refuge of the scientific scoundrel [...] Where else in science can one get away with publishing observations that are claimed to prove a theory or illustrate the success of a technique without having to give a careful description of the methods used, in sufficient detail that others can attempt to repeat the experiment? [...] Scientific and mathematical journals are filled with pretty pictures these days of computational experiments that the reader has no hope of repeating. Even brilliant and well intentioned computational scientists often do a poor job of presenting their work in a reproducible manner. The methods are often very vaguely defined, and even if they are carefully defined, they would normally have to be implemented from scratch by the reader in order to test them.\end{quote}In computer science, the concept of publishing and explaining computerprograms goes back to the idea of \emph{literate programming} promotedby \cite{knuth} and expended by many other researchers\cite[]{thimbleby}. In his 2004 lecture on ``better programming'',Harold Thimbleby notes\footnote{\url{http://www.uclic.ucl.ac.uk/harold/}}\begin{quote} We want ideas, and in particular programs, that work in one place to work elsewhere. One form of objectivity is that published science must work elsewhere than just in the author's laboratory or even just in the author's imagination; this requirement is called \emph{reproducibility}.\end{quote}\begin{comment}The quest for peer review and reproducibility is especially importantfor computational geosciences and computational geophysics inparticular. The very first paper published in \emph{Geophysics} wastitled ``Black magic in geophysical prospecting''\cite[]{GEO01-01-00010008,TLE02-03-00280031} and presented an accountof different ``magical'' methods of oil explorations promoted byentrepreneurs in the early days of geophysical exploration industry.Although none of these methods exist today, it is not a secret thatindustrial practice is full of nearly magical tricks, often hiddenbesides a scientific appearance. Only a scrutiny of peer review andresult verification can help us distinguish magic from science andadvance the latter.\end{comment}Nearly ten years ago, the technology of reproducible research ingeophysics was pioneered by Jon Claerbout and his students at theStanford Exploration Project (SEP). SEP's system of reproducibleresearch requires the author of a publication to document creation ofnumerical results from the input data and software sources to letothers test and verify the result reproducibility\cite[]{SEG-1992-0601,matt}.The discipline of reproducible research was also adopted andpopularized in the statistics and wavelet theory community by\cite{donoho}. It is referenced in several popular wavelet theorybooks \cite[]{hubbard,mallat}. Pledges for reproducible researchappear nowadays in fields as diverse as bioinformatics\cite[]{bioconductor}, geoinformatics \cite[]{geo}, and computationalwave propagation \cite[]{randy}. However, the adoption or reproducibleresearch practice by computational scientists has been slow.Partially, this is caused by difficult and inadequate tools.\subsection{Tools for reproducible research}The reproducible research system developed at Stanford is based on``make \cite[]{make}'', a Unix software construction utility.Originally, SEP used ``cake'', a dialect of ``make''\cite[]{Nichols.sep.61.341,Claerbout.sep.67.145,Claerbout.sep.73.451,Claerbout.sep.77.427}.The system was converted to ``GNU make'', a more standard dialect, by\cite{Schwab.sep.89.217}. The ``make'' program keeps track ofdependencies between different components of the system and thesoftware construction targets, which, in the case of a reproducibleresearch system, turn into figures and manuscripts. The targets andcommands for their construction are specified by the author in``makefiles'', which serve as databases for defining source and targetdependencies. A dependency-based system leads to rapid development,because when one of the sources changes, only parts that depend onthis source get recomputed. \cite{donoho} based their system onMATLAB, a popular integrated development environment produced byMathWorks \cite[]{matlab}. While MATLAB is an adequate tool forprototyping numerical algorithms, it may not be sufficient forlarge-scale computations typical for many applications incomputational geophysics.``Make'' is an extremely useful utility employed by thousands ofsoftware development projects. Unfortunately, it is notwell designed from the user experience prospective. ``Make'' employsan obscure and limited special language (a mixture of Unix shellcommands and special-purpose commands), which often appears confusingto unexperienced users. According to Peter van der Linden, a softwareexpert from Sun Microsystems \cite[]{linden},\begin{quote} ``Sendmail'' and ``make'' are two well known programs that are pretty widely regarded as originally being debugged into existence. That's why their command languages are so poorly thought out and difficult to learn. It's not just you -- everyone finds them troublesome.\end{quote}The inconvenience of ``make'' command language is also in its limitedcapabilities. The reproducible research system developed by\cite{matt} includes not only custom ``make'' rules but also anobscure and hardly portable agglomeration of shell and Perl scriptsthat extend ``make'' \cite[]{Fomel.sep.94.matt3}.Several alternative systems for dependency-checking softwareconstruction have been developed in recent years. One of the mostpromising new tools is SCons, enthusiastically endorsed by\cite{dubois}. The SCons initial design won the Software Carpentrycompetition sponsored by Los Alamos National Laboratory in 2000 in thecategory of ``a dependency management tool to replace make''. Some ofthe main advantages of SCons are:\begin{itemize}\item SCons configuration files are Python scripts. Python is a modern programming language praised for its readability, elegance, simplicity, and power \cite[]{python1,python2}. \cite{TLE21-03-02600267} recommend Python as the first programming language for geophysics students.\item SCons offers reliable, automatic, and extensible dependency analysis and creates a global view of all dependencies -- no more ``make depend'', ``make clean'', or multiple build passes of touching and reordering targets to get all of the dependencies.\item SCons has built-in support for many programming languages and systems: C, C++, Fortran, Java, LaTeX, and others.\item While ``make'' relies on timestamps for detecting file changes (creating numerous problems on platforms with different system clocks), SCons uses by default a more reliable detection mechanism employing MD5 signatures. It can detect changes not only in files but also in commands used to build them.\item SCons provides integrated support for parallel builds.\item SCons provides configuration support analogous to the ``autoconf'' utility for testing the environment on different platforms.\item SCons is designed from the ground up as a cross-platform tool. It is known to work equally well on both POSIX systems (Linux, Mac OS X, Solaris, etc.) and Windows.\item The stability of SCons is assured by an incremental development methodology utilizing comprehensive regression tests.\item SCons is publicly released under a liberal open-source license\footnote{As of time of this writing, SCons is in a beta version 0.96 approaching the 1.0 official release. See \url{http://www.scons.org/}.}\end{itemize}In this paper, we propose to adopt SCons as a new platform forreproducible research in scientific computing.\subsection{Paper organization}We first give a brief overview of ``Madagascar'' software package anddefine the different levels of user interactions. To demonstrate ouradoption of SCons for reproducible research, we then describe a coupleof simple examples of computational experiments and finally show howSCons helps us document our computational results.\section{Madagascar software package overview}%\inputdir{.}\plot{rsf_diag}{width=\textwidth}{caption}%``Madagascar'' is a multi-layered software package(Fig.~\ref{fig:rsf_diag}). Users can thus use it in different ways:%\begin{itemize}\item \textbf{command line}: ``Madagascar'' is first of all a collection of command line programs. Most programs act as filters on input data and can be chained in a Unix pipeline, e.g.\begin{verbatim}sfspike n1=200 n2=50 | sfnoise rep=y >noise.rsf\end{verbatim}\\Although these programs mainly focus at this point on geophysicalapplications, users can use the API (application programmer'sinterface) for writing their own software to manipulate RegularlySampled Format (RSF) files, ``Madagascar'' file format. The main
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -