Archive for the 'ideas, comments,...' Category

What is the best reproducible research?

What is best research practice in terms of reproducibility? At the recent workshop in As (Norway), I had a discussion with Marc-Oliver Gewaltig, similar to discussions I had earlier with some other colleagues as well. So I decided to put it up here. All feedback is welcome!

The discussion boils down to the following question: Is it better (in terms of reproducibility) to make code and data available online and allow users to repeat your experiments (or simulations as Marc-Oliver would call them) obtaining the same results, or to describe your theory (model in Marc-Oliver’s terminology) in sufficient detail that people can verify your results by re-implementing your experiments and verifying that they obtain the same thing?

I personally believe both approaches have their pros and cons. With the first one, a reader can download the related code and data, and very easily verify that he/she can obtain the same results as presented in the paper. If he wants to analyze things further, there is already a first implementation available to start analyzing, or to test on other data. However, that certainly doesn’t take away the need for a good and clear description in the paper!

With the second approach, one avoids the risk that a bug in the code giving those results is not caught by a reader reproducing the results, because he can just “double-click” to repeat the experiment. The second approach allows a thorough verification of the presented concept/theory, as the reader independently re-implements the work and checks the results. I believe certain standardization bodies like MPEG use this approach to make sure that descriptions are sufficiently precise.

Personally, I think the second approach is a better, more thorough approach in an ideal world. Currently, I prefer the first one, because most people won’t go into the depth of re-implementing things, and the first approach already gives those people something. Something more than just the paper, allowing to get their hands dirty on it. And “more interested readers” may still re-implement, or start analyzing the code in detail.

On doing research

I was just reading the following two articles/notes. While they are not entirely about reproducible research, I think they reflect well the worries that many researchers have about current “publish or perish” research practices. Not sure I agree with all of it, but they do make a number of good remarks.

D. Geman, Ten Reasons Why Conference Papers Should be Abolished, Johns Hopkins University, Nov. 2007.

Y. Ma, Warning Signs of Bogus Progress in Research in an Age of Rich Computation and Information, ECE, University of Illinois, Nov. 2007.


Climate science

Just like many other domains, climate science is a mixture between theory, models and empirical results. Often this comes with different scientists working on the different parts (theory/model/experiments), and all claiming their part to be the (far) more important one of the three. A nice analysis is given on the IEEE Spectrum site. Unlike many other domains, it seems hard to me (not being a climate scientist) to do a lot of small experiments to validate the models. This makes it even more important to be open about the precise models used, parameters, and the data used to validate those models.

We’ve only got one planet Earth to validate models on. And it takes soooo long to check whether a model is correct, that we’d better be open about it, collaborate, check each other’s assumptions, and make sure it’s the best model we can make!

For some more discussion on the recent climate study scandal and reproducible research, see also Victoria Stodden’s blog (or also here).

ORCID: on being a number

I just learned about ORCID: the Open Researcher Contributor Identification initiative. Its goal is to provide a unique ID for every researcher, and in that way provide better traceability of all the work by a researcher. It should avoid ambiguity between authors with the same name and typos. They even intend to include not only ’standard’ conference/journal publications, but also more ‘exotic’ research output like data sets, blog posts, etc. The initiative is supported by a large number of major publishers, like Springer, Elsevier and Nature.

A very nice initiative, which should get a few problems out of the world. However, I am not sure how that is supposed to work in practice. Does that mean that we should soon add an ORCID number (without typos) below the title and the author name? And cite works by citing the ORCID and the DOI (digital object identifier)? And will we write these numbers with less errors than the author names now?

It makes me indeed think of that other unique number: DOI, which was introduced to uniquely identify a document (publication, for as far as I have seen them). I’ve seen it for some time now when I look up articles, and I have no doubt it uniquely identifies those articles, but what is it used for? Maybe they have their use… but I haven’t seen it yet.

People who do know of practical cases where the DOI is used, feel free to comment! (others too, of course)

repository server for publications

I think it’s probably a lot easier, and more consistent, if instead of making a web page for each RR paper we do (http://lcavwww.epfl.ch/reproducible_research), we have a setup (a bit) like Infoscience, where everyone can enter publications by filling in the required and optional fields. I would like to build such a setup based on EPrints (http://www.eprints.org/software/) and make it public, such that other labs/universities can also easily set up a similar server. We will probably let the people from EPrints develop this system, but for that we need accurate requirements… So your comments on this would be very welcome!

I was thinking about the following fields:
- standard publication fields (title, author, reviewing status, journal, volume, number, pages, year, DOI, abstract, keywords, PDF, publisher, official URL)
- specifically for RR:
* code and data (in a zip archive, specifying also the type of code), mandatory
* tested configurations, mandatory
* contact e-mail address, mandatory
* figures, optional
- additional features for readers (cfr http://clare.eprints.org/10/ for an example of the last)
* a check box saying ‘I have tested this code and it runs/does not run’
* a check box saying ‘I was/was not able to reproduce the results described in this paper’
* a field where anyone can add comments

Any comments? More/less things needed?
Some specific questions:
- should we make these ‘Additional features’ linked to a name and/or date or so, such that we can avoid the author clicking 10 times? ;-)
- should we separate code and data? Data might get quite large, while code is generally small.