Welcome to our discussion forum on reproducible research!
The topic of reproducible research raises a lot of very interesting discussions, going from the question whether there is actually a need for a change in habits to practical issues like how to make such research available online. We already had some very interesting discussions at the ICASSP 2007 special session on reproducible research, and would like to continue these on this forum.
So if you have any comments on reproducible research in general, our current efforts at LCAV, or personal experiences with reproducible/non-reproducible research, we would be very happy to hear about them! A lively discussion about advantages and disadvantages of reproducible research and various needs will only improve the outcome, and will increase the visibility and impact of our research!
Our personal experiences and interests are mainly in signal processing, but contributions from other research domains are of course very welcome too, and can only enrich the discussions.
The reproducibility of research in signal processing seems highly dependent on the area. In Speech for example, there appears to be a history of sharing software and using benchmarking databases. In radar signal processing there are isolated but significant pockets where there exisit high standards of reproducibility. In Genomic signal processing too there is a pre-existing culture (biology) where reproducivility is a given. These areas have been highly subsidized by govt agencies that have placed a premium on reproducibility.
Other areas of SP have spotty records in rr. These areas should be the focus of rr discussion. I think that this is especially true of image processing, theory and methods, and communications SP but zould like to hear input from others on this. A targetted movement to encourage rr in a especially deserving area would have higher impact.
RR shall also take into account experimental parameters, especially in signal processing (and image processing -- I agree with Hero). Generally, simulations with synthetic signals are reproducible, but problems arise when talking about real-life signals. When no statistical model is used (for example: generalized Gaussian model for wavelet image coefficients), then one would expect the results to be obtained from a significant amount of data. This is especially true in digital forensics and watermarking. In this field, there is this publication started by Barni et al. which focuses on reproducible results.
There is also, for real-life applications, the problem of the relevant database. I was told the MPEG testbed only consists in less than a dozen of (quite short) videos... Every research domain should publish a standard database to test any algorithm against (who's gonna pay for that?) I work in an image watermarking team, and we try to give results against thousands of images. And we know this is not enough to estimate a (hopefully very low) false alarm probability! However, this database is ours. How relevant would it be for anybody else? This database contains 1 million images. It is surely not enough to estimate probabilities of 1e-9 order of magnitude, as stated by standard requirements. But so many people give results against the sole Lena image... One point in a figure costs us 3 days of computation on our high-end server.
Does that mean we should give access to our database to the reviewers? and hope they will be patient enough to get the same figures as ours?
I totally agree that code sharing practices depend a lot on the community or research field.
Then, concerning the data, things are indeed difficult to generalize. It seems very difficult to agree on standardized datasets for testing, and real-life data from practical experiments can only be reproduced in a statistical sense, if they can be reproduced at all. In my opinion, full datasets should be made available online to everyone. So Cayre, my proposal is to make the database available online to everyone, and not just the reviewers. Only if you own the copyright to the data of course.
I believe people will give credit to the owner of the database in the long term.