Another data competition:
Machine Learning for Signal Processing (MLSP) TC Announces the Winners of the 6th Annual Data Analysis Competition
See here for more info.
Ideas, interesting papers and news items around reproducible research
Another data competition:
Machine Learning for Signal Processing (MLSP) TC Announces the Winners of the 6th Annual Data Analysis Competition
See here for more info.
Earlier this week, I was at the Workshop in Computational Systems Biology: Models, Methods, Meaning at the Norwegian University of Life Sciences in As, Norway. I gave a talk there on reproducible research, and there were some other excellent talks on modeling and simulation, research methods, etc. I liked it a lot, and it was really an excellent workshop! Thanks for the organization, Hans Ekkehard!
As it says on the site, this workshop was on the following topic, which very well described to me both the content and the spirit of the workshop: “Modeling and simulation are essential tools in systems biology and many other branches of science. This workshop is an invitation to step back from the day-to-day struggle with our simulations and to reflect about the nature of modeling and its relation to simulation: How do modeling and simulation contribute to the development of knowledge? Is a simulation per se a valid scientific experiment?”
Both the speakers and the audience consisted of people with a very diverse background, ranging from physicists, chemists and engineers (like me) all the way to philosophers in metaphysics. This resulted in often very enthusiast and interesting discussions. It was also very interesting for me to see how scientists in neuroscience struggle with similar issues as me, and to see how they approach things. I learned a lot of new things, some of which will pop up in separate posts on this blog in the future. Stay tuned!
I just got a pointer to ResearchAssistant, a Java tool to keep better track of research experiments. It stores the exact circumstances under which experiments are performed, with parameter values etc. Looks like a very nice tool for reproducible research. At least, it should make it very easy when writing a paper to trace back how certain results were obtained. If you’re doing research in Java, I’d certainly recommend taking a look at RA!
The tool is also described in the following paper:
D. Ramage and A. J. Oliner, RA: ResearchAssistant for the Computational Sciences, Workshop on Experimental Computer Science (ExpCS), June 2007.
Making publications reproducible is tough…
I recently experienced it again in some of my work. In the stress of preparing a publication for a submission deadline, it is very challenging to take the (precious) time to verify all of the results once more and make sure all the results are perfectly reproducible. A result or figure so easily slips in for which the exact parameter settings have not been checked or written down…
I got a pointer earlier this week to a New York Times article about R. A very interesting article about the use of R in scientific communities and industrial research, mainly for statistical analysis. R is open source software, so it is free and has already taken advantage from contributions made by various authors. And (although I haven’t used it myself yet), it is a great tool for reproducible research. Using the package Sweave, authors can write a single document containing their article and the R code to reproduce the results and put them in place. This ensures that all the material is in a single place.
It also shows something about the amazing power of open source software developed by a community of authors (and typically users at the same time).
I seem to be dwelling quite some time on the web lately… After my post about the lifetime of URLs, here’s one about domain names and reproducibility. I recently noticed when looking around that there are quite some websites and domain names related to reproducible research.
reproducibleresearch.org is an overview website by John D. Cook containing links to reproducible research projects, articles about the topics, and relevant tools. It also contains a blog about reproducible ideas.
reproducibleresearch.com is owned by the people at Blue Reference, who created Inference for Office, a commercial tool to perform reproducible research from within Microsoft Office.
reproducibility.org is used by Sergey Fomel and his colleagues as home for their Madagascar open source package for reproducible research experiments.
reproducible.org is a reproducible research archive maintained by R. Peng at Johns Hopkins School, where the goal is to host a place for reproducible research packages.
Quite a range of domain names containing the word “reproducible” (or a derivative), if you ask me! And then I didn’t even start about the Open Research or Research 2.0 sites. Let’s hope this also means that research itself will soon see a big boost in reproducibility!
I am getting worried these days about the volatility of URLs and web pages. I guess you all know the problem: it is very easy to create a web page, and hence many people do so. Great! However, after some years, only few of those web pages are still available. Common reasons include people retiring, or moving to other places, and therefore their web pages at their employer’s site disappear. Similarly, registering a domain name at some point in time does not mean you will keep on paying the yearly fees forever. Or also, web sites getting an entire re-design often result in broken URLs.
Why does this worry me so much?
Reproducible research, literate programming, open science, and science 2.0. All different namings, and (in my opinion) all covering largely the same topic: sharing code and/or data complementing a publication as a presentation of your research work. While literate programming is more focused on adding documentation to code, and science 2.0 seems to include the assumption that you put work in progress online, there really seems to be a very large intersection between these topics.
This clearly shows that from various sides of the scientific community, in very different fields of science, the same ideas pop up. That is a really exciting thing! And at the same time it also shows that there is a clear need for such open publication of a piece of research. And I think everyone will agree that there would be nothing nicer than being able to really start from the current state-of-the-art when starting to do research in a certain field?
Should all these efforts be merged under a single “label”? It would definitely be exciting. And it would create a huge impact, as a joint effort for “open science”, “reproducible research”, or whatever the name may be, would receive a lot of attention, and cannot be overlooked by anyone anymore. At the same time, every research domain needs other specifics or finetuning, and it is not clear to me now what the “best” setup would be for the type of work I am doing now. So maybe we should let these variations co-exist for some more time, and see later which ones survive, are the simplest to use, and which tools can be combined to create an optimal method for research.
But of course (if anyone is reading these posts), I would be very happy to hear your own opinion on this!
A few months ago, I read in a Belgian newspaper that 9% of the participants in a study among 2.000 American scientists said they had witnessed scientific fraud within the past three years. And it seems they were not talking about those cases where people use Photoshop to crop an image or so, but rather inventing fake results or falsifying articles.
Although I wasn’t able to find this back on the web with Google, I am quite sure the original authors checked the number. Wikipedia reports on another study, where the actual number was 3%. Anyhow, whether it is 3 or 9 percent, this number is much too high. Let us hope it can be taken down by requiring higher reproducibility of our research work. I do realize that there will always be people cheating, and falsifying results (Wikipedia even keeps a list of the most famous cases). But I also strongly believe that in the end, most researchers just want to do good work. And many of them perform non-reproducible work, just because they don’t feel the need for making it reproducible (yet). Or are too busy with their next piece of work to properly finish off the current one…
Climate science
Just like many other domains, climate science is a mixture between theory, models and empirical results. Often this comes with different scientists working on the different parts (theory/model/experiments), and all claiming their part to be the (far) more important one of the three. A nice analysis is given on the IEEE Spectrum site. Unlike many other domains, it seems hard to me (not being a climate scientist) to do a lot of small experiments to validate the models. This makes it even more important to be open about the precise models used, parameters, and the data used to validate those models.
We’ve only got one planet Earth to validate models on. And it takes soooo long to check whether a model is correct, that we’d better be open about it, collaborate, check each other’s assumptions, and make sure it’s the best model we can make!
For some more discussion on the recent climate study scandal and reproducible research, see also Victoria Stodden’s blog (or also here).