We always find it useful when large-circulation journals like Science or Nature tackle the data issue, especially when they devote a special issue to this topic. They typically take a structured approach, providing a brief, understandable overview to the topic at hand, and then specific examples or case studies as illustration.
Science’s special issue, on Data Replication and Reproducibility (December 2, 2011) was the latest in our list of “make note of these helpful special issues”. True to form, the Introduction laid out the issues, noting how “replication — the confirmation of results and conclusions from one study obtained independently in another — is considered the scientific gold standard”, but acknowledging that this concept is complicated by the amounts of data produced, the approaches taken to research, and the complexity of the question.
The first article, by Roger D. Peng from the Johns Hopkins Bloomberg School of Public Health on reproducible research in computational science, raised many interesting points about the potential for reproducibility as a minimum standard for assessing the value of scientific claims. While full replication of a study is the gold standard for evaluating publishing findings, it is often not feasible. He uses the example of environmental epidemiology to illustrate: reproducing a large cohort study designed to examine health effects of pollution would be difficult if not impossible. Such a study is very expensive and requires a long follow-up time.
Other reproducibility barriers exist, as he notes, related to technical issues with instrumentation, cultural issues within the researcher community, and the lack of an integrated infrastructure for distributing research results. But progress is being made, through journals like Biostatistics that encourage authors to make their work reproducible by others. Peng suggests small steps can be made now to reach the overall goal of reproducibility. Authors can publish their code, even if it is not “clean or beautiful”. Free repositories exist for this purpose. Publishing code plus data sets is another step. His final recommendation is that the science community create a “DataMed Central” and a “CodeMed Central”, similar to PubMed Central, that is, a repository for data, metadata, and code, with links to each other and corresponding publications.
What do you think of a resource like a “DataMed Central”?