Transparency in Scientific Discovery: Innovation and Knowledge Dissemination

Victoria Stodden
Department of Statistics
Columbia University

I have been researching issues with open science broadly for two or three years. My phd was from Stanford and my PhD advisor in stats that we shared all of our code and data when we published a paper. This was how I began to be interested in these issues. This is why I am somewhat of an advocate for some of these things. This is probably the most fun talk to give because there is no sense that I need to be an advocate here. Everyone here is interested in these issues. I thought in this talk that I would distill some of the principles and ideas that I have learned while talking with community members talking about open science issues and what I've learned about this advocacy in mind. My theme is about transparency in scientific discovery and the deep impacts on knowledge dissemnation.

Framing my talk, I thought about why open science is open in the first place. Transparency is such a meme that it seems reasonable to apply it to science. But it's not new. It's been in science to share our methodology so that others could verify our results. This iswhat separates scientific investigation and fact from everything else. I've been thinking about framing this as faciliation of science for reproducibility of publishing, innovation in industry, and access to scientific knowledge. I will tuch on these three.

The first issue is reproducibility a close issue to my heart. In this slide, there are three examples of the data deluge that is hitting every aspect of research that we do. It's not restricted to the hard sciences. It's something that happens into fields - social sciences are being transformed for instance - a paper in Science Mag in 2009 where they are talking about quantitative revolution from social networking data and the abundance of this data. The LHC is pulling off in just 1 experiment - the CMS experiment - 780 terabytes per year. They processed this into several petabytes per year of data. This is just one experiment in one part of the LHC. How do we do this in an open way, how do we replicate the results even within LHC itself? These are enormous and testing questions to us as advocates.

Sloan digital sky survey - in 2010 there was about 50 TB of data if you wanted to download images of the sky. That's openly available and massive. That's one way that technology is changing through technology. Another way is through increased comptuational power. We are able to do the complete evolution of a systema nd vary the parameters. This is a new type of investigation due to technology.

The last point that I think is salient is how, for the first time, the deeep contributions are embedded in code. This is overlooked in talks about open science - there's talk about the amounts of data available, but when you think about what science is about, it's about communicating what we thought about. This ends up in the code, and not captured in the published paper. I think this is a key part as advocates of open science summit.

The flagship journal is JASA - The Journal of the American Statistic Publication. Which ones are computationally focused? And of those, how many of them discuss how I can get the code, do they have magic numbers in the code, do the functions work? Do the functions do what the algorithms say they do? In 1996, this is when I first looked. A little under half of the articles were computational. The other half was mathematics and they were showing proofs. None of those talked about where to get the code. Before 10 years, 33 of 35 are computational articles. Some of them are talking about where to get code - about 9%. In 2009 or 2011, now all of the articles, if you're publishing in JASA.. you've used a computer somehow in your work. This is where I have been focusing my research. You are publishing your results via a computer, and how is this changing how we need to think. A few more people are getting it,a bout 20% of where to download the code package. And how to use that results.. that's more than 0 but it's still poor. You know how hard it is to replicate code based on a 4 page description of their result. Our traditional scientific papers are not up to this task.

This is a credibility crisis for computational science. Most of our computational results that are being published now are not verifiable and they are not verified. This is new. Think about this in the context of the scientific method. We thought about 2 branches: the deductive branch, and mathematics and logic and so on. And something that can be completed in deduction on its own. And then we started to do empericial work, and this is the second branch. The statistical analysis of controlled experiments. Hypothesis testing and so on. This is the second branch. In keynotes, grant applications, there's lots of discussion about the third branch - the computational or simulation. Or the fourth branch, the data deluge and so on. What I would like to submit to you is that we are not yet at the stage where we can call this new type of method, this tech-driven investigation as science.

THis is what I think we're about in the open science movement. The central movitivation of the scientific method is to root out error. The deductive branch - the first two branches have had 100s of years to think about this. Computational science we've had maybe 2 decades to think about this. So, the other branches have a proof - if you have a mathematical fact, and you want to publish it, you have to show a proof. And everyone knows what the standards of this proof has to be to get your knowledge accepted. There's similar results for the emperical branch - there's structured communication, there's methods sections, there's very detailed about how to communicate this work so that you can transfer the knowledge and others can verify the work. Computational science as practiced today does not have these standards and does not generate reliable knowledge.

Here's a framing principle to think about the discussion of open science. There's, and I think it revolves around this core concept of reproducibility. There's different ways to think about what this means. This comes up in all discussions of reproducibility - the paraphrasing by David Donoho, my thesis advisor. "The idea is: An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complet eset of instructions which generated the figures." You see how this could apply toa ny computational scientific results being published.

A rresult is reproducibility if a member of the field can independently verify the published work. As in, they don't have to contact the author. The side effect as a core framing issue, it scopes what type sof data gets shared. What types of metadata gets attached to this data? What is required to share code? This issue is more defined than saying "We need to be open" - which is scary to a lot of scientists. Reproducibility is something that every science understands.

This is really hard to implement. Why hasn't this happened overnight? These are all familiar. There are requirements put on by funding agencies and grant agencies. There are patents and abilities to patent. There are intellectual property constraints that interfere with the free and open sharing of data. There are institutional expectations, if you are going into tenure, there might be issues about reproducibility and when you care about justpublishing as much as possible. Journals have particular requirements, they may or may not rqeuire code and generally they don't. Why would you spend extra time on it if it's required? Then there's general requirements of scientific integrity.

There is a lot of words here. There are actually requirements in the grant guidelines for NSF and National Institutes for Health. There are encouragements and requirements for sharing data and code. The NSF grant guidelines since 2005 and earlier expects investigators to share with other researchers the supporting materials, materials, collections, etc., it encourages grantees to share software and so on to make them widely useful and usable. They are singing our song. But what's wrong? This is 2005 and earlier. And NIH ahs been even more progressive about open data, like over $500k projects. NIH endorses the sharing of final research data, and it expects timely release there of. They aren't talking about code, but they are talking abuot data sharing. None of this is enforced. I don't think this is because people don't believe in the principles. The granularity of the research problems makes it not at all clear what the next step for each person.. there is no one-side fits all position. 

In January of 2011, NSF introduced a data management plan that is required. You must submit 2 pages with every grant application to talk about, describing how the proposal will confirm to the dessimination of research results policy. This document is peer reviewed as part of the granting process. It's an enormous step in what I think is the right step in funding agency policy. This is a little bit about the data management plan. Investigators are expected to share samples, data, physical collections, and are expected to facilitate such expectations and so on. They are taking steps to enforce their pre-existing grant guidelines.

Other things on the policy level. In 2011, the ACT America Competes was authorized. Section 103 and 104 I will briefly highlight. COMPETE. Coordinating federal agencies should dessiminate the long-term stewardship of.. including digital data. This is important about open access and open science. Most people are familiar with open access, which is related. At the congressional level, they are thinking about what is being shared with scientific integratory and data. They are thinking about data. Section 104 is where they say the OSTP - the Office of Science and Technology Police at the Whitehouse - they are trying to improve the access including online access. They are aware that we are becoming fully digitized and that this is changing our nature of communication.

Innovation and Tech Transfer.

This became something that is important to us and the Bayh-Dole Act in 1980. People were not aware of the coming, at least the congressional people were not aware of the revolution of computers and technology. This was passed on the eve of where computers would pervade our practice of science. The idea of the Bayh-Dole act was to first streamline the requirements of the federal funding agencies for the science they were funding. The biggest impact is to give incentives to take ideas and inventions in academia and make them more widely avaiable for commercial development. They were transfering ownership of these ideas to the institution where the scientists work. When you are hired, you sign this document transfering the rights to the invention to the university. The university then patents and has an incentive to patent it, and gather money from the licensing of the patent. This is hwo they were thinking about transfering the tech out into the broader public and make it commercializable. This is why we have so many tech transfer offices at universities.

Now we are in a strange place because of the ability to patent software. And now that software is so pervasive throughout science, and at the beginning we had such deep intellectual contributions in software, this is something that falls into patent - and is now licensed out. What does this have to do with open science? If we were thinking about reproducibility, we would put our code and data out there so that others could reproduce the results. Well, what if they watn to patent this and start a company about it? That's fine, but it's a change in the scientific norms that we should be aware of.

There are a number of barriers facing individual scientists. In 2010, I surveyed people in the machine learning communities via a conference. I talked about a paper in NIPS. I wanted to know if they shared code or not from that paper. The paper itself was an entire talk.. but here's one table about the survey of the machine learning community, NIPS, Stodden 2011. Why did you share the code or not? One person wrote in the comments space, the first reason here was so important on the long scale.. it takes a long time to clean up your code and clean up your data. Putting that time in is not rewarded in the community. I think that's changing, but not yet..

The scientists did not think they were IT people. They did not want to feel the wrath of questions if people had questions about how to wrk with it, and maybe it's not worth it. They were also worried about not getting attribution. This is something that I have been working on with licensing structures about attribution for computatinoal code and computational data. And how to work in this IP structure. Attribution was a big deal.

The next two surprised me. There were legal issues when I did computational work, but some people were saying "I might want to patent this software and I don't want to throw it out there and establish prior art." And then copyright was another one that they were worried about. I need to check things with admins, it's a pain, I could lose future publications and get scooped, why should I advantage my competitors? Why should I take the time and effort to release code and data and the other guy will look like a better scientist in other publications? A deep collective action problem.

The last one that I feel as ayoung professor is that there's one senior guy in a chair, and he's saying, one door is tenure and the other door is flipping burgers. We get so specialized that we ./.

Not just sharing the paper, the idea was to share the methodology and the way to generate those results originally. That's what we're doing. We can now play with the results, change them, try their ideas, rather than reading a paper and where it's quite opaque. So, in a project I'm working on now..

I looked at 170 journals. I chose the ones that had a computational element. These aren't journals in general. We had about 14% with requiring data. And there, in the computational journals, many are biologically focused. That's why we have accession ... and then in the proportion requiring code is about 7%. If journals are going to think about code sharing, they think about data first. Once data is a policy, then they think about code. It's not surprising that this number is a little smaller. None of this is review. Supplemental materials that you put with your code-  that's 90%. And then, open access. A sister issue. Distinct from this issue but related. In the computational journals, that was 22% for open access. You could pay to publish if you wanted.

With the journals, they fit their own set of constraints. If reproducibility is such a core idea for computational scientists, then why don't they pick up data? How do you estbalish standards for what's acceptabily shared code, what's the metadata? Who does the archiving? Does the journal do it? What's documentation? Should we be using sharing platforms? Is this something from the federal funding level, or the institutional level? Who checks any of this, should it be reviewed? I fall in the side of the community checking. This doesn't need to be reviewed before-hand. WE've seen high profile cases, like the Duke clinicla trials, where there are assumptions that this stuff is checked when it's not.

What about less technical authors sharing data and code? We need to be sensitive to this. What about replication in social sciences, and how will this affect the decision of when to publish? Maybe they publish later, and then the knowledge gets out there later? And what about journals wanting to attract the best papers, and not put people off? Well, arguably, the best papers are the ones that share code and data.

So this is a blizzard of words about grassroot movements. There's lots of movement afoot. We have workshops and conferences in applied mathematics, in geoscience there's biostatistics.

AMP 2011
AMP / ICIAM 2011
SIAM Geosciences 2011
ENAR International Biometric Society 2011
AAAS 2011
SIAM CSE 2011
Yale 2009
ACM SIGMOD conferences
NSF OCI report Grand Challenge Communities
IOM Review of Omics-based tests for predicting..

They are calling it reproducability. All scientists have a day job.. there is an enormous role for people to make a difference here. I wanted to conclude with afew principles for advocacy as I see it and I think most of us ehre probably qualify as advocates in the open science movement.

Work within scientific norms as much as possible. Reproducibility is a long-term norm in science. Scientists understand this, even if they haven't heard of open science. They know this. Arguments that step outside this norm are tougher. Arguments that open data is good for the community, that's difficult - that's a shift. I think reproducibility is the easier argument. That implies code and data.

Scientific integrity. This leads to openness because of verifiability of the results. The needs of verification and establishing scientific facts. We need to remember this.

Another thing that I think is useful - the data deluge - the experiences that are affecting the research landscape.. many researchers are generating questions, how do I share my data? Where do I put this? What type of structure should I give this? Let's engage them about these questions. Let's reach out and there are questions all over the place. There are people with day jobs who aren't thinking about this. Let's engage them at the community level.

In the survey results, I showed that computational people have lots of ideas about this. I think the arguments fall the other way - the scientists that are open with code and data are the best scientists. I think if we can show this emperically, then I think that people who think that open science is a waste of time, well, we can use that to put us in a strong position. The slides are on my website, stodden.net, and the papers that I talked about and the deeper results are also available.

Thank you.

Q & A

I wanted to ask if you were familiar with the publishing for the data set for recently or genomic projects that are underway. Where they publish the data as they are generating it, and then license it for agreemnet later on top of that, for particular use, until the whole project is completed and the primary authors can get the publication. Do you see this as a viable way forward, or other sorts of data-driven research?

I see that as a special case. I think, there they have left outside the norms. Sharing data pre-publication is specialized in terms even for the genomic community. Computational scientists with data, sharing results before publication, and open lab noetbooks, and remember that community - that community is the leader in terms of open science, but definitely in terms of open data. They were pushed in the rsuh to sequence the human genome. There was that big intellectual war about whether this was in the public domain or something commercialized. They were doing declarations about the importance of open data. They are far out ahead at least in terms of open data. There's other communities with whole other sets of norms and standards. These guys were pushed out early. I think that outside of that context for other scientists, that's outside the norms. If you were to approach them and say, put your data online immediately, they would just stop talking to us. I think we're best off working within their norms as much as possible.

"I am wondering what your solution is for the attribution problem. Scientists do not want to share, is because the way to build credit is to build papers. But what if you can build attribution through code that you share."

As part of my Phd dissertation, I put together a sharing platform for papers in sparse representation. It was from 2005. I just did it at Google Scholars SPARSE lab for a year.. and I never thought about it when I was a phd student, and I wondered what they were citing. I wrote some documentation and made pdfs out of it. So they are citing those as papers. Because they are searching for a paper so that they can shoehorn the code into their paper. They are coming up with ways to cite data, like a paper almost. But I think we're going in that direction. I think there's natural hestiation around it because there's not a sense yet to our shame - as computational scientists - but even braodly that the work done, collating data and creating usable data sets as valuable as the idea and results generation. I think this is something that is changing. I think this is something that is changing, and the contributions within code. I think this is something that we are moving towards, i mean we have to, there's no way around it. I think the community is evaluating the value of this.