My name is Michael .. I am a theoretical physicist and I am about to publish a book on open science.
We're going to up the gain on this microphone. We'll get on that. How is it here? Are we live there? Okay, just a second. Just to get.. Australians, my name is Freder Nadegem. Council in the UK. I am interested in open science, sharing science, processing, how to make the web suck less, how to do science with others. I'm Ray Torrez Sastain, I am a statistician at the universit,y I am going to talk about the credibility crisis in computational science and how the solution is in this room, how open science will help it, increased visibility and science path. I am Jason hoytt at Mendeley, and I am here to talk about what you want to talk about. You'll hear me in a few minutes. I am Mike Credis from My Health Cap, how open science and open innovation can be applied to research and research development. Thanks. It sounds like it's much better.
We're going to do about 15 minutes of general Q&A, discussion among the panelists here, and then we launch into our first presentation with Mike Pretz, who is hereby on notice. Be ready to start in 15 minutes. Why don't we have the panelists open up for any general questions that the audience would like to ask. Can we start with a good definition of open science? I thought that science was open by definition.
The "open" bit. The open bit means that it is available to any one in the world to do whatever they like with it, without any strings attached. I wuold agree with that. It's pretty clear that we don't have that situation right now in science. I would be thinking for some time- open science is not something that you want to define too precisely. It's a great banner, and a great rallying pool for people with quite diferent priorities. There are people who very strongly advocate public access to the published peer-reviewed literature; there are people who are strongly arguing for access to data, and that includes the open government data movements, and then there are others who are interested in improving the efficiencies in the way that we do science. All of these people might not agree on the priorities. We see some different priorities in the open access movement. There are some differences in science.
Have you ever downloaded someone else's scripts without an open license, and you ran them? There's the horror story of IP- it's getting in the way, it's actually illegal. This is just one that is pervasive and a real barrier to sharing and openness in science. There's certainly horror stories in terms of patent stories and maybe I'll let other people talk about that. Here's a brief personal story. I was involved in a proejct we were trying to get funded for. It was in nanotech and heart disease. It was important, it concerns people a great deal. It was for the treating of degradation of hearts in heart disease. We had to have trials, we had to have patents, at the same time, we're talking about putting nanoparticles into people. We need to have a lengthy appropriate conversation with a lot of community, what kind of safeguards, we can't have that conversation until we've got the patent. The therapy- if it works- it's not popular. These things feed on each other in different directions, and they are not always good. They can be problems, personal stories at least. So, I have a few one. Heresay, but it's true. One university in the UK started putting its theses online, and one of the scientists discovered that these theses- horror- they were being read by other people, and some of it might be patents. People thought that theses are for obscuring science. So now the university has put an embargo on theses so nobody can read any of them. We have to educate people on best practices. I did a survey of machine learning community, and what was preventing them for putting code and data online, and what enabled them if they did it? A number of senior professors said that they wouldn't put the code up, because they would patent it and do a startup, and the scientific community doesn't get it. At Mendeley, we've had a couple people contact us and shocked that the metadata ended up on their search catalog. People wanted that taken out. That was to their own detriment, now people can't discover it. It was quite weird. I can't even think abuot the IP issues- we have people getting upset sharing publications about these IPs, so we have a long way to go apparently.
I just wanted to contribute to an example of a hroror story. For the last 18 months, we've been doing a project on gene therapy. A number of scienctists, these are great scientists, they are bright. The majority of scientists get sto.. they said nothing, this is a real problem because we're going to break this impass of intellectual property dominating science. It's going to take courage, it's going to take scientists... evidence of how their research is being impacted. The evidence came from Peter Mack. This is a world renowned cancer research institute in Melbourne. Publicly funded. In 2 years, VRCA1 gene sequences was delayed because of teh Australian and .. over who actually could give permission to these guys to do this work. Eventulaly, the ersearch was granted, and it has been delayed, and the cost was tripled. Let me also add. One more comment. This idea that you need a patent, or without a patent you aren't going to get a license, that's just nonsense. It really is nonsense. Until 1978, you couldn't even get a patent on this, we have medicines. By the way, the world's first antibiotic (penicillin) was developed withuot patent protection. What happened before the patent system even existed? Why do we have thsi in innovation? It's not because of the patent system. Scientists have to stop believing thhhe nonsense of patent lawyers. They have done a good job of making sure you understand this... that you need a patent to successfully commercialize your research. You do not.
A very simple question: why now? Why is openness important now, but not 1995, or 1865, or so on? The human race is growing up and this is an important part of its future? Has that ever not been a true? We haven't understood the importance fo openness until now. I started 50 years ago, it wasn't an issue, it was perhaps implicit then, and we've been through a period of digital land grabs, so now we're realizing the price we're paying for the digital land grabs. In th e 1600s, the world, Henry Otenborg, went through a similar process of open ness in science. What's the jogular issue today? Sorry, I don't want to push you too hard. I can take a stab at ttttthat. A main tttheme in my research is that in the 16660s we had certaiiiin standardss. So why are we not adhearing tooooo it now? Until 20 yearrrrrrs ago, when ocmputaaaation became more pervasive, you had strict standards on what you would include so that other scientists could replicate your experiments and verify your resulst. When you introduce computation, the results are um, so detailed and complex in terms of the steps that you taken, invocation for scripts and code, what parameters you used, it doesn't fit into a section in a paper. We've not embraced reprrrrroducibility....... A lot oooof my work has beenarrrrrround fostering rrrrreeeeeproduccccibbbbbbiiility. We'vvvve had this since the 1660s, our tools have changed. We need to recognize and adapt to this, to simplify something that we already have. 400 years ago, the methods for dissimentating science, matched the research output, for 400 years. And now, as now Victoriiiiiiiiia is alluding to, we have so much data, so many IP issues, licensing etc., to disseminate all of this, it took quite a bit of effort than 400 years ago. I don't know if the debate over open science is any larger now, and it certainly seems like it. We need more channels. I guess for me, the answer is that the way it was, reached a stage of maturity, where we can share more details, and what the process of science, what the outputs of science are, and we haven't used it effectively in the past 5 years as we have with.. taken traditional print-publication and previous.. and that has been about the end of it.. we can do a lot better, we need to do a lot better. We have a lot more scientists, we have a lot more people and a lot more projects. Actually being able to facilitate effective communication between the right subsets of people. It should have been done 5 years ago, but it should happen over the next 5 years.
This isn't so much a question as a comment to spark uallatiyoff further commingents from the panel. A lot of the arguments for and against open access in literature, which probably won't find much sympathy in this room, some of them seem disproved by current existing counter-examples. If you have pre- publication publications, then you can't have a real publication because the journals won't accept it. There can't be peer review, and that's sort of nonsensical. This is.. the norm in math and pyhsics, my background is in math, as soon as you have something that is reasonable, you put it up on the arXiv preprint server, and you have published it and everyone can see that you got there first. In the peer review process, that's what happens when you submit it to a journal. So it seems like a historical thing in the life sciennnces, hwo is it that it keeps persisting despite clear evidence that these arguments don't make sense?alys
Very quickly: it varies between domain. In math, computer science and physics, they have flowered the forest. If you publish a word in chemistr,y the American Chemistry Society will not published it. We have to change the way that they look at the world. It will be tough. This is true in various areas. If you let the publishers runt hings, you will end up with a lot of these problems. If we run things, we can decide what teh rules aaaaaare. We areour own worst enemies, we'll have something that is incredible conservatism. We do this to ourselves. You can see this very easily, you go into the top ranking universities, yuo look at the people doing things that are radical and unusual. They are people who are middle-aged who have some sense of security who are small numbered, not terribly successful, they got tenure. The people who are most conservative are postdocs, tenure-track professors, who are doing desperately whatever they can to fit into the mold to make the job. There's a whole industry serving their conservativsm.. because of the intense pressure. You have to validate or take count of the value of the different types.. I think that will change, but I think that's one reason.
I am interested in data that is inconclusive and act negve results. I am moving from physics to bioinformatics.. er, something. A lot of data that has negative results. I can negotiate the process, but it puts a barrier. Some stuff is a secret- like bad drugs. Later on, as you put those drugs up on the market without explaining all the bad effects, maybe as a society we might want to start proposing some rules on publishing this data. Big data sets can find random stuff in the data. There are scientists that are advocating the idea of publishing failed experiments. It would be nice to have failed resultsp ublished, because at least 25% of the work in a lab is stuff that they did in the lab 5 years ago. There's a lot of work wasted. At least getting boring science published. Then we move to stuff that is exciting. I put up 5 years of failed and boring PhD research, so it's up on the web to download (applause). This is an extremely serious problem. Part of the open science debate is this multiple comparisons problem. Data mining across data sets, and then picking your one significant result, as if you just found that, and publishing it. When you are going to do a scientific experiment, befoer you look at the data, look at the hypotheses, list it at a timestamp on a website, and then verify about the hypotheses. But it's extremely pervasive just from doing statistical consulting on a number of fields, I'm not going to point fingers. And a lot of, some kind of mechanism to keep track of technical an.. broader issue. I think this is something where openness in terms of reproducibility by workflow tracking and data probing software and things that get the actual methods otu into the visibility and at least out of the lab. So something that could be incorporated into this, recording your experimetns and being able to communicate them in a way that lets others figure out what's working and not working.. and p-values. There's one other things. The problem is that there's not venues.. that's not it.. there are journals of negative results. How many scientists at the rule, had a bunch of results to publish, but hadn't got the time to write the paper. There's about 3 hands. There's a real need to make this work.. where we can record that research in a way that can be published with a very easy press-a-button.. writing a paper takes a long time. This is a reason for open notebooks, this is not so that others can readi t, but so that we can make systems that makes a single button publishing system. Maybe not into a journal, but at least it's there. For him, it's not worth his time to write up those negative results. If the incentives were there, maybe you can change his mind.. what incentives could be accomplished to make this happen?
I haven't heard the distinction between academic research that is publicly funded, and under circumstances, that data should be open. So what about the derivative of that science- like technology that is commercialized? To permit the existence of intellectual property for copyrights or patent rights. The other issue is quality control. Getting the concathiticity of data, all this openness is making a mess of signal to noise ratio. How do you know what you have? I have an example- the business that I am in: the mess by irressponsible marketing of genetic analysis data which I think is a very serious setback both to the open science concept and to the commercial development of what can come out of sequencing. I'll leave it to you guys to argue about it.
There's an assumption about the information that is coming in from researchers. More stuff being published is bad because we'd have to read all of it. That assumption iis fundamentally flawed and you don't need to look very far to see why. You want more stuff published and more rubbish published because then someone can build Google for science. We need to make the discovery process better. We're never going back to reading the table of contents. Filter failure, science in terms of filter failure, we always assumed that we had to filter stuff before publishing or whatever. That's crap. There's a moral argument for public availability for science. In a sense, the more interesting question is, what is the business case for open approaches to technology development? In which cases are they going to be successful, and how can we build legal structures to support these open innovations, and the best commercial return on public investment?
I think I have a quick point. Tax payer rights, like I am a tax payer, and I paid to fund you as a scientist, so I want to read what you write and stuff. It isn't about the benefits to the tax payers. It's really about society as a whole and we've decided to do this and give this back to society as a public good. A problem with the tax payer argument is that not everyone is a tax payer.. what we've decided is that as a society we're going to do it for society's benefit. The second point is about the quality of data and analyzing it, we're going to have to use machines to help us. We publish PDFs and GIFs and stuff which machines can't deal with. We're looking at semantic publishing of data so that machines can referree those bits that machines are good at. The data has to be fit for review by humans, and searh by humans. In some other disciplines, this isn't true. Astronomers do it well. We do not put enough effort into it and we are going to have to it.
I usually agree with everything Cameron says.. but not today. Many younger scientists have become conservative in their behavior. But they start out not being conservatve. In case you want to read what 12 year olds are posting on Facebook, they want to share everything, and within science this is also true. There aer many young scientists who really get what open science is about. Our culture has bashed them over the head about why they shouldn't be open.. they read this as part of the culture of science, as the "older" crowd, what we need to do as the older leaders of this community, not just give examples, but also try to solve those existential crises about why these people no longer want to be open. That's the most important things here, how to keep them from being open, like what they want to be. I think it's the other way around.
I'll respond to that. I agree with you. That's why I used the term postdocs instead of students. I agree with you. We need to do something radical with the way we train young scientists, and be radical about the way to be radical. A change of policy would definitely help with that. Speaking as someone who leaved academia and take a career in science, as a technologist I am trying to create the tools, not to shame scientists into putting their data out there, but it's going to .. make their life more difficult to do it, when they see that their peers are promoting their data, and their peers are getting recognized and they are not. I will hint about how we are going to do that later.
This question about how do you get more people into open science. Some of it is inertia, it's not malicious, it's quite hard. In a big genomics institute, making sure that data gets out the door, and making sure there's metadata, there's lots of funding rqeuirements and stuff. We make the metadata a requirement. Open access publishing requirements- that's something else we require. We have to force it with an internal incentive but just a wider incentive about people keeping secrets. There's this U.S. Whyte Noll act. Things that get developed in academia get translated into products. The fixed that was design was by dow- to give grants that make mony out of it, to incentize them to go through this. There's been lots of tech transfer, double their costs, and a lot of patents which industry actually, they say they can't get access, because academics have inappropriate views of how much.. so that's a big block. They will not anything else.. so it looks like. IT's just the wrong term. The objective was to improve translation, and they tried this mechanism, whih made things worse. We've come across this idea of openness instead, how about aggressive openness, keep the academics pure, and that might be better for translation, instead of the Bighta Act. Thank you so much ffffor that. That was not planned. We had slides that we had to get queued up. Cameron, please step down, and Mike Kretz, and Organ, and Martha, and we're going to run through the ten minutes presentations starting with Mike.
Neglected R&D: How can open science bridge the health gap? Mike Gretes, MindTheHealthGap.org
Okay, hi. Good evening. Thanks to the organizer for listening and setting this up. From what I understand, there's two questions: the ethical dimension of why do open science- because humanity has a right to the fruits of scientific research, and an element of the fact that we can make science better, if we have greater openness. I wanted to touch on these themes. And also kind of give that moral, an .. as well.
Okay, so, my name is Mike Gretes. I am going to talk about neglected disease R&D and how open science can helpw ith that. What is the health gap? There are differences in terms of life expteancy. In the dark green there, where most of us live, you get to live to 70 or 80. You can expect half of that in other parts of the world. Why is this? People haven't always lived to 80. If you're going to make it to the age of 33, like Jesus did in the year 0, he was actually doing pretty well for that time. The situation around the world is still trapped in health stuff from hundreds of years ago. So why is that? There's a lot of reasons for that- violence. Infectious disease is a huge element of that. Hear me out: malaria, HIV, tuberculosis. The top ten killers of people. Four of the top seven in this list was infectious disease including TB, HIV, and brain infections. It's not necessarily, what doesn't killy ou makes you stronger. If you look at the amount of suffering and dsiability caused by infectious diseases, these are disability-adjusted-life-years. The same pattern persists, and the same countries suffer from disability as well as death. Parasitic infectious contribute to this, but they don't kill people, but it's a huge amount of disability. It's billions of people. 2.5 billion people effected by TB. And a billion and half effected by other infections. It becomes not only a technological question, ut also a .. income.. purchasing power. The countries that are sicker and die.. this isn't a surprise to anyone, but the fact that ti is a surprise is showing how profuond this question should be. It still persists. So, I wanted to talk about, a lot of diseases, that people haven't heard about. I am going to talk about one, whether HIV is a disease or not, but for 25 years ago, it was definitely a disease. No cure. HIV, you're going to die. For awhile it was not recognized as a cause of AIDS. From 1987 to 1994, and you can see a number of people, and HIV just continuing to climb. It's a lot more staggering if you compress the X axis, and there's this trend, and nobody knew what was going to happen. And the turn-around and what caused that to happen. This is Joseph. I won't tell you where he is right now, he's in a health care facility. The health outcome he's seeing right now is from HIV, it could have been anyone in the 1980s. But what turned this around was anti-retval viral drugs and therapies, and these tailed off the infections in the United States and much of the developed world. What does that mean?
Thsi is my neighbor. He has been living with HIV. He was dying of a cancer caused by HIV/AIDS. In the 1990s, he heard about anti-viral therapy, and he asked them to put him on it. He heard it from his friends. He made a full recovery, he's very healthy. From his state, he looked a lot like Joseph did. So thanks to these drugs, he turned his life around. Joseph was looking like that in the year 2000 in Haiti though. There's this sort of biomedical victory.. and great achievement of antiviral drugs.. without the social innovationt o actually get these drugst o everyone. To most of the world it's an incomplete victory. There's a whole story aout that. Need... smuggeling drugs down to Haiti.. in 6 months, Joseph was recovered. This matters to us because we're interested in doing research and at least in parts.. drug development is a way of helping people in the immediate term. The burden of disease is in the colored bars here. There's cancer, HIV, aids, TB, malaria, and other diseases that line up. The number of stacked pills there represents the number of countries... rate of new drug development from 1975, 30 year period. There's no drugs whatsoever for HIV, now there's many. The rate of drug development is far less than that for cardiovascular disease. It fall alongs the line of rich and poor. We would like to see the same number of drugs for most of these diseases. What are the challenges for this? We had our PhD team here. What happens for those cancer and cardiovascular disease drugs? We had patients like my neighbor John, and then we have funders, Heart&Stroke Foundation, and a large profitable industry that can invest a lot of resources, put it into the pipeline, a lot of universities help out, and then there's MTHG. I'll skip over this. This is what the pipeline looks like in detail. IT costs a lot in detail. Why can't we do drug development more cheaply? We should. It's important in phase 3, we're talking about 3,000 patients for several years, and that's hundreds of millions of dollars, per drug, and the estimate varies based on who is asking. The public isn't going to pay that (?). If there really isn't money to pay for it, even small industry interest, a huge patient population, you're not going to see anything for some time. The situation is changing. Linux. I don't know what the open science mascot is. So I just used the Linux penguins.
thesynapticleap.org
tropicaldisease.org
sandler.ucsf.edu/lnf
rarediseases.
pd2.lilly.com
gsk.com
callobartivedrug.com
Medicines for Neglected Diseases (Boston)
info@mindthehealthgap.org
Two ideas for open science. Victoria Stodden. In a way that it will catch fire across all scientific disciplines. Open science as a movement from my perspective. Reproducibility as a framing principle and also touch on what I believe is a credibility crisis at least in computational science, based on the fact that we're not sharing data and scripts in the way that we would be if we were following scientific principls. Code needs to be included in this framework of open science, underscored by this concept of reproducibility. My understanding- or oen way to think of a movement- is that it's something that is emerging across multiple disciplines. It's not just happening in biology or crystallography, writ large. We have changing communication modality and pervasiveness of computationality, but the type of knowledge and the questions they can ask, and different ways that.. using that. there's also a cultural component. Data standards. Being discussed in many circles along with the publication of your paper. We also had the opening discussion, Tim mentioned how in the UK there were data release plans,a nd the NSF has decided that data release plans will accompany all grants starting in October and that will be peer reviewed. Another dimension of the cultural aspect is the standards and expectations. There are journals and so on, but the strongest incentive is as the scientists, is what do our peers expect? And what are the standards in my local community? SO my thesis for this mini-talk is that our adapation and so on to the technological and new openness and sharing- it's not happening fast enough, and it's bringing about a credibility crisis. In Climategate, there were many emails but also some documentation files and other pieces of code from these things in a University in the UK, one of the premiere climate research schools. This was a failure of information sharing, we couldn't .. we didn't know how the results were being generated, and not so much as scientists being bad, we just wanted to know what was happening. Something this week was .. some ground breaking work on using genetic data to know what drugs will best treat your cancer, and there's clinical trials at Duke, and there's a lot of scrutiny as other scientists found mistakes. The work was award-winning.. and the mistakes shed a lot of light on what maybe actually be flawed science, so this scandal is on-going. I don't think that scandal is too strong of a word.. what type of review does our work go under? And what could be a foundation for safe clinical.. mistakes in publications.. So what's with all these stories? There's lots of risks .. it's a problem. This has actually started to seep into .. an off-band .. lso what I think the solution to thisi s is this getting the code and data out there that there is a way to reproduce the data, so that other peopl can be shared.. under the published results at the time of publication. Reproducibility. ................ cooking, cleaning, wriet up results, and just build up the resultsi nto the public. So, um, I would like to also argue that all of these aspects of what a scientist does, and there's deep intellectual contributions. For example, in all of the knowledge of all of these, it's all important for the replication of these results. Data filtering is not trivial. This is something that is not only hard and complex to replicate, but it can really impact the outcomes. Leaving a few observations here and there, that dramatically changes the results. Data analysis, there's um, typically in many cases, a large amount of intellectual capital in terms of the statistical methods and the modeling that can embody many deep intellectual contributions to science, so all this, the filtering and analysis and the software necessary for replication, and it would be an oversight to leave these out of the discussion when we're talking abuot transferring knowledge and so on. Open code is as much an important part of this as much as open data, so it must have an important role to play here. One thing that I have been working on is something I called the Reproducible Research Standard. So I had a licensing framework for code and data and for a published paper so that scientists could attach this license, so here's one recommendation: all of the work can be freely shared consistent with scientific norms and not in violation of copyright, so my recommendations in brief is to attach an appropriate attributions license to each component. Use my work however you like, just attribute me, or put it into the public domain. This notion of the research compendium that we're seeing discussed more and more more.. there's a paper, code, data. Tools that are developing rapidly in different areas and it's exciting- make this easier for the scientists to get teh code and data in a format shared and so that others can verify the work with. There's many more. Publication is being assisted by S-weave, so when you are compiling your published doc, it re-compiles yyour data and so on. There's also a GenePattern. Sweave. Sharing software platforms. mloss.org, DANSE, madagascar, Taverna, Pegasus, Trident Workbench, Galaxy, Sumatra. Allow the community to do this, very specialized in a platform, and is able to understand the data and use the tools for the data. Madagascar is another platform for sharing in geophysics and lots of workflow and tracking. We have .. this is her work. Pegasus. Trident. Galaxy. It goes on and on. This will facilitate the openness of code and data in terms of reproducibility. My final slide. Open code and data is a unified principle which will allow us to do what we talked about in the very beginning. Make it a movement that goes across all scientific pfields.. we can rely on the notion of reproducibility and reproducible research. This is nothing new in science, it's just something we signed up for when we signed up for being a scientist. We are not updating the social contract, what we're doing is returning to the scientific method which has beena round for hundreds of years. (applause)
Peter Murray-Rust. I am a chemist. I don't know about slides because I do not do PowerPoint. If any of you have anything, here we are, right, you can type murrayrust blog and you will see that. Click on various things as we go through. My main method of presentation is flowerpoint. I am old enough to have remembered the 60s and not to have been at Berkeley but it has made a huge contribution to our culture. The Open Knowledge Foundation will adopt this as a way of making my points. We have many different areas- mabye 50- that come under open, that relate to knowledge in general. If we can scroll down.. First of all, my petals are going to talk about various aspects of openness. So I will cover those things there, if you can down to the second link, the open knowledge definition. This is the most important thing in this. A piece of knowledge is open if you are free to use, re-use and re-distribute it, subject only to attribute and share-alike. That's a wonderfully powerful algorithm. If you cand o that, it's open. If not, it's not open according to this knowledge. What the OKF has done, another picture, Panton Principles. It's a placed called a pub. It's 200 meters from the chemistry departments where I work, and between the pub and the chemistry lab is the Open Knowledge Foundation. Rufus has been successful to get people to work on this. A lot of this is about government, public relations. How many people have written open source software? What about open access papers? How many of them had a full CC-BY license. If they weren't, they didn't work as open objects. CC-NC, cause more problems thant hey solve. How many people have either published or have people in their group who have published a digital thesis, not many, right? How many of those explicitly carry the CC-BY license. That's an area where wwe have to work. Open Theses aer a part of what we're trying to set up in the Open Knowledge Foundation, made the semantics available, LaTeX, Word, whatever they wrote it in, that would be enormously helpful. The digital landgrab in theses is startinga nd we have to stop it. There are many things we can do. There are two projects, and these have been funded. okay, So, Open Bibliography and Open Citations. At the moment, we're being governed by non- accountable proprietary organizations who measure our scholarly worth by citations and metrics that they invent because they are easy to manage and retain control of our scholarship. We can reclaim that within a year or two, and gather all of our citation data, and bibliographic data, and we can then, and if we want to do metrics, I am not a fan, we should be doing them, and not some unaccountable body. Anyone can get involved in Open Bibliography and Open Citations. The next is open data, and the next is very straight forward. Jordan Hatcher, John Wilbanks from Science Commons, that has shown that open data is complex. I think it's going to take 10 years. This is a group involved in the Panton Principles, I can't point to them. Jenny Malone, Jenny is a student. The power of our students.. undergraduates are not held back by fear and conventions. She has done a fantastic job in the Open Knowledge Fuondation. Jordan, then Rufus, John Wilbanks, Cameron, and me, and anyway, we came up with the Panton Principles, so if you back a slide, you will see the Panton PRinciples, and let's just deal with the first one. Data related to public science should be explicitly placed in the public domain. There are four principles to use when you publish data. What came out of all of this work is that, one should use a license that explicitly puts your license in the public domain -CC0, or PPL from the Open KNowledge Foundation. So, the motto that I have brought to this is which I've been using and been taken up by.. our general library in the UK in the UK, on the reverse of the flower, reclaim our scholarship. That's a verys imple idea, one's that possible if a large enough people int he world looks to reclaiming scholarship, we can do it. There are many more difficult things that have been done by concerted activists. We can bring back our scholarship where we control it, and not others. I would like to thank to people on these projects, Open Citations and our funders and collaborators who are Jisk, who funds it, BioMed Central who also sponsors this, Open .. Public Library of Science. (applause)
Biotorrents: a file sharing service for scientific daa. Morgan Langillie. Here's a way to share your data right now. First, I'd like to acknowledge the MOORE Foundation and my supervisor who let me take this tangent. You can send me comments via twitter. I think we all agree that data is growing. We're drowning in data, I hate that term. I'm going to throw you some terms. If we want to continue to share open data and more openly, it should be simple and so on. Three sort of personal challenges on a day to day basis, thsi is why I built biotorrents. I want the download speed reliability, I just want to grab some data, and it's annoying that it takes 3 days before I can get it. I want to share all data associated with a study. The easiest way was to package it up and share its omewhere. With biotorrents you can do that. It's not super elegant, but at least it gets out there. Traditional file transfer.. you connect with your main server, and the other one is basically download that data, and the whitebar indicates how much that file has been transferred, and another one doesn't get the data because of teh bandwidth. Unfortunately the data has to travel across entire continents, and in between the two institutions, your actual download speed is very limited sometimes for whatever reasons. If the site goes down for planned maintenance or by accident, that data doesn't get out there, which is not good if the data is time-sensitive. You also want to check that the data is the original copy. Using traditional method, you have to .. it's not in the protocol by default. There has to be a better method. Today, I can download movies much faster than, movies that are legal, than data that is open. In a p2p file transfer method, like bittorent, the data set is now broken up in small pieces, and each computer has the piece, and you still have that sole provider, but then the other users, as long as they have the different pieces, so bandwidth grows as users increase. The other computers might be geographically sperse so that others might be nearby. If there's at least one full copy, at least everyone can get the paper. There's also a SHA1 cryptographic hash so that the data can be guaranteed to be the original. It's really well tested.. at least 25% of all internet traffic is bit torrent. We can use it to share movies, but also data. So how easy is it to use? Install bit torrent, you download it, you download a .torrent, basically what happens when a user downloads a .torrent file, and there's a tracker/server, like biotorrents.org, and it's not hosted on the server. If it's hosted on biotorrents.org, it's just the .torrent file, not the giant data set. It's metadata. And then it's communicated to other computers. Behind the scene, the software is connecting to the tracker, getting IP addresses, and the peers start communicating with each other and sharing data. There's a few other bit torrent features.. a lot of people talk about unique IDDDDs, whenever you createa dataset, there's a hash, the whole data set, you'er guaranteed that another person sitting beside you is using the same dataset, there's also the client software, hashtable, and there's things.. without connecting to a tracker even if it is down, and there's data posted to different trackers, and the clients can find each other thorugh those other trackers. Lots of people download a data set, and then a person downloads a data set, and then local peer discovery so that people can find someone and then transfer the data over the local network. I found out about this by accident, and I started testing it and it was blazing fast, that was nice. If there was data hosted on traditional methods like FTP,t hey can be added and they just act as an existing or extra seed. You can upload your favorite genome to any one of these. Lots of these exist already, and a lot of these have illegal copyright file sharing issues. There's a few other trackers, but not very many. So, on top of that, even if yuo did upload it there, it would be a hard time to find it, because the community there isn't oriented to science. So, that's why I made biotorrents. Of course, all data must be open. No illegal file hosting. The biological domain. Of course I mentioned, it's not hosted on biotorrents, but I'm mirroring the data on a separate server, in the long term it's up to the users to provide the seeding of that data. You can search and browse by particular scientific categories, and also by license and username. You have to set some kind of license when you upload it. There's a large list there. There's an "Other" category but they usually pick one of those licenses. Anyone can download the data without a username, but if you want to interact with the site, you can use your own username. Hopefully people will get a reputation for sharing good data down the road. A few cool things about this - there's an RSS feed, where you can automatically download data sets, there's also versions of data set, or if the data just expands, or whatever, then you can put that through versions with RSS feeds where you basically subscribe to the versions for a certain data set and then from then on you get all new versions of that data set, and that basically means that it also handles updates. Lastly, there's an upload script, that so far, there's about 1,000 users, and it's pretty limitedo n the number of data sets, whatever you've been sitting on for years, and do we really need it? Here's an example of genbank. By FTP it took 6 hours. Right now the only way is to get it from NCBI.. and I can only get 0.5 MB/sec, and that means 5 days. So that sucks. So who uses biotorrents? Existing large data providers, scientists share and publish data. Scientists sharing unpublished data. There's issues with any sort of technology. Metalink. Volunteer computing. THat's it. My final message is that data transfer should be fast and easy. Embrace the technologies suh as bit torrent. Hopefully.. thanks.
It's time we change how research has changed
We are going to change the world of reference managemetn. THis is a bold statement, a ridiculous oen I would say to you a couple of years ago. What we are doing and what we are seeing. To explain the why, we are going to enlist the ehlp of Tim Bergers-Lee, and what TIm said, the paraphrase is that we have all this information on cancer, stem cells, diseases, it's all siloed away to different computers. So Tim has issued a challenge to unblock this data. It's not just abuot technology. There's this huge social norm. There's the behavior of people. Open data, open science, it's actually obstructing the progress of human progress in the world. The U.S. national economy, issued a few grand challenges, one of these was the tools of scientific discovery, how can we address this challenge? With mendeley, we are trying to make rscience more transparent and open, and we're trying to build the world's largest academic database. In these next slides, we are keeping these things in mind. Helping researchers work scatter, extractable text and PDF files. There's a PDf.. annotate with Microsoft Word or Open Office and then what we do is take that research data, and aggregate it into the cloud. By doing this then, we are helping researchers collaborate and we'er making that data more transparent. So then this is a screenshot of what you get when you sign up for Mendeley Web, and you start seeing what's going on with people you're collaborating with on different projects. What separates us from other reference managers? We find statistical trends, like the most popular author or paper for the upcoming week. And then if your'e familiar with twitter or trending tags, then we show some of that. So we take all of that data being siloed away, and then we built a search catalog on top of it. The big difference with our search catalog like with pubmed, the 27 and those, that's the number of readers for that particular article, you can't get that if you're just doing something on pubmed. So I clicked to my landing page, so the standard citation information for that, but then we also start digging down into the demographics. Who are these readers? PhD students, professors, where are they from, what discipline are they in? Because most of these papers are multi-disciplinary. And then of cuorse we show some relayed research, like TIDEF, and then also collaborative filtering, like the research papers that you should be reading but may have been missing. SO we've been in public beta for 18 months, we have actually 450,000 users. These are the top 20 universities so far. In terms of the number of papers we're aggregating, we have 29M papers that metadata has been uploaded on. The size of this? The Thomas Web Of Knowledge database has 40M papers, and it took them 50 years this. We might be able to match that amount in just 2 years. So one of the things.. we created an open API so that others can access the same metadata and statistics and there's some mashups that developers are creating. Chemical Compounds, Location-based mashups of alzheimers research, swan data, grant search engines, twitter streams, we have people building Google Wave mashups, some Microsoft Word mashups, Google Docs mashups with these open APIs. And as far as the future goes at Mendeley, getting back at what Tim Bergers-Lee said, all that data is filed away on individual computers. There's a vast amounto fk nowledge in our heads. How do we re-use and repurpose scientific knowledge? How about semantic markup of other papers? Does this sentence support this paragraph or sentence from this other paper? So we're creating a human-currated, high throughput system, crowdsourced system for semantically linking PDF papers that would be impossible to link even if this was machine-readable, and just to get back to this statement here. How do we change the social behavior of scientists who are sketpical of sharing their own publications? So one of the things we're experimenting with and haven't released yet are reputation metrics. We might show the number of downloads of your publications or the page-views, and an incentive in your reputation metric for scientists to upload their PDF files, and later on their own data. So getting back to end this to what Tim was saying.. are we unlocking the data that we have been siloing away for years? I hope we're doing that, and I hope that our project wille ncourage others to do similar things, maybe with biotorrents.
Open, Candidd discussion of science
Maartha Bagnall
thirdreviewer.com
This conference is long overdue. Scientists have conversations about the quality and evaluations of the published literature all the time. Every day you walk into the alb and you see this recent paper, how are you thinking about it, scientists are constantly usin this information in published literature, to design new experiments, build on published ilterature, or whatever. What are the venues for the kinds of communication?
ok, I need a break.. uploading.