transcripts/open-science-summit-2010/introductions.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668

My name is Michael .. I am a theoretical physicist and I am about to publish a
book on open science.

We're going to up the gain on this microphone. We'll get on that. How is it
here? Are we live there? Okay, just a second. Just to get.. Australians, my
name is Freder Nadegem. Council in the UK. I am interested in open science,
sharing science, processing, how to make the web suck less, how to do science
with others. I'm Ray Torrez Sastain, I am a statistician at the universit,y I
am going to talk about the credibility crisis in computational science and how
the solution is in this room, how open science will help it, increased
visibility and science path. I am Jason hoytt at Mendeley, and I am here to
talk about what you want to talk about. You'll hear me in a few minutes. I am
Mike Credis from My Health Cap, how open science and open innovation can be
applied to research and research development. Thanks. It sounds like it's much
better.

We're going to do about 15 minutes of general Q&A, discussion among the
panelists here, and then we launch into our first presentation with Mike
Pretz, who is hereby on notice. Be ready to start in 15 minutes. Why don't we
have the panelists open up for any general questions that the audience would
like to ask. Can we start with a good definition of open science? I thought
that science was open by definition.

The "open" bit. The open bit means that it is available to any one in the
world to do whatever they like with it, without any strings attached. I wuold
agree with that. It's pretty clear that we don't have that situation right now
in science. I would be thinking for some time- open science is not something
that you want to define too precisely. It's a great banner, and a great
rallying pool for people with quite diferent priorities. There are people who
very strongly advocate public access to the published peer-reviewed
literature; there are people who are strongly arguing for access to data, and
that includes the open government data movements, and then there are others
who are interested in improving the efficiencies in the way that we do
science. All of these people might not agree on the priorities. We see some
different priorities in the open access movement. There are some differences
in science.

Have you ever downloaded someone else's scripts without an open license, and
you ran them? There's the horror story of IP- it's getting in the way, it's
actually illegal. This is just one that is pervasive and a real barrier to
sharing and openness in science. There's certainly horror stories in terms of
patent stories and maybe I'll let other people talk about that. Here's a brief
personal story. I was involved in a proejct we were trying to get funded for.
It was in nanotech and heart disease. It was important, it concerns people a
great deal. It was for the treating of degradation of hearts in heart disease.
We had to have trials, we had to have patents, at the same time, we're talking
about putting nanoparticles into people. We need to have a lengthy appropriate
conversation with a lot of community, what kind of safeguards, we can't have
that conversation until we've got the patent. The therapy- if it works- it's
not popular. These things feed on each other in different directions, and they
are not always good. They can be problems, personal stories at least. So, I
have a few one. Heresay, but it's true. One university in the UK started
putting its theses online, and one of the scientists discovered that these
theses- horror- they were being read by other people, and some of it might be
patents. People thought that theses are for obscuring science. So now the
university has put an embargo on theses so nobody can read any of them. We
have to educate people on best practices. I did a survey of machine learning
community, and what was preventing them for putting code and data online, and
what enabled them if they did it? A number of senior professors said that they
wouldn't put the code up, because they would patent it and do a startup, and
the scientific community doesn't get it. At Mendeley, we've had a couple
people contact us and shocked that the metadata ended up on their search
catalog. People wanted that taken out. That was to their own detriment, now
people can't discover it. It was quite weird. I can't even think abuot the IP
issues- we have people getting upset sharing publications about these IPs, so
we have a long way to go apparently.

I just wanted to contribute to an example of a hroror story. For the last 18
months, we've been doing a project on gene therapy. A number of scienctists,
these are great scientists, they are bright. The majority of scientists get
sto.. they said nothing, this is a real problem because we're going to break
this impass of intellectual property dominating science. It's going to take
courage, it's going to take scientists... evidence of how their research is
being impacted. The evidence came from Peter Mack. This is a world renowned
cancer research institute in Melbourne. Publicly funded. In 2 years, VRCA1
gene sequences was delayed because of teh Australian and .. over who actually
could give permission to these guys to do this work. Eventulaly, the ersearch
was granted, and it has been delayed, and the cost was tripled. Let me also
add. One more comment. This idea that you need a patent, or without a patent
you aren't going to get a license, that's just nonsense. It really is
nonsense. Until 1978, you couldn't even get a patent on this, we have
medicines. By the way, the world's first antibiotic (penicillin) was developed
withuot patent protection. What happened before the patent system even
existed? Why do we have thsi in innovation? It's not because of the patent
system. Scientists have to stop believing thhhe nonsense of patent lawyers.
They have done a good job of making sure you understand this... that you need
a patent to successfully commercialize your research. You do not.

A very simple question: why now? Why is openness important now, but not 1995,
or 1865, or so on? The human race is growing up and this is an important part
of its future? Has that ever not been a true? We haven't understood the
importance fo openness until now. I started 50 years ago, it wasn't an issue,
it was perhaps implicit then, and we've been through a period of digital land
grabs, so now we're realizing the price we're paying for the digital land
grabs. In th e 1600s, the world, Henry Otenborg, went through a similar
process of open ness in science. What's the jogular issue today? Sorry, I
don't want to push you too hard. I can take a stab at ttttthat. A main tttheme
in my research is that in the 16660s we had certaiiiin standardss. So why are
we not adhearing tooooo it now? Until 20 yearrrrrrs ago, when ocmputaaaation
became more pervasive, you had strict standards on what you would include so
that other scientists could replicate your experiments and verify your
resulst. When you introduce computation, the results are um, so detailed and
complex in terms of the steps that you taken, invocation for scripts and code,
what parameters you used, it doesn't fit into a section in a paper. We've not
embraced reprrrrroducibility....... A lot oooof my work has beenarrrrrround
fostering rrrrreeeeeproduccccibbbbbbiiility. We'vvvve had this since the
1660s, our tools have changed. We need to recognize and adapt to this, to
simplify something that we already have. 400 years ago, the methods for
dissimentating science, matched the research output, for 400 years. And now,
as now Victoriiiiiiiiia is alluding to, we have so much data, so many IP
issues, licensing etc., to disseminate all of this, it took quite a bit of
effort than 400 years ago. I don't know if the debate over open science is any
larger now, and it certainly seems like it. We need more channels. I guess for
me, the answer is that the way it was, reached a stage of maturity, where we
can share more details, and what the process of science, what the outputs of
science are, and we haven't used it effectively in the past 5 years as we have
with.. taken traditional print-publication and previous.. and that has been
about the end of it.. we can do a lot better, we need to do a lot better. We
have a lot more scientists, we have a lot more people and a lot more projects.
Actually being able to facilitate effective communication between the right
subsets of people. It should have been done 5 years ago, but it should happen
over the next 5 years.

This isn't so much a question as a comment to spark uallatiyoff further
commingents from the panel. A lot of the arguments for and against open access
in literature, which probably won't find much sympathy in this room, some of
them seem disproved by current existing counter-examples. If you have pre-
publication publications, then you can't have a real publication because the
journals won't accept it. There can't be peer review, and that's sort of
nonsensical. This is.. the norm in math and pyhsics, my background is in math,
as soon as you have something that is reasonable, you put it up on the arXiv
preprint server, and you have published it and everyone can see that you got
there first. In the peer review process, that's what happens when you submit
it to a journal. So it seems like a historical thing in the life sciennnces,
hwo is it that it keeps persisting despite clear evidence that these arguments
don't make sense?alys

Very quickly: it varies between domain. In math, computer science and physics,
they have flowered the forest. If you publish a word in chemistr,y the
American Chemistry Society will not published it. We have to change the way
that they look at the world. It will be tough. This is true in various areas.
If you let the publishers runt hings, you will end up with a lot of these
problems. If we run things, we can decide what teh rules aaaaaare. We areour
own worst enemies, we'll have something that is incredible conservatism. We do
this to ourselves. You can see this very easily, you go into the top ranking
universities, yuo look at the people doing things that are radical and
unusual. They are people who are middle-aged who have some sense of security
who are small numbered, not terribly successful, they got tenure. The people
who are most conservative are postdocs, tenure-track professors, who are doing
desperately whatever they can to fit into the mold to make the job. There's a
whole industry serving their conservativsm.. because of the intense pressure.
You have to validate or take count of the value of the different types.. I
think that will change, but I think that's one reason.

I am interested in data that is inconclusive and act negve results. I am
moving from physics to bioinformatics.. er, something. A lot of data that has
negative results. I can negotiate the process, but it puts a barrier. Some
stuff is a secret- like bad drugs. Later on, as you put those drugs up on the
market without explaining all the bad effects, maybe as a society we might
want to start proposing some rules on publishing this data. Big data sets can
find random stuff in the data. There are scientists that are advocating the
idea of publishing failed experiments. It would be nice to have failed
resultsp ublished, because at least 25% of the work in a lab is stuff that
they did in the lab 5 years ago. There's a lot of work wasted. At least
getting boring science published. Then we move to stuff that is exciting. I
put up 5 years of failed and boring PhD research, so it's up on the web to
download (applause). This is an extremely serious problem. Part of the open
science debate is this multiple comparisons problem. Data mining across data
sets, and then picking your one significant result, as if you just found that,
and publishing it. When you are going to do a scientific experiment, befoer
you look at the data, look at the hypotheses, list it at a timestamp on a
website, and then verify about the hypotheses. But it's extremely pervasive
just from doing statistical consulting on a number of fields, I'm not going to
point fingers. And a lot of, some kind of mechanism to keep track of technical
an.. broader issue. I think this is something where openness in terms of
reproducibility by workflow tracking and data probing software and things that
get the actual methods otu into the visibility and at least out of the lab. So
something that could be incorporated into this, recording your experimetns and
being able to communicate them in a way that lets others figure out what's
working and not working.. and p-values. There's one other things. The problem
is that there's not venues.. that's not it.. there are journals of negative
results. How many scientists at the rule, had a bunch of results to publish,
but hadn't got the time to write the paper. There's about 3 hands. There's a
real need to make this work.. where we can record that research in a way that
can be published with a very easy press-a-button.. writing a paper takes a
long time. This is a reason for open notebooks, this is not so that others can
readi t, but so that we can make systems that makes a single button publishing
system. Maybe not into a journal, but at least it's there. For him, it's not
worth his time to write up those negative results. If the incentives were
there, maybe you can change his mind.. what incentives could be accomplished
to make this happen?

I haven't heard the distinction between academic research that is publicly
funded, and under circumstances, that data should be open. So what about the
derivative of that science- like technology that is commercialized? To permit
the existence of intellectual property for copyrights or patent rights. The
other issue is quality control. Getting the concathiticity of data, all this
openness is making a mess of signal to noise ratio. How do you know what you
have? I have an example- the business that I am in: the mess by irressponsible
marketing of genetic analysis data which I think is a very serious setback
both to the open science concept and to the commercial development of what can
come out of sequencing. I'll leave it to you guys to argue about it.

There's an assumption about the information that is coming in from
researchers. More stuff being published is bad because we'd have to read all
of it. That assumption iis fundamentally flawed and you don't need to look
very far to see why. You want more stuff published and more rubbish published
because then someone can build Google for science. We need to make the
discovery process better. We're never going back to reading the table of
contents. Filter failure, science in terms of filter failure, we always
assumed that we had to filter stuff before publishing or whatever. That's
crap. There's a moral argument for public availability for science. In a
sense, the more interesting question is, what is the business case for open
approaches to technology development? In which cases are they going to be
successful, and how can we build legal structures to support these open
innovations, and the best commercial return on public investment?

I think I have a quick point. Tax payer rights, like I am a tax payer, and I
paid to fund you as a scientist, so I want to read what you write and stuff.
It isn't about the benefits to the tax payers. It's really about society as a
whole and we've decided to do this and give this back to society as a public
good. A problem with the tax payer argument is that not everyone is a tax
payer.. what we've decided is that as a society we're going to do it for
society's benefit. The second point is about the quality of data and analyzing
it, we're going to have to use machines to help us. We publish PDFs and GIFs
and stuff which machines can't deal with. We're looking at semantic publishing
of data so that machines can referree those bits that machines are good at.
The data has to be fit for review by humans, and searh by humans. In some
other disciplines, this isn't true. Astronomers do it well. We do not put
enough effort into it and we are going to have to it.

I usually agree with everything Cameron says.. but not today. Many younger
scientists have become conservative in their behavior. But they start out not
being conservatve. In case you want to read what 12 year olds are posting on
Facebook, they want to share everything, and within science this is also true.
There aer many young scientists who really get what open science is about. Our
culture has bashed them over the head about why they shouldn't be open.. they
read this as part of the culture of science, as the "older" crowd, what we
need to do as the older leaders of this community, not just give examples, but
also try to solve those existential crises about why these people no longer
want to be open. That's the most important things here, how to keep them from
being open, like what they want to be. I think it's the other way around.

I'll respond to that. I agree with you. That's why I used the term postdocs
instead of students. I agree with you. We need to do something radical with
the way we train young scientists, and be radical about the way to be radical.
A change of policy would definitely help with that. Speaking as someone who
leaved academia and take a career in science, as a technologist I am trying to
create the tools, not to shame scientists into putting their data out there,
but it's going to .. make their life more difficult to do it, when they see
that their peers are promoting their data, and their peers are getting
recognized and they are not. I will hint about how we are going to do that
later.

This question about how do you get more people into open science. Some of it
is inertia, it's not malicious, it's quite hard. In a big genomics institute,
making sure that data gets out the door, and making sure there's metadata,
there's lots of funding rqeuirements and stuff. We make the metadata a
requirement. Open access publishing requirements- that's something else we
require. We have to force it with an internal incentive but just a wider
incentive about people keeping secrets. There's this U.S. Whyte Noll act.
Things that get developed in academia get translated into products. The fixed
that was design was by dow- to give grants that make mony out of it, to
incentize them to go through this. There's been lots of tech transfer, double
their costs, and a lot of patents which industry actually, they say they can't
get access, because academics have inappropriate views of how much.. so that's
a big block. They will not anything else.. so it looks like. IT's just the
wrong term. The objective was to improve translation, and they tried this
mechanism, whih made things worse. We've come across this idea of openness
instead, how about aggressive openness, keep the academics pure, and that
might be better for translation, instead of the Bighta Act. Thank you so much
ffffor that. That was not planned. We had slides that we had to get queued up.
Cameron, please step down, and Mike Kretz, and Organ, and Martha, and we're
going to run through the ten minutes presentations starting with Mike.

Neglected R&D: How can open science bridge the health gap? Mike Gretes,
MindTheHealthGap.org

Okay, hi. Good evening. Thanks to the organizer for listening and setting this
up. From what I understand, there's two questions: the ethical dimension of
why do open science- because humanity has a right to the fruits of scientific
research, and an element of the fact that we can make science better, if we
have greater openness. I wanted to touch on these themes. And also kind of
give that moral, an .. as well.

Okay, so, my name is Mike Gretes. I am going to talk about neglected disease
R&D and how open science can helpw ith that. What is the health gap? There are
differences in terms of life expteancy. In the dark green there, where most of
us live, you get to live to 70 or 80. You can expect half of that in other
parts of the world. Why is this? People haven't always lived to 80. If you're
going to make it to the age of 33, like Jesus did in the year 0, he was
actually doing pretty well for that time. The situation around the world is
still trapped in health stuff from hundreds of years ago. So why is that?
There's a lot of reasons for that- violence. Infectious disease is a huge
element of that. Hear me out: malaria, HIV, tuberculosis. The top ten killers
of people. Four of the top seven in this list was infectious disease including
TB, HIV, and brain infections. It's not necessarily, what doesn't killy ou
makes you stronger. If you look at the amount of suffering and dsiability
caused by infectious diseases, these are disability-adjusted-life-years. The
same pattern persists, and the same countries suffer from disability as well
as death. Parasitic infectious contribute to this, but they don't kill people,
but it's a huge amount of disability. It's billions of people. 2.5 billion
people effected by TB. And a billion and half effected by other infections. It
becomes not only a technological question, ut also a .. income.. purchasing
power. The countries that are sicker and die.. this isn't a surprise to
anyone, but the fact that ti is a surprise is showing how profuond this
question should be. It still persists. So, I wanted to talk about, a lot of
diseases, that people haven't heard about. I am going to talk about one,
whether HIV is a disease or not, but for 25 years ago, it was definitely a
disease. No cure. HIV, you're going to die. For awhile it was not recognized
as a cause of AIDS. From 1987 to 1994, and you can see a number of people, and
HIV just continuing to climb. It's a lot more staggering if you compress the X
axis, and there's this trend, and nobody knew what was going to happen. And
the turn-around and what caused that to happen. This is Joseph. I won't tell
you where he is right now, he's in a health care facility. The health outcome
he's seeing right now is from HIV, it could have been anyone in the 1980s. But
what turned this around was anti-retval viral drugs and therapies, and these
tailed off the infections in the United States and much of the developed
world. What does that mean?

Thsi is my neighbor. He has been living with HIV. He was dying of a cancer
caused by HIV/AIDS. In the 1990s, he heard about anti-viral therapy, and he
asked them to put him on it. He heard it from his friends. He made a full
recovery, he's very healthy. From his state, he looked a lot like Joseph did.
So thanks to these drugs, he turned his life around. Joseph was looking like
that in the year 2000 in Haiti though. There's this sort of biomedical
victory.. and great achievement of antiviral drugs.. without the social
innovationt o actually get these drugst o everyone. To most of the world it's
an incomplete victory. There's a whole story aout that. Need... smuggeling
drugs down to Haiti.. in 6 months, Joseph was recovered. This matters to us
because we're interested in doing research and at least in parts.. drug
development is a way of helping people in the immediate term. The burden of
disease is in the colored bars here. There's cancer, HIV, aids, TB, malaria,
and other diseases that line up. The number of stacked pills there represents
the number of countries... rate of new drug development from 1975, 30 year
period. There's no drugs whatsoever for HIV, now there's many. The rate of
drug development is far less than that for cardiovascular disease. It fall
alongs the line of rich and poor. We would like to see the same number of
drugs for most of these diseases. What are the challenges for this? We had our
PhD team here. What happens for those cancer and cardiovascular disease drugs?
We had patients like my neighbor John, and then we have funders, Heart&Stroke
Foundation, and a large profitable industry that can invest a lot of
resources, put it into the pipeline, a lot of universities help out, and then
there's MTHG. I'll skip over this. This is what the pipeline looks like in
detail. IT costs a lot in detail. Why can't we do drug development more
cheaply? We should. It's important in phase 3, we're talking about 3,000
patients for several years, and that's hundreds of millions of dollars, per
drug, and the estimate varies based on who is asking. The public isn't going
to pay that (?). If there really isn't money to pay for it, even small
industry interest, a huge patient population, you're not going to see anything
for some time. The situation is changing. Linux. I don't know what the open
science mascot is. So I just used the Linux penguins.

thesynapticleap.org

tropicaldisease.org

sandler.ucsf.edu/lnf

rarediseases.

pd2.lilly.com

gsk.com

callobartivedrug.com

Medicines for Neglected Diseases (Boston)

info@mindthehealthgap.org


Two ideas for open science. Victoria Stodden. In a way that it will catch fire
across all scientific disciplines. Open science as a movement from my
perspective. Reproducibility as a framing principle and also touch on what I
believe is a credibility crisis at least in computational science, based on
the fact that we're not sharing data and scripts in the way that we would be
if we were following scientific principls. Code needs to be included in this
framework of open science, underscored by this concept of reproducibility. My
understanding- or oen way to think of a movement- is that it's something that
is emerging across multiple disciplines. It's not just happening in biology or
crystallography, writ large. We have changing communication modality and
pervasiveness of computationality, but the type of knowledge and the questions
they can ask, and different ways that.. using that. there's also a cultural
component. Data standards. Being discussed in many circles along with the
publication of your paper. We also had the opening discussion, Tim mentioned
how in the UK there were data release plans,a nd the NSF has decided that data
release plans will accompany all grants starting in October and that will be
peer reviewed. Another dimension of the cultural aspect is the standards and
expectations. There are journals and so on, but the strongest incentive is as
the scientists, is what do our peers expect? And what are the standards in my
local community? SO my thesis for this mini-talk is that our adapation and so
on to the technological and new openness and sharing- it's not happening fast
enough, and it's bringing about a credibility crisis. In Climategate, there
were many emails but also some documentation files and other pieces of code
from these things in a University in the UK, one of the premiere climate
research schools. This was a failure of information sharing, we couldn't .. we
didn't know how the results were being generated, and not so much as
scientists being bad, we just wanted to know what was happening. Something
this week was .. some ground breaking work on using genetic data to know what
drugs will best treat your cancer, and there's clinical trials at Duke, and
there's a lot of scrutiny as other scientists found mistakes. The work was
award-winning.. and the mistakes shed a lot of light on what maybe actually be
flawed science, so this scandal is on-going. I don't think that scandal is too
strong of a word.. what type of review does our work go under? And what could
be a foundation for safe clinical.. mistakes in publications.. So what's with
all these stories? There's lots of risks .. it's a problem. This has actually
started to seep into .. an off-band .. lso what I think the solution to thisi
s is this getting the code and data out there that there is a way to reproduce
the data, so that other peopl can be shared.. under the published results at
the time of publication. Reproducibility. ................ cooking, cleaning,
wriet up results, and just build up the resultsi nto the public. So, um, I
would like to also argue that all of these aspects of what a scientist does,
and there's deep intellectual contributions. For example, in all of the
knowledge of all of these, it's all important for the replication of these
results. Data filtering is not trivial. This is something that is not only
hard and complex to replicate, but it can really impact the outcomes. Leaving
a few observations here and there, that dramatically changes the results. Data
analysis, there's um, typically in many cases, a large amount of intellectual
capital in terms of the statistical methods and the modeling that can embody
many deep intellectual contributions to science, so all this, the filtering
and analysis and the software necessary for replication, and it would be an
oversight to leave these out of the discussion when we're talking abuot
transferring knowledge and so on. Open code is as much an important part of
this as much as open data, so it must have an important role to play here. One
thing that I have been working on is something I called the Reproducible
Research Standard. So I had a licensing framework for code and data and for a
published paper so that scientists could attach this license, so here's one
recommendation: all of the work can be freely shared consistent with
scientific norms and not in violation of copyright, so my recommendations in
brief is to attach an appropriate attributions license to each component. Use
my work however you like, just attribute me, or put it into the public domain.
This notion of the research compendium that we're seeing discussed more and
more more.. there's a paper, code, data. Tools that are developing rapidly in
different areas and it's exciting- make this easier for the scientists to get
teh code and data in a format shared and so that others can verify the work
with. There's many more. Publication is being assisted by S-weave, so when you
are compiling your published doc, it re-compiles yyour data and so on. There's
also a GenePattern. Sweave. Sharing software platforms. mloss.org, DANSE,
madagascar, Taverna, Pegasus, Trident Workbench, Galaxy, Sumatra. Allow the
community to do this, very specialized in a platform, and is able to
understand the data and use the tools for the data. Madagascar is another
platform for sharing in geophysics and lots of workflow and tracking. We have
.. this is her work. Pegasus. Trident. Galaxy. It goes on and on. This will
facilitate the openness of code and data in terms of reproducibility. My final
slide. Open code and data is a unified principle which will allow us to do
what we talked about in the very beginning. Make it a movement that goes
across all scientific pfields.. we can rely on the notion of reproducibility
and reproducible research. This is nothing new in science, it's just something
we signed up for when we signed up for being a scientist. We are not updating
the social contract, what we're doing is returning to the scientific method
which has beena round for hundreds of years. (applause)

Peter Murray-Rust. I am a chemist. I don't know about slides because I do not
do PowerPoint. If any of you have anything, here we are, right, you can type
murrayrust blog and you will see that. Click on various things as we go
through. My main method of presentation is flowerpoint. I am old enough to
have remembered the 60s and not to have been at Berkeley but it has made a
huge contribution to our culture. The Open Knowledge Foundation will adopt
this as a way of making my points. We have many different areas- mabye 50-
that come under open, that relate to knowledge in general. If we can scroll
down.. First of all, my petals are going to talk about various aspects of
openness. So I will cover those things there, if you can down to the second
link, the open knowledge definition. This is the most important thing in this.
A piece of knowledge is open if you are free to use, re-use and re-distribute
it, subject only to attribute and share-alike. That's a wonderfully powerful
algorithm. If you cand o that, it's open. If not, it's not open according to
this knowledge. What the OKF has done, another picture, Panton Principles.
It's a placed called a pub. It's 200 meters from the chemistry departments
where I work, and between the pub and the chemistry lab is the Open Knowledge
Foundation. Rufus has been successful to get people to work on this. A lot of
this is about government, public relations. How many people have written open
source software? What about open access papers? How many of them had a full
CC-BY license. If they weren't, they didn't work as open objects. CC-NC, cause
more problems thant hey solve. How many people have either published or have
people in their group who have published a digital thesis, not many, right?
How many of those explicitly carry the CC-BY license. That's an area where wwe
have to work. Open Theses aer a part of what we're trying to set up in the
Open Knowledge Foundation, made the semantics available, LaTeX, Word, whatever
they wrote it in, that would be enormously helpful. The digital landgrab in
theses is startinga nd we have to stop it. There are many things we can do.
There are two projects, and these have been funded. okay, So, Open
Bibliography and Open Citations. At the moment, we're being governed by non-
accountable proprietary organizations who measure our scholarly worth by
citations and metrics that they invent because they are easy to manage and
retain control of our scholarship. We can reclaim that within a year or two,
and gather all of our citation data, and bibliographic data, and we can then,
and if we want to do metrics, I am not a fan, we should be doing them, and not
some unaccountable body. Anyone can get involved in Open Bibliography and Open
Citations. The next is open data, and the next is very straight forward.
Jordan Hatcher, John Wilbanks from Science Commons, that has shown that open
data is complex. I think it's going to take 10 years. This is a group involved
in the Panton Principles, I can't point to them. Jenny Malone, Jenny is a
student. The power of our students.. undergraduates are not held back by fear
and conventions. She has done a fantastic job in the Open Knowledge
Fuondation. Jordan, then Rufus, John Wilbanks, Cameron, and me, and anyway, we
came up with the Panton Principles, so if you back a slide, you will see the
Panton PRinciples, and let's just deal with the first one. Data related to
public science should be explicitly placed in the public domain. There are
four principles to use when you publish data. What came out of all of this
work is that, one should use a license that explicitly puts your license in
the public domain -CC0, or PPL from the Open KNowledge Foundation. So, the
motto that I have brought to this is which I've been using and been taken up
by.. our general library in the UK in the UK, on the reverse of the flower,
reclaim our scholarship. That's a verys imple idea, one's that possible if a
large enough people int he world looks to reclaiming scholarship, we can do
it. There are many more difficult things that have been done by concerted
activists. We can bring back our scholarship where we control it, and not
others. I would like to thank to people on these projects, Open Citations and
our funders and collaborators who are Jisk, who funds it, BioMed Central who
also sponsors this, Open .. Public Library of Science. (applause)

Biotorrents: a file sharing service for scientific daa. Morgan Langillie.
Here's a way to share your data right now. First, I'd like to acknowledge the
MOORE Foundation and my supervisor who let me take this tangent. You can send
me comments via twitter. I think we all agree that data is growing. We're
drowning in data, I hate that term. I'm going to throw you some terms. If we
want to continue to share open data and more openly, it should be simple and
so on. Three sort of personal challenges on a day to day basis, thsi is why I
built biotorrents. I want the download speed reliability, I just want to grab
some data, and it's annoying that it takes 3 days before I can get it. I want
to share all data associated with a study. The easiest way was to package it
up and share its omewhere. With biotorrents you can do that. It's not super
elegant, but at least it gets out there. Traditional file transfer.. you
connect with your main server, and the other one is basically download that
data, and the whitebar indicates how much that file has been transferred, and
another one doesn't get the data because of teh bandwidth. Unfortunately the
data has to travel across entire continents, and in between the two
institutions, your actual download speed is very limited sometimes for
whatever reasons. If the site goes down for planned maintenance or by
accident, that data doesn't get out there, which is not good if the data is
time-sensitive. You also want to check that the data is the original copy.
Using traditional method, you have to .. it's not in the protocol by default.
There has to be a better method. Today, I can download movies much faster
than, movies that are legal, than data that is open. In a p2p file transfer
method, like bittorent, the data set is now broken up in small pieces, and
each computer has the piece, and you still have that sole provider, but then
the other users, as long as they have the different pieces, so bandwidth grows
as users increase. The other computers might be geographically sperse so that
others might be nearby. If there's at least one full copy, at least everyone
can get the paper. There's also a SHA1 cryptographic hash so that the data can
be guaranteed to be the original. It's really well tested.. at least 25% of
all internet traffic is bit torrent. We can use it to share movies, but also
data. So how easy is it to use? Install bit torrent, you download it, you
download a .torrent, basically what happens when a user downloads a .torrent
file, and there's a tracker/server, like biotorrents.org, and it's not hosted
on the server. If it's hosted on biotorrents.org, it's just the .torrent file,
not the giant data set. It's metadata. And then it's communicated to other
computers. Behind the scene, the software is connecting to the tracker,
getting IP addresses, and the peers start communicating with each other and
sharing data. There's a few other bit torrent features.. a lot of people talk
about unique IDDDDs, whenever you createa dataset, there's a hash, the whole
data set, you'er guaranteed that another person sitting beside you is using
the same dataset, there's also the client software, hashtable, and there's
things.. without connecting to a tracker even if it is down, and there's data
posted to different trackers, and the clients can find each other thorugh
those other trackers. Lots of people download a data set, and then a person
downloads a data set, and then local peer discovery so that people can find
someone and then transfer the data over the local network. I found out about
this by accident, and I started testing it and it was blazing fast, that was
nice. If there was data hosted on traditional methods like FTP,t hey can be
added and they just act as an existing or extra seed. You can upload your
favorite genome to any one of these. Lots of these exist already, and a lot of
these have illegal copyright file sharing issues. There's a few other
trackers, but not very many. So, on top of that, even if yuo did upload it
there, it would be a hard time to find it, because the community there isn't
oriented to science. So, that's why I made biotorrents. Of course, all data
must be open. No illegal file hosting. The biological domain. Of course I
mentioned, it's not hosted on biotorrents, but I'm mirroring the data on a
separate server, in the long term it's up to the users to provide the seeding
of that data. You can search and browse by particular scientific categories,
and also by license and username. You have to set some kind of license when
you upload it. There's a large list there. There's an "Other" category but
they usually pick one of those licenses. Anyone can download the data without
a username, but if you want to interact with the site, you can use your own
username. Hopefully people will get a reputation for sharing good data down
the road. A few cool things about this - there's an RSS feed, where you can
automatically download data sets, there's also versions of data set, or if the
data just expands, or whatever, then you can put that through versions with
RSS feeds where you basically subscribe to the versions for a certain data set
and then from then on you get all new versions of that data set, and that
basically means that it also handles updates. Lastly, there's an upload
script, that so far, there's about 1,000 users, and it's pretty limitedo n the
number of data sets, whatever you've been sitting on for years, and do we
really need it? Here's an example of genbank. By FTP it took 6 hours. Right
now the only way is to get it from NCBI.. and I can only get 0.5 MB/sec, and
that means 5 days. So that sucks. So who uses biotorrents? Existing large data
providers, scientists share and publish data. Scientists sharing unpublished
data. There's issues with any sort of technology. Metalink. Volunteer
computing. THat's it. My final message is that data transfer should be fast
and easy. Embrace the technologies suh as bit torrent. Hopefully.. thanks.

It's time we change how research has changed

We are going to change the world of reference managemetn. THis is a bold
statement, a ridiculous oen I would say to you a couple of years ago. What we
are doing and what we are seeing. To explain the why, we are going to enlist
the ehlp of Tim Bergers-Lee, and what TIm said, the paraphrase is that we have
all this information on cancer, stem cells, diseases, it's all siloed away to
different computers. So Tim has issued a challenge to unblock this data. It's
not just abuot technology. There's this huge social norm. There's the behavior
of people. Open data, open science, it's actually obstructing the progress of
human progress in the world. The U.S. national economy, issued a few grand
challenges, one of these was the tools of scientific discovery, how can we
address this challenge? With mendeley, we are trying to make rscience more
transparent and open, and we're trying to build the world's largest academic
database. In these next slides, we are keeping these things in mind. Helping
researchers work scatter, extractable text and PDF files. There's a PDf..
annotate with Microsoft Word or Open Office and then what we do is take that
research data, and aggregate it into the cloud. By doing this then, we are
helping researchers collaborate and we'er making that data more transparent.
So then this is a screenshot of what you get when you sign up for Mendeley
Web, and you start seeing what's going on with people you're collaborating
with on different projects. What separates us from other reference managers?
We find statistical trends, like the most popular author or paper for the
upcoming week. And then if your'e familiar with twitter or trending tags, then
we show some of that. So we take all of that data being siloed away, and then
we built a search catalog on top of it. The big difference with our search
catalog like with pubmed, the 27 and those, that's the number of readers for
that particular article, you can't get that if you're just doing something on
pubmed. So I clicked to my landing page, so the standard citation information
for that, but then we also start digging down into the demographics. Who are
these readers? PhD students, professors, where are they from, what discipline
are they in? Because most of these papers are multi-disciplinary. And then of
cuorse we show some relayed research, like TIDEF, and then also collaborative
filtering, like the research papers that you should be reading but may have
been missing. SO we've been in public beta for 18 months, we have actually
450,000 users. These are the top 20 universities so far. In terms of the
number of papers we're aggregating, we have 29M papers that metadata has been
uploaded on. The size of this? The Thomas Web Of Knowledge database has 40M
papers, and it took them 50 years this. We might be able to match that amount
in just 2 years. So one of the things.. we created an open API so that others
can access the same metadata and statistics and there's some mashups that
developers are creating. Chemical Compounds, Location-based mashups of
alzheimers research, swan data, grant search engines, twitter streams, we have
people building Google Wave mashups, some Microsoft Word mashups, Google Docs
mashups with these open APIs. And as far as the future goes at Mendeley,
getting back at what Tim Bergers-Lee said, all that data is filed away on
individual computers. There's a vast amounto fk nowledge in our heads. How do
we re-use and repurpose scientific knowledge? How about semantic markup of
other papers? Does this sentence support this paragraph or sentence from this
other paper? So we're creating a human-currated, high throughput system,
crowdsourced system for semantically linking PDF papers that would be
impossible to link even if this was machine-readable, and just to get back to
this statement here. How do we change the social behavior of scientists who
are sketpical of sharing their own publications? So one of the things we're
experimenting with and haven't released yet are reputation metrics. We might
show the number of downloads of your publications or the page-views, and an
incentive in your reputation metric for scientists to upload their PDF files,
and later on their own data. So getting back to end this to what Tim was
saying.. are we unlocking the data that we have been siloing away for years? I
hope we're doing that, and I hope that our project wille ncourage others to do
similar things, maybe with biotorrents.

Open, Candidd discussion of science

Maartha Bagnall

thirdreviewer.com

This conference is long overdue. Scientists have conversations about the
quality and evaluations of the published literature all the time. Every day
you walk into the alb and you see this recent paper, how are you thinking
about it, scientists are constantly usin this information in published
literature, to design new experiments, build on published ilterature, or
whatever. What are the venues for the kinds of communication?

ok, I need a break.. uploading.