transcripts/startup-science-2012/mark-kaganovich-solvebio.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171

Mark Kaganovich http://solvebio.com/

My name is Mark. I am a grad student at Stanford doing genetics. I am also
working on startups. So. Yeah. I am a grad student in genetics working on a
startup in bioinformatics. Thanks a lot for letting me talk at you. I want to
spend a few minutes talking about the startup and then another few minutes
talking about the thought process and maybe tie it into why I think we're here.

The startup is called SolveBio. The theme is bioinformatics, helping people
analyze data better, learn from their data, focusing on genomics, we thikn that
this as Jeremy mentioned that this is important as an opportunity. I am going
to move fast by the way. Okay, uh. I think the main theme really is that
bioinformatics could be the sort of very important part of medicine, kind of
eat medicine, kinda the most important part of research of diagnosis and
treatment. The main new thing that is going on is that we can measure a lot of
stuff. In genetics, it's crazy that we can measure gene expression
inexpensively compared to a few years ago. People perdict that as demand grows,
more and more people try to get into the space, there's brutal competition, and
it's a terrible business to be in, but it's great to be a consumer. There are
some cool new startups that are also making and measuring lots of other things,
mass spec is an  example of a technology that has yet to go through this
exponential growth phase but it could.

The result is that data becomes very important. The kind of goal of the startup
is to recognize that we could measure all of these biological features. We can
sequence millions of genomes, and they have millions of variables and features.
This could be one of the most interesting data problems we have yet had. I
think a lot of time you want to organize interesting problems, this could end
up one of those problems that is not fully solvable. There are millions of
different variables.

SolveBio wants to automate this process of computing and finding patterns in
biomedical data. People in research and R&D and if this stuff is successful in
clinic, they don't have access to the kind of computing and analysis tool that
the people do in the world of the web there's been lots of advances like aws
that everyone uses now, that kinda stuff has yet to happen in medicine and
biology. We hope to change that.

The kind of one one of the motivations is that in genomics the cost is rapidly
shifting from the actual measurement of the data to the analysis. So the cost
is about $20,000 to sequence a genome last year, and maybe $100 was IT and then
like $3,000 for drawing inferences, and if that cost comes down to 100, then
that cost gos down much greater. So the IT cost should start becoming the main
important one... we want a product that helps people implement this.

Specifically, the problems that we are working on are that there's millions of
datasets out there, there are joined datasets, to annotate their own data, then
there's all these programs. You have to resetup everything, set up all this
both hardware and software, download data, install programs and people
constantly redo this both in the labs and in companies. Nobody actually wants
to make workflows, it's boring as shit. I just want to press a button and run
it, but at the same time you have to trust what you're running. It has to be
easy to run the same code on different data computers on the same data. You
have to do this well or else it will suck. That's what we're doing, we're just
getting started, I'd be happy to show people who are interested or have genomic
data or interested in running these tools, like if you're measuring gene
expression on lots of people, like pipelines like cuff links or something.

Some examples of computations that people find to be ap ain in the ass- the
general theme is that data collection is growing, and so the importance of
using computers is growing. But I wanted to kind of serve also talk about, the
philosophy behind this. This seems like a list of people that all really
believe in things strongly, seems different than other startup gatherings in
that there's a reason people are doing this stuff. It seems like it defines you
more if you're doing a science startup, rather than a gelocation social
network. Not sure why that is. But, whether it's more than just business
opportunity, so I kind of wanted to talk about where we think the world would
be and sort of how we capture the value and see if we can tie into the changes
in scientific communication and the ecommerce side. The shift in the amount of
data in bioinformatics can play into that. That's important to consider because
how you see the world can define you. I like JayZ's take on being a business
man -he's not a businessman he's a business, man. That really defines who he
is. They define themselves by their own businesses, they have a core belief.

So one of the core beliefs that I struggle with is how complex is the world?
That's very relevant for what we're trying to do - learn from data and biology.
The thing is, Kaggle might have some perspective on this as well. If the world
is too complex then we can't just figure it out, and there's no value, it
doesn't matter anyway. If it's really simple, then we're not adding values as
programmers. So, in genetics, this example that we keep talking about is how
it's really difficult for whatever reason- people were surprised that
sequencing genomes didn't give all the answers about diseases. We don't know
yet what causes all diseases, so people in the community have been calling it
the missing heritability, there's lots of complex traits where we can't explain
their genetic cause.

There are other examples from Peter Thiel's class at Stanford, he gave an
example of Siri.. also, the world can be too complex and maybe we can't find
the cause of genetic diseases, maybe we can't do ai well, so the best thing
possible instead won't really make any sense. But if the world is complicated
but within reach, then my thesis is that bioinformatics starts mattering a lot
because everything becomes a molecular phenotype, you can measure DNA, mass
spec, small molecules, you can measure the environment, see the small molecules
in your bloodstream, assay enzymatic activity in the blood, then in those cases
there are lots of things that matter there, and the EMRs don't matter in that
world. People are really into electronic medical records, it's like some xrays
and some blood tests, and 90% is shit your doctor wrote down.. in a world where
you measure your data, you don't need that stuff.

Maybe the government mandates that EMRs are holding us back, sure.. iphone
lifestyle apps don't matter, it ddoesn't matter what you eat, just measure the
small molecules in your bloodstream. I'm not saying that the worldi s like
this, I'm just saying if we assume that. But molecular biology starts mattering
a lot in that world. Technology to measure stuff matters a lot. And computers
matter.

So then it's all about data. In that world, some of the things that people have
been talking about already start mattering even more. In that world, I think
Science Exchange and Assay Depot start mattering even more in that world.
Because we can automate more stuff, data becomes more important, Transcriptics
I like them, that matters a lot because we have analysis, experiments, right
now it's very difficult to do research without the human component, it's almost
impossible. If we can figure out more stuff we can measure, we could figure out
the molecular mechanisms of things. When data starts ultimately driving our
conclusions, which we're not there yet, in that world, I think it's like
separating experiments ffrom analysis makes more sense.

One of the interesting things is that, I'm a fourth year PhD student in
genetics, in an awful year. So, when I joined Mike Synder at Stanford, he's my
PI, he's well known for genomics and proteonomics work, he was the pioneer of
that in some way. When I was doing a lab, there's 30 or 50 people in there.
There's maybe one or two computational people at first. But now it's 50%
computational. Almost everyone that gets hired is computational now. That world
has shifted in the last 3 years. Computational labs are hiring
experimentalists. Experimental labs hire computation people.

There's this shift that you need both, it will be interesting to see if those
things entirely shift to computational. I suspect that might not do that fully
ever, but you still need people to design experiments. That's a striking
phenomena and it sort of inspired me to be on the informatics side. In a
certain way, the science communication innovations like academia.edu and
others, were working on the world where data is easier to measure, and where
data drives scientific and medical inference. Papers start mattering less. When
I do research, I read some papers, I download some stuff and information, I
parse it, it's usually behind paywalls, some people are really passionate about
that. So, in a world where we sort of like much more easily generate data and
compare it, the actual wording of the papers matters less. The actual data sets
start to matter more, most of teh papers on genomics it's not the case, we
still observed some stuff here are conclusions and trust us that we did this
stuff. That really is the case, you do trust them, it's in a journal, and if
they're from a good school it would be interesting to see if that changes as
this revolution happens. And, as a result, that might be kind of a way that we
get to the open science framework that people think about. I think everyone
here supports that idea.

We as taxpayers pay for research, so we should have access to it, it's
difficult to deal with. As the data starts mattering more, you can throw up
data sets unrelated to papers, give recipes for classic disruptive model where
the publishers sort of are concerned with the actual words in papers.
Researchers are actually concerned with datasets and matrices that, lemons and
files, so I think that's an interesting thing to me.

We're specifically interested in how to distribute algorithms between different
people and how to distribute datasets. Can we make a difference in that? And if
we can, those people that develop algorithms can have a shared environment with
someone else, they can re-run that on new data, and if we're successful, that
will spur innovation in algorithms and programs. Right now people tend to
rewrite their stuff. It seems like we could do much better. That's kind of what
we're into. I'm bringing this up because it's a great crowd to talk about all
those things, our new idea to generate lots of datasets in biology and how it
relates to how science communication will evolve, how distributed labor stuff
evolves, and how big bio will draw inferences from biology will also evolve.

With that, our goal of making easier to run code on different computers, to
make it easy for regular bioinformatics people that aren't systems programmers
to very easily access programs and millions of datasets without redownloading
and doing setup, I think that could be relevant to a lot of ways that people
distribute information and do publishing, and how people learn from data, and
how people do experiments, so we'd love to chat with people about that, so
thanks.