Mark Kaganovich

My name is Mark. I am a grad student at Stanford doing genetics. I am also working on startups. So. Yeah. I am a grad student in genetics working on a startup in bioinformatics. Thanks a lot for letting me talk at you. I want to spend a few minutes talking about the startup and then another few minutes talking about the thought process and maybe tie it into why I think we're here.

The startup is called SolveBio. The theme is bioinformatics, helping people analyze data better, learn from their data, focusing on genomics, we thikn that this as Jeremy mentioned that this is important as an opportunity. I am going to move fast by the way. Okay, uh. I think the main theme really is that bioinformatics could be the sort of very important part of medicine, kind of eat medicine, kinda the most important part of research of diagnosis and treatment. The main new thing that is going on is that we can measure a lot of stuff. In genetics, it's crazy that we can measure gene expression inexpensively compared to a few years ago. People perdict that as demand grows, more and more people try to get into the space, there's brutal competition, and it's a terrible business to be in, but it's great to be a consumer. There are some cool new startups that are also making and measuring lots of other things, mass spec is an example of a technology that has yet to go through this exponential growth phase but it could.

The result is that data becomes very important. The kind of goal of the startup is to recognize that we could measure all of these biological features. We can sequence millions of genomes, and they have millions of variables and features. This could be one of the most interesting data problems we have yet had. I think a lot of time you want to organize interesting problems, this could end up one of those problems that is not fully solvable. There are millions of different variables.

SolveBio wants to automate this process of computing and finding patterns in biomedical data. People in research and R&D and if this stuff is successful in clinic, they don't have access to the kind of computing and analysis tool that the people do in the world of the web there's been lots of advances like aws that everyone uses now, that kinda stuff has yet to happen in medicine and biology. We hope to change that.

The kind of one one of the motivations is that in genomics the cost is rapidly shifting from the actual measurement of the data to the analysis. So the cost is about $20,000 to sequence a genome last year, and maybe $100 was IT and then like $3,000 for drawing inferences, and if that cost comes down to 100, then that cost gos down much greater. So the IT cost should start becoming the main important one... we want a product that helps people implement this.

Specifically, the problems that we are working on are that there's millions of datasets out there, there are joined datasets, to annotate their own data, then there's all these programs. You have to resetup everything, set up all this both hardware and software, download data, install programs and people constantly redo this both in the labs and in companies. Nobody actually wants to make workflows, it's boring as shit. I just want to press a button and run it, but at the same time you have to trust what you're running. It has to be easy to run the same code on different data computers on the same data. You have to do this well or else it will suck. That's what we're doing, we're just getting started, I'd be happy to show people who are interested or have genomic data or interested in running these tools, like if you're measuring gene expression on lots of people, like pipelines like cuff links or something.

Some examples of computations that people find to be ap ain in the ass- the general theme is that data collection is growing, and so the importance of using computers is growing. But I wanted to kind of serve also talk about, the philosophy behind this. This seems like a list of people that all really believe in things strongly, seems different than other startup gatherings in that there's a reason people are doing this stuff. It seems like it defines you more if you're doing a science startup, rather than a gelocation social network. Not sure why that is. But, whether it's more than just business opportunity, so I kind of wanted to talk about where we think the world would be and sort of how we capture the value and see if we can tie into the changes in scientific communication and the ecommerce side. The shift in the amount of data in bioinformatics can play into that. That's important to consider because how you see the world can define you. I like JayZ's take on being a business man -he's not a businessman he's a business, man. That really defines who he is. They define themselves by their own businesses, they have a core belief.

So one of the core beliefs that I struggle with is how complex is the world? That's very relevant for what we're trying to do - learn from data and biology. The thing is, Kaggle might have some perspective on this as well. If the world is too complex then we can't just figure it out, and there's no value, it doesn't matter anyway. If it's really simple, then we're not adding values as programmers. So, in genetics, this example that we keep talking about is how it's really difficult for whatever reason- people were surprised that sequencing genomes didn't give all the answers about diseases. We don't know yet what causes all diseases, so people in the community have been calling it the missing heritability, there's lots of complex traits where we can't explain their genetic cause.

There are other examples from Peter Thiel's class at Stanford, he gave an example of Siri.. also, the world can be too complex and maybe we can't find the cause of genetic diseases, maybe we can't do ai well, so the best thing possible instead won't really make any sense. But if the world is complicated but within reach, then my thesis is that bioinformatics starts mattering a lot because everything becomes a molecular phenotype, you can measure DNA, mass spec, small molecules, you can measure the environment, see the small molecules in your bloodstream, assay enzymatic activity in the blood, then in those cases there are lots of things that matter there, and the EMRs don't matter in that world. People are really into electronic medical records, it's like some xrays and some blood tests, and 90% is shit your doctor wrote down.. in a world where you measure your data, you don't need that stuff.

Maybe the government mandates that EMRs are holding us back, sure.. iphone lifestyle apps don't matter, it ddoesn't matter what you eat, just measure the small molecules in your bloodstream. I'm not saying that the worldi s like this, I'm just saying if we assume that. But molecular biology starts mattering a lot in that world. Technology to measure stuff matters a lot. And computers matter.

So then it's all about data. In that world, some of the things that people have been talking about already start mattering even more. In that world, I think Science Exchange and Assay Depot start mattering even more in that world. Because we can automate more stuff, data becomes more important, Transcriptics I like them, that matters a lot because we have analysis, experiments, right now it's very difficult to do research without the human component, it's almost impossible. If we can figure out more stuff we can measure, we could figure out the molecular mechanisms of things. When data starts ultimately driving our conclusions, which we're not there yet, in that world, I think it's like separating experiments ffrom analysis makes more sense.

One of the interesting things is that, I'm a fourth year PhD student in genetics, in an awful year. So, when I joined Mike Synder at Stanford, he's my PI, he's well known for genomics and proteonomics work, he was the pioneer of that in some way. When I was doing a lab, there's 30 or 50 people in there. There's maybe one or two computational people at first. But now it's 50% computational. Almost everyone that gets hired is computational now. That world has shifted in the last 3 years. Computational labs are hiring experimentalists. Experimental labs hire computation people.

There's this shift that you need both, it will be interesting to see if those things entirely shift to computational. I suspect that might not do that fully ever, but you still need people to design experiments. That's a striking phenomena and it sort of inspired me to be on the informatics side. In a certain way, the science communication innovations like and others, were working on the world where data is easier to measure, and where data drives scientific and medical inference. Papers start mattering less. When I do research, I read some papers, I download some stuff and information, I parse it, it's usually behind paywalls, some people are really passionate about that. So, in a world where we sort of like much more easily generate data and compare it, the actual wording of the papers matters less. The actual data sets start to matter more, most of teh papers on genomics it's not the case, we still observed some stuff here are conclusions and trust us that we did this stuff. That really is the case, you do trust them, it's in a journal, and if they're from a good school it would be interesting to see if that changes as this revolution happens. And, as a result, that might be kind of a way that we get to the open science framework that people think about. I think everyone here supports that idea.

We as taxpayers pay for research, so we should have access to it, it's difficult to deal with. As the data starts mattering more, you can throw up data sets unrelated to papers, give recipes for classic disruptive model where the publishers sort of are concerned with the actual words in papers. Researchers are actually concerned with datasets and matrices that, lemons and files, so I think that's an interesting thing to me.

We're specifically interested in how to distribute algorithms between different people and how to distribute datasets. Can we make a difference in that? And if we can, those people that develop algorithms can have a shared environment with someone else, they can re-run that on new data, and if we're successful, that will spur innovation in algorithms and programs. Right now people tend to rewrite their stuff. It seems like we could do much better. That's kind of what we're into. I'm bringing this up because it's a great crowd to talk about all those things, our new idea to generate lots of datasets in biology and how it relates to how science communication will evolve, how distributed labor stuff evolves, and how big bio will draw inferences from biology will also evolve.

With that, our goal of making easier to run code on different computers, to make it easy for regular bioinformatics people that aren't systems programmers to very easily access programs and millions of datasets without redownloading and doing setup, I think that could be relevant to a lot of ways that people distribute information and do publishing, and how people learn from data, and how people do experiments, so we'd love to chat with people about that, so thanks.