working-group-roadmap-bioinformatics

High performance computing

Chris Dawn

High performance computing is important for bioinformatics and sequencing, but it didn't attract a lot of conversation for GP-write. Data standards imply heavy committee work. We had the idea that since sharing is the after thought and doing large meta analysis and people try to aggregate knowledge and they end up making garbage in terms of integratable artifacts at the end of a research project.

With proper conversation in this open setting, we could do really neat work on that thing.

We also had a discussion on ontologies. As a newcomer to synthetic biology, there were a lot of words I didn't know. I am sure some of the words have related meanings, but it would be nice to have that defined and pulbished and it might be helpful to start defining a vocabulary and data types and data structures. This would probably help the community move faster over the 10 year horizon.

Each of these initiatives like BRAIN Initative HapMap, Cancer Moonshot, HMP, Encode, Precision medicine initative, they all had working groups and we should build on their work.

NIST has a standard that is in RFC stage about identity access management. These federated projects need a coherent way to deal with the identities of the participants and administrators who are requesting access to data sets; coherent way to deal with authentication as opposed to identity. NIST has this. There's a lot of cool tech I would like to see us start from a modern platform. I don't know what we will build with that, we would have to have a concrete idea before we build anything in particular. I suggest we build from strong standards.

We talked a lot about consent. There has been conversation about consent management, measuring the informedness of people giving consent and receiving it. And without violating anonymity, revisiting.. or to inform, or to opt-out and get removed. These are technologically possible.

Importance of privacy. De-identification of genomes may not be possible.

Data platforms for research have properties about flexible ocllaborative searchable whatever-- are the files sensibly named in a consistent way? If you choose to store data in some place, is there a reason for that? Or is it just what happened to be available right there? If we duplicate data, if there's a duplicate set, is there a reason? Do we know when to clean it up? We should focus mostly on data hygiene for the first 6 months or year here. We will have conversations in the working group about what the formats are, what the technological readiness is, and changing the name away from HPC and Bioinformatics because I think we would have got a lot more people if we would have just changed the name to something like "Data".

All of us think that data technology in particular can be a driver of innovation. I would be very open to collaborative efforts in this space that merge directly from technology and using the data so that frankly we get to play a part in this exciting project.