2018-02-21

IARPA MIST Proposers' Day

Molecular Information Storage

https://www.iarpa.gov/index.php/research-programs/mist

video: https://www.youtube.com/watch?v=OnencB-iIFA

slides: https://www.iarpa.gov/images/files/programs/mist/MIST_proposers_day_briefing.pdf

https://twitter.com/kanzure/status/987126121237016576

Welcome, logistics, proposers' days, goals

David Markowitz

Good morning. Let's begin. I am David Markowitz, the program manager for the molecular information storage program (MIST). I want to thank you all for coming. Many of you have come from far flung places. You have busy schedules, thank you for participating in this today. This is gratifying for me because I have been working with some of you on community building around these efforts. It's exciting to see the representation in this room today. We have a mix of large industry players from semiconductors and storage device industries, representatives from Intel, Micron, Seagate, Western Digital, 12 startup companies, we have a lot of academic and industrial research laboratories and even some venture capital staff floating around. Given that one important goal of this event is to facilitate teaming activities and help people to build the teams that they need to contribute to this program, I think we have just the perfect mix here today. Thank you all very much for your participation.

For the next couple of minutes, I will be introducing some of the logistics of the events. First of all, the bathrooms are just out the door to the left and then make a hard right, there's stairs and a coffee shop in the lobby. Lunch is at 12. The government is not providing you with lunch, but in your packets there's a list of a number of local restaurants where you can go and get lunch.

A couple of important disclaimers at the offset. This presentation is only for information and planning purposes. It is not a formal solicitation for proposals or abstracts. We will be releasing a draft technical portion of the program solictation hopefully by tomorrow. I'll endeavor to send out a notification to everyone who has registered for this event. You'll be able to see what the program design looks like, and you should provide feedback regarding anything you should think should be changed. There will be a formal mechanism for providing feedback. Nothing said today here will not change any requirements in a BPA, and the BPA will supercede everything IARPA says.

We are interested in what the program design will look like. This event, and the feedback you provide both here and on the draft BPA, this is your opportunity to influence how the program is designed and executed. Please seize this opportunity.

The goal of this event is to familiarize you guys with IARPAs' interest in molecular information storage. Ask questions and provide feedback. You should all have note cards in your packets. Please write down your questions. Bernado will be collecting your question cards. Please provide the notecards to you. In the late morning session at 11:20, we'll do a fireside chat where we answer your questions and answer them to the best of my ability.

The other key goal of this event is to foster discussion on complementary capabilities of potential participants, also known as teaming. Many of you have superlative capabilities in one or more areas that are vital to this program. Maybe you have device development experience, but not synthesis. Please endeavor to meet the people in this room with those capabilities. As we will see in the government's presentation of the technical areas- each area will require multi-disciplinary expertise. Seize the opportunity. You all made the trip to come here, do your best to meet everyone.

We're going to have opportunities for you to be aware of other people's capabilities, in the afternoon session. You should record your questions on the notecards. Once the BAA is released, there is a formal way we have to go about answering your questions. You submit them in writing to the program email address, we email them in traches, responses are posted on the program website. There is sometimes a delay in a few days or a week in doing this. Right now is the opportunity to get rapid feedback on your questions.

This is the agenda for today. We're talking about goals for proposers' days and logistics and events. Our chief scientist, Will Vanderlinde will be giving an overview of IARPA. Then I will speak for 45 minutes where I will tell you the technical details, what the goals of the program are, why we're doing this, and what the current program design looks like. Everything that is in this program overview, is going to be in the draft BAA that should come out tomorrow. Don't feel that you need to need to feverishly write down everything that you see in this talk, because they will be released publicly. Also the slides will be public as well.

We'll be back at 11am, and someone from our acquisition team will be talking about doing business with IARPA. There are 12 startup companies in the room today, and I doubt many of them have had a contractual relationship with IARPA before so we will clarify what these relationships look like and how they are managed and what the expectations are for reporting. We will have a Q&A session, which could run over.

In the afternoon, starting at 1:30pm, we will have an offers-capabilities briefings. The government people have to get lost after the morning sessions. I am not allowed to be in the room when you guys are talking about your capabilities and what you're looking for. This is your opportunity to network with each other. After 2:30pm, we have the room next door, to your right when you leave, there are poster-boards setup for those of you who have voluntereed to present posters. This is set aside for general networking and teaming activities, which hopefully will be valuable for all of you.

IARPA overview

William Vanderlinde

Thank you, David. We would like to give an overview of IARPA at this proposers' day because sometimes people come in and ask what is IARPA. Our name sounds like DARPA. We were created in the image of DARPA but with a different target audience. IARPA supports the 16 agencies that comprise the intelligence community.

We work on a diverse set of technologies. What we do is high-risk high-reward research. Or to be precise, you do it. We just give you the money to do it. We are a funding agency, we do no research in-house. The problems are complex and multidisciplinary. We solve hard problems that our mission partners have. We try to boil it down into technical issues and then answer the technical issue then we have knocked down the risk of something and we could transfer it to our partners to develop it into a system.

We do a full and open competition (BAA process and the proposers' day today). What we call, a program manager centric program. The progrma manager has full authority on whether to do a program and what it looks like. David Markowitz is the man. He has full authority on this program. Give him questions, talk to him.

We have a heavy emphasis on metrics and measurement. It's been very hard to measure the progress of research, but we think it's doable. We have a great deal of support to the program. We have a test evaluation partner concept. They will be grading you as you go along on your results. We often spend 25-30% of the program budget on test evaluation.

We bring our partners in at the start and we don't do a program unless we see a way to do transition. These are programs that last 3 to 5 years. We encourage you to publish your data to the literature. Most of what we do is unclassified.

For our convenience, we devised 4 research thrust areas. Analysis, anticipatory intelligence, collection, ... First, I'll talk about analysis.

Big data. We're drowning in data, making sense of it is difficult, majority of analysis programs involve machine learning. We have efforts in image analysis, video analysis, language, but also looking at sorts of other large data sets.

We do "anticipatory intelligence" - this is everything from ST intelligence where we're assessing technologies or looking at publications and patents which are indicators. We're looking at indicators and warnings of things that are of interest, in the intelligence field, giving us early warning of cirises, disease outbreaks, insider threats, cyber attacks. We also do geopolitical forecasting, we develop prediction markets and use machine learning for these types of forecasting.

Collection covers a wide range of things- it's everything from geolocation, frequency signals, unconventional polygraphs, we're very interested in synthetic biology and CRISPR-Cas9. We do a lot with biometrics, chemical and explosives detectives, anything of interest to our partners. There's often a niche for us to push the technology forward.

Finally, our focus today is in the computing area. We're looking at areas where we can get revolutionary improvements in computing power. What these agencies need is-- a --- quantum computing, we also support, ... neuromorphic computing, which is where David has his expertise, in biologically-inspired components. We do a lot of effort on, supply chain assurance for microelectronics, and we also have cybersecurity efforts.

There's a variety of ways to engage with IARPA. We often do an request for information or workshops, we did a molecular information workshop in cooperation with SRC in 2016 with Victor Zhirnov. If you see us doing R5's and workshops in an area, it's a good predictor that we're going to form a program. We do seedlings- these are small projects, typically 1 year or less in length, which is intended to explore a high-risk area, which might lead to a unique project eventually. It's taking the idea from disbelief to doubt. We do have a BAA open all the time, a rolling mission process, you can send a proposal and we encourage you to send an abstract first and then we call up the program managers and see if there's interest. That's always open.

We also do prize challenges- we put out prize challenges, we sometimes get better results from a $50k prize challenge than spending millions of dollars on our partners. It requires a pretty significant part of effort on our part to launch challenges, not just money but also effort. It lets us leverage both-- who we would never ever, sign a far base contract with the US government, like we recently completed a prize challenge called... it had to do with analyzing chemical spectra of hazardous substances, a lot of machine learning involved in this, the top 5 finishers were all European individual grad students not faculty members, in computer science in math, who were specialists in machine learning who knew nothing about chemistry. It gives you some idea of the power of machine learning. All 5 of them were, were grad students, 3 were from Russia, people who wouldn't be submitting a BAA proposal.

We have full research programs that last 3-5 years which is what we're here for today. I am happy to take questiosn. Otherwise, we can just move on and get to the good stuff.

MIST program overview

David Markowitz

This is the meat and potatoes of the technical program. This is what we're all here for.

I am David Markowitz. I am the program manager for the molecular information storage program or MIST. This will be the technical portion of the government presentation. When I go through the background and the motivation and the design of the MIST program. I should note at the outset that we're videotaping this briefing and it will be posted on IARPA's youtube page after the fact. Only the government portion's of today's event will be videotaped. If anyone wants to go back and see, then it will be posted there. The slides will be posted online.

A couple of key things at the outset. This is going to be a multi-year R&D program. It's slated for 48 months or 4 years. The program seeks to develop deployable storage technologies that can eventually (not within the timeline of this program) offer a clear path for scalability into the exabyte regime and beyond with reduced physical footprint, power and cost requirements relative to coventional storage technologies.

We seek to do this using sequence-controlled polymers as a data storage medium, and by building the necessary devices and information systems to interface with this medium.

We want a clear and plausible path to commercialization. We want more than just prototypes. That's why it's great to have good industry participation in the event today.

We're specifically seeking tech to optimize the writing and reading the information to and from polymer media at scale, and to support random access retrieval from archives at scale.

Why are we doing this? The first third of my talk is going to be setting up why is this necessary and why this approach is appropriate. The scale of the world's big data problems are increasing rapidly. Use cases that require access to unstructured data, like in the internet private sector space, and increasingly relevance for the government space. There are extraordinary financial resources for exabyte storage, including warehouses that have 100's of megawatts, cost billions of dollars to build operate maintain.

The canonical example of this is this cold storage data center that was stood up late 2016 by a major internet company in Fort Worth, Texas. This is for archival storage and retrieval only, not for analytics. Based on media reports, this has about 1 million sq ft of space. They had to build a 200 megawatt wind farm to support it. It indefinitely requires 100 megawatts of power. This is using repurposed blu ray media, and requires continual replacement every 5 years. Over 10 years, the total cost of ownership is over a billion dollars.

How do we scale to beyond an exabyte in the future? We're faced with exponential data growth. Large data consumers are going to face a choice regarding investing in more data storage, or discarding more data. It's an impossible choice. It's one exponential investment or an exponential discard.

All conventional storage paradigms- whether optical, solid state, etc., write to a planar media. Once the areal storage density has been maximized in 2d, it offers limited capabilities for 3d storage. There are scalability limitations in optical, which I will show in a moment. As a result of this limitation, if you want to build a data center with exponentially larger capacity than planar storage, it requires exponentially more read-write hardware, and then it has cooling, powre and cost requirements.

If we re-imagine large-scale storage from first principles to address this, what would be on our wish list for next generation storage? Any next generation storage medium should have orders of magnitude higher volumetric data storage than conventional paradigms. This should neable the development of tech that scale with smaller footprints. And low power read-write hardware. Also, on our wishlist, we want next generation storage medium to have long-term stability against progressive data degradation, regular media checks and this causes maintenance costs in today's data centers. Anything that we can invest in today, there should already be basic methods in existence for reading-writing from the storage medium. The engineering optimizations to support real-world commercial deployment in a 10 year horizon should be clear and plausible. This item on the wishlist throws out many potentially promising items just because the read-write technologies have yet to be demonstrated or reproduced.

The opportunity space that we're attacking through the MIST program is to use seuqence-controlled polymers as the next-generation data storage medium.

Polymers are molecular scale sequences of physical bits. DNA is biology's own long-term data storage medium. I'm showing here, the volumetric information density of a variety of different storage technologies in relation to DNA. It's been shown that DNA has a stable lifetime of minimally of hundreds of years in less-than-ideal storage circumstances, and information storage density orders-of-magnitude better than conventional storage. But we're not limiting ourselves to DNA as a storage medium in this program. Synthetic polymers offer an attractive path to the development of novel storage devices.

To give some insight on why the volumetric information density of polymers can be so much higher than traditional tech, look at the bit density in NAND flash. It's 19 nm bit features in 64 stacked layers. On the future roadmap is 100s of layers. If you look at the bit feature sizes in DNA, in relation to flash, the nucleotides in DNA are about 1 nm, and the spacing between them is subnanometer, and that compares to 19 nm in NAND flash nanofabricated out of silicon. DNA is compressible within a volume as well. This is why these numbers for maximum volumetric information density for DNA are so much higher.

Here's an example that was provided by Louis from U of Washington who has done foundational work for operating systems for molecular storage mediums. This illustrates the process from going from a sequence of bits, to a sequence of bases that you want to encode as DNA, you can then use synthesis technologies to physically instantiation that sequence as an oligonucleotide, you can store this, store it, preserve it, do bulk sequencing using conventional life science tech, there's opportunities for random access through hybridization reactions, you read back the sequence, you use a decoding algorithm, then you get your original binary sequence. This is an exmaple of how to use DNA in a physical MIST system.

There has been work from synthetic chemists to use commodity polymers to digitally encode sequences, such as mass spec, like Luftz's work.

There has been foundational work on-- there's been foundational work on developing encoding schemes and random access schemes using DNA as the substrate. Here's some illustrations on others have employed Huffman encodes, XOR encoding, addressing schemes, various error correction schemes, there's, there's a lot of background work that has already been done thinking through the logistics of you can write information to DNA in a way that is findable and decodable in an error-free manner. This is the type of foundation on which we are going to build through technical area 3 in our program.

Just to give some historical context for this program, we've been doing community building activities for the past 2 years. IARPA and SRC and Victor Zhirnov have organized a couple of workshops since 2016 that have assembled international stakeholders from academica, biotech, semicondcutors, and infotech industries to roadmap clear and achievable engineering optimizations necessary to develop scalable MIST systems. In the slides, we will have a link to the roadmap from the 2016 roadmap so that you can familiarize yourselves with that roadmap. This program seeks to put that roadmap into practice, by assembling multi-disciplinaryu communities around the goal of scalable compact information storage technologies to support real world big data use cases relevant to IARPA, transition partners, and the US government and other ecosystem partners.

It's my job to project a vision for where we're going to take a new set of technologies. The end result of this program will be tech that jointly supports end-to-end storage and retrieval at the terabyte scale. At the end of 4 years, you will have a practical system at the terabyte scale. This will offer a clear and viable path for exabyte scale in the future. It was important to have device manufacturers and storage ... we want to lay the groundwork for deployment beyond the scope of this program.

The ultimate vision for what we would like to achievge within 10 years, is to go from the behemoth of a cold storage data center that costs $1 billion today, to something that fits on a tabletop. Something that sits on a tabletop and consumes orders of magnitude less power, that has a media lifetime that doesn't require frequent integrity checks or replacement ever, and the total cost of ownership can be millions or tens of millions of dollars instead of billions. This at exabyte scale.

What are the approaches we are looking for? We're looking for innovative solutions across chemistry, molecular biology, microfluidics, semiconductors engineering. A subset of those, no doubt resonate with each of you in this roomm. Some of them resonate with me given my own background. I would be surprised if we have anyone that feels like an expert in all of these areas.

Some examples of writing data to polymer media include massively parallel polymer synthesis on microchips. That's just one example, so chosen because that's what's used for commercial synthetic biology applications today.

Example approaches to reading might include but not limited to sequencing using arrays of nanopore sensors. There are other approaches, like high-throughput mass spectrometry. It's for you to tell us, what you think the appropriate approaches are. My intent is to not be prescriptive.

Some example approaches to tailoring random access in a polymer archive might include, key-value stores and some physical compartmentalization of media by data type to make things easier. This has been demonstrated in previous work. It's up to you to tell us what you think an optimal scheme should be like.

What's the current state of MIST technology, as I see it?

Most work in this space has been on development of proof-of-concept encoding-decoding schemes for DNA data storage. DNA has been used for convenience because biology already has lots of tools for working with DNA. People have piggybacked on the achievements of the human genome project, and been using those technologies to synthesize and read-out, the novel encoding-decoding schemes. We could use peptides, synthetic polymers, which could have larger alphabets and higher information density per unit volume. But the tools are comparably immature, which is why I think most people have been working with DNA. None of these alternative media are out of scope for this program. W'ere interested in hearing how you might use other polymers which might have more attractive properties.

Many studies have shown that DNA can support scalable random access and error-free information storage. There have been some recent publications that illustrate this. It's plausible that you could use any type of synthetic polymer. Bornholt et al, ASPLOS 2016.

There are some technical challenges for developing deployable storage devices. These are physical media and the operating system. On the physical media side, we need improvements on cost, speed and scale of polymer synthesis and sequencing technologies. On the operating system side, we need scalable approaches to indexing, random access, and although not in the scope of this program also parallel search capabilities. Many people who have attended the workshops we organized in the past few years, they say no operating system has been demonstrated that plausibly achieves these goals in the exabyte scale. It's important to push on these goals in parallel with developing the devices and physical media.

Related to these physical media challenges is improving cost, sppeed, and scale of synthesis tech. The key challenge for this program, so far as DNA, is improving the physical measurements, beyond the needs of the life sciences industry by several orders of magnitude. Life scieences requires perfect synthesis and sequencing. A gene with 5% errors is not compatible with life. If I sequence a gene in a way that is high error, that is going to make my genome-wide association study useless and not statistically meaningful. The development of synthesis and sequencing tech for life sciences focused on perfection. Scal,e throughput and cost have been secondary considerations to perfection. By contrast, data storage can tolerate high read and write error rates. Scale, speed and cost become primary design considerations.

To give you some deeper insight into what current synthesis and sequencing technologies can do, and what the economics look like, and what would allow for practical molecular information storage... Let's say we wanted to write 1 TB per day, at a cost of $1k or less, and read it back just as quickly. You can take numbers from one of the more recent coding papers for DNA storage, and back out, and there's a lot of numbers here, you can back out-- for the DNA fountain paper--- the cost per base in this encoding scheme to achieve this goal. It's 10^-10 $/base, you would need 107 bytes/sec reading/writing speed.

How does this compare with current life sciences tech? So, the, as of about a year ago, the true cost of polymer synthesis, these are the Carlson curves becfause they are produced by Rob Carlson, to track the pricing of synthesis and trhoughput of DNA synthesis tech, and they are useful for genomics applications and for scoping our needs for molceular storage as well.

The true cost today for DNA synthesis is on the order of 10^-6 dollar per bp. That factors in a lot of overhead for the production and operations, these are commercial operations. Our goal of 10^-10 dollars per bp is still 4 orders of magnitudes off. This is the cost for perfect synthesis. On the throughput side, this is what has been communicated-- the current read speed for sequencing technologies. What we need to achieve this information retrieval goal of 1 TB/day, is also multiple orders of magnitude off. These are histrorical data for life science tech with low error. I've been assured by parties that the cost and speed goals for information storage is achievable through optimization, particularly if we're willing to tolerate errors.

Just to give you a sense for what the current workflow looks like if I want to encode information, write to molecular media, read it back, and decode it... This takes weeks. This is a workflow from, I think it was from a 2016 paper, from the University of Washington and Microsoft collaboration, where once you know the oligo sequences that you want to synthesize, it can take weeks from order to receipt of synthesized DNA. It's costly, even if you buy in bulk. Once you know what you want to pull out of your molecular archive, it could take days to weeks to get new primers, and then pull the information out of the archive. This just highlights that to make this practical for real-world storage applications, we need to deploy synthesis and sequencing technologies together in a fully automated end-to-end workflow that allow us to do something in a day rather than making us wait weeks.

Some other practical challenges on the workflow side... again, this is a owrkflow from the Bornholt et al ASPLOS 2016 paper.. Say you had a library of molecular media, and in that particular paper, this was organized into mm scale files containing DNA in solution, and the act of pulling information out of the archive, requires physically pulling out a vial, pipetting DNA out, and running the DNA through a sequencer. If I wanted to scale this up to an exabyte regime, it would still fill a room of this size. By requiring a manual retrieval process, and all the vials, you're sacrificing a lot on the information density side. This highlights that if we want a storage system that plausibly scales to the exabyte regime and fits on a tabletop, it requires minituarization and automation.

The manual workflows that have been instrumental in establishing this field and demonstrating plausible polymer work, we have to dispense with those and start thinking about minitaruization and automation.

I'll be talking about the specifics of the program design in a few slides. I want to meditate on these challenges now and I want everyone in the room has an opportunity to think about these as you formulate your questions for me later in the morning.

We need to optimize all of these performance characteristics together in the context of an end-to-end workflow. It's minimally useful to have a device to write data to polymer media, if it takes forever for me to read it out. We need encoding schemes, physical organization schemes, storage and retrieval approaches, and everything has to work together in an end-to-end workflow.

Some challenges in system integration might include fluids automatic, reliable interface betweeen electronics and reagents and wet systems. This is a relatively immatu-- we shouldn't take it for granted. There are fundamental R&D challenges that need to be solved here.

There's some cfhallenges in parallelization like scaling down printheads, reducing feature sizes to reduce volume of reagents for storage or manipulation of media. We're going to be doing things at the nanometer scale in this program.

One key challenge that we're going to grapple with in this program, is the read-out. We need non-destructive read-out, versus the need to re-generate data after reading. If we have to write after reading, then that places more burden on us improving synthesis technologies, which as you saw in my earlier slide, the pricing curves over time, very much lags sequencing technologies in terms of the pricing and the throughput. The long pole and tent in phase 1 in this program is getting polymer synthesis closer to parity with sequencing approaches on the cost and throughput size. In an ideal world, we would be able to read out information without destroying polymers in the process, but this might not be feasible in v1. You need to give careful thought to this, and this has implications to-- whether to use this for cold storage and retrieval, or whether it could be used for analytics as well.

On the operating system side, I've been counseled that it's a major challenge to do scalable indexing and search capabilities in the operating system. Some fundamental questions that we're going to have to solve in this program are, how do we index an exabyte of data in a way that supports fast random access? This is not a solved problem. If we have random access, then what addressing scheme is optimal to make this efficient? Do we need to optimize for random access, which suggests a physical layout media, is it by media type, file size, how do you organize this data?

As I mentioned before, this is not within the scope of the program (like pattern matching and search on the media itself), but if you have a medium that could support that, what's the optimal encoding for supporting this? The addresses that are baked into the oligos, could encode information about the content of the oligo itself. So maybe you could do pattern matching in address space, I don't know. This is a "nice to have" for this program which I am open to supporting, but it's not a hard requirement in the BAA.

There are other things that are very important to the government, like managing security policies. You don't want to re-synthesize data every time you change the security policy. Many of these, the solutions are going to be determined by the expected access patterns. Archival storage with uncommon reads will be the focus of this program, not analytics where reads are common. Some of the metrics for the retrieval technical area (TA2) are trying to optimize it so that in the future we have a better shot at using this technology for analytics applications. We're hoping to demonstrate largely bulk storage and retrieval.

What are the technical areas for the MIST program? There are three areas. I compartmentalized things in this way because everyone in this room is basically a best-of-breed in some area, whether synthesis, sequencing, or operating system development. We want to make it as easy as possible for anyone to contribute value to this program. If you have polymer synthesis capabilities but maybe you don't have any background in sequencing or operating systems, then there's an area for you, do some networking and team building. To be a credible offeror, we're concerned about building devices, and you have to fill out all of the necessary capabilities on your team.

TA1 is for storage. The goal for this is to build a tabletop device capable of writing molecular media and target throughput and resource budget. I'll be more quantitative about this in a few moments. There's a lot of approaches that one could take in this area- DNA, polypeptides, synthetic polymers, other sequence-controlled polymers, etc.

TA2 is the development of retrieval devices. You have to develop a tabletop device capable of random access from molecular media with a target... You could do this with nanopores, mass spectrometry, or other methods for sequencing polymers in a high-throughput manner.

TA3 is the operating system development. The goal here is to develop an OS for storage and retrieval devices developed by the other two technical areas. It coordinates indexing, compression, encoding, decoding, from molecular media, and in a way that supports efficient random access at scale. For all of the progress that we've seen on the OS side, in the research community in the past few years, this is probably an area where is the need for the most fundamental R&D work. The metrics for this TA are more open-ended as you'll see in a moment.

We strongly encourage collaboration and teaming. The teams are going to be multidiscplinary. For any of these technical areas, you're likely to see teams with expertise in chemistry, microfluidics, semiconductors, synthetic biology, computer science. You can propose any combination of TA1, TA1, TA2, or TA3, or all 3. It's up ot you. You should plan to be part of an integrated team that comprised of all three areas. If you propose only to TA1, the government will pair you to others that have compatible approaches for the other areas.

What's out of scope? Media that doesn't use sequence-controlled polymers.

We're going to employ a test and evaluation team to assist us with evaluating progress and the success of the program. There are some parties in the room that represent our probable T&E partners. I encourage you to engage with them and learn about their expertise while they are here today.

The T&E device will measure the device against some performance milestones specific to each technical area. As you will see int he draft BAA hopefully released tomorrow, we're going to ask you to propose a T&E methodology that is compatible with your proposed technical approach. I don't want to presume that I or our T&E partners know the best way to evaluate your devices. We think we have a pretty good idea, but this is a collaborative undertaking so please tell us your preferred approaches.

Phase 1 will seek to develop storage and retrieval devices and operating systems, for gigabyte-scale applications. This is 24-months long. In TA1 and TA2, the fundamental objective is to dde-risk scalable synthesis and sequencing approaches for data storage applications. I mentioned the metrics of DNA synthesis is like the long pole in the tent.. there's a lot of fundamental innovation required. In TA3, we want performers to develop a simulator of the hardware developed by TA1 and TA2 guided by the anticipated performance characters, and to capture some anticipated failure modes of those devices, and then demonstrate an operating system that supports indexing, random access at scale on top of that simulated hardware. This is a well-established technique in the data storage industry for data storage hardware and we're embracing that here.

In phase 1, key decision points... in month 12 of the program, TA1 is going to have deliver a decodable polymer data archive to the government. This means that 9am on a Monday morning, we send you some files, and at 9am on Tuesday morning, you give us a bag of polymers, or more, or in a manner more aligned with the program, make you give us a chip with an array of polymers, that's to be determined. We will then sequence that and then determine if we're able to decode the files that we gave you the day prior.

In month 23, just before the end of phase 1, both technical area 1 and 2 will have to demonstrate functional devices. TA3 will have to demonstrate an operating system functioning on simulated hardware. There's explicit decoupling from the devices and the operating system. This is a deliberate choice so that coordination is not as required.

The output is a 10 GB/day workflow.

During phase 2, the goal is to optimize the devices and the operations to support terabyte-scale, this will be over 24 years. At month 36 of the program, all technical areas will develop devices and OS that work together to support 100 GB/day workflows. At the end of the program, they should support TB/day workflows.

We will do T&E at least once per year. Performers will be evaluated on several metrics. These metrics will become progressively more challenging. The information on this slide will be released in the draft BAA.

For TA1, there are example milestones like the resource budget for storage. As I mentioned earlier, we woudl like to synthesize a terabyte of data to media for $1k in 1 day or less. The goal for phase 1 is to do 10 GB for $1k or less, and then in phase 2, it's a terabyte. There are many ways of doing this evaluation. We don't anticipate that you need to write 10 Gb in a day, we can extrapolate this from a smaller demonstrat,e like if you're doing 100 MB, and then we look at the power envelope and the reagents. ....

How we expect to-- and let me just check my time... Okay, how do we expect to do test evaluation in TA1? Our T&E partners may require physical access to the devices. We give you a file, we take the polymer media archive that you generate, and we sequence it. This is an umabiguous measurement of whether you have written what we intended you to do. We don't need to get into the guts of your device. It is to be determined what the physical collection of files that we ask you to store will be compromised of. It will be both structured, and unstructured documents, like spreadsheets, server logs, file sizes from kilobytes to megabytes. There's a lot of diverse stakeholders that have different goals here, and we want to evaluate the capability of these approaches with a wide variety of use cases.

For TA2, we're going to be optimizing some of these metrics for future analytics applications. These are the same metrics that I just described for TA1. This is resource budget for retrieval and read throughput. In phase 1, we want to read the equivalent of 1 TB/day in phase 1. The reason for this ambitious goal is because we want to lay the groundwork for analytics applications in the future. We want non-destructive reads with repeated reads, we would like high throughput read to support analytics. The likely methodology for TA2 is likely similar for TA1. We are going to require physical access to your devices to instrument them with sensors. We will evaluate the inputs and outputs. For a retrieval-- the government will give you a polymer media archive where you have told us in advance what properties it should have for compatibility, and 24 hours later your device emails us with the data. Since we know what we put into the polymer media archive, it's pretty easy to check whether you have decoded the information correctly.

I already mentioned that you're going to have to specify the requirements regarding polymer chemical composition going into the retrieval device... and file organization details or anything like this. This is not going to be as trivial as here's a bag of polymers and just deal with it, it will be very carefully planned in advanced.

I mentioned TA3 is going to be a little bit more open-ended. This is where a lot of fundamental R&D work is required. Here's a couple of example metrics and milestones for TA3, the operating system technical area, which we might pursue. The resource requirements for the simulated storage and retrieval hardware-- you should not require a supercomputer to simulate storage retrieval hardware. I don't think that molecular dynamics simulations at the level of billions of polymers, is the sweet spot for developing these systems. You should make some simplifytin assumptions for the sake of this having tractable resource requirements. We'll also look at the resource requirements for storage and retrieval workflow. It's been communicated clearly to the government that encoding is cheap and decoding can be expensive. What are the resource requirements for encoding information before writing it, and what are the resource requirements for decoding information, once you get base calls off of your sequencing hardware. Also metrics like read time, random access, we're totally open to those. In your questions, in your written feedback to the program BAA email address, give us your thoughts and in particular your feedback on draft BAA. The composition of the files we ask you to work with, will be consistent with what I have described for the ohter TAs.

We're looking for a diversity of approaches for developing deployable technologies for storage and that this can scale into the exabyte scale and beyond. We anticipate teams will include individuals with expertise/experience in chemistry.

The BAA wil supersede anything presented or said today at this proposers' day at IARPA. Here's my contact information. Feel free to reach out to me. Once a BAA is released, I have to be careful about giving priviledged information to any one party. All answers to questions will be posted publicly to the website.

dni-iarpa-baa-18-03@iarpa.gov include "IARPA-BAA-18-03" in the subject line

We will have a 30 minute break until 11. And then our acquisitions team will give a brief on how to do business with IARPA.

Doing business with IARPA

I am the chief acquisition officer for IARPA. I know you are primarily here to hear about the MIST program. I am going to spend the next 15 minutes talking you through our processes.

Responding to Q&As... BAAs get published. We obviously recommend you read the entire BAA before submitting questions. Pay attention to section 4, which gives instructions and submission information. We use a system called IDEAs. This is where all proposals are submitted.

In addition to the BAA, on the right-hand top corner for IARPA, there is a FAQ link. Please look at that if there's any question the BAA doesn't answer. If the FAQs don't for whatever reason answer your question, there will be the BAA email address. Do not include proprietary information in any of your questions, please.

Eligible applications... we do these proposers' days because we encourage collaboration and teaming efforts. That's the responsibility of the proposers' to come up with teaming arrangements and collaborations. The government will not provide any instruction on that, just gvive you the opportunity. Foreign individuals may participate, they need to comply with export control laws, NDAs, security regulations.

There are some organizations that we consider to be inegible. Under the IARPA government website, we have an organizational conflict-of-interest policy which spells this out in deal. Some organizations have access to government information that makes them ineligible to submit proposals, like federally funded research institutions.

For intellectual property, the government will take unlimited rights. At a minimum for anything proposed, we ask government purpose rights for any data developed using MIST funding. At the time you are submitting your proposals, there are instructions for right restrictions you're requesting as part of the submission. IARPA requires Government Purpose Rights (GPR). There will be an officer you work with.

Pre-publication review. IARPA encourages that information resulting form the research we're doing gets published and it's unclassified. Prior to release of tha twork, there will be some sort of review of that information from the program manager. A lot of this is up to the program manager for how to do that... They might prefer a courtesy copy x days prior to release, and this will be spelled out as part of the BAA. Questions can be answered during negotiation as required.

Preparing the proposal. Go into section 4, make sure you understand what's that saying. You'll have a link to IARPA's proposal system, ideas. https://iarpa-ideas.gov/ We encourage you to enroll in the system a few days prior to the due date. It's relatively easy. There's a helpdesk link which provides you email and telephone if you have any concerns or issues. It's up to the offeror that the final versions of the proposals are uploaded prior to the due date.

if you have classified information that you want to submit, you must contact the chief of security and there will be instructions for this.

We encourage you to go to FBO because that's where we will post any updates, like if we respond to any questions. They will be posted to FBO. The onous is on you to be tracking that. Finally, under section 5 of the BAA, we outline our evaluation critera. The proposal is not evaluated against other proposals, but against the evaluation criteria established in section 5.

In preparing for the proposal, please review our Organizational Conflict of Interest (OCI) policy. E.g., SETA, FFRDC, UARC, etc. There will be instructions for how to contact IARPA for problems. All of these instructions will be outlined in the BAA.

Streamling the award process. In the proposal, we will be clear on what we're asking for. In the cost proposal, respond to the requirements requested. We don't need additional detail at this time. In order to do a cost reimbursable with the government, a cognizant government auditor needs to go through your system and approve it. If you have not done this before, this does not limit or preclude you from submission. We can establish other kinds of contracts with you. There will be a statement of work from you, it must outline the work. If you are selected for negotiation, there might be clarifications you need to provide to outline deliverables. If you identify key personnel, there's expectations of time provided by those people, and you will outline those percentages. 10% is not key. There needs to be substantive effort that you need from the key personnel and we anticipate tseeing that during the proposal response. We understand there's sensitivities that a subcontractor will and will not provide to a prime. There could be some cost information there, but we understand those restrictions, if you are selected for negotiation then we will work to get those details from the subcontractors.

We fund applied research for the intelligence community. There's always concerns about export administrative regulation (EAR) or ITAR concerns. We odn't resolve those concerns for you. IARPA is not DoD.

As a disclaimer, the final BAA is what you're responding to. Today's program is just a broad outline. You're going to use the final published BAA published on the FBO for what you're respondind to and what the requirements are.

As far as budget... yes this program has a budget. We're not going to share this budget with you. We don't have a pre-determined numbers of awards. We will be looking at the proposals that come in, in correlation with section 5 for determining who we select.

Thank you all very much. I will be around afterwards in case there's anything else that pops up.

How much money is the government giving out? What are the individual awards going to be?

Smallest award amounts?

Q&A session

I am going to go through some of the questions. I want everyone to leave here being as informed as possible.

What is the size of the grants? They will be appropriate for the work proposed.

Co-founding agreements are in the scope.

What is the IP policy? Katy addressed the fundamentals of that, the government retains government purpose rights, everything else is negotiable.

Anticipated budget.

Matching contributions.

Are there seedling programs in DNA data storage space? Because we're doing a full program in this area now, which tends to have a substantially larger budget, I do not plan to fund seedlings. This could change, but that's the current plan. That said, you are free to submit seedling proposals and address them to specific PMs at any time. I don't want to discourage you from submitting proposals for things that you think are aligned with the needs of the intelligence community. This program seems to be addressing this need, and it may not be a need for near-term seedlings.

If a team develops tech for both TA1 and TA2, it is conceivable that IARPA might not be able to create the materials for TA1 or TA2 how is this handled? Thank you for addressing this question. It has been communicated to me that there are a couple of parties, over the last year at least, a few organizations developing integrated read-write devices where it's not trivial to pull the media out for the government to look at. The draft BAA that should be released tomorrow addresses this point and says if you're proposing to develop an integrated device, then please propose your own test and evaluation approach for instrumentation so that we can assess your performance? If you can't pull the media out, or the read isn't compatible with us synthesizing on our own, what is an appropriate approach? We're flexible for accomodating other approaches or whatever.

Can you provide more information on pairing between the different technical areas? What will the govt do if teams are missing components in their proposal? So I mentioned, you're all best of breed in some way. The purpose of this meeting is to help facilitate teaming activities. If you're looking to propose within a TA, you can fill out your team with the requisite expertise. We can pair you across TAs. If you propose to TA and there's a TA2 performer you have never met that has a compatible approach, we can team you up. Within a TA area, we can't fill out your team. Say you have a novel synthesis chemistry- we don't have people that can design that in practice. That burden, unfortunately, is on you. But we can provide opportunities to network with the right people. I hope this addresses this question. If you propose one technical area and you are funded in that area, the government will provide relevant partners for you.

By forming teams, does this restrict a TA3 participant from moving between teams where maybe TA1 or TA2 is demonstrating better approaches? My interpretation of this question is if we pair a TA3 performer with a TA1/TA2 pair, that doesn't achieve the desired program for program goals, are all members of that team at risk? There are ways of engineering the teaming so that one TA3 performer could conceivably contribute to multiple TA2 teams. As long as they have compatible approaches that are compatible with the particular OS. We want that flexibility and we will see what proposals we get and what makes the most sense. The short answer is that we have flexibility so that a TA3 performer will not necessarily be at risk of being downselected if their partners are not upholding their promises for various reasons, or contractual obligations rather.

Seedling prize program for DNA-based storage tech. I already addressed seedling issue. Prize programs, I have no plans for this right now, but if you think that would be value-building for the community, in the operating system domain, maybe we should be organizing regular owrkshops and cultivate a community. Should we be doing prize challenges? I'm open to it. Please give feedback on this.

Error rate goals for the MIST program. These were actually absent from the slides. I was counseled by some information theorists not to be overly prescriptive about what the error rates should be. We can tolerate non-zero error rates. I only care that we can read and write information without loss.

How do you deal with confidentiality during phase 1 and phase 2? Given the long term goals of this program is to produce commercially viable tech, there are IP considerations and confidentiality considerations. I believe the, all the reviewers of proposals we have received, have signed NDAs and financial information disclosures to make sure they are not conflicted, and we have to keep anything source selective confidential. The terms of your- of what you disclose to partners on a team or in public to the program at large during our workshops, we have technical exchange meetings and other events we will put together, that's all negotiable during contract negotiations. Yes. Again, we're sensitive to these concerns and we will work with you. If it's not ideal if we have to have a kickoff meeting where each team only presents to the government and other teams aren't allowed in the room, if that's the only way to make this program happen, then that's a possibility.

If this $1k/day budget, identical for reading for archival applications it may be more reasonable to require a readback of a small amount of the data encoded. I tried to emphasize in my talk that in the technical presentation, the $1k is the effective cost throughput target. We may allow you to write and read smaller volumes of data and then we extrapolate from that. You can define tihs for us. It's kind of arbitrary. You could buy all of the world's reagents that are needed for phosphoramidite synthesis of DNA, and then sort of the, the marginal cost of synthesizing a small amount from that bulk volume is like zero. So have you really synthesized DNA for $0 in that case? Maybe it's really about power and volume of reagents. $1k is just a proxy for these other estimates. So the short answer to are we looking for the same budget for reading and writing, no. We have more aggressive budget targets for reading than we do for writing. We're looking for reading 1 TB/day for $1k or less, at the end of phase 1. For TA1, we're looking to synthesize 10 GB/day for $1k/day and similar resource requirements.

How do you anticipate polymers for T&E given that each approach will differ dramatically? This is true. We're going to need to do a lot of planning on the government side to make sure we're able to synthesize appropriate polymers. We have a year-long lead time during phase 1 and even more lead time, like pre-phase during contract negotiations, to prepare for that.

If we make a TA1 and TA3 proposal, would the government make pairing suggestion for a different TA2 team? Are these suggestions or requirements? You are encouraged to propose to all technical areas where you can credibly preform. If the government decides to select you for one TA and not another, it's entirely up to the negotiating process during contract negotiations to determine who your partners are and what your obligations are in other TAs. You are free to say, during contract negotiations, either you fund everything or none of it, but that could be to your disadvantage to do so. We're looking to work with you in partnership to maximize our chances of achieving overall program goals. It doesn't do anybody any good to be inflexible while we're sorting out how these goals are structured. I'm open to being flexible if you are.

"Your timeline is aggressive." Yes, that's the idea. 3-6 mo to hire and buy equipment. What's the time from contract approval to contract start date? A number of these questions use the word "grant". These are not grants. These are contracts with deliverables and a statement of work. The government reserves the right to halt work on a contract if there is not satisfactory progress towards the goals on the schedule we defined. These are not grants. These will be contracts. We have some flexibility in the types of awards we make, we can do other transactions. There are exotic contracts we can do, and Katy is the person to speak to about that. We do not anticipate award grants. But what is the time from approval to start date? Katy? It depends. A lot of this comes down to negotiation and how much negotiation is involved? The goal is to be releasing them, at roughly the same time. There's been cases we have 6 months because we're working on negotiation. After that, a lot of it comes down to the selection process, and then after that, anywhere from 3 to 8 months potentially, just depending again. You submit a proposal, the government does a source selection process, we make a determination with whom we would like to negotiate for contracting, you get a letter saying you have been selected, it will typically say you have been selected with modifications and it wont say what the modifications are. The interpretation of that clause is careful- we might be selecting one of the 3 TAs to which you have proposed. The contract negotiation process can take 6 months. I am committed to it going much faster than that, I want this program to get going. Typically as soon as contract negotiations have concluded, we make awards, we obligate funding, then we have a kickoff meeting. Once contract negotiations have concluded, we move quickly. I appreciate that it takes 3-6 mo to hire people and buy capital equipment.

The first major deliverable at phase 1 is at the 12 month mark. That deliverable requires you to show that you can synthesize, that you can write 10 megabytes of data to a polymer media archive which is not a massive amount of information. There are some very specific reasons for choosing that. If you have anxiety about ramping up a production-scale operation on that timescale, at the 1 year mark, we're looking at just 10 megabytes. Hopefully, that's achievable even if you have hiring delays in getting your equipment setup. Please ask clarifying questions if that didn't help.

In the case of spceial chemistry for TA1 coupled with a read-head in TA2, how might TA2 testing proceed? If you're developing an integrated read-write device where it's non-trivial to pull things out of the device, then we're committed to working with you to find a test evaluation strategy that works.

Should the TA1 partner provide the government with the new recipe? This is negotiable. The government-- contracts can define the terms that protect your IP. Any partners on the government team will have to adhere closely to those requirement. If it makes the most sense for us to do polymer synthesis using your novel chemistry, then we can put procedures in place to guarantee the confidentiality of this method.

Is this an accurate assessment of the program goal? 1 exabyte/day in year 5? We're going after 1 TB written and read back in a day by the end of phase 2, year 4. We want tech that offers a plausible path to scaling to the exabyte regime. But it's not a goal of this 4 year program.

Terabyte desktop form factor? Yes. I did not define power envelope requirements. I highlighted my long-term vision for an exabyte scale device, but I was deliberately cagey about reagent volumes in this presentation because I don't want to be overly prescriptive, I don't want to assume I want to know the best metrics for appropriate resource utilization budget. We're setting targets on the performance of the device for how quickly data goes in and out of them, and broad targets for resource requirements.

Do you have to demonstrate an exabyte for T&E purposes? No. An effective read-write throughput of about 1 TB/day. I think it's 10 TB/day for writing, and 1 TB/day for reading. We allow you to do this with smaller volumes.

Is 100 mb at 12 months out, the minimum requirements? The exact volumes to demonstrate and do extrapolation, that's negotiable. Tell us what's appropriate, we can negotiate targets accordingly. It's also for our T&E partners to tell us what's reasonable and necessary based on the technical approaches proposed.

Draft BAA feedback by email, are all emails made public? Before the final BAA is posted, we will not be providing any further public feedback in response to your comments. After the final BAA is posted, we will release on the program website a public response to all questions that we receive. You can see every question we receive, and every answer we provide.

When does the government anticipate to pair TA1/TA2 teams with TA3 teams? In phase 1, there's no explicit requirement for the TA3 teams to do anything--- using the devices developed by TA1/TA2. But they do need clear guidance from the device developers regarding how it works and likely failure modes. TA3 needs to build an appropriate simulator. Teaming begins from day 1. We need you to be exchanging technical specs immediately. The close coordination that actually uses the operating system with the physical write and storage/retrieval devices, starts in phase 2.

What do you think is the market size for MIST? I recognize that those of you in the room looking to develop commercially deployable technologies care closely about this.. I wish I could predict the future. I think the enterprise data storage market is about $50 billion/year. Is that too low? I am seeing some shrugs. That's an order-of-magnitude for global enterprise data storage market is, currently. I would expect, under the constraint that people are going to look to store more data, not less, over time, that the market size certainly would not shrink. If you can deliver a storage technology with more attractive scalability and price points, competitive with current solutions, I don't see why it wouldn't achieve widespread adoption. We have to build it, we have to build it in a way that establishes a competitive value of these technologies, with things on the market already. If we do this successfully, we have to trust there's a market.

Who needs it- and who needs this over cloud storage? I think this is a misunderstanding of the economics of large-scale cloud storage. Somebody has to pay to build an exabyte-scale data center. Those costs are passed on to the customer, in cloud data storage. If you are the US government, those costs typically carry a mark-up that obviate the need for the government to run its own data centers, but you're paying for the priviledge. The government needs more scalable more economically-scalable storage technologies. If we could deploy these technologies in the cloud and the government would just store exabytes of data or more in the future through a cloud provider, that would be great. The government doesn't want to be doing this itself. It makes economic sense. It's my own personal opinion, though.

Are there any other questions? I have gone through the stack of cards.

Do you plan on funding multiple teams for the different TAs and will they be aware of them? It's common for IARPA to fund multiple approaches to balance the risk. I anticipate that this program is likely to fund more than one team of performers, unless IP considerations-- I don't see circumstances where IP considerations would prohibit the government from disclosing the existence of other teams, but what they're doing in detail might be limited by IP.

How large were prior programs in budgets? The answer is no, can't reveal that. DARPA has a sister agency that announces program budgets at the beginning. This is not how IARPA operates.

Is it within the TA3 scope, to propose the use of traditional storage mediums for error correction, metadata, optimization? Yes. The draft BAA that will be released, I tried to be explicit that we require the use of polymers for long-term storage, but for shorter-term operations in support of things like error correction, storing metadata, indexing, I think there could be a lot of value added by using conventional storage technologies. You shouldn't need to store an exabyte in flash, in order to maintain and access your data banks.

How can one group propose two different proposals to the same TA area? There is no prohibition from one research group participating in different proposals. Two ideas, two different proposals? Two distinct proposals, to the same TA? There's no explicit prohibition against doing that. IARPA might consider whether your group has the bandwidth to do this work within the same lab in terms of bandwidth, but there might be things that make this administratively simpler, to put it into the same proposal, and then IARPA could select which one would be preferred. You could also submit two different proposals. Can the same offeror offer the same capabilities in different proposals? Yes.

Do I want to review 10,000 proposals? No. Do I want to give you flexibility, needed, so that we can extract from proposals the highest value options to the government. Yes. There's a tradeoff. We'll review as many proposals as we have to review.

We'll have more time for lunch. Thank you all for attending the government portion of this event. We will break for lunch, and you're all on your own. We will be back at 1:30pm. We will have briefings for offeror capabilities for those of you who have provided slides. We have the room next door running until 5pm for networking and posters. I have to get lost at this point. Thank you all very much for your contributions and building this community. I look forward to working with all of you. Thanks again.