Dark matter of the human genome: Synthetic regulatory genomics

Matt Maurano, NYU Langone Medical Center

https://www.youtube.com/watch?v=xvlkdTgqo3A&t=0s&list=PLHpV_30XFQ8RN0v_PIiPKnf8c_QHVztFM&index=13

This is the slide for mammalian gene regulation. This alludes to the complexity of the regulatory genes, and gives us all the interesting biology and medicine that you will hear about. What's going on in this slide is that there are chromatid fibers, wrapping around some DNA. There are transcription factors binding over here. The binding of these TFs is specified by sequence, but this sequence plays out in an epigenetic context and has many differences in effects from one cell type to another. The point of having all this up here is to give you a sense of genomic content. If you move these regulatory elements around, and people have done this in a small scale in the past, so this provides complication for study gene regulation but also an opportunity for this group here.

Originally mapping regulatory DNA was based on accessibility. These days we can do this on a genome-wide scale. You see tracks like this, where each peak reppresents a particular regulatory element. These are transcription factors- if you go look in different cell types, there's a great variety of regulatory landscape for cell types. These are things like promoters, enhancers and other regulatory elements. They have been mapped in great depth over the last 5 years by a large national consortium. You can see here a list of a huge variety of cell and tissue ypes for which data is available publicly on the internet identifying regulatory elements.

An individual cell type might be 1% of the genome, represented by regulatory elements. Because they are specific, there might be up to 4 million regulation elements in the genome. There's a lot of material out there.

What do these elements do?

We use terms like promoters and ehancers. But it doesn't really scale to what we see today in terms of the genomics movement. You can look at the binding of multiple different TFs, repressors, etc. There's a lot of stuff going on at each loci. Why hasn't this been answered yet?

Maurano et al Science 2012 Mapping human disease and trait-associated variation by genome-wide association studies (GWAS)

For common diseases and traits that many of us are interested in, .. the majority of the hits lie in non-coding DNA. 5% of the hits are landing in protein-coding regions. The rest are landing in non-coding DNA and they are highly concentrated in these regulatory elements. So this presents a rather depressing picture at first glance for medicine.

Why is this a hard problem to tackle?

There are two prospective ways-- one is to study regulatory variation in its endogenous context, in the first it's engineering variation, and this has taken many decades but it has been accelerated by using nucleases. This gives you a really high degree of control. You can extend this to many sites, with some technical complications but there are some limitations as to what you are able to do. You want to start doing multiple changes, which are getting harder.

The other approach is to look at natural variation. You can do genome profiling to find gene expression traits. This gives you efficient yield. We've done some work using .. if you will.. we've been able to build local models, and study sequence variation effects.

Endogenous approaches to study regulatory variation

We don't have reporter assays able to study regulatory variations for all the uestions we might like to ask. We could classify these by reporter size, like plasmids that are smaller than 5 KB, or BACs which are 100-300 kb, or YACs which are 100-1000 kb. You could clasiffy by integration like in vitro, transient, stable (random), site-directed single-copy. And you can clasisfy by scale- traditional single reporter, or multiplexed.

You start going back and looking at locus and loci at large scale. We're interested in pushing this forward, a group of us at NYU as well as other places. This is a quick overview orf orur strategy. There's a BAC vector in which we can integrate a large segment of DNA up to a few kilobases, containing a gene of interest, and at places of interest of hypersensitive sites, that lets us place modules, to focus our attention on elements that are important and begin to scale in-depth analysis.

The problem is delivering these guys... so we do scarless single-copy, site-specific integration and use counter-selection. Ultimately the interesting part is going to be what are you going to do to do profiling, like RNA profiling via rt-PCR, chromatin accessibility via capture DNAse-seq, and chromating capture via something-C.

This opens up a lot of problems to address: multi-edited synthetic haplotypes, cross species function, gene fusions, chromosomal rearrangements, position effect variations.

You can make single substitutions, pairwise substitutions. We scan through single knockouts, we do this rapidly and base on the results we test double knockouts as well as other types of changes.

We have been working on an alpha globin locus in collaboratorion with Hay lab, see Hay et al. Nat Genet. 2016. They did this in mice, it has been informative for locus control around these genes, and we can go well beyond previous studies by increasing the scale.

Maurano labMegan Hogan, Jesper Maag, Nick Vulpescu, Maia Stoicvici, and collaborators like Boeke lab, LIMS/automation like Sergei German, Andrew Martin, Henri Berger, Vincent German, and Doug Higgs, Tim Niewold, Aravinda Chakravarti.

Q&A

Q: When you make heterozygtes, do you have problems analyzing hypersensitive regions?

A: We have done a lot of work looking at natural variation. The advantage of doing it this way is that you can do point variants as markers. Some of the first ways are going to be copying-- we can go back in and put in markers so we can distinguish them.

Q: Perfect example is the advantage of synthetic over natural.

A: That's true.