View on GitHub

Machine learning begins to unlock non-coding variation


Summary: New methods that use machine learning techniques to directly predict regulatory properties of non-coding sequence will likely play a key role in interpreting non-coding genetic variation. QTL analysis could quickly become obsolete.

Understanding the impact of non-coding genetic variation is key to understanding heritable complex traits in humans. We’ve all heard the story: the vast majority of loci (~95%) identified by genome-wide association studies (GWAS) implicate non-coding variation. This suggests that variation in regulatory function, rather than in protein coding sequences, is largely responsible for common diseases like diabetes and Crohn’s Disease. One common hypothesis is that many of these loci share an underlying model: a genetic variant affects binding of some factor to DNA, which affects transcription of a nearby gene(s), which affects some downstream cellular process leading to disease (even though we’re learning more and more that transcriptional changes might not necessarily lead to actual changes in protein levels). It sounds simple, but interpreting non-coding variation turns out to be extremely challenging.

Learning the regulatory code

So how can we predict which non-coding variants have “causal” effects leading to disease? Here are some options:

New methods provide direct sequence-based predictions of regulatory activity

In the last several months, a handful of new methods have come out for predicting regulatory activity of non-coding regions. All of these take similar forms: train a machine learning model to predict an annotation of interest based on local sequence features. These models seem to mostly capture features related to sequence specificities of transcription factors, but they can also take into account things like broader sequence context, co-binding of different factors, etc. There are several things I find beautiful about these methods. First, we can now use these models to directly predict the impact of a mutation by feeding a model several versions of sequences containing different alleles and quantifying the change. Second, since quite accurate models can be built using a single dataset from a cell type of interest, these methods preclude the need to measure these molecular phenotypes across hundreds of samples as is required for QTL analysis. This is a huge advantage. Below I take a brief look at some recently published methods (sorry if I am missing some), divided into two general classes.

Kmer-based models

One class of methods are “kmer” based, meaning they train an underlying model to learn the effect of short kmers (e.g. <=10bp) on local sequence annotations. Super simplified version: every time I see the kmer ATCG I see tons of my transcription factor binding but every time I see AAAT I see nothing.

“Deep learning” using convolutional neural nets

A second class of methods relies on “deep learning” (every time I hear that word I can’t help but thinking about this tweet by Daniel MacArthur). Specifically, these methods rely on a technique called deep convolutaional neural networks (CNNs), which I had admittedly never heard of until a couple months ago, and I should probably learn more about. CNNs can capture complicated nonlinear sequence features that might not be well captured by kmer based approaches, and it seems we can even pull out relevant features in ways that we can learn some biology from. I really think these will be a valuable approach for many aspects of genomics going forward.

Are these methods accessible to non-machine learning gurus?

I was happy to see that all of these studies made some or all of their methods available by providing the source code, interactive web applications, or both. deltaSVM provides source code and precomputed models. GERV comes packaged in a Docker container, and has extensive instructions on how to use it on the Amazon cloud. DeepSea provides a webapp to perform in silico mutagenesis or predict the epigenetic state of a sequence as well as the source code to do so. DeepBind has a webapp to visualize motif patterns discovered by their models plus packaged binaries to perform predictions, although I didn’t see a way to train new models. Finally, Basset lives up to its claim of making all of this accessible and I think is the one I’ll likely invest in learning. Although it took a little effort to install all the dependencies, there is good documentation and wonderful IPython tutorials walking through many of the steps to train and apply the models, plus precomputed models from public datasets. Kudos to all of these studies for making these things accessible.


I think these methods have huge potential applications in human genetics. One criticism I’ve heard is that these just provide one more annotation to add to our list. But they are much more than that: we can now perform a single experiment which then allows us to directly predict the impact of any variant in a given cell type. Sure there are still limitations: TF binding will not be the answer every time, and there are likely variants/regions/annotations these methods simply won’t work well for because something else is at play. But I think these will prove to be valuable tools going forward as we begin to sift through all of these non-coding variants in much deeper ways than we’ve been able to do with only genetic associations.

comments powered by Disqus