View on GitHub

The complicated world of splice QTLs


Summary: There is a lot of interest in analyzing the effect of genetic variants on splicing, but as far as I can tell there is no consensus on the best way to analyze “splice QTLs”. A variety of methods have been used, each answering slightly different questions.

We now have publicly available large scale RNA-sequencing datasets (e.g. from the Geuvadis and GTEx projects, but also from many others) from individuals that also have genotype chip or whole genome sequencing data available. As a result we can now start to identify genetic variants involved in regulating not just gene expression levels, but also in regulating splicing events. There have been a lot of studies in the last several years surveying genetic effects on transcriptomic variation (e.g. Lappalainen et al. 2013, Pickrell et al. 2010, Montgomery et al. 2010, Battle et al. 2013, Zhao et al. 2013, Monlong et al. 2014), but there is no consensus on the best way to analyze “splice QTLs” (sQTLs).

For one thing, while several good tools exist, it is generally agreed that accurately quantifying expression levels of different isoforms is not a solved problem. But another challenge is that unlike gene expression levels, which can be for the most part represented by a single value per gene, there is a confusing variety of ways one might represent the phenotype of an alternatively spliced gene. Each of these representations is best suited to identify differnent types of events, and each comes with its own set of advantages and disadvantages.

So what is the best method to analyze sQTLs? What do most splicing variation events look like? (single exon events? or are they more complex?) How does the sQTL method used affect which events are likely to be picked up? These are some of the questions I have as a relative newcomer to the RNA-seq/splicing world, and from what I can tell we still don’t know the answers.

Below I summarize the methods I’ve seen used in some recent papers and what I think are advantages and disadvantages of each. In my opinion there is no “best method”, and the method you choose is strongly dependent on the type of events you’re looking for or expect to find.

Methods to detect sQTLs

Each method for detecting QTLs has the same general outline: define a phenotype, test for association between that phenotype and the genotype of a nearby variant. For gene-level eQTLs, this is usually straightforward: the phenotype is the expression level of the gene (after some normalization, removing covariates, etc.), and the genotype is something like 0/1/2 for a biallelic variant. For analyzing sQTLs, the question is how do you define the phenotype to capture splicing variation?. Each method below does this slightly differently.

Exon expression level

One widely used phenotype is simply exon expression level. (i.e., for each exon, how many reads mapped to that exon, after normalizing for things like the length of the exon, not going to get into FPKM/RPKM/TPM here).

exon expression level

Then you would simply test for association between expression of each exon and each variant nearby. This is one of the methods used by Montgomery et al. and also in the Geuvadis Project paper.

Transcript ratio

Another commonly used phenotype is transcript ratio: for each isoform, what percent of the total transcripts from that gene come from that isoform?

transcript ratio

Then you test for association between ratio for each transcript and each variant nearby. This method is used in the Geuvadis Project paper and by Battle et al..

Percent exon inclusion (or PSI=percent spliced in)

This method is concerned with inclusion rates of single exons at a time: how often is a given exon included vs. excluded from a transcript?

PSI method

Then you test for association between the PSI of each exon and each nearby variant. Really this is asking: of isoforms that could contain this exon, what percent of them do? This method is used in Monlong et al. to compare to a different method they developed (see below). By design, it specifically captures exon skipping events, but not other types of variation in transcript structure. PSI is mentioned in the Geuvadis Project paper (Figure 4b), although they do not perform QTL analysis on these values. It also forms the basis of the GLIMMPs method by Zhao et al.

A variation on exon inclusion is used in Pickrell et al, where they use as a phenotype the fraction of reads mapped to each exon divided by all reads mapped to that gene. This captures a different set of events than “PSI”: for instance it would capture events involving switching between different transcripts that could differ by multiple exons, whereas PSI is very specific in that it captures only events that differ by exclusion specifically of a single exon.

Multivariate approach

Monlong et al. take a different approach (called sQTLseekeR) by creating a multivariate phenotype for each gene consisting of the relative abundance of each transcript. (See Figure 1 of their paper).

They then do a MANOVA like test to find associations between genotype and transcript composition. This is somewhat related to the transcript ratio approach, in that it is testing for variation in which of a known set of isoforms is being expressed, but it involves only one phenotype value per gene, rather than per isoform.


So what method is best to use? This somewhat depends on what types of splicing events we expect, or what type we hope to find. But what do most splicing events look like? Monlong et al. and the Geuvadis paper used AStalavista to determine the type of events represented by their QTLs. (see Figure 4 of Monlong et al. shown below, plus Figure 2b in the Geuvadis paper).

Splicing event composition, Figure 4 of Monlong et al

Both found only ~15% of events to be simple exon skipping, with most events looking more complex (“complex 3’ event”, “complex 5’ event”, alternative exon usage, modified UTRs, etc.). However, it should be noted that these events could be biased by the fact that they used existing transcript annotations for these analyses, which limits the set of events that can be detected. Therefore, this might not represent the true distribution of types of splicing events.

Overall my conclusion is that splicing analysis is complicated! We’re starting to make some progres, but we have a long way to go to make sense of it all.

comments powered by Disqus