Improving allele-specific transcript inference using genome graphs
Navn på bevillingshaver
Jonas Andreas Sibbesen
Beløb
DKK 450,685
År
2018
Bevillingstype
Internationalisation Fellowships
Hvad?
Current methods for analysing RNA-seq data are generally based on comparing the reads to a linear reference genome. However, this approach is biased towards the reference, since transcripts in regions which differ markedly between the individual sequenced and the reference are harder to analyse, compared to regions which are more identical. This is also known as mapping-bias and can affect downstream analysis pipelines, such as allele-specific transcript expression estimation. The problem can, however, be mitigated by comparing the reads to a genome graph, that not only contains the linear reference, but also known genetic variation. The aim of this project is thus to develop a method that improves estimation of allele-specific expression, by reducing mapping-bias using genome graphs.
Hvorfor?
In allele-specific expression (ASE) analysis, the expression levels of genes or transcripts on the maternal and paternal allele are estimated independently. For this reason, it is extremely important that there is no mapping-bias between the different alleles. Analyses of ASE can, among other things, be used to investigate genomic imprinting or the effect of genomic variation on the expression of genes and transcripts. Genomic imprinting is a specific type of ASE, in which a gene is only expressed on a single allele dependent on whether it is maternal or paternal, and disruption of this process has been shown to be associated with developmental diseases and cancer. Methods that are able to better handle mapping-bias are therefore needed in order to get more sensitive estimates of ASE.
Hvordan?
This project will address the problem of RNA-seq mapping-bias by developing a spliced genome graph reference structure. This graph will contain not only known variants and haplotypes, but also transcriptomic information, such as known splice-sites and transcripts. Similar to how known haplotypes can be used to guide the mapping process of genome sequencing reads, haplotype-specific transcripts will be used to improve mapping of reads from RNA-seq. The project will be based on extending the variation graph toolkit (vg) to be able to map and analyse RNA-seq data. vg is a collaborative effort to create a common framework for methods that work on genome graphs. The method will be benchmarked using simulations and long reads from direct RNA Oxford Nanopore sequencing.