Improving discovery and genotyping of structural variation using genome graphs

Navn på bevillingshaver

Jonas Andreas Sibbesen

Beløb

DKK 350,000

År

2017

Bevillingstype

Internationalisation Fellowships

Hvad?

Current methods for genotyping structural variation, from high-throughput sequencing data, are generally based on comparing the reads to a linear reference genome. However, this approach is biased towards the reference, since regions which differ markedly between the individual sequenced and the reference are harder to infer, compared to regions which are more identical. Hence, prediction of structural variants is generally much harder compared to simpler SNVs. This problem can be mitigated by comparing the reads to a genome graph that contain not only the linear reference, but also the millions of variants already known. The aim of this project is thus to develop a method that improves discovery and genotyping of structural variation, by reducing the reference-bias using genome graphs.

Hvorfor?

One region in the human genome where reference-bias is especially problematic is in the highly polymorphic Human Leukocyte Antigen locus. This region contains many important genes that are involved in the regulation of the immune response, and structural variation in this region has been shown to be associated with the development of autoimmune diseases. Besides this, many other diseases have been shown to be caused by structural variation and thus, the inability to completely resolve the full spectrum of variation in an individual, makes sequencing studies blind to potential disease causing variants. Therefore, methods are needed in order to address the reference-bias problem and accurately predict structural variation.

Hvordan?

This project will expand on previous work, carried out during my PhD, which involved the development of a genotyping method based on genome graphs. This method already addresses many aspects of the reference-bias problem, however it is currently not able to discover new variation. This is a problem since many structural variants are still unknown. To solve this I will develop an algorithm that performs local de novo assembly, using a genome graph built from known variants as a template. Importantly, this assembler will use information across individuals thus increasing sensitivity for low coverage data. The algorithm will be based on memory efficient k-mer Bloom filters, however other efficient indexing strategies such as the Burrows-Wheeler transform will also be explored.

Tilbage til oversigtssiden