The fusion of two genes can occur as the result of genomic rearrangements such as the breakage and re-joining of two different chromosomes, or from rearrangements within a chromosome (deletions, insertions, inversions). Gene fusions are a common event in the development of some cancers, particularly hematological (blood) cancers, sarcomas, and prostate cancer. Next Generation Sequencing has recently led to the discovery of gene fusions in some breast cancers, lung cancers, and other types of solid tumors.
Not all genetic fusion leads to cancer, but…
Fusion genes may be oncogenic because they fundamentally change the function of a protein, such as removing a regulatory domain so that a protein is permanently active, changing the cellular location or the target of the protein, or an increased expression of a gene due to the addition of a highly active promoter. Not all fusion genes lead to cancer, but some are potent oncogenes.
One well-known example of an oncogenic fusion gene is the Philadelphia Chromosome- named after the city in which it was discovered in 1960. This is a reciprocal translocation between chromosomes 9 and 22 that creates a fusion between the BCR and Abl1 genes. In chronic mylogenous leukemia, 95% of patients have this translocation, and it is also found at a lower frequency in other types of leukemia. This resulting fusion gene is a novel tyrosine kinase which is highly active in phosphorylating the interleukin-3 receptor, consequently speeding up cell division and inhibiting DNA repair.
Both sides of the map
NGS can be used to detect gene fusions using several different approaches. However, the standard whole genome sequencing, using single-end reads from Illumina or SOLID technology, is an inefficient way to find fusions. To detect fusions, reads must span the fusion break point with enough overlapping bases on both sides to map to both of the chromosomes on the reference genome. This is impossible with 36 bp reads, difficult with 50 bp reads, and inefficient with 100 bp reads. Whole genome sequencing with paired-end reads can produce “discordant mate-pairs” where the two ends of a single template DNA fragment may map to different chromosomes.
Breakdancing wit da NGS!
The software tools BreakDancer and SVDetect are designed to filter a set of reference-aligned NGS reads to find multiple discordant pairs that map to the same combination of distant genomic loci. This allows for simultaneous detection of insertions, deletions, and translocations. Unfortunately, these tools are subject to high levels of false positive detections due to the mis-mapping of reads to repeated sequences (i.e. simple sequence repeats, transposons, and segmental duplications) which occur in multiple locations on the genome.
Putting on the Top Hat…
TopHat Fusion is a new software package designed to detect fusion genes using RNA-seq data. It works by a combination of two methods. Firstly, individual reads are split into 25 bp segments, which are each mapped independently to the reference genome with the Bowtie alignment tool. Secondly, paired-end reads are tested for discordant alignment (mapping to different chromosomes). The sequences of putative fusion sites are stitched together from the aligned segments and all reads are re-tested for alignments with at least 13 bp aligned on both sides of the junction point.
…and some very smart filters
Then a number of very smart filters are implemented to eliminate most of the false positives that result from multiple mapping of repeated sequences. The sum of reads spanning the breakpoint and paired ends flanking it must meet some threshold. One side of the fusion junction must fall within a RefSeq annotated gene. A segment from each side of the junction point is used as a query for a BLAST search against the entire reference genome to further eliminate repeats. Finally, the 300 bp flanks on either side of the fusion point are tested for uniformity. The scoring scheme prefers alignments that have no gaps and uniform depth of coverage across this window.
TopHat Fusion requires a very powerful computing system (at least 40 GB of RAM and days of compute time on many processors), and a lot of hands-on time from skilled informatics technicians- so it is not suitable for every investigator with an RNA sample.
Do you work on Fusion Genes? We’d be interested to hear your experiences with any of the above software tools.