It is no news to anyone in the genomics field that as we are constantly accumulating and producing big data, we are in big need to manage the data better. As we move towards personalized medicine, there is the need to integrate epigenetic, genomic and transcriptome data together. It is important more than ever to curate this data and look beyond present needs.
Look to the stars
While most scientific fields are dealing with large data problem, some groups have gotten an edge over the other. In a recent article in Genome Biology, Aaron Golden describes how we could gaze up to the stars for a solution. The field of astronomy has developed a consistent workable method to handle and concisely analyze data.
The virtual observatory
So the solution to manage astronomical data was the following: they developed a ‘virtual observatory’ (VO) where any researcher could store data. The researcher creating the data did not necessarily need to be tech savvy because another team curated the data into the virtual observatory. To ensure efficiency the virtual observatory used two key components, Flexible Image Transportation System (FITS) an easy to use imaging standard, and a publicly accessible data drive.
What is the parallel between astronomical data and sequencing data?
Data flows from the telescope to raw images which are eventually transformed into a significant data points, then mapped back to the sky as celestial coordinates. The parallel to the telescope is the sequencer. From the sequencer to raw reads, to align reads to the locus on the chromosome. Voila. It sounds plausible but there are significant hurdles to overcome.
Close to the machines
A step towards that would be to bring the bioinformatics workflow close to the sequencing machines. Also as we get better at this, and as the cost of sequencing drops, there may be a point where we could avoid the in-between step of creating raw reads, but directly convert it into significant data. Consequently creating a repository that is accessible to all.
One repository to connect them
Cloud computing can make this possible, along with certain other established tools from companies which build the sequencers themselves. BaseSpace from Illumina and the Torrent Browser from Ion Torrent are a step towards this. While we have the tools, we need a repository to connect them all. InterMine is a publicly available database that is an integrative biology tool.
The hurdles we face
The challenges that will come out of this are multitude. Outdated gene information, contradicting information from individual groups and in consistent identifiers. There is also an extensive cost involved, comparable to changing to the metric system worldwide. But when so much is spent on research and development and bringing sequencers to the forefront of diagnostics, it is important to invest in organization. There will also be a need to set a universal standard for sequencing read quality and analysis parameters. This maybe the biggest hurdle of them all.
While the World Wide Web was the inspiration to develop the virtual observatory, can the VO be ours?