Next generation sequencing opened the doors to our genome. It gives massive amounts of information in a week – whereas Sanger sequencing takes thrice as long, and causes lab lesions due to the abusive use of pipettes.
Indeed, with minimal hands-on procedures we obtain a lot of data. But nothing in Science is ever easy. So, of course this new amount of information is very interesting – but also very hard to analyze.
Due to massive-parallel sequencing – commonly known as next generation sequencing – we can obtain whole exomes (and even genomes!) of multiple samples, in a matter of days. This crazy amount of data poses a challenge for bioinformatics. Therefore, development of a streamlined, highly automated pipeline to facilitate data analysis is a critical step in analyzing NGS data.
There are some concerns prior to the analysis that should be taken into consideration:
- Run quality metrics – most NGS platforms indicate how the run is going, and if the library is yielding good sequencing results (error rate, sequence alignment); and
- Sequencing coverage metrics – the data that shows whether your target is being sequenced, and how many sequences it has produced (coverage). You should always aim to have good coverage – with higher coverage you can have more confidence in your results, and trust that what you are obtaining in your results is actually there. If you are not obtaining a good coverage it might mean that there is something not working quite right – either in the pool preparation protocol, in the lab bench, in the sequencing protocol, or there might be something wrong with your sample. It means it’s time for some good old fashioned troubleshooting!
After the run, you finally have your data. And, the first thing to do in your pipeline is to align, or map, your results to your genome of reference, in a key step properly called “sequence alignment”.
However, it is not as easy as it looks. First, the read lengths are relatively short, between 36 and 250 bp, which increases the likelihood that a read can be mapped to multiple locations (and don’t even get me started on genes with pseudo-genes). Second, your data may not have much quality, which is relatively normal in NGS platforms, especially in repetitive zones; however, that does mean that they contain higher rates of sequencing error. The third complication presented by NGS platforms is the sheer volume of data. A single run produces millions of sequencing reads. You need great computational power to process all of this information. So, this means that you need a super computer that has the power to analyze all of this data, without any hiccups on the way. A laptop, or a normal bench computer might not be powerful enough to handle this massive amount of information.
There are some software available for alignment (some free, some commercial), that you can use to make your life easier:
Problems When Mapping Multi-reads
So, what are multi-reads? Well, the genome is very, very big, and our data is composed of very small fragments. And, to make matters worse, our genome is not necessarily completely unique and different. Meaning, we have many nucleotide sequences exactly like the other (or with minor differences); between pseudo-genes, repetitive sequences, and just sheer similarity between sequences, mapping our data to our genome may be a headache. The reads that map to multiple locations are often called multi-reads.
Multi-reads may become prejudicial because they influence downstream analyses (such as SNP calling) that rely on unique regions that flank the repeats.
So, our software has some options: ignore all the multi-reads; assign reads to the location of their best alignment; report all alignments up to a maximum number, or ignore multi-reads that align to multiple locations. These methods, however, are not always correct; and you must choose depending on the question you are asking your data. Remember that your whole experiment should revolve around your data, and what you want to know. If you want to discover a mutation in a tumor sample you expect it to not be well represented– therefore, any small deviation is important! However, if you are looking for a mutation in the genome, it is expected to be at 50% or higher, so it gives you more room to work with.
Identifying Redundant Sequences
It is very important to identify redundant sequences – duplicate reads, which are a result of PCR amplification. These duplicates must be removed before variant calling! Remember also that PCR amplification may introduce sequencing errors – if your PCR introduces one little nucleotide change, and then amplifies it, you may end up with a decrease variant detection and sensitivity, as you will be introducing something new to your sequence, and have it massively represented. It is very difficult to recognize sequences that were altered by PCR. Therefore they represent a big risk in your experiment (especially if you are looking for something that is not well represented in your sample). A safe way to recognize it is to remove duplicates, and/or run a Sanger sequencing with the same sample (always remember to check your PCR primers to make sure there isn’t any chance of allele dropout – but that’s another topic altogether!). If you are confident in your Sanger sequencing, it doesn’t show up in your electropherogram, and after you remove the duplicates it only appears on one sequence, you can be more confident that is was only a PCR artifact.
So, every time you have the same sequence, start site and orientation you may have multiple reads of the same unique DNA fragment, and you must remove them before continuing analyses.
Genotyping and SNP Detection
You just mapped your whole dataset, though your work is not done yet. However, if all the upstream work was done correctly, genotyping and SNP detection is much easier.
The next step in the computational pipeline is to call SNPs using a program such as GATK SAMtools, SOAPsnp or VarScan. These programs detect differences between your results and the reference genome, signaling them.
Now your challenge is to accurately identify these variants: are they polymorphisms? Mutations? Were they described before? If so, what was the category assigned to them? How many times were they found in a normal population?…
What I am trying to say is, you still have a lot of work to do. But I can assure you that if you have a good pipeline upstream, your variant identification process will go more smoothly. And even better, you won’t second guess your variants, because they were obtained in the best way possible.
And good luck!Image Credit: Pino D'Amico