Garbage in, Garbage out? Quality Control of Your NGS Data

Written by: Derek Bickhart

last updated: April 11, 2022

So, you’ve just received a call from the core facility that you hired to prepare and sequence your libraries. The facility director tells you that the sequence data from your next generation sequencing (NGS) experiment does not look good. You panic and, perhaps, let loose a scream of frustration—aaarrrrggghhhh! This project was going to be your primary data source for a big Nature paper that you’ve always dreamed about! Can you assess the problem and make some corrections to salvage some or all of the sequence data?!

What Was Your Starting Material?

Or how did you isolate the DNA? Did you do a whole blood DNA extraction? A genomic DNA extraction from microalgal or plant cells? Or perhaps your DNA isolation was done on yeast.  The first step in troubleshooting your experiment is to look at the starting quality of your DNA. Starting with high quality DNA is arguably the most important part of any experiment. You must remove any sample contaminants that can inhibit NGS. For example, anticoagulants often added to whole blood samples can inhibit PCR and, thereby, interfere with your NGS results. Certain sample preservation methods present challenges to getting high quality DNA, for example working with formalin-fixed paraffin-embedded (FFPE) tissues. Starting material can also be an issue because you must have sufficient material to get enough DNA to run NGS. You can overcome these challenges through using methods or kits specific for working with these samples. Without high-quality DNA, then your NGS experiments will likely be a bust. Always (always!) check the purity of your isolated DNA before going forward. As computer programmers like to say, “GIGO,” (garbage in, garbage out).

What Was the Sequencing Platform? Help May Be at Hand…

If you used ABI’s SOLiD™, Illumina’s HiSeq™ or Pacific Biosystem’s sequencer (informally called the ‘Pacbio’), you can get initial quality estimates back on your data in several ways. Most modern sequencing platforms provide runtime quality estimates. Using these estimates, you can likely retrieve quality assessments directly from the core facility itself. The HiSeq™ provides quality score and dataset quality metrics automatically. The Pacbio has a series of proprietary software tools that allow you to assess its data as well. If the core facility deletes the run quality score metrics, or you cannot gain access to them, there is an open-source way for you to check the quality of your sequence data—FastQC. Developed by the Babraham Institute, FastQC gives you a wide range of quality metrics on your data. The program accepts Fastq files (the raw output of many sequencing platforms) and several common sequence alignment formats, so there is a wide range of support for different platforms.

What Types of Problems Are You Looking For?

Many sequence data problems are platform dependent. For example, the Pacbio has a high (but consistent) error rate, the HiSeq™ preferentially amplifies high GC% regions, and the IonTorrent ION™ has problems with long stretches of single nucleotides (homopolymeric repeats). If your data is in Fastq format, you should have quality scores assigned to each read. Assessing these quality scores on a read-by-read basis (using the automatic quality printouts from the machine or from FastQC analysis) should give you an idea as to what may have happened to your data. If the quality scores from a HiSeq™ run suddenly drop by 20 points in the last few bases of your second read, there was likely an issue with the machine or reagents at that point in time. Check with your core facility or contact the technical support hotline of your preferred sequencing platform if you find any issues that are difficult to interpret.

What Can You Do to Save Your Data?

If your reads have a large quality score drop-off at the end of each sequence, you can trim the last few bases from each read in the file. If there are large amounts of low quality sequence reads in your dataset, you can filter them by quality score. There are many proprietary tools which allow you to correct your data files that are specific to the type of machine that you used. Additionally, you can use an open-source tool called the FASTX-toolkit. This tool allows you to trim or rearrange raw Fastq files before beginning to align the data with your reference sequence. If you are planning on using your data for any type of assembly project, you may need to do more drastic error correction! Assembly requires ‘pristine’ data to work correctly, so many assembly programs come with their own error correction toolkits. Expect a loss of sequence data after using a packaged error-correction toolkit regardless of your data quality. While most quality assessment is done after a problem occurs, checking your sequence data for errors should become a routine step in your analysis pipeline. It may just help you in the first round of review for that Nature paper you’re planning!

Further Resources:

FastQC. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ FASTX-Toolkit. https://hannonlab.cshl.edu/fastx_toolkit/

Derek is a US postdoctoral fellow working with next-generation sequencing data derived from many livestock and domestic animal species.

More 'Genomics and Epigenetics' articles