How to Improve Your WGS DNA Library

In whole genome sequencing (WGS) initiatives it is not enough to simply sequence the whole length of the genomic DNA sample just once. This is because genomes are usually very large. The human genome, for example, contains approximately 3 billion base pairs. Although sequencing accuracy for individual bases is very high, when you consider large genomes such as the human genome, even an error of 1 in 1,000 bases will result in 3 million erroneous base reads in the genomic data. Moreover, most often the goal of WGS efforts is to detect rare single nucleotide polymorphisms (SNPs) and point mutations in the genomic DNA. For example, various types of cancers¹ and neurodegenerative diseases² are driven by single nucleotide variants. To distinguish such biological variations in the genomic DNA from artefactual sequencing errors, it is important to increase the sequencing accuracy even further by sequencing individual genomes multiple times.

The number of times the entire genome or reference nucleotide landmarks are sequenced in a WGS initiative is called the coverage, read coverage, fragment count, or depth of sequencing. Whereas shallow or low coverage WGS refers to 0.1 to 0.2 x sequencing coverage and is useful in the detection of structural and copy number variations, deep sequencing that reads a whole genome sample approximately 30 times or more is crucial for the detection of single nucleotide variations (SNVs), including rare polymorphisms and point mutations, with high confidence. The high-throughput systems available from Illumina™ include the HiSeq™ series of sequencing systems which includes the HiSeq 2500, HiSeq 3000, HiSeq 4000, and HiSeq X systems, as well as the recent NovaSeq™ 6000 system. These sequencing systems are equipped to flexibly sequence a large variety of genomes at coverages suitable for the desired application.

Several studies show that high GC percentage in a genomic region results in low sequencing depth.³ This dependence of sequencing depth on the density of GC bases in a segment of DNA is described as GC coverage bias. Understandably, a high GC bias affects sequencing data quality scores and skews data interpretation, particularly when the analysis focuses on detecting rare SNVs, copy number variations (CNVs), or insertions and deletions (INDELS).

A variety of factors affect GC bias, including shearing mechanism of the DNA, ligation efficiency, and PCR amplification. For example, non-uniform physical or enzymatic shearing of DNA in library preparation protocols can result in fragment length bias. The method of tagging adaptors at the ends of DNA fragments can impact both the quality and quantity of the mapped reads.

Additional bias can be introduced when DNA fragments are amplified using PCR, as some DNA fragments in the library can get preferentially enriched over others during PCR amplification. This shows the crucial importance of the conditions of DNA library construction in introducing GC bias. Although PCR amplification is a major source of GC bias, improved PCR protocols using optimized conditions have reduced amplification bias. Optimization of thermocycler and temperature ramp rate, and increasing the duration of the denaturation phase to allow complete denaturation of the strongly coupled GC-rich regions were found to significantly improve evenness of coverage. PCR-based and PCR-free Invitrogen^TM Collibri^TM PS DNA Library Prep Kits for high-throughput Illumina systems provide the most even coverage for DNA input amounts ranging from 1 to 1000 ng. Moreover, in these kits, PCR can be considered optional because adaptor ligation does not require PCR amplification.

An essential aspect of optimizing a WGS workflow is to consider biases that may occur, and determine how to avoid or minimize their influence in the final data readout. With the development of the Invitrogen™ Collibri™ PCR-Free PS DNA Library Prep Kit for Illumina Systems and Invitrogen™ Collibri™ PS DNA Library Prep Kit for Illumina Systems, it is possible to reduce these biases.