Reducing GC Bias in WGS: Moving Beyond PCR

WGS technologies have seen significant progress since the completion of the Human Genome Project in 2003. First-generation Sanger Sequencers were limited by lengthy run times, high expenses, and throughputs that read only tens of kilobases per run. The arrival of second-generation sequencers in the mid-2000s brought about the plummeting of sequencing costs and run times, and increased sequencing throughput to hundreds of gigabases per run.

However, with the deepening of genomic knowledge and increasing accessibility for quick and inexpensive high-throughput WGS, the need to estimate and eliminate obstacles to obtain clean and accurate sequencing data has risen. In the genomic landscape, where a single base change against a backdrop of billions of base changes can be the difference between health and disease, the need for accuracy is paramount. Deviations of DNA sequencing data from an ideal uniform distribution of sequencing reads may creep in because of a variety of inherent biases in the sequencing technology. Key among these obstacles to accuracy in high-throughput WGS data are genomic regions with extremely low or high percentages of guanine and cytosine (GC) bases, homopolymeric regions, and coding and non-coding genomic regions with fewer than ten nucleotides repeated extensively in tandem.

PCR itself is a major source of errors in sequencing data and can skew read coverage (i.e., the number of times specific regions of the genome are sequenced in comparison to landmark reference nucleotides). Cutting-edge sequencing technology works around this issue by eliminating the need for amplification altogether. For example, the Invitrogen™ Collibri™ PCR-Free PS DNA Library Prep Kit for Illumina Systems gives a median read length of approximately 350 to 550 bps and even read coverage across genomic windows with varying GC percentages. Overall, PCR-free technology in WGS offers major advantages, including longer read lengths, higher consensus accuracy at deep sequencing coverages of 30x or greater, and evenness of coverage throughout the genome.

The inclination of sequencing technologies to give low read coverage or fragment count in GC-rich regions of the genome is called GC bias. Many biologically important regions of the human genome (such as promoters and coding regions) are GC rich. Moreover, GC abundance is heterogeneously distributed throughout the genome and is frequently correlated with functionality, making it harder to distinguish a GC bias from a true biological signal. Additional confounding factors in determining GC effect include the fact that it cannot be eliminated by using larger bins, or subsets of genomic regions, and is even prevalent in sequence bins of 10kbps or greater. These difficulties in estimating and reducing GC bias demand special consideration during protocol design and quality control to maintain a high level of accuracy in WGS studies.

Reduction of GC bias is critical in removing obstacles that impair biological and biomedical applications, in improving the assembly of genomes in humans and model research organisms, and in improving coverage in biologically significant genomic hot spots. Doing so improves assemblies of human and model organism genomes without increasing sequencing depth.