It took scientists a little while to warm up to long-read sequencing, but now you couldn’t pry most of them away from their sequencers with a crowbar. Long reads — we’re talking 10,000 bases and more — provide a level of contiguity and completeness in genome assemblies that simply isn’t possible with short-read sequencers. They can reveal full structural variants and accurately represent long, repetitive regions that flummox their short-read counterparts.

For example, scientists sequencing microbial genomes have discovered that they can often generate fully closed assemblies with long reads, representing the whole genome in a single contig. With more complex organisms, it’s not uncommon to hear about assemblies that have one contig to represent each chromosome. With short reads, assemblies are far more fragmented, split into hundreds or even thousands of small pieces that are difficult to place in the correct order and orientation.

There are two vendors in long-read sequencing today: PacBio and Oxford Nanopore Technologies. Others are waiting in the wings. For scientists using either of these platforms, they don’t want just long reads, they want the longest reads. And that’s where automated DNA size selection comes in.

Long-read sequencers are limited most by the length of the fragments fed into them. You can have a machine capable of producing 100,000-base reads, but if you load only 500-base DNA fragments, you can’t get the benefit of long-read data. In some cases, these sequencers preferentially sequence smaller fragments, so even if you had a mix of long and short fragments in your library, you’d wind up with much shorter average read lengths than the instrument is capable of producing.

Users of sequencers from both PacBio and ONT have shown that size selection can be used to remove the smaller fragments from a library prior to sequencing. This step may seem trivial, but studies show that it can double the average read length generated simply by focusing the sequencer on the longest DNA fragments available.

Here’s a great example from blogger Lex Nederbragt with nice data and charts. In a more recent study of the human genome, scientists from the Icahn School of Medicine at Mount Sinai and several other institutions reported the first diploid human genome sequence and noted that size selection was essential for maximizing read length. “Without selection, smaller 2000 – 7000 bp molecules dominate the zero-mode waveguide loading distribution, decreasing the sub-readlength,” the researchers noted in the supplementary materials.

At a recent ONT user group meeting, scientist and blogger Keith Robison reported that the company had begun using the BluePippin™ automated size selection platform to increase average read lengths; some users demonstrated the ability to enrich for reads at least 20 Kb long. At a PacBio user group event last fall, CSO Jonas Korlach introduced a protocol for generating libraries of at least 30 Kb by using the BluePippin with Diagenode shearing.

To learn more, check out the long-read sequencing resources listed here.

Resources

Sage Protocols for PacBio

Scientist Profile: Long Reads at Mount Sinai

App Note: 7 Kb+ Libraries