Breaking Down the Assembly of Nucleic Acid Sequences

Published July 9, 2016

Microbiome—a term that has become a hot topic in recent years—has scientists of all disciplines wanting to know more. Microbes are everywhere. On any type of surface you can think of. Our physical make up, by number, consists of 10 bacterial cells to every one of our own. What’s more, approximately 99% of microbes cannot be cultured by conventional methods. Wow!

So how do we get to know our micro-sized counterparts better?

You too can do NGS

To answer this question, we need to turn to next generation sequencing (NGS) and bioinformatics. Now before I scare you away, stick with me here. It isn’t as daunting of a task as you think. Well maybe it is, but with a little bit of help (and luck), you’ll be performing de novo sequencing of genomes in no time. If this biochemist/microbiologist can do it, so can you. Disclaimer: I have no formal training in computer science and most mistakes are my own.

The ever-increasing quality of high throughput sequencing technologies and the continual decrease in the price of these technologies has enabled bioinformatics wannabes like me to take a crack at sequencing: something which was once only possible by those with the money, resources, and know how.

This is great because it enables more novice bioinformaticians to use NGS as a tool to answer complex biological questions. This is a problem because it enables more novice bioinformaticians to use NGS technologies.

Outlined below are some general steps and a few ‘pro tips’ to help you along the way.

General steps

Nucleic acid purification

The first step in any NGS is to obtain your starting material. Extract either DNA or RNA from a sample using your favorite technique (bioreactor, sediment, water column, etc.).

Care should be taken here. You need to extract all nucleic acid equally. This can be difficult due to the heterogeneity of the sample.

Once extracted and before library preparation, quantify the nucleic acid and assess the quality.

Pro tip: If you want DNA for (meta)genome analysis, there are commercial kits available to easily do this. I prefer a kit that allows you to isolate DNA and RNA, with the added option of protein. Before spending the time, energy, and money sequencing, spend a lot of time optimizing this procedure, especially if the end goal is RNA. Isolation and purification of mRNA for (meta)transcriptome studies can be daunting but with a solid pipeline in place, processing samples becomes much easier.

Library preparation

Once you have high quality, purified nucleic acid you need to prep it for sequencing. You might need to shear the nucleic acid to obtain a library with a specified insert size, and then add adapters to the samples, depending on the platform used and if you are multiplexing your samples.

Pro tip: There are many ways to maximize sequencing depth and coverage while minimizing cost…multiplexing is one.

Sequencing

Next, use a sequencing platform to determine who is there (genome) and/or what they are doing (transcriptome). Many types of platforms exist and you should choose your platform based upon the type of sequences that you want. For example, Illumina [both MiSeq and HiSeq] are great for generating lots of short reads, which is ideal for (meta)transcriptome analysis. Pacific Biosciences (PacBio) is great for generating long reads for (meta)genome analysis. A recent breakthrough in sequencing technologies moves away from the conventional Sanger-based sequencing and utilizes a nanopore to generate longer-than-PacBio reads…pretty cool if you ask me.

Pro tip: Make sure you understand each platform and your experimental design before spending the money to sequence. Try to shy away from sequencing for the sake of sequencing. Often times these decisions are driven by pricing, availability, convenience, University policy, time constraints, or simply what others you know have used in the past. Know what you want/need but be flexible.

Assembly

Now it’s starting to get fun: assemble the sequencing reads into longer contigs and in some cases into even longer scaffolds. This step helps to solidify your confidence in the sequences that you have.

Pro tip: Examine many assemblers. Some assemblers are optimized for certain sequencing platforms, so make sure your assembler is compatible with your sequencing reads. For a metagenome/metatranscriptome study I used Velvet, metaVelvet, Ray, IDBA-UD, and SPAdes. I chose SPAdes for my dataset based upon the total length of assembly, n50 score, and % reads mapping back to the assembly.

Gene calling/gene prediction

The fun continues: Use a program to analyze the contigs, scaffolds, or in some cases, single chromosomes for coding regions (CDSs). Not all programs are created equal and can be specific for eukaryotes or prokaryotes. In my brief tenure I have used Prodigal and Glimmer, which I recommend. For initial gene calling and annotation, RAST is a very useful, automated resource, but from personal experience, manual genome curation is recommended, especially for isolation of genomes from a metagenome.

Annotation

In this step you will use information from the gene-calling step to determine what each gene does. Annotation is dependent upon the databases utilized and the cutoff values used (e.g. e-value or bitscore). It is possible to circumvent assembly with direct annotation of sequences. A popular platform for this is MG-RAST.

For those of us that want to learn bioinformatics, there are many languages to choose from, along with many great resources to learn NGS programming language. Even some beginner’s guides to get you started. My advice is to go to workshops, talk to experts, take a Massive Open Online Course (MOOC), or get your hands on a dataset to examine. I found that I learned most from trial and error.

Final Pro Tip: It is very likely that while you are running through your sequencing/bioinformatics pipeline, a new chemistry or sequencing technology will be released. Don’t get discouraged. You have spent the time and energy choosing an interesting hypothesis…you’ll be fine. Keep Calm and Sequence On.

So you’ve done all this work, now what? Now the real effort starts—the actual analysis (e.g. phylogenomics, metabolisms).

Share this to your network:

X Facebook LinkedIn

Written by Daniel Ross

Image Credit: University of the Fraser Valley