Microbiome—a term that has become a hot topic in recent years—has scientists of all disciplines wanting to know more. Microbes are everywhere. On any type of surface you can
think of. Our physical make up, by number, consists of 10 bacterial cells to every one of
our own. What’s more, approximately 99% of microbes cannot be cultured by conventional methods. Wow!
So how do we get to know our micro-sized counterparts better?
You too can do NGS
To answer this question, we need to turn to next generation sequencing (NGS) and bioinformatics. Now before I scare you away, stick with me here. It isn’t as daunting of a task as you think. Well maybe it is, but with a little bit of help (and luck), you’ll be performing de novo sequencing of genomes in no time. If this biochemist/microbiologist can do it, so can you. Disclaimer: I have no formal training in computer science and most mistakes are my own.
The ever-increasing quality of high throughput sequencing technologies and the continual decrease in the price of these technologies has enabled bioinformatics wannabes like me to take a crack at sequencing: something which was once only possible by those with the money, resources, and know how.
This is great because it enables more novice bioinformaticians to use NGS as a tool to answer complex biological questions. This is a problem because it enables more novice bioinformaticians to use NGS technologies.
Outlined below are some general steps and a few ‘pro tips’ to help you along the way.
General steps
Nucleic acid purification
The first step in any NGS is to obtain your starting material. Extract either DNA or RNA from a sample using your favorite technique (bioreactor, sediment, water column, etc.).
Care should be taken here. You need to extract all nucleic acid equally. This can be difficult due to the heterogeneity of the sample.
Once extracted and before library preparation, quantify the nucleic acid and assess the
quality.
Pro tip: If you want DNA for (meta)genome analysis, there are commercial kits available to easily do this. I prefer a kit that allows you to isolate DNA and RNA, with the added option of protein. Before spending the time, energy, and money sequencing, spend a lot of time optimizing this procedure, especially if the end goal is RNA. Isolation and purification of mRNA for (meta)transcriptome studies can be daunting but with a solid pipeline in place, processing samples becomes much easier.
Library preparation
Once you have high quality, purified nucleic acid you need to prep it for sequencing. You might need to shear the nucleic acid to obtain a library with a specified insert size, and then add adapters to the samples, depending on the platform used and if you are
multiplexing your samples.
Pro tip: There are many ways to maximize sequencing depth and coverage while minimizing cost…multiplexing is one.
Sequencing
Next, use a sequencing platform to determine who is there (genome) and/or what they are doing (transcriptome). Many types of platforms exist and you should choose your platform based upon the type of sequences that you want. For example, Illumina [both
MiSeq and
HiSeq] are great for generating lots of short reads, which is ideal for (meta)transcriptome analysis. Pacific Biosciences (
PacBio) is great for generating long reads for (meta)genome analysis. A recent breakthrough in sequencing technologies moves away from the conventional Sanger-based sequencing and utilizes a
nanopore to generate longer-than-PacBio reads…pretty cool if you ask me.
Pro tip: Make sure you understand each platform and your experimental design before spending the money to sequence. Try to shy away from sequencing for the sake of sequencing. Often times these decisions are driven by pricing, availability, convenience, University policy, time constraints, or simply what others you know have used in the past. Know what you want/need but be flexible.
Assembly
Now it’s starting to get fun: assemble the
sequencing reads into longer contigs and in some cases into even longer scaffolds. This step helps to solidify your confidence in the sequences that you have.
Pro tip: Examine many assemblers. Some assemblers are optimized for certain sequencing platforms, so make sure your assembler is compatible with your sequencing reads. For a metagenome/metatranscriptome study I used
Velvet,
metaVelvet,
Ray,
IDBA-UD, and
SPAdes. I chose SPAdes for my dataset based upon the total length of assembly, n50 score, and % reads mapping back to the assembly.
Gene calling/gene prediction
The fun continues: Use a program to analyze the contigs, scaffolds, or in some cases, single chromosomes for coding regions (CDSs). Not all programs are created equal and can be specific for
eukaryotes or prokaryotes. In my brief tenure I have used Prodigal and
Glimmer, which I recommend. For initial gene calling and annotation,
RAST is a very useful, automated resource, but from personal experience, manual genome curation is recommended, especially for isolation of genomes from a metagenome.
Annotation
In this step you will use information from the gene-calling step to determine what each gene does. Annotation is dependent upon the databases utilized and the cutoff values used (e.g. e-value or bitscore). It is possible to circumvent assembly with direct annotation of sequences. A popular platform for this is
MG-RAST.
For those of us that want to learn bioinformatics, there are many
languages to choose from, along with many great resources to learn
NGS programming language. Even some
beginner’s guides to get you started. My advice is to go to workshops, talk to experts, take a Massive Open Online Course
(MOOC), or get your hands on a dataset to examine. I found that I learned most from trial and error.
Final Pro Tip: It is very likely that while you are running through your sequencing/bioinformatics pipeline, a new chemistry or sequencing technology will be released. Don’t get discouraged. You have spent the time and energy choosing an interesting hypothesis…you’ll be fine. Keep Calm and Sequence On.
So you’ve done all this work, now what? Now the real effort starts—the actual analysis (e.g.
phylogenomics, metabolisms).