Let’s say that you’ve just finished gathering your NGS reads and you’re going to simulate the introduction of random mutations at specific rates into the reads. Before you move on to the next step, you need to determine which NGS data simulator will get the job done.
With the ever-increasing advancement of NGS in the field of molecular biology, improved simulation tools are constantly developed to provide better performance and accuracy. These include EAGLE, pIRS, ReadSim, SimSeq, among others.
However, it’s important to note that not all simulation tools are created equal, and they are very diverse in functionality and input methods. For that reason, choosing an appropriate NGS simulator could be quite a headache.
With that in mind, here are the major steps on how to choose a NGS data simulator.
Determine If a Reference Sequence Is Needed
The majority of the available NGS simulators (EAGLE, ReadSim, pIRS, and SimSeq) require you to use a reference sequence to generate simulated reads. EAGLE and ReadSim allow you to specify any ploidy genome as the reference as opposed to pIRS and SimSeq that only allow you to simulate reads based on the haploid reference. On the other hand, XS read simulator doesn’t need any reference sequence to generate reads. Instead, it uses the available sequencing technology, nucleotide composition, and read length to generate reads from scratch.
Genomic vs Metagenomic
Next, decide whether the reads should be simulated from one or several organisms.
GemSim, Grinder, NeSSM, and BEAR (NO! Not the actual animal, the program!) use a set of reference sequences belonging to various taxa and generate reads resembling the actual metagenomic community.
Introducing Genomic Variants
And of course, you can also introduce genomic variants such as single nucleotide polymorphisms (SNPs), insertions, deletions, inversions, translocations, copy number variants (CNVs), and short tandem repeats (STRs) into the reference sequence prior to the reads generation step.
Depending on the simulators, you may need to input a file with known mutations into the program to add the mutation rate. In this case, EAGLE and DWGSIM require files in plain text or variant cell format (VCF), while FASTQsim requires FASTQ files and reference genome. Other simulators, such as GemSim, allow you to generate tab-delimited haplotype files to introduce genomic variants at specific location. Along with Mason, GemSim may also allow users to generate population-level diversity, such as SNP, through the generation of mutant sequences from a single reference sequence.
Determine the Sequencing Platforms
Last, but not least, you might want to check what sequencing platforms you’re using. Each sequencing platform has their own protocol and error rate, from ~1% in Illumina to ~30% in Nanopore. They will give you sets of data with different characteristics.
Each sequencing platform is also prone to particular errors. For example, substitution errors are dominant in Illumina and SOLiD platforms, while indel errors are dominant in IonTorrent and 454. The aforementioned simulators might take these into account but not all of them. But the good news is that you can specify the error parameters for each platform (as seen in ART, Mason, and pIRS). Several simulators can also fix these errors by specifying error parameter (DWGSIM and FASTQsim), while the others fix the error by using variable error rates in reads (simhtsd and wgsim); error distribution (Grinder) or maybe generating errors with some noise (simNGS).
Pick Your NGS Data Simulator
Now that you’ve got the basic idea, hopefully, you’ll be able to choose the suitable simulator for your research.
Got any questions or tips that I might have left out in this article? Feel free to comment below.
- Escalona M, Rocha S, and Posada D. (2016) A comparison of tools for the simulation of genomic next-generation sequencing data. Nat Rev Genet. 17: p. 459–469.