Call Out the Variants and Genotypes
In this first step, you will determine the positions where at least one of your samples differs from the reference sequence or otherwise known as “variant calling”. The next step, where you evaluate the individual alleles at all variant sites is known as “genotyping”. Large numbers of bioinformatic software can be used to help you find the variant within the NGS reads. For single-sample variant calling, you can use Atlas-SNP2, SLIDERII, and SOAPsnp. For genotyping or multi-sample variant calling, FreeBayes, GATK, QCALL, SAMtools, and SeqEM are good recommendations. How you process your information depends on the type of sequence coverage you have. Sequencing coverage refers to the average number of times a single base is read during a sequencing experiment. Calculate it like this:Coverage = Read count x Read length / Target sequence size
For example, a 10x coverage means each base has been read by 10 sequences, while a 100x coverage means each base has been read by 100 sequences. The more frequently a base is sequenced, the higher the coverage of the reads, and the higher the reliability as well. Most publications require the level of coverage ranges from 10x to 50x depending on the research application. Certain cancer research might require 100x coverage to ensure the quality of the data. With high coverage sequencing data, you could simply ignore the low-quality alleles and count only the high-quality alleles that you come across. On the other hand, if you are dealing with medium and low coverage sequencing data, you might want to use probabilistic or Bayesian methods to avoid undercalling of heterozygous genotypes. These take into consideration additional information before determining whether a given locus is heterozygous or homozygous. This includes information like read coverage, the error rate of the NGS platform, and alignment quality scores.Increase Sensitivity with Joint Calling
I almost forgot to mention that you could also get your hands on data with high sensitivity and high specificity, by combining information from the initial alignment with the local short reads you have generated from scratch (de novo assembly). This method is called joint calling. Like almost any other in silico research, experimental design plays important roles in variant calling and genotyping. By using high coverage sequence data, you will get more specific variant calls and genotype estimations. However, due to limited research budgets…*looks at the wallet* *sigh*, forcing yourself to obtain high coverage will result in sequencing fewer samples. Thus, you will get a poor representation of a population’s true genetic variation. On the other hand, using low coverage sequencing might provide a better picture of the population’s variation, but then again, it will have much higher uncertainties and errors. For that reason, you might want to use joint calling to reduce the systematic errors, sampling bias, and increase the strength of poorly supported variants. By doing this, you are getting the best of the both worlds. For this method to work, you need the appropriate software to perform the job, such as GATK’s HaplotypeCaller or Platypus. The results report in the variant calling format.Filter out the error in SNPs
Next, you should remove the false positives from the initial genotype calling data set. By doing this, you will improve the specificity tremendously. First, there are two filtering strategies to choose from. The first strategy is hard filtering. This strategy assumes that false positives often display unusual characteristics, for example:- Low-quality scores in duplicated regions which lead to ambiguous alignment
- Poor alignment with the reference sequence caused by an incomplete reference sequence
- Strand bias (normally, genuine variants will have equal coverage on both forward and reverse strand. However, in this case, the variant is either only supported by the forward strand or reverse strand)
- A high number of variants clustering within certain region