The Beginners Guide to DNA Sequence Alignment

on 15th of October, 2012 in Nucleic Acid Purification and Analysis
header image copy
About the Author:
I have what some might call an eclectic background. Originally a music major (yes, I wanted to be a rock star), I spent the first 17 years of my professional career as a technical advisor in the car business for Acura, BMW, Mazda and Volkswagen. Twelve years ago, my mother’s bout with cancer brought out the scientist in me.

I’ve been lucky enough to earn two degrees in cell and molecular biology and have a passion for virology and immunology. I’ve also been honored to serve as senior molecular biologist for the US Department of Defense’s Global Influenza Surveillance Program from 2006 -2011 where we made great strides...

Fortunately, those of us who have learned how to sequence know that aligning sequences is a lot easier and less time consuming than creating them. Whether you’re employing sequencing gels, Sanger-based methods, or the latest in pyrosequencing or ion torrent technologies, obtaining, manipulating and analyzing your sequences has never been easier.

We’re going to take a look at just the basics of sequence alignment to get you started.

How many sequences can I align?

You must have a minimum of 2 sequences to perform an alignment. For comparing 2 sequences you’ll need to perform a “pairwise” alignment. Most programs will align 3 or more sequences at a time and will require a different algorithm e.g. MUSCLE or one of the Clustal algorithms like ClustalW.

You can align several hundred to several thousand if you wish, but there are several factors that can make this straightforward and simple or a time hog if not impossible. First, you must choose an appropriate algorithm. For instance, the sequencing program MUSCLE can usually handle large data sets with a premium on accuracy. For some perspective, I can usually align ~750 sequences of 1000 nucleotides each in about an hour using MUSCLE. For aligning a large number of sequences, you must have sufficient computer memory and storage.

What is the difference between similarity and identity?

Identity is the degree of correlation between 2 un-gapped sequences, and indicates that the amino acids or nucleotides at a particular position are an exact match.  Generally, an identity of 25% or higher suggests the potential for similarity of function; an identity of 18-25% implies similarity of structure or function.  It is important to note that 2 or more completely unrelated sequences can have 20% identity or greater, so this is not a hard and fast rule. Similarity is the degree of resemblance between two sequences when they are compared, and indicates that the amino acids or nucleotides at a particular position have some properties in common (for instance, charge or hydrophobicity), but are not identical. A high percentage of similar residues can also suggest a conserved function or structure.

What is a “consensus” sequence?

A consensus sequence usually appears at the top of your alignment worktable, and each nucleotide (or amino acid) of the sequence is based on the residue that appears at that position most frequently in your aligned sequence. For instance, if you align 5 sequences, and the nucleotides at position 20 are A, A, T, A, and G, then the consensus sequence will have an A at position 20. The use of consensus sequences can be very useful when examining evolutionary relationships between sequences with high degrees of identity. It is also useful to use the consensus to identify potential gaps in your aligned sequences.

Why are gaps important?

A gap is one or more spaces in a single string of a given alignment and usually corresponds to an insertion or deletion in one or more sequences within the alignment. The insertion or deletion can be an artifact of sequencing chemistry and not indicative of the authentic DNA sequence. According to the European Bioinformatics Institute, there are several other potential explanations for:

  • A single mutation can create a gap (very common).
  • Unequal crossover in meiosis can lead to insertion or deletion of strings of bases.
  • DNA slippage in the replication procedure can result in the repetition of a string.
  • Retrovirus insertions.
  • Translocations of DNA between chromosomes.

How do I know my sequence data is good?

Alphabet soup. Lots of As, Ts, Cs and Gs. Regardless of your methods to obtain your sequences, the overall success and accuracy of your sequence alignments and subsequent analyses depend entirely on the quality of your sequence data. Things like solid upstream preparations, primer design and reagent quality can make you a hero….or a zero. So, in scientific terms, the quality of sequence data is directly proportional to the success and robustness of your alignments, you know, ‘garbage in – garbage out’.

In most cases, your raw data is “scored” and cleaned up by the sequencer software resulting in your finished, exportable sequence. Quality can be scored many different ways, depending on the technology and chemistry used, and utilizes criteria such as signal strength, number of contiguous nucleotides read and the ease with which each nucleotide is determined, e.g. a clean, unobstructed peak in a chromatograph. Other than As, Ts, Cs and Gs, it’s important to understand other codes that may appear in your data (click here for the complete list of IUPAC codes).

Some helpful tips

  • I would highly recommend getting in the habit of saving your work early and often!
  • Get used to FASTA file formats – you’ll need these when downloading from clearing houses like GenBank (http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml)
  • In general, if you’re aligning sequences with LTR (long terminal repeats) regions, you might try deleting these regions as long as they are all identical in composition and length – this will speed up your alignments without sacrificing accuracy.
  • The longer your sequences, the longer the time required.

Stay tuned for the next article in this series, in which we’ll talk about the different sequencing alignment programs that are available.

Photo Credits

One thought on “The Beginners Guide to DNA Sequence Alignment”

  1. Avatar of NULabMonkey NULabMonkey says:

    Just a couple of things, for large datasets (like your example of 750 sequences * 1000 nucleotides), it might be helpful to use more efficient methods such as MAFFT (http://mafft.cbrc.jp/alignment/software/)rather than MUSCLE.

    Furthermore most alignment programs will NOT actually give you anything resembling a "confidence score" by default. Tools like the GUIDANCE server (http://guidance.tau.ac.il/), Heads or Tails analysis (http://www.ncbi.nlm.nih.gov/pubmed/17387100), and others can provide SOME idea of the relative confidence of a given alignment at a given position. Furthermore, tools like T-Coffee and M-Coffee (http://www.tcoffee.org/) actually DO report confidence, and M-Coffee does so in kind of a cool way (by running many different alignments, aggregating them, and comparing agreement across the aggregate alignment).

    However, if one's genuinely interested in assessing the robustness of their alignment, then the best current option might be utilization of fully-probabilistic Bayesian methods, which sadly may be far too computationally expensive for your example dataset…unless you have a rather large computing cluster at your disposal.

    Just my 2-cents.

Leave a Reply