New Channels on Bitesize Bio

To help you find information on exactly what you need we're implementing channels, a new way to browse content

Each channel is focused on a specific technique or area and authored/presented by hand-picked authors who are experts in their field. Make sure you don't miss a thing by checking the box below for each channel that interests you.

In return we'll send you one email per month that brings you the latest from your chosen channel(s), along with free members-only content.

Check out our upcoming new channels; Flow Cytometry and Cell Culture, we'll be launching them very soon!

I would like to receive the newsletters for the following channels

Cell Culture
Flow Cytomery
Microscopy & Imaging
Next Generation Sequencing
Writing, Publishing and Presenting
Cloning & Expression


My email address is:

header image copy

The Beginners Guide to DNA Sequence Alignment

by in Software & Tools
From the Bitesize Bio channel

Fortunately, those of us who have learned how to sequence know that aligning sequences is a lot easier and less time consuming than creating them. Whether you’re employing sequencing gels, Sanger-based methods, or the latest in pyrosequencing or ion torrent technologies, obtaining, manipulating and analyzing your sequences has never been easier.

We’re going to take a look at just the basics of sequence alignment to get you started.

How many sequences can I align?

You must have a minimum of 2 sequences to perform an alignment. For comparing 2 sequences you’ll need to perform a “pairwise” alignment. Most programs will align 3 or more sequences at a time and will require a different algorithm e.g. MUSCLE or one of the Clustal algorithms like ClustalW.

You can align several hundred to several thousand if you wish, but there are several factors that can make this straightforward and simple or a time hog if not impossible. First, you must choose an appropriate algorithm. For instance, the sequencing program MUSCLE can usually handle large data sets with a premium on accuracy. For some perspective, I can usually align ~750 sequences of 1000 nucleotides each in about an hour using MUSCLE. For aligning a large number of sequences, you must have sufficient computer memory and storage.

What is the difference between similarity and identity?

Identity is the degree of correlation between 2 un-gapped sequences, and indicates that the amino acids or nucleotides at a particular position are an exact match.  Generally, an identity of 25% or higher suggests the potential for similarity of function; an identity of 18-25% implies similarity of structure or function.  It is important to note that 2 or more completely unrelated sequences can have 20% identity or greater, so this is not a hard and fast rule. Similarity is the degree of resemblance between two sequences when they are compared, and indicates that the amino acids or nucleotides at a particular position have some properties in common (for instance, charge or hydrophobicity), but are not identical. A high percentage of similar residues can also suggest a conserved function or structure.

What is a “consensus” sequence?

A consensus sequence usually appears at the top of your alignment worktable, and each nucleotide (or amino acid) of the sequence is based on the residue that appears at that position most frequently in your aligned sequence. For instance, if you align 5 sequences, and the nucleotides at position 20 are A, A, T, A, and G, then the consensus sequence will have an A at position 20. The use of consensus sequences can be very useful when examining evolutionary relationships between sequences with high degrees of identity. It is also useful to use the consensus to identify potential gaps in your aligned sequences.

Why are gaps important?

A gap is one or more spaces in a single string of a given alignment and usually corresponds to an insertion or deletion in one or more sequences within the alignment. The insertion or deletion can be an artifact of sequencing chemistry and not indicative of the authentic DNA sequence. According to the European Bioinformatics Institute, there are several other potential explanations for:

  • A single mutation can create a gap (very common).
  • Unequal crossover in meiosis can lead to insertion or deletion of strings of bases.
  • DNA slippage in the replication procedure can result in the repetition of a string.
  • Retrovirus insertions.
  • Translocations of DNA between chromosomes.

How do I know my sequence data is good?

Alphabet soup. Lots of As, Ts, Cs and Gs. Regardless of your methods to obtain your sequences, the overall success and accuracy of your sequence alignments and subsequent analyses depend entirely on the quality of your sequence data. Things like solid upstream preparations, primer design and reagent quality can make you a hero….or a zero. So, in scientific terms, the quality of sequence data is directly proportional to the success and robustness of your alignments, you know, ‘garbage in – garbage out’.

In most cases, your raw data is “scored” and cleaned up by the sequencer software resulting in your finished, exportable sequence. Quality can be scored many different ways, depending on the technology and chemistry used, and utilizes criteria such as signal strength, number of contiguous nucleotides read and the ease with which each nucleotide is determined, e.g. a clean, unobstructed peak in a chromatograph. Other than As, Ts, Cs and Gs, it’s important to understand other codes that may appear in your data (click here for the complete list of IUPAC codes).

Some helpful tips

  • I would highly recommend getting in the habit of saving your work early and often!
  • Get used to FASTA file formats – you’ll need these when downloading from clearing houses like GenBank (http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml)
  • In general, if you’re aligning sequences with LTR (long terminal repeats) regions, you might try deleting these regions as long as they are all identical in composition and length – this will speed up your alignments without sacrificing accuracy.
  • The longer your sequences, the longer the time required.

Stay tuned for the next article in this series, in which we’ll talk about the different sequencing alignment programs that are available.

Articles in your inbox

Enter your email to be informed when we publish more articles like this on BsB, and also get access to all of these goodies:

  • Free ebooks and audiobooks on the topics that matter to you
  • Access to Member’s-only articles and Videos
  • Advance notice of new webinars and eBooks
  • Access to make comments and ask questions on BsB



What to read next

Taming the Data Stampede with Wikis

Since you’re a Bitesize Bio reader, you’re well aware that there is an abundance of both broad and specialized bioinformatics tools freely available for researchers – if you know where and how to find them. Recently emerging is a push to combine or link some of the most important concept-related databases as well as create [...]

LibreOffice: A Free Office Suite to Rival Microsoft’s

As scientists, our everyday tasks involve creating small notes, assembling a short report, a manuscript or even a book chapter, sketching illustrative diagrams, organizing numbers or other items in spreadsheets. Today, all of us, researchers depend more or less on computer programs to solve these tasks quickly and easily. These pieces of software called office [...]

Finally, Useful Heatmaps in Excel

Heat maps are a useful way to represent certain types of data; the data are colored by coloring according to the values in them, (e.g. red for high values, yellow for medium and green for low values), providing a powerful visual representation of a data set. This allows you to quickly see results from DNA [...]

Go Pubmed!

GoPubmed is a powerful new way to search the literature. As the name suggests, it is based on our old, familiar friend the Pubmed database but GoPubmed provides a whole new set of tools that will power-up your search. After entering your search term into the search box at gopubmed.org, GoPubmed mines a vast array [...]

About the author

Jason Garner

I have what some might call an eclectic background. Originally a music major (yes, I wanted to be a rock star), I spent the first 17 years of my professional career as a technical advisor in the car business for Acura, BMW, Mazda and Volkswagen....

What do you think?

One comment

  1. from on

    Just a couple of things, for large datasets (like your example of 750 sequences * 1000 nucleotides), it might be helpful to use more efficient methods such as MAFFT (http://mafft.cbrc.jp/alignment/software/)rather than MUSCLE.

    Furthermore most alignment programs will NOT actually give you anything resembling a “confidence score” by default. Tools like the GUIDANCE server (http://guidance.tau.ac.il/), Heads or Tails analysis (http://www.ncbi.nlm.nih.gov/pubmed/17387100), and others can provide SOME idea of the relative confidence of a given alignment at a given position. Furthermore, tools like T-Coffee and M-Coffee (http://www.tcoffee.org/) actually DO report confidence, and M-Coffee does so in kind of a cool way (by running many different alignments, aggregating them, and comparing agreement across the aggregate alignment).

    However, if one’s genuinely interested in assessing the robustness of their alignment, then the best current option might be utilization of fully-probabilistic Bayesian methods, which sadly may be far too computationally expensive for your example dataset…unless you have a rather large computing cluster at your disposal.

    Just my 2-cents.

Subscribe to Channels

To receive information about any of our new channels click on the button below.
subscribe to the channel newsletter »

Write for us

Have a short tip, a written
article or a video you'd like
to see published?
write for us »