The Beginners Guide to DNA Sequence Alignment

Bitesize Bio Search

Search below to delve into the Bitesize Bio archive. Here, you’ll find over two decades of the best articles, live events, podcasts, and resources, created by real experts and passionate mentors, to help you improve as a bioscientist. Whether you’re looking to learn something new or dig deep into a topic, you’ll find trustworthy, human-crafted content that’s ready to inspire and guide you.

Fortunately, those of us who have learned how to sequence know that aligning sequences is a lot easier and less time consuming than creating them. Whether you’re employing sequencing gels, Sanger-based methods, or the latest in pyrosequencing or ion torrent technologies, obtaining, manipulating and analyzing your sequences has never been easier.

We’re going to take a look at just the basics of sequence alignment to get you started.

How many sequences can I align?

You must have a minimum of 2 sequences to perform an alignment. For comparing 2 sequences you’ll need to perform a “pairwise” alignment. Most programs will align 3 or more sequences at a time and will require a different algorithm e.g. MUSCLE or one of the Clustal algorithms like ClustalW.

You can align several hundred to several thousand if you wish, but there are several factors that can make this straightforward and simple or a time hog if not impossible. First, you must choose an appropriate algorithm. For instance, the sequencing program MUSCLE can usually handle large data sets with a premium on accuracy. For some perspective, I can usually align ~750 sequences of 1000 nucleotides each in about an hour using MUSCLE. For aligning a large number of sequences, you must have sufficient computer memory and storage.

What is the difference between similarity and identity?

Identity is the degree of correlation between 2 un-gapped sequences, and indicates that the amino acids or nucleotides at a particular position are an exact match. Generally, an identity of 25% or higher suggests the potential for similarity of function; an identity of 18-25% implies similarity of structure or function. It is important to note that 2 or more completely unrelated sequences can have 20% identity or greater, so this is not a hard and fast rule. Similarity is the degree of resemblance between two sequences when they are compared, and indicates that the amino acids or nucleotides at a particular position have some properties in common (for instance, charge or hydrophobicity), but are not identical. A high percentage of similar residues can also suggest a conserved function or structure.

What is a “consensus” sequence?

A consensus sequence usually appears at the top of your alignment worktable, and each nucleotide (or amino acid) of the sequence is based on the residue that appears at that position most frequently in your aligned sequence. For instance, if you align 5 sequences, and the nucleotides at position 20 are A, A, T, A, and G, then the consensus sequence will have an A at position 20. The use of consensus sequences can be very useful when examining evolutionary relationships between sequences with high degrees of identity. It is also useful to use the consensus to identify potential gaps in your aligned sequences.

Why are gaps important?

A gap is one or more spaces in a single string of a given alignment and usually corresponds to an insertion or deletion in one or more sequences within the alignment. The insertion or deletion can be an artifact of sequencing chemistry and not indicative of the authentic DNA sequence. According to the European Bioinformatics Institute, there are several other potential explanations for:

A single mutation can create a gap (very common).
Unequal crossover in meiosis can lead to insertion or deletion of strings of bases.
DNA slippage in the replication procedure can result in the repetition of a string.
Retrovirus insertions.
Translocations of DNA between chromosomes.

How do I know my sequence data is good?

Alphabet soup. Lots of As, Ts, Cs and Gs. Regardless of your methods to obtain your sequences, the overall success and accuracy of your sequence alignments and subsequent analyses depend entirely on the quality of your sequence data. Things like solid upstream preparations, primer design and reagent quality can make you a hero….or a zero. So, in scientific terms, the quality of sequence data is directly proportional to the success and robustness of your alignments, you know, ‘garbage in – garbage out’.

In most cases, your raw data is “scored” and cleaned up by the sequencer software resulting in your finished, exportable sequence. Quality can be scored many different ways, depending on the technology and chemistry used, and utilizes criteria such as signal strength, number of contiguous nucleotides read and the ease with which each nucleotide is determined, e.g. a clean, unobstructed peak in a chromatograph. Other than As, Ts, Cs and Gs, it’s important to understand other codes that may appear in your data (click here for the complete list of IUPAC codes).

Some helpful tips

I would highly recommend getting in the habit of saving your work early and often!
Get used to FASTA file formats – you’ll need these when downloading from clearing houses like GenBank (https://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml)
In general, if you’re aligning sequences with LTR (long terminal repeats) regions, you might try deleting these regions as long as they are all identical in composition and length – this will speed up your alignments without sacrificing accuracy.
The longer your sequences, the longer the time required.

Stay tuned for the next article in this series, in which we’ll talk about the different sequencing alignment programs that are available.

You made it to the end—nice work! If you’re the kind of scientist who likes figuring things out without wasting half a day on trial and error, you’ll love our newsletter. Get 3 quick reads a week, packed with hard-won lab wisdom. Join FREE here.

Jason Garner

Jason has an MS in Cell and Molecular Biology from University of Texas at San Antonio. He has a passion for virology, immunology, infectious disease and forensics, and has been involved in business development for the past 10 years working in senior roles with Bio-Rad, Qiagen and currently, Thermo Fisher Scientific.

More 'DNA / RNA Manipulation and Analysis' articles

DNA / RNA Manipulation and Analysis

Re-cycling Electroporation Cuvettes
ByDr Nick Oswald

If you have ever worked out the price of an electroporation cuvette you will realise that, at several dollars each, they are worth recycling. Accounts on how amenable electroporation cuvettes are to recycling vary, but I find that as long as you treat them well it is possible to use single cuvette many times. It’s…

Read More Re-cycling Electroporation Cuvettes
DNA / RNA Manipulation and Analysis

Three Important Things to Check After Obtaining Your Plasmid
ByBhargava Morampalli

You have a new plasmid, now what do you do? You are excited to go further with your project. But before you can move on, you have to confirm the presence of your insert as well as the sequence and orientation of the insert. Is the insert the right size? Most people use restriction enzymes…

Read More Three Important Things to Check After Obtaining Your Plasmid
DNA / RNA Manipulation and Analysis

DEPC: The Wicked Witch of RNA?
ByVicki Doronina

If you have ever worked with RNA, you know about DEPC (diethylpyrocarbonate). You add it to water at a concentration of 0.1%, shake or stir, incubate at 37°C for two hours or at room temperature overnight and, as if targeted by a magic bullet, the RNAses that may have been in the water are gone….

Read More DEPC: The Wicked Witch of RNA?
DNA / RNA Manipulation and Analysis

CRISPR Genome Editing: What You Need to Know to Get Started
ByPete Jones

Get some ideas on what CRISPR can do for you and what using it involves.

Read More CRISPR Genome Editing: What You Need to Know to Get Started
DNA / RNA Manipulation and Analysis

Multiple Fragment Ligation: The Why and How
ByHeinz Reiske

You may be familiar with standard single fragment ligations, but did you know you can ligate multiple fragments into your vector all at the same time! Discover how to perform multiple fragment ligation, including the different methods and troubleshooting tips for when things go wrong.

Read More Multiple Fragment Ligation: The Why and How
DNA / RNA Manipulation and Analysis

RNA Isolation from Drosophila – Don’t Let the Cuticle Scare You!
ByKayly Lembke

Isolating RNA from either Drosophila larvae or adult heads can seem a bit daunting, primarily due to the presence of the cuticle. The cuticle is a protective exoskeleton comprised of insoluble collagens, cuticlins, glycoproteins, and lipids. While it may take some force to remove the cuticle, you can do this easily and without compromising your…

Read More RNA Isolation from Drosophila – Don’t Let the Cuticle Scare You!