How Does BLAST Work?

by

last updated: October 7, 2024

More than a pun on the explosive growth of sequencing data, BLAST makes annotation and comparisons of similar sequences much easier. Created by a group at the U.S. National Center for Biotechnology Information in 1991, the Basic Local Alignment Search Tool is arguably the most heavily used tool for sequence analysis (that’s available for free, anyway).

BLAST is a powerful and popular tool because it can find similarities between experimental and reference sequences (or a whole series of sequences) very quickly and accurately. There are several different types of BLAST algorithms, accessing databases for help with identifying genomes (RNA and DNA nucleotide sequences), proteins, and targeted genomic sections like SNPs or specifically targeted regions.

The BLAST databases of sequences has been added to over the years (every query a scientist makes is stored in the database, creating an ever-growing reference). This growth has only added to the accuracy and helpfulness of this database. At the same time, NCBI has added computer power, and is now experimenting with Amazon Web Services to operate BLAST “in the cloud.”

What do you need to do to use BLAST?

Naturally, you’ll first need a computer and a sequence of something.

Going to the NCBI/BLAST website, you’ll see a number of options. Choose a species to search, or you can compare your sample against all the species in the database.

You’ll need to decide on a BLAST program:

  • To search nucleotides against nucleotides, select “blastn” or “megaBLAST” (this second category is considered the fastest).
  • To search proteins against proteins, select “blastp”
  • “Blastx” will search a protein database using your translated nucleotide query.
  • “tBlastn” will do the opposite of blastx, searching a translated nucleotide database with your protein query.
  • And “tBlastx” searches translated nucleotide databases with your translated nucleotide query.

There are a lot of specialized searches you can perform, too, including making primers, finding conserved domains only, looking at immunoglobulin sequences and structures, and search for possible vector contamination.

Once you’ve decided which BLAST program to use, it’s very easy and web-based; just copy and paste your sequence into the right area, and fill out a few other areas per the instructions (each program is a little different, but easy to follow).

A wealth of BLAST resources

The NCBI provides so much material to get you started, it’s almost overwhelming.

Tutorials, web-based instructions, videos, step-by-step programs can be found nearly anywhere on the BLAST site. One slightly annoying aspect of the NCBI BLAST pages, however, is the number of online courses that have been discontinued, but remain on the web sites. These same sites also contain new courses, but couldn’t an organization with a reputation for computerized prowess know how to take down a retired page?

Behind the scenes of BLAST

The NCBI estimates that about 200,000 “queries” (that’s your submission of a sequence) are made every week. However, depending on how many sequences you enter and how long those sequences are, you can get results back in a few minutes, possibly a handful of seconds.

BLAST works by detecting local alignments between sequences that work the best. The BLAST computers start with a small set of three letters, which they call the “query word.” These letters will represent three amino acids or nucleotides, in a specific order (for example, the nucleotides ATC, in that order). The BLAST search then looks for the number of times (and places along the sequence) in which this three-letter “word” appears. It will also look for closely related “words” in which one letter is different. Then, each query is scored to determine which database is “in the neighborhood” of your sample.

What results do you get?

When your BLAST search is finished, you’ll get a computerized “picture” of your results. Your “query” sequence will appear first. Below your query sequence, you’ll see a number of shorter lines, representing the reference sequences that were the most comparable to your query sequence. You’ll also get a percentage similarity estimate. Moving your mouse over the lines will show the identity of each “hit”. You’ll then be able to identify (one hopes) the species, gene or type of protein you’ve submitted for comparison.

What’s not to like?

BLAST does have a few shortcomings. Because the algorithms are making estimates of the best possible alignments, you may have errors pop up due to rare SNPs or an INDEL. There is a SNP BLAST search, however. In addition, if your query word “neighborhood” search includes too many three word combinations, you’ll end up with sequences that really aren’t as similar as you hoped.

However, NCBI is working on BLAST constantly, and it gets stronger with the number of scientists making queries.

Andrew has been a freelance life science writer for more than 20 years. Worked for academic institutions, startup biotechs, major biopharmaceuticals. Agriculture editor, Genetic Literacy Project. He has an MS in Biotechnology from the University of Maryland, and a BA in Physical Anthropology from the University of Pennsylvania.

More 'Software and Online Tools' articles