“It’s them! Blast them!”—Storm Trooper from STAR WARS Episode IV seeing Han Solo and Princess Leia in the Death Star. Did any scientist know back in 1977 that you would use this phrase not in relation to the rebel alliance, but in the modern era of bioinformatics? Basic Local Alignment Search Tool or B.L.A.S.T. is the most popular bioinformatics online tool, one that biologists around the globe are using. But in order to fully exploit BLAST and properly use it, you have to understand that is not just a search query against a database but something far more complicated, and the information you get back far more…precious!
History of the [sic] BLAST
But first things first: The National Center of Biotechnology Information (NCBI) introduced the BLAST algorithm in 1990, accompanied by a publication from Altschul et al. That version had limited limited functions, but at the same time it was valuable for all those researchers who had newly sequenced amino acid (aa) or nucleotide (nt) chains with no idea how to compare those sequences against already existing databases. Back then, BLAST was performed for gap-free alignments only and the result provided p values for researchers to evaluate the significance of the result. 1997 was the year BLAST made a remarkable step towards the future of bioinformatics, when a gapped version was introduced along with the PSI-BLAST algorithm. From then on, new toolkits are added frequently.
What’s the science behind BLAST?
Although most people think BLAST is using the GenBank database to compare any given query in real time, that’s not the case! Interestingly, BLAST is transforming your queries into BLAST databases with a “secret” format that makes searching less time-consuming and more reliable. BLAST is splitting your query into several files and is comparing them independently in order to generate different results regarding taxonomy of the organism, structure, protein domain, or sequence title. At the same time it’s comparing the sequence query against GenBank, which is a “heavier” workload and it assembles the corresponding results of all files. This is why when you try a typical BLAST search you get some information like domain families or taxonomy faster and can even interact with them until sequences and their alignment scores appear.
But of course all the queries are not the same. This is why there are different BLAST algorithms for different types of datasets.
The anatomy of BLAST
Databases
Naturally, the algorithm you are going to use is not the only question you have to ask yourself. The database you are going to search against is also something you have to think about. Before you BLAST a query you can choose which database you want to BLAST against (A). Although the most common option is the non-reductant (nr), which includes all the major existing databases, it’s not your only option. By choosing the right one you can significantly reduce the time of your BLAST search and increase the quality and specificity of your return result. In order to make the correct choice regarding the search database, you should click on the question mark (B), which will reveal more detailed information about the database you are heading to. Finally, if you already know from which organism your query is or against which organism you want to BLAST, you can choose it in the organism section (C) option, which will reduce the time and the scale of the BLAST operation.
Blast with Software
Most of the time you have more than one query you want to identify. And when I say more than one, I do not mean two, but THOUSANDS. For example, results from an RNA-sequencing (ESTs) or predicted proteins after a de novo assembly, can result in thousands of queries left unknown to you. By using software, such as Blast2Go and a stable internet connection, you can search for many queries simultaneously and automatically with only one limitation…time!
Different queries, different species of BLAST
In order to properly perform a BLAST search, it is wise to ask yourself what kind of information you are interested in. There is an entire “arsenal” of BLAST tools, which I’ll describe below from the most to the least common ones.
BLASTN
Compares nucleotide sequences against nucleotide sequences of the chosen database(s)—the database generally depends on the species you are studying.
BLASTP
Compares protein sequences against protein sequences from different species (databases). According to this more complicated algorithm, other search types have been established over the years (BLASTX and TBLASTN).
BLASTX
Compares the six possible translation frames of a nucleotide sequence against amino acid sequences.
TBLASTN
It is like the opposite of BLASTX, it reverse translates an amino acid sequence to all possible nucleotide sequences and compares them against nucleotide sequences from various species or builds.
MegaBLAST
Compares nucleotide sequences against nucleotide sequences, but its algorithm is optimized for identifying very similar sequences for putatively related species. It searches for at least one exact match of 28 bases and then trying to match a full alignment.
PSI-BLAST
Position Specific Iterate BLAST performs a BLASTP comparison, with your amino acid sequence. Using that information, it creates a matrix with the probability of a related sequence possessing the same residues at the same position. This is called Position Specific Scoring Matrix (PSSM), and basically, PSI-BLAST searches against the databases for protein sequences with similar PSSM scores and returns these sequence results. It can help find distant evolutionary relationships. It is the most sensitive approach of finding similar protein sequences for your query.
RPS-BLAST
Reverse Position Specific BLAST, acts complimentary and opposite to PSI-BLAST by comparing your query to the PSSMs of the Conserved Domains Database (CDD) of NCBI. It is a more sensitive way of identifying conserved domains in proteins than standard BLAST searching. It uses a query to compare it to PSSMs profiles of conserved domain proteins, rather than PSI-BLAST, which uses a PSSM profile to return protein sequences.
DELTA-BLAST
Domain Enhanced Lookup Time Accelerated BLAST was created in order to detect remote protein homologs faster than IMPALA used to. It is also established in order to correspond your protein sequence to similar PSMMs sequences. In contrary with RPS-BLAST, DELTA-BLAST will use a subset of a PSSMs database pre-constructed to match your protein sequence and then will search against a protein-sequence sub-database to yield better homology detection. It is a new algorithm only released in 2012 and its main objective is to reduce the time of the search from 10 even 100 times.
This is only an introduction of how BLAST works and how you can exploit it! There are still many toolkits, which you can use to enhance your search. Do not forget to think wisely what you are looking for and what is your final goal.
Then, every answer will be at the tip of your finger.
For more information you can read the official help page of NCBI.
REFERENCES
1. Altschul SF et. al. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10.