When I was being trained in microbiology as an undergrad, one of the first skills I acquired was the ability to quickly compare and visualize amino acid sequences using BLAST and ClustalW. 15 years later, those two programs have done nothing but improve by expanding the data contained in these databases and simplifying the user interface. Dimitris Skliros put together a great article on the BLAST tool that explains the inner machinations and how the system works.
Here, I hope to illustrate how to use BLAST in combination with ClustalW to answer some very practical questions about protein sequences that you may find yourself stumbling into as you learn to use these tools.
Got an Amino Acid Sequence? BLAST it
There are two versions of BLAST software you can use. Dimitris talked about the first, found on the NCBI website, in the aforementioned article. It contains a few more options and variables. For the sake of consistency, I will be using the BLAST tool found on the ExPASy website.
Once you open the site, you can easily address the aforementioned question of “where did this sequence come from?” Simply copy and paste your amino acid sequence into the window and click “Run BLAST.”
Upon completion, you encounter a colored, graphical representation of the similarity with different proteins identified from the BLAST database. A color scale of green to red indicates a greater and lesser similarity. It also shows areas of significant differences.
In the example shown in Figure 1, I ran a BLAST query on an “unknown” sequence and am showing the first two returned values as an illustration. They are both green, reflecting a high level of homology. You’ll notice two different naming schema in this figure. The first refers to the canonical isoform of RIF1 – hence “RIF1_HUMAN”. The second refers to “isoform 2” of this protein and uses the accession number instead of the protein name – “Q5UIP0-2”. Also, as the program considers less-similar proteins, these bars will become increasingly red, and the bar will become increasing broken, indicating gaps in the sequence.
Scroll down further to see the “list of the matches.” For now, we’ll just focus on the top 2 hits again (Figure 2). Here, both the accession number, the gene name associated with this protein, and the species associated with the protein are listed. The most important values are the “score” and the “E-value” numbers. A higher score indicates a better match between the two sequences. The E-value measures biological relevance – the lower the E-value, the more biologically relevant it likely is.
Scroll even further down to find the “alignments” section, which provides the most detail available (Figure 3).
The score and E-value are re-stated here, but now you can see new information. This includes the “identities” section, which means “these two amino acids identical,” the “positives,” which can be read as “these two amino acids are different, but have similar chemistry,” and then gaps, which reflects any regions that are missing between the query and the subject sequences. In this case, the protein in question is human RIF1.
Is this Protein Found in a Different Species?
Knowing what we know now, this question is easy to answer. In Figure 2, the species name is a five-letter code following the gene name (i.e. HUMAN). If you see anything other than “HUMAN” in this space, you’ve answered your question.
To address specifically which species it is you’re looking at, scroll down to the view provided in Figure 3. The species name is provided on the second line in the “OS=” field.
Is this Protein Found Specifically in Mice? Again, How Similar is it?
Using our RIF1 example again, take Figure 2 and scroll down further to see if “mouse” is one of our options.
Yep, there it is. Notice the lower score though, so it’s not a perfect match. Let’s go see how imperfect it is…(Figure 5)
Showing the entire sequence would be silly, so we just take a look at the statistics here to see how similar it is. 70% identity is a pretty high degree of homology, and that increases to 80% if you consider side-chain chemistry similarity. These two proteins are highly homologous, but not identical.
Can I Find a Region of Sequence Novelty Between my Protein and a Group of Similar Proteins?
BLAST is a great tool for comparing one polypeptide chain to another, but it becomes unwieldy when you want to consider more than 2 sequences. In this case, use ClustalW, the most recent version of which is called Clustal Omega.
For the purposes of this example, I am using the first 120 amino acids of murine RIF1, with some small changes to illustrate how the program works. “Isoform 2” contains a series of point mutations in the first 10 residues, and “isoform 3” contains a 5 residue deletion from the C-terminus (Figure 6). Also, it is important to notice a small difference here when compared to BLAST. When you enter the sequences to align, you need to add a “>” followed by a title for your sequences. The program won’t run without it, and it doesn’t tell you this anywhere (Figure 6).
Upon clicking “submit,” the program simply lines the three sequences up, giving you an opportunity to quickly identify where the sequences are different. Take a look (Figure 7):
Here, you can very clearly see that isoform 3 and the wild-type sequence are conserved in the N-terminus. You also see that isoform 2 has the mutations. The program also suggests similarity, much like BLAST does. Serine (S) and threonine (T) have similar side chain chemistry, indicated by a “:”. Glycine (G) and lysine (K) do not, hence the blank. Further, isoform 2 and the wild type have similar C-termini. The gap present in isoform 3 is indicated by a “-“ in the sequence, allowing the alignment to stay oriented properly, but still emphasizing the lack of wild-type residue.
These examples only scratch the surface of what these two programs are capable of, but hopefully they provide a nice “on ramp” from which you can start exploring and finding creative ways to solve the actual problems you come across in your work.
Phage display – the process of genetically fusing antibody fragments with phage to identify binding partners to your protein of interest – was covered pretty thoroughly here over the past few months. The success of this assay predicates on creating a diverse library of up to 1012 genes coding for these antibody fragments. Despite being…
Most eukaryotic proteins exist as several isoforms, differing in posttranslational modifications, which allows them to perform slightly different functions or the same function under slightly different conditions. A common posttranslational modification of proteins is glycosylation.
I think that transferring Western blots is one the most enjoyable tasks to do in a lab: it’s quick, it’s messy, and on some gleeful level, it feels like a child’s art project gone wrong. Of course, it’s also finicky and slippery and prone to tiny pitfalls that can noticeably affect the quality of your…
SDS-PAGE is the technique that unites life scientists. We all perform it during our research to separate protein analytes and, therefore, we all need a good SDS-PAGE gel recipe.
Here’s your reliable and adaptable nuclear extraction protocol. It contains buffer recipes, explanations of what the reagents do, and top tips to boost yields.
Is ELISA giving you the blues? The frustrating kind, not the lovely kind you get while watching the enzyme substrate reaction! This age old assay has the perks of being quick and fairly simple to perform, but when conditions are not perfect, ELISAs can deliver less than optimal results, and fail to be consistent and…
10 Things Every Molecular Biologist Should Know
The eBook with top tips from our Researcher community.