Basic Local Alignment Search Tool (BLAST) remains the cornerstone of biological sequence analysis. When users have one or few queries, the data are easy to interpret based on the graphical output that the search provides. However, when you have to analyze hundreds or a few thousand queries, running graphical outputs can be computationally intensive, difficult to analyze and time consuming.
Command line warriors on Unix and Linux platforms can get data cracking using text editors and programming languages, but what if you don’t have the training, time or resources to do that for the job at hand?
All BLAST webservers offer the data in text editable formats such as *.csv files which are amenable to quicker analysis. So in this small guide, I am going to show you how to set up a local BLAST database in BioEdit, do a BLAST run using parameters of your choice, and then use Microsoft Excel Pivot tables and charts for analysis and quick BLAST summaries. For a sequence file of a few hundred sequences, the whole exercise can be done in less than an hour (depending on your workstation configuration and personal inclination of course!)
Choose your sequences
For the purpose of this exercise, I am creating a database of human tRNAs in BioEdit. BioEdit is freeware that provides an extremely easy way to do sequence analysis on the Windows platform. So download the latest version of BioEdit and install it on your PC.
We next need two nucleotide/protein files: one acting as our database and the other as our query. I am going to download the Mus musculus (mouse) and human tRNA sequences in fasta format from genomic tRNA database (Figure 1). Save the files in a folder of your choice or under C:\BioEdit\database. We will use the mouse sequences as the query and human tRNA sequences as the database.
Figure 1. Downloading tRNA sequences
Create a database
Fire up BioEdit and you should see the interface shown in Figure 2a. Next, open the individual tRNA fasta files in BioEdit and create a database of human tRNA sequences.
To create a human tRNA database locally, go to Accessory Application→BLAST→Create a local nucleotide database file (Figure 2b). Open the human tRNA fasta file and BioEdit will format the file using the formatdb utility from NCBI and create a local nucleotide database. Verify that the database has been created by going to Accessory Application→BLAST→Local BLAST. The database should be visible in the drop down menu (Figure 2c).
Figure 2. Installing a local database in BioEdit
Search for matches
Next, you are going to use the batch BLAST functionality of BioEdit to search for matches to the human tRNAs.
Open the query mouse tRNA sequences file in BioEdit via File->Open. When you open the file in BioEdit, the sequence file headers are very long and the BLAST output may not be easy to analyze and filter. If needed, shorten the titles. To do this, select all sequences using Edit->Select All sequences. Go to Edità Copy sequence titles. Paste the data in an Excel spreadsheet and use the Text to columns under Data tab and separate data as shown in Figure 3. Copy the first column, which now contains the sequence names and go back to BioEdit and paste the titles over the existing names. This will rename the sequences with the shortened titles. Save the file (Figure 3).
Figure 3. Using Excel to shorten sequence names. Click to open full size image in a new window.
We are now ready to start the BLAST search. Select all sequences using Ctrl+A or Edit→Select all sequences. BioEdit will ask/warn you that more than one sequence is selected and if you want to do a batch BLAST. Select Yes and it will open the BLAST interface. Change parameters for the BLAST matrix and click Do Search. The blastall.exe program will run, and depending upon the parameters set, the size of the query and the processing power of your machine, you will eventually see an output that resembles that in Figure 4.
Figure 4. BLAST output summary. Click to open full size image in a new window.
Move to excel
The output is a tab delimited text file that can be copy-pasted or imported into Excel. The column headings are shown below each column. Insert a row above the first row in the Excel spreadsheet and enter the column headings to enable quick data analysis and filtering.
You could use Excel’s built-in filtering to sort and look at the data that you want to analyze, but a better way is to utilize Pivot tables to get a summary of the data. In this particular example, in which we used a batch BLAST search, the output table contains multiple subject hits for each query sequence and hence the number of rows that will be filled up can easily exceed a few thousand, making filtering laborious. Pivot tables are excellent tools to get an overview of the data without needing to filter the data.
Use pivot tables
Let’s assume that you have already pasted/imported the output text to an excel spreadsheet and inserted column names. To begin data analysis, select the data range by going to the last cell with data value in it (Ctrl+Shift+Rt arrow and then down arrow) and then click on Insert and select Pivot Table. Excel will ask you about the data range and where to put the Pivot table (Figure 5). You can insert it into a new sheet, but I like to have a summary in the data sheet itself. Choose any empty cell (say O2) and click OK.
Figure 5. Inserting Pivot tables in Excel
Let’s say you want to find how many subject hits of the alignment length >15 were found for each query in this output. Drag the Alignment Length field code in the Filter box (top right), drag the query field code below it. Then Drag subject into the rows box. Drag subject from the top field codes into the values box as shown in Figure 6a below. The Pivot table will update to show you the summary of the data. To filter hits that are >15 nt long, click on the pull down arrow next to Alignment Length and select the length matches of interest. You can also select one or few queries if further data filtering is needed.
Figure 7. Data filtering with Excel Pivot Tables. Click to open full size image in a new window.
Interpreting the data
As you can see, the Pivot table gives you a count of hits for each sequence that was used as a BLAST query. Double clicking on the individual row labels or the corresponding count opens a new worksheet that has details of all the hits for that particular sequence. Thus, Pivot Tables enable easy summary of the BLAST run while also enabling in depth analysis if needed. In addition you can create additional filters in the data range (e.g., gene names or genomic coordinates, etc.) and use Pivot tables to filter hits that match a certain gene or a genomic region.
While this is a quick and easier way of summarizing BLAST data, this is not the only one. This method is useful for small to medium size projects; if your query sequences run into several hundred thousand you are better off creating database tables with programs like Microsoft Access.
Though my method still needs user intervention and analysis, it speeds the process significantly, enabling a quick data summary before digging deep.
As a biologist you will no doubt have seen Enzyme Commission (EC) numbers. An EC number is group of four numbers separated by periods in papers discussing enzymes…something like this: EC 126.96.36.199. But do you know what these numbers mean? Or where they came from? Or why we use them? If not, I will aim […]
It’s great to have you in the Bitesize Bio family! We’ve sent you an email to confirm your registration. Please click on the link in the email or paste it into your browser to finalize your registration.
For more information on how to use Bitesize Bio, take a look at the following image (click it, for a larger version)
An error occured while registering you, please reload the page and try again