Most ‘wet lab’ biologists do not have much computer programming experience, which can make downstream analysis of next generation sequencing results a bit daunting. After the sequencing platform spits out your data, what do you do with it? That’s where Galaxy comes in.
What is Galaxy?
Galaxy is a bioinformatics workflow management system, created by collaboration between Penn State University and Emory University. It is a collection of software packages which can be operated via a web browser on a public server. The graphical user interface means no knowledge of code is needed. Galaxy is also ‘open source’. If you are unfamiliar with this term, it means the software that can be edited and improved by the users. The underlying code is publicly available, rather than proprietary, as is the case with Windows or Adobe software. The Galaxy project is based on the programming language Python, and the open source format means it is constantly being developed by the user community.
What can Galaxy do?
The list of modules is enormous! To summarize them all in a single blog post is likely impossible and it is recommended that you peruse the site to see how Galaxy can work for your specific needs. The site contains many tutorials to orient new users.
Among the many functions include an Next Generation Sequencing Toolbox which allows the user to convert between various sequence file formats such as Text, Tabular, SFF, FASTA, and FASTQ for Sanger, 454, and Illumina platforms. You can filter your data sets by quality score, trim characters from either end of your reads, sort by multiplexing identifier (or ‘barcode’), search for specific character strings in data sets, run full statistical reports on your datasets showing traits such as average read length, read distribution, quality score distribution, over-represented sequences, and much more.
Benefits of Galaxy: cost and accessibility
Because Galaxy is open source and browser based, it can be accessed by anyone and is free of charge. You can use the public server, or if you have Mac OS or Linux, install Galaxy on your personal computer. With the high cost of proprietary sequence analysis software, Galaxy provides a clear cost benefit to labs operating on a tight budget.
As mentioned above, it doesn’t require programming knowledge, but if you can use and understand Python, you can help maximize its utility by developing it further.
Benefits of Galaxy: reproducibility
Galaxy allows the creation of saved ‘workflows’ that can be carried out on different datasets in exactly the same way each time. The user can try different modules easily to optimize their desired workflow, and once decided, Galaxy will extract key information from your workflow and recreate it exactly to be applied to future datasets.
What are the drawbacks of Galaxy?
Galaxy has some limitations. Because it emphasizes usability and simplicity, it is often not suitable for workflows containing loops.
From personal experience, I can also say that its barcode parsing feature does not allow barcodes of different lengths. For example, if your list of barcodes contains 5mers and 8mers, you have to run the barcode splitter twice – once for the 5mers and once for 8mers. Additionally, it cannot merge paired end reads that overlap in the middle; it can only join two sets of reads end to end. Users should also be warned that the public server is often very busy during the week, so jobs and data uploads may sit in the queue for many hours before running. Typically, it is faster on evenings and weekends.
Galaxy is a handy tool for laboratory biologists dabbling in bioinformatics, or for those processing NGS data who have not been privileged to earn a computer science degree. Tools such as Galaxy are helping to bridge the gap between computer science and biology.