Large amounts of data? Check. Repetitive tasks? Check. If you work with next gen sequencing data, you have probably already realized it’s a good idea to learn a scripting language.
But learning a programming language is a major endeavour, and with lots of languages available how do you decide which one to study? And once you’ve decided on a language, what is the best way to master it? Here, we break down the differences between the programming languages most commonly used in next generation sequencing projects.
We’ll start with the two most popular scripting languages in bioinformatics: Python and Perl.
Today, Python is one of the most popular scripting languages for next gen sequencing projects. Relative to Perl, it has more rules and stylistic conventions. Many coders feel that if the language had a motto, it would be “there is only one way to do it.” Many coders like this aspect of Python, because it means it is easier to read and understand other people’s scripts.
In the 1990s, Perl was by far the most popular scripting language for handling genetic sequencing data – and there are still many coders who use it as their primary scripting language. Unlike Python, Perl’s unofficial motto is “there is more than one way to do it.” Coders typically have the flexibility to achieve the same results in many different ways. Although many appreciate this aspect of Perl, it can sometimes make it a little difficult to read and understand someone else’s scripts.
One benefit of Perl’s staying power is that the BioPerl library is more mature and larger than the BioPython library. Chances are that you can find what you want in Biopython – but if not, check out BioPerl.
Many researchers are already familiar with Java, since this language is so commonly used in business and other fields. People comfortable using Java may find that BioJava best meets their needs. You can find a description of the modules currently available here.
Although Ruby is a less common scripting language for bioinformatic applications than Python or Perl, fans think it is more intuitive and easier to use. And, of course, it has its own BioRuby library with components for sequence analysis and more, described here.
Finally, R is a popular (and free) statistical programming language. If you regularly deal with next gen sequencing data, chances are you have already used R and its free Bioconductor tools in your analyses or will need to at some point. It’s possible to write scripts in R, but many researchers find it is more efficient to use R for statistical analyses and a scripting language to write programs and manipulate data. Luckily, there are interfaces that allow R to be called from within other languages, such as Python or Perl, and vice versa.
So How Do I Decide?
The unique features of each language are important considerations, but the truth is that your decision may be made by other factors.
- What code everyone around you knows. Remember that you will most likely be sharing code with your colleagues, which requires you to share a common language. In addition, if you are just learning a new language, you are going to have questions – lots of them. Having experts nearby to help out will be invaluable!
- The available bioinformatics libraries. Once you understand your project’s needs, you can check out libraries such as BioPython and BioPerl to see whether they already contain the code you need. This will save you from having to write your own code — and if one library has what you need and another doesn’t, your decision may be made.
- Work with more than one. Due to different collaborators’ preferences and the varying strengths of each language, you may end up working with a few. Luckily, many programmers find that after mastering one language, learning another is much easier.
Hopefully, the information presented here has helped you decide which language you’d like to learn. To find out more about how to get started, please check out my upcoming article, ‘Top resources for learning an NGS programming language.’