Choosing a Scripting Language for Next Generation Sequencing: Python, Perl, and More

Large amounts of data? Check. Repetitive tasks? Check. If you work with next gen sequencing data, you have probably already realized it’s a good idea to learn a scripting language.

But learning a programming language is a major endeavour, and with lots of languages available how do you decide which one to study? And once you’ve decided on a language, what is the best way to master it? Here, we break down the differences between the programming languages most commonly used in next generation sequencing projects.

We’ll start with the two most popular scripting languages in bioinformatics: Python and Perl.

Python

Today, Python is one of the most popular scripting languages for next gen sequencing projects. Relative to Perl, it has more rules and stylistic conventions. Many coders feel that if the language had a motto, it would be “there is only one way to do it.” Many coders like this aspect of Python, because it means it is easier to read and understand other people’s scripts.

Image Larger Volumes with the UltraMicroscope Choros™

From: Miltenyi Biotech

Trust Your Quantification with the DeNovix DS-8X Rapid Eight Channel, 1µL UV-Vis Spectrophotometer

From: DeNovix

Biopython is a handy and fairly comprehensive set of freely available Python tools for biological computation. Want to manipulate sequences or build a phylogenetic tree? Start here!

Perl

In the 1990s, Perl was by far the most popular scripting language for handling genetic sequencing data – and there are still many coders who use it as their primary scripting language. Unlike Python, Perl’s unofficial motto is “there is more than one way to do it.” Coders typically have the flexibility to achieve the same results in many different ways. Although many appreciate this aspect of Perl, it can sometimes make it a little difficult to read and understand someone else’s scripts.

One benefit of Perl’s staying power is that the BioPerl library is more mature and larger than the BioPython library. Chances are that you can find what you want in Biopython – but if not, check out BioPerl.

Other Languages

Java

Many researchers are already familiar with Java, since this language is so commonly used in business and other fields. People comfortable using Java may find that BioJava best meets their needs. You can find a description of the modules currently available here.

Is finding ways to visualize your data your primary goal? Then BioJS, an open source JavaScript framework, may be your best bet. You can learn more about using BioJS to make your data shine on the web here.

Ruby

Although Ruby is a less common scripting language for bioinformatic applications than Python or Perl, fans think it is more intuitive and easier to use. And, of course, it has its own BioRuby library with components for sequence analysis and more, described here.

R

Finally, R is a popular (and free) statistical programming language. If you regularly deal with next gen sequencing data, chances are you have already used R and its free Bioconductor tools in your analyses or will need to at some point. It’s possible to write scripts in R, but many researchers find it is more efficient to use R for statistical analyses and a scripting language to write programs and manipulate data. Luckily, there are interfaces that allow R to be called from within other languages, such as Python or Perl, and vice versa.

So How Do I Decide?

The unique features of each language are important considerations, but the truth is that your decision may be made by other factors.

What code everyone around you knows. Remember that you will most likely be sharing code with your colleagues, which requires you to share a common language. In addition, if you are just learning a new language, you are going to have questions – lots of them. Having experts nearby to help out will be invaluable!
The available bioinformatics libraries. Once you understand your project’s needs, you can check out libraries such as BioPython and BioPerl to see whether they already contain the code you need. This will save you from having to write your own code — and if one library has what you need and another doesn’t, your decision may be made.
Work with more than one. Due to different collaborators’ preferences and the varying strengths of each language, you may end up working with a few. Luckily, many programmers find that after mastering one language, learning another is much easier.

Hopefully, the information presented here has helped you decide which language you’d like to learn. To find out more about how to get started, please check out my upcoming article, ‘Top resources for learning an NGS programming language.’

Kristin Harper

Kristen has a PhD in Population Biology, Ecology, and Evolution, and a Master of Public Health in Global Epidemiology from Emory University.

About Us

Marketing

Choosing a Scripting Language for Next Generation Sequencing: Python, Perl, and More

Python

Image Larger Volumes with the UltraMicroscope Choros™

Trust Your Quantification with the DeNovix DS-8X Rapid Eight Channel, 1µL UV-Vis Spectrophotometer

Perl

Other Languages

Java

Ruby

R

So How Do I Decide?

Introduction to Linux for High-Throughput Sequencing Analysis

Get to Know Your Reference Genome (GRCh37 vs GRCh38)

Kiss your samples goodbye: Outsourcing your Next-Gen experiment

The qRT-rtPCR Control You Should Be Doing, But Probably Aren’t

What Has Methylation Done For You Lately?

All in the Chip: Ion Torrent Sequencers

The Happy Scientist Live - What Are You Avoiding?

10 Things Every Molecular Biologist Should Know

About Us

Marketing

Choosing a Scripting Language for Next Generation Sequencing: Python, Perl, and More

Python

Perl

Other Languages

Java

Ruby

R

So How Do I Decide?

More 'Genomics and Epigenetics' articles

The Happy Scientist Live - What Are You Avoiding?

10 Things Every Molecular Biologist Should Know