If you’re a bench biologist who has decided to incorporate next-generation sequencing (NGS) bioinformatics into your research, you might be feeling overwhelmed as you try to process the data. I felt that way when I did gene expression analysis for the first time, but I quickly learned that NGS bioinformatics is not only powerful but can also be fun! Here are some insights to keep in mind while you navigate these data-ridden waters. (If you’re interested in becoming a bioinformatician,
Asmaa Ali’s article on the topic is a great place to start.)
To Code or Not to Code?
Coding is no longer a prerequisite for performing NGS bioinformatics in your research, although it is if bioinformatics is going to be your profession or constitute the meat of your research. If not, consider the advantages and disadvantages of going the coding route.
The Benefits of Learning to Code
Coding the calculations yourself will give you a lot of control. In addition, the software that requires you to code to extract data is mostly open-source—in other words, free of charge, making your research cheaper. But maybe you’ve never used coding languages before, and they read like gibberish to you.
Fortunately, the websites that provide the software programming environments, including
Python and
R, which are most commonly used for bioinformatics, contain resources to get you started learning to use them. The Python site has step-by-step tutorials, as
written instructions and as
audio/video clips, and the R site has
manuals and a
book list.
If books are your thing,
this book is great for getting started with
Bioconductor, a software package specifically for high-throughput genomics that’s mostly distributed as an add-on module in R. For more instruction,
MIT OpenCourseware,
UCLA Coding Boot Camp, and
Codecademy are just a few examples of places where you can take coding courses online.
Graphical User Interfaces Allow You to Avoid Coding
If you’d like to skip the coding aspect of NGS bioinformatics, you (or your company) might want to invest in a graphical user interface (GUI) program – i.e., one that doesn’t require a coding language. If you’re in academia, your university likely already has subscriptions to such programs and it won’t cost you or your lab a penny to use them; you’d just need to apply for a login account.
For instance, to help me analyze my RNA-seq data, my university provided me with
Partek Flow and
Partek Genomics Suite, as well as a wonderful program from Qiagen called
Ingenuity Pathway Analysis (IPA), which yields a downstream analysis of the biological functions affected by the genes or transcripts in your data.
If you’re in industry,
AltAnalyze,
Apache Taverna, and
UGENE are free alternatives to the Partek suite. You might also talk your boss into investing in software from Partek or Qiagen, or make the investment yourself if you own the company. Ultimately, it’s a judgment call, depending on your company’s finances, how satisfied you are with open source software, and the extent to which you intend to use bioinformatics going forward.
Learn Some Linux Commands
A final note about code: I recommend that, even if you find you don’t need code for your NGS bioinformatics, try to learn at least a few Linux commands, if only just to set up the file transfers you’ll need to get started. Even GUI programs sometimes have a few early steps at the command line. For instance, say you are doing an entire NGS project on Partek Flow. Before the NGS bioinformatics even begins, you will likely need to transfer the raw sequencing data files from the server that performed the sequencing to your computer or (more likely) cloud service for storing and running processes that take up large amounts of space. Transferring the files at a Linux command line would minimize the chance of the process freezing mid-transfer, as is often seen with GUIs (Google Drive, for one, is notorious for file downloads freezing
after the first 2.0 GB).
The Linux command line is readily accessible on a Mac or a computer running a Linux operating system, but with a client app such as
PuTTY it is also easy to run Linux commands on a PC. Maker Pro provides
one of many resource pages for learning Linux commands, which I have found to be much easier to remember than other programming languages. Besides, a bit of experience typing in commands will ready your brain for learning more code if you do decide to wade deeper into the wonderful world of computation one day.
Gather Your Resources
Whichever path of programs you choose, your learning curve will be steep in the beginning. Fortunately, many of the sources for learning programming languages also teach bioinformatics. If you’re at a university, your library is likely to have people whose job it is to connect you to the bioinformatics programs available at the school and teach you how to use them. In addition to finding help at the library, GUI programs, like programming language platforms, provide a lot of instruction on their own websites. Qiagen has a
particularly impressive and still growing list of in-depth tutorial videos on ways to use IPA as well as the company’s other bioinformatics tools. AltAnalyze, Apache Taverna, and UGENE also come with their own tutorials. Free bioinformatics help is available at
Biostars.
Always remember: you’ll have a lot of big and little questions that you can always type word for word into a search engine. Chances are, it’ll refer you to a forum where someone else asked the same question, and got answers! Personally, I Google shamelessly for all my projects, bioinformatics and otherwise. Of course, you could always ask the question yourself on Quora, Reddit (particularly at the
r/learnbioinformatics subreddit), or ResearchGate, but if you start with Google, you could get your answer right away.
More resources for teaching yourself NGS analysis are listed
here and, for learning a coding language,
here and
here.
Your Project, Your Interpretation, Your Decisions
So you’ve run your NGS bioinformatics analyses and ended up with huge tables and tangles of networks. What now? Remember that you ultimately want to tell a verifiable story, not just cough up a bunch of numbers.
Make Sure Your Analysis Algorithms Were Sound
First off, it’s imperative to check that your
in-silico data are reliable and not just artifacts of the combination of programs you happened to choose. That step wherein you normalize your gene counts, for example. Which algorithm did you use? Another algorithm could have generated a completely different list of statistically significant reads. It’s advantageous if your project involves many knowns alongside your unknowns. If you’re checking a cell population that you know expresses certain cytokines more than another cell population but you have no idea whether or not it expresses more enzymes, check the cytokines. Do their expression patterns, using the algorithms you used, match those seen in the literature? If not, you’ll have to try a different algorithm (and the normalization step is usually a good place to start trying changes).
You may end up running three different algorithms for your NGS analysis, running functional analyses on all the results, and ending up with Venn diagrams showing a relatively small list on which they all agree! And that might already be your story (for now) if little is known about what you’re studying. If, again, you have a lot of knowns, let them be your guidepost and they’ll save you a lot of repetition. Your unknowns are where you’ll want to explore.
Getting Creative With the Unknowns
Your NGS will almost certainly include genes whose expression patterns were not previously known in your tissue(s) of interest. Use them as a springboard for the next chapter of your project! Maybe you love endocytosis and it’s never been studied in the tissue your samples are from. Does your functional analysis show you statistically significant patterns for that? Sometimes you’ll get the most interesting answers when you pull from your data rather than let your data push at you.
Don’t worry if what you picked isn’t what has the highest-magnitude change in your data. It’s YOUR story, and your experiments will tell you whether you’re onto something or not. Combine poring over your tables with some good old-fashioned paper-reading about genes that have piqued your interest, and start planning that benchwork! After all, bioinformatics is a hypothesis generator, not a decision-maker. The latter role belongs to you! (Well, your PI, too, if you have one.)
Time Management Is Key
Bioinformatics means planning your time as much as you do with hands-on science that calls for lengthy experiments. You never know when you’ll have to repeat an analysis with one or two changes (see above). Some analysis steps take hours or even days to run (in the background while you do other stuff). Oftentimes there are bugs and downtimes for program administrators to fix them or make upgrades. So, as with pretty much any sizeable task, set aside more time than you think you’ll need and don’t wait until you’re near a deadline to start.
As you’ve seen, bioinformatics is fascinating, widely useful, and surprisingly accessible. The keys to succeeding in it are to find the right resources for you and never stop steering the big data ship. Good luck with your projects!