Working with large datasets can be very frustrating and time consuming. If only there were more tools out there to simplify things without needing to invest a PhD’s worth of time to learn how to use them! I am here to tell you that there is a solution, and a free one at that.
If you are working with high-throughput techniques that provide you with large data sets, you might have heard about the R programming language. Maybe you even have some colleagues that use it, but they told you that it is quite complicated and you are too scared to give it a chance.
In this article I will give you some tips to lose the fear and start taking advantage of this extremely useful tool. In following articles we will give you step-by-step instructions for using R to analyze your data.
What is R anyway?
R is a programming language that is widely used for statistics and graphics. It is also starting to become very popular in the biology world due to the Bioconductor project (http://www.bioconductor.org) that provides tools based on R for the analysis of biological data.
Why is R so convenient?
- It is completely free
- It is open-source therefore it is constantly checked by its users (It is so widely used that any bug or error in the program is reported soon)
- It is very useful for dealing with large amounts of data because it doesn’t require high computer processing power (Have you ever tried to work with a 20000 raws list with Excel?)
- It provides you with high quality graphics.
- It has a big community of users so you can easily get support online
Very nice, but how does it help me with my research?
Hundreds of packages are available from the Bioconductor project (http://www.bioconductor.org/packages/release/bioc/). You can for example:
- Get a list of strongly regulated genes from your microarrays data
- Cluster time-series data
- Do a pathway or gene ontology analysis of any list of genes or proteins
- Have an idea of which transcription factors might be regulated based on a list of regulated genes
- Check the quality of several types of data (sequencing, mass spectrometry, flow cytometry, microarrays…).
- Do an automated analysis of high-throughput qPCR data
- Create and simulate a mathematical model (Boolean, Bayesian…)
- Perform any statistical test with your data (that’s why R was created in the first place)
There have to be some bad things as well…
- You can do simple things “easily” but it’s not intuitive. Be patient – it will require a couple of days until you are able to make it work.
Ok you convinced me! How can I start?
If you have the opportunity to take a short introductory course in your University don’t hesitate to do it. They will guide you through the first steps and help you when you get your first error screens (this is normal and part of the fun of starting).
But don’t worry, in case you have no other choice but to start on your own, there are several tools that can help you. My favorite is the R studio suite (http://www.rstudio.com) that makes using R much more intuitive and user-friendly. Simple options, like loading a data file, are built into the program so that you can do it with just one click (instead of typing a whole command line).
Moreover, there are several sites with free R tutorials for beginners:
So now you are ready to show-off your “computer programming skills” amongst your colleagues that are still too afraid to try!