Introduction to Linux for High-Throughput Sequencing Analysis

So, you’ve spent time planning your high-throughput sequencing experiment. You’ve chosen how many replicates to use, deliberated about sequencing depth, and kept everything RNase-free. Now you have many gigabytes of data available. What’s next?

While the first step of RNA-Seq analysis is aligning your sequencing reads to a reference genome, first you need to get your data on a Linux server to use those analysis tools. This guide will introduce you to using Linux on the command line and teach you how to get your data on to the Linux server.

1. Get Access to a Linux Server

Some alignment tools can require large amounts of RAM and your files are large, so you’ll want to use a high-performance Linux server. Helpfully, many universities have these available for their students and staff, and they keep them updated with useful software. Some labs decide to instead maintain their own computers. Find out what your options are.

2. Connect to That Computer

You probably won’t sit in front of the server that you’re using and type on it. Instead, you’ll connect through your own computer. Doing so is easy from a Mac, and not very complicated from a Windows machine.

Image Larger Volumes with the UltraMicroscope Choros™

From: Miltenyi Biotech

Trust Your Quantification with the DeNovix DS-8X Rapid Eight Channel, 1µL UV-Vis Spectrophotometer

From: DeNovix

The server that you’re using will have its own IP address. You’ll get a user name and password for it.

For a Mac, open the terminal. Type “ssh <username>@<IP address>. It will then ask for your password.

For Windows, you’ll need to download what’s called an SSH client, and it will connect you. The most common one is called PuTTY. After installing and opening PuTTY, type in your IP address. The host number is almost always 22. Hit open, it will connect, then ask you for user name and password.

You’re now connected!

3. Adjust to Linux

If you’re new to Linux, getting used to the command line can take some time. Instead of clicking on folders or files to open them, you navigate by typing into the terminal. Some useful commands are:

pwd – print the path of your current directory
ls – list all contents of your current directory
ls -lh – list contents of directory with their size, when last modified, and more information
rm <file name> – remove specified file
cd <location> – change directory to a specified directory
cd ../ – change directory to one level higher
mkdir <name> – make a new directory with the specified name
cp <source> <destination> – copy file from source to destination
mv <source> <destination> – move file from source to destination
head <file name> – see first ten lines of a file
tail <file name> – see last ten lines of a file
cat <file name> – see entire file
more <file name> – scroll through entire file
gunzip – unzip files with .gz extension
gzip – zip file to .gz extension
tar -xzvf – extract and unzip files from file ending in .tar.gz or .tgz

You can find more commands online.

4. Download Your Data

There are a couple options for downloading data. One is to download it to your computer, then transfer it to the Linux server. The second is to directly download the data to the server.

If you first download to your computer, you’ll create a backup of your data on your own computer. An easy way to get the data from your computer to the server is using the Filezilla client.

If you download directly to the server, you save time (but remember to make a backup!). Whoever did the sequencing will provide a link to where the data are stored. While connected to the server, type in:

wget -c <link>

If you disconnect from the server during the download, type that command again, and your download will continue from where it ended.

At this point, you have your data and are ready to start analyzing it! Stay tuned for more advice on choosing a genome alignment tool, detecting differential gene expression, and more!

Discover some of the wonderful free software that is available for biologists using Linux.

Erin Wissink

Erin gained a PhD in Biochemistry and Molecular Biology from Cornell University. She is currently a Postdoctoral Associate at Cornell University.