So, you’ve spent time
planning your high-throughput sequencing experiment. You’ve chosen how many replicates to use, deliberated about sequencing depth, and kept everything RNase-free. Now you have many gigabytes of data available. What’s next?
While the first step of RNA-Seq analysis is aligning your sequencing reads to a reference genome, first you need to get your data on a Linux server to use those analysis tools. This guide will introduce you to using Linux on the command line and teach you how to get your data on to the Linux server.
1. Get Access to a Linux Server
Some alignment tools can require large amounts of RAM and your files are large, so you’ll want to use a high-performance Linux server. Helpfully, many universities have these available for their students and staff, and they keep them updated with useful software. Some labs decide to instead maintain their own computers. Find out what your options are.
2. Connect to That Computer
You probably won’t sit in front of the server that you’re using and type on it. Instead, you’ll connect through your own computer. Doing so is easy from a Mac, and not very complicated from a Windows machine.
The server that you’re using will have its own IP address. You’ll get a user name and password for it.
For a Mac, open the terminal. Type “ssh <username>@<IP address>. It will then ask for your password.
For Windows, you’ll need to download what’s called an SSH client, and it will connect you. The most common one is called
PuTTY. After installing and opening PuTTY, type in your IP address. The host number is almost always 22. Hit open, it will connect, then ask you for user name and password.
You’re now connected!
3. Adjust to Linux
If you’re new to Linux, getting used to the command line can take some time. Instead of clicking on folders or files to open them, you navigate by typing into the terminal. Some useful commands are:
- pwd – print the path of your current directory
- ls – list all contents of your current directory
- ls -lh – list contents of directory with their size, when last modified, and more information
- rm <file name> – remove specified file
- cd <location> – change directory to a specified directory
- cd ../ – change directory to one level higher
- mkdir <name> – make a new directory with the specified name
- cp <source> <destination> – copy file from source to destination
- mv <source> <destination> – move file from source to destination
- head <file name> – see first ten lines of a file
- tail <file name> – see last ten lines of a file
- cat <file name> – see entire file
- more <file name> – scroll through entire file
- gunzip – unzip files with .gz extension
- gzip – zip file to .gz extension
- tar -xzvf – extract and unzip files from file ending in .tar.gz or .tgz
You can
find more commands online.
4. Download Your Data
There are a couple options for downloading data. One is to download it to your computer, then transfer it to the Linux server. The second is to directly download the data to the server.
If you first download to your computer, you’ll create a backup of your data on your own computer. An easy way to get the data from your computer to the server is using the
Filezilla client.
If you download directly to the server, you save time (but remember to make a backup!). Whoever did the sequencing will provide a link to where the data are stored. While connected to the server, type in:
wget -c <link>
If you disconnect from the server during the download, type that command again, and your download will continue from where it ended.
At this point, you have your data and are ready to start analyzing it! Stay tuned for more advice on choosing a genome alignment tool, detecting differential gene expression, and more!
Discover some of the wonderful
free software that is available for biologists using Linux.