# An Easy Way to Start Using R in Your Research: Exploratory Data Analysis

As youāve probably kind of guessed from our previous articles Introducng R and the Basic R Tutorial, we think R programming language and R-studio are great tools for data analysis and figure production.Ā And now we are about to prove it!

So, youāve collected some data and are pretty sure you know what statistical test you wish to do on it (you did think about this when designing the experiment, right? Right?). But first, you want to take a look at the data, get a good feel for it and make sure you can proceed with confidence.

Time for some exploratory data analysis!

Last time in our venture into R-Studio, we learned how to save data as a variable, make a boxplot, summarize the data, do a t-test, and get help. Today weāll build on that a bit: weāll make a bunch of plots and try to understand what theyāre telling us. In the process, weāll learn about a very useful R object: the data.frame.

Okay ā a little refresher. How did we enter data last time? We used the c(x1, x2, x3, x4, ā¦, xn) to concatenate all the xās into a variable using the equal sign, like:

**x = c(1,4,2,-3,4.6****)**

and recall that we put that code into a script and ran it by highlighting it and hitting control-enter, then watching what happens in the command window. (In these articles, commands to enter are bolded and what you see in the command window is indicated in blue text).

Well, what if we want to look at more than one group of data? We could do what we did last time, and do something like:

**group1 = c(1,4,2,-3,4.6)**

**group2 = c(7,4,5,12,4,2)**

and this works perfectly well for some things, like a t-test, which is setup to take data as two variables:

**t.test(group1,group2)**

as well as boxplots:

**boxplot(group1,group2)**

But soon enough youāll run into situations that start to get unwieldy doing it that way (in fact, right now).

Letās imagine a somewhat realistic experiment. Letās pretend you were studying two insect species populations, *Hertle *and *Gertie* (donāt ask me). Both of these species are predators and eat a third species *Deadie. Hertle *is usually better at exploiting *Deadie *than *Gertie *is, and so youād expect to find *Hertle *more than *Gertie*. However, *Gertie* can eat a second prey, *Pappamaw*, so it isnāt fully outcompeted by *Hertle. *

Letās pretend you went to a bunch of different places and counted these three species. Complicating matters further, you also went to two different types of environments: grasslands and stream-beds, and are interested in these environmentsā effect on the insectsā relationships.

How the heck are we going to represent this data in R in a way that is not too confusing??

Let me introduce, *the dataframe. *Ā A dataframe is a kind of object in R (like a variable) that is like an excel spreadsheet. You can have data in a grid with column titles and you can access the information in those columns. We want one that looks like this:

Ā Ā siteĀ env_type hertle gertie pappamaw

1Ā Ā Ā Ā aĀ Ā Ā Ā grassĀ Ā Ā Ā Ā 0Ā Ā Ā Ā 26Ā Ā Ā Ā Ā Ā Ā 6

2Ā Ā Ā Ā bĀ Ā Ā Ā grassĀ Ā Ā Ā Ā 2Ā Ā Ā Ā 14Ā Ā Ā Ā Ā Ā Ā 4

3Ā Ā Ā Ā cĀ Ā Ā Ā grassĀ Ā Ā Ā Ā 4Ā Ā Ā Ā 15Ā Ā Ā Ā Ā Ā Ā 3

4Ā Ā Ā Ā dĀ Ā Ā Ā grassĀ Ā Ā Ā Ā 2Ā Ā Ā Ā 22Ā Ā Ā Ā Ā Ā Ā 4

5Ā Ā Ā Ā e streambedĀ Ā Ā Ā Ā 0Ā Ā Ā Ā 34Ā Ā Ā Ā Ā Ā Ā 8

6Ā Ā Ā Ā f streambedĀ Ā Ā Ā Ā 5Ā Ā Ā Ā Ā 6Ā Ā Ā Ā Ā Ā Ā 2

7Ā Ā Ā Ā g streambedĀ Ā Ā Ā Ā 2Ā Ā Ā Ā Ā 2Ā Ā Ā Ā Ā Ā Ā 3

8Ā Ā Ā Ā h streambedĀ Ā Ā Ā Ā 8Ā Ā Ā Ā 12Ā Ā Ā Ā Ā Ā Ā 0

9Ā Ā Ā Ā i streambedĀ Ā Ā Ā Ā 5Ā Ā Ā Ā Ā 8Ā Ā Ā Ā Ā Ā Ā 2

10Ā Ā Ā j streambedĀ Ā Ā Ā 10Ā Ā Ā Ā Ā 0Ā Ā Ā Ā Ā Ā Ā 0

11Ā Ā Ā k streambedĀ Ā Ā Ā 22Ā Ā Ā Ā Ā 0Ā Ā Ā Ā Ā Ā Ā 0

12Ā Ā Ā l streambedĀ Ā Ā Ā 42Ā Ā Ā Ā Ā 1Ā Ā Ā Ā Ā Ā Ā 0

13Ā Ā Ā mĀ Ā Ā Ā grassĀ Ā Ā Ā 30Ā Ā Ā Ā Ā 0Ā Ā Ā Ā Ā Ā Ā 3

14Ā Ā Ā nĀ Ā Ā Ā grassĀ Ā Ā Ā 59Ā Ā Ā Ā Ā 6Ā Ā Ā Ā Ā Ā Ā 0

15Ā Ā Ā oĀ Ā Ā Ā grassĀ Ā Ā Ā 35Ā Ā Ā Ā Ā 8Ā Ā Ā Ā Ā Ā Ā 0

There is an important thing to notice about this setup: itās in whatās known as ālongā format. This means that each row is a unique observation, and all the measurements and variables associated with that observation are in different columns. This is the format that is most commonly assumed your data will be in for many, many R functions, so I recommend always storing your data this way.

How do we build this data frame object in R? Today weāll do it the simplest, if not the most efficient, way that I know: using multiple c() commands and then putting them all together.

**site = c(āaā, ābā, ācā, ādā, āeā, āfā, āgā, āhā, āiā, ājā, ākā, ālā, āmā, ānā, āoā)**

**env_type = c(āgrassā, āgrassā, āgrassā, āgrassā, āstreambedā, āstreambedā, āstreambedā, āstreambedā, āstreambedā, āstreambedā, āstreambedā, āstreambedā, āgrassā, āgrassā, āgrassā)**

**hertle = c(0,2,4,2,0,5,2,8,5,10,22,42,30,59,35)**

**gertie = c(26,14,15,22,34,6,2,12,8,0,0,1,0,6,8)**

**pappamaw = c(6,4,3,4,8,2,3,0,2,0,0,0,3,0,0)**

You may notice something new here: non-numeric data (i.e. Character data) was put together using the c() command but with single quotes around each string of characters. This tells R that the information is a string of characters, and not variables. For example, if I tried:

**site = c(a,b,c,d) **

I would get an error because R would look for variables in the workspace named a, b, c, and d to put together into site, but wouldnāt find them.

Now to put it all together into a data.frame, we use the data.frame() command:

**my_data = data.frame(site=site,env_type=env_type,hertle=hertle,gertie=gertie,pappamaw = pappamaw)**

Now your data is all nicely put together into one object! You can click on it in the workspace to see it nicely arranged. Or type **my_data** into the console to have it show up there. I should note that while I made the column titles the same as the variables, that is not necessary. The general way to think about putting together a data.frame is like this:

Final_dataframe_name = data.frame(column1_title = column1_data, column2_title = column2_data, ā¦)

You can do a few neat commands to take a look at the data.frame. Letās try a few:

**head(my_data,5)**

head() shows you the first 5 observations in the data.frame called my_data.

**tail(my_data, 6)**

Like head(), but tail() shows the last observations. The number can be any size. I recommend using these to make sure your dataframe looks like you expect it to.

**str(my_data)**

This one is very important. It shows you the āstructureā of your data, in other words, what types of data R thinks are in each column of the data.frame. Right now it says:

> str(my_data)

ādata.frameā:Ā Ā 15 obs. ofĀ 5 variables:

$ siteĀ Ā Ā : Factor w/ 15 levels āaā,ābā,ācā,ādā,..: 1 2 3 4 5 6 7 8 9 10 ā¦

$ env_type: Factor w/ 2 levels āgrassā,āstreambedā: 1 1 1 1 2 2 2 2 2 2 ā¦

$ hertleĀ : numĀ 0 2 4 2 0 5 2 8 5 10 ā¦

$ gertieĀ : numĀ 26 14 15 22 34 6 2 12 8 0 ā¦

$ pappamaw: numĀ 6 4 3 4 8 2 3 0 2 0 ā¦

which is telling you that my_data is a data.frame with 15 observations (rows) of 5 variables (columns). Then it tells you the type of each variable and shows the first handful. Hertle, gertie and pappamaw are numbers, as expected. Site and env_type, on the other hand, are āfactors.ā This is what we want them to be if we want to be able to separate the data based upon these variables. Donāt worry if this is confusing, it will become clearer as you work more with R. A good rule of thumb is that if a variable represents different treatments, it makes sense to store it as a factor data type.

Remember the summary() command which we used on a single variable last time? Well, we can use it on data.frames as well, and this is extremely useful to get a sense of your data!

> summary(my_data)

Ā Ā Ā Ā Ā siteĀ Ā Ā Ā Ā Ā Ā env_typeĀ Ā Ā Ā hertleĀ Ā Ā Ā Ā Ā Ā Ā Ā gertieĀ Ā Ā Ā Ā Ā Ā Ā pappamaw

Ā aĀ Ā Ā Ā Ā :1Ā Ā grassĀ Ā Ā :7Ā Ā Min.Ā Ā : 0.00Ā Ā Min.Ā Ā : 0.00Ā Ā Min.Ā Ā :0.000

Ā bĀ Ā Ā Ā Ā :1Ā Ā streambed:8Ā Ā 1st Qu.: 2.00Ā Ā 1st Qu.: 1.50Ā Ā 1st Qu.:0.000

Ā cĀ Ā Ā Ā Ā :1Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Median : 5.00Ā Ā Median : 8.00Ā Ā Median :2.000

Ā dĀ Ā Ā Ā Ā :1Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā MeanĀ Ā :15.07Ā Ā MeanĀ Ā :10.27Ā Ā MeanĀ Ā :2.333

Ā eĀ Ā Ā Ā Ā :1Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 3rd Qu.:26.00Ā Ā 3rd Qu.:14.50Ā Ā 3rd Qu.:3.500

Ā fĀ Ā Ā Ā Ā :1Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Max.Ā Ā :59.00Ā Ā Max.Ā Ā :34.00Ā Ā Max.Ā Ā :8.000

Ā (Other):9

For the factor variables, it tells you (at least some of) the variable ālevelsā and how many are in each level. For example, for env_type it says there are 7 grass and 8 streambed observations. In site, it shows that there is 1 of each level being shown and 9 other levels. For hertle, gertie, and pappamaw you get a 5-point summary (min, median, max, and quartiles) as well as the mean. From here, you can see that the range of population of each species differs, with hertle having a much larger maximum population size than the other species. You can also see that the mean is similar to the median within the gertie and pappamaw data, but that they are very different for hertle. Comparing the mean and the median is very useful: it can tell you if your data has a chance at being symmetric and normal (but certainly does *NOT *tell you that for sure!).

Letās spend the rest of the time pulling out specific pieces of the data.frame and plotting them using Rās base graphics. Note: this is not going to show off some of the great reasons to store your data in a data.frame; that will come later when we really get into data manipulation and advanced plotting with some non-standard libraries. For now, itās best to start working with data.frames because more familiarity with them will make things *much *easier later on.

First, how do we get the data from a single column, for example, the hertle data?

There are a few ways, none of which is inherently better. The two methods I use most commonly look like this:

> my_data$hertle

[1]Ā 0Ā 2Ā 4Ā 2Ā 0Ā 5Ā 2Ā 8Ā 5 10 22 42 30 59 35

and

> my_data[,āhertleā]

[1]Ā 0Ā 2Ā 4Ā 2Ā 0Ā 5Ā 2Ā 8Ā 5 10 22 42 30 59 35

The first method, using the dollar sign, is specific to data frames. Putting a column title after a dollar sign after the data frameās name gives you the information in that column.

The second method, which used square brackets, can be used in any type of R array (vectors, matrices, data.frames, etc.). What is happening here is that we are saying what rows we want (all of them) and what columns we want (the one named āhertleā). The confusing thing about R is that if you want *everything *in a category, you donāt give the square brackets any information. In other words, because there is nothing before the comma, that means we want all the rows.

**data_frame_name[rows_you_want,columns_you_want]** # will return the part of data_frame_name that is just the rows and columns you want.

We donāt even have to use a name. Since we know the hertle data is in the third column, we could instead do:

**my_data[,3]**

What if we just wanted rows 1-5 of the hertle data? Easy! We would do this:

**my_data[1:5,3]****Ā Ā **

The colon says āfirst and last and everything betweenā so this command says we want all the rows between 1 and 5 of column three, returning this:

> my_data[1:5,3]

[1] 0 2 4 2 0

Letās leave the data manipulation at that right now, in later lessons weāll introduce more complicated

manipulations.

Letās now use Rās base graphics to look at some important exploratory data plots: boxplots, histograms, and scatterplots.

We made a boxplot in the last post so I wonāt go into detail explaining them here, but typing in the following should show you a boxplot of your data.

**boxplot(my_data$hertle,my_data$gertie,names=c(āhertleā, āgertieā),ylab=āNā,main=āPopulation sizesā)**

Here is a histogram:

**hist(my_data$hertle,xlab=āpopulation sizeā,ylim=c(0,20),xlim=c(0,70),breaks=10)**

The parameters of hist shown here are the data, the x-axis label (xlab), the minimum and maximum that the y-axis shows (ylim), the minimum and maximum that the x-axis shows (xlim), and how many different bars I want the data broken into (breaks). Note that ylim and xlim both take in vectors of length 2, which I assign within the hist command using c().

It was important to specify ylim and xlim so that when we call the same command for gertie, the data are shown within the same range. In a later post, weāll use a plotting package that takes care of all of that for you.

Hereās the same thing for gertie:

**hist(gertie,xlab=āpopulation sizeā,ylim=c(0,20),xlim=c(0,70),breaks=10)**

Now, it would be nice to have both plots show simultaneously. To do that, weāll first have to tell R that when we plot, we want to plot 2 things. Iām not going to go into the details of the command that does this, except to show you how to say how many rows and columns of plots we want. We use the command:

**par(mfrow = c(1,2))**

which tells R to make 1 row and 2 columns. In other words, it reads it like this:

par(mfrow= c(num_rows,num_columns))

If we do:

**par(mfrow=c(1,2))**

**hist(my_data$hertle,xlab=āpopulation sizeā,ylim=c(0,20),xlim=c(0,70),breaks=10)**

**hist(gertie,xlab=āpopulation sizeā,ylim=c(0,20),xlim=c(0,70),breaks=10)**

We should see:

Now itās easy to see that hertle has a longer tail, but mostly has tiny population sizes, whereas gertie has fewer tiny population sizes but also a less extreme tail.

Ok, now letās get into the meat of our data. We know that hertle is a better competitor than gertie, do their populations negatively correlate? To look at this, weāll use the plot() command, and plot hertle vs. Gertie.

**plot(my_data$hertle,my_data$gertie,xlab=āhertle pop.ā,ylab=āgertie pop.ā)**

Yes, in fact they only seem to exist where the other isnāt. Letās see if the pappamawāgertieās alternative food source, has an effect on this. To examine this, we are going to set the size of the dot to vary depending on the population size of pappamaw. Changing the size of the dot is done using the cex parameter in plot commands, like so:

**plot(my_data$hertle,my_data$gertie,xlab=āhertle pop.ā,ylab=āgertie pop.ā,cex=my_data$pappamaw+1)**

Wow! In fact, the gertie population is only substantial where pappamaw has populations, with one exception. Notice that I did cex = pappamaw + 1; this is because in some locations pappamaw has a population size of zero. Setting cex = 0 will make the dot be of size zero.

We might also want the shape of the dot to change depending on whether the environment is grassland or a streambed. This is just as easy as changing dot size. Instead of cex, weāll use pch. However, we also have to make env_type, which is currently set as a factor, a number for plotting, because pch is expecting integers. This is easy to do, we just wrap my_data$env_type in the command as.numeric(), which forces my_data$env_type to be numbers. Note that this does not actually change your data frame, it is just a temporary change that only occurs within your plot.

**plot(my_data$hertle,my_data$gertie,xlab=āhertle pop.ā,ylab=āgertie pop.ā,pch=as.numeric(my_data$env_type))**

Well, that worked, but what are the different shapes? Weāll need to add a legend for that. Legends are added as a second command after the plot command. They work by doing legend(x,y,thing_,thing_labels). The parameters are the location you want the legend at (x and y), what you want to show in the legend (here, we want to show the two first shapes, or pch=c(1,2)), and what we want to label those shapes as (here, c(āgrassā,āstreamā)).

**legend(30,30,pch=c(1,2),c(āgrassā,āstreambedā))**

Remember, you can type in **?legend** to ask for help and learn way more about what you can do with legends.

Finally, letās combine the size and the shapes to see how all four of our variables change together.

**plot(my_data$hertle,my_data$gertie,xlab=āhertle pop.ā,ylab=āgertie pop.ā,cex=(my_data$pappamaw+1)/1.5,pch=as.numeric(my_data$env_type))**

**legend(30,30,pch=c(1,2),c(āgrassā,āstreambedā))**

Thatās one way to put all of that data in one plot!

Okay, so in this post we went through making a new type of R object, a data.frame, we learned how to access data within a dataframe, and we learned how to plot boxplots, histograms, and scatterplots, as well as how to modify dot shape and size within the scatterplot, and add a legend. As usual, I recommend using ?command to look into details of whatever command you are using.

Next time, weāll learn how to load datasetsāfrom files, from some of the standard ones provided by R, and then weāll also install a new package and remake plots like these in a much more efficient (and pretty) way, using ggplot2.

Please let me know if there is anything in particular you would like to see in these posts in the future!