site env_type hertle gertie pappamaw
1 a grass 0 26 6
2 b grass 2 14 4
3 c grass 4 15 3
4 d grass 2 22 4
5 e streambed 0 34 8
6 f streambed 5 6 2
7 g streambed 2 2 3
8 h streambed 8 12 0
9 i streambed 5 8 2
10 j streambed 10 0 0
11 k streambed 22 0 0
12 l streambed 42 1 0
13 m grass 30 0 3
14 n grass 59 6 0
15 o grass 35 8 0There is an important thing to notice about this setup: it’s in what’s known as “long” format. This means that each row is a unique observation, and all the measurements and variables associated with that observation are in different columns. This is the format that is most commonly assumed your data will be in for many, many R functions, so I recommend always storing your data this way. How do we build this data frame object in R? Today we’ll do it the simplest, if not the most efficient, way that I know: using multiple c() commands and then putting them all together. site = c(‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’, ‘h’, ‘i’, ‘j’, ‘k’, ‘l’, ‘m’, ‘n’, ‘o’) env_type = c(‘grass’, ‘grass’, ‘grass’, ‘grass’, ‘streambed’, ‘streambed’, ‘streambed’, ‘streambed’, ‘streambed’, ‘streambed’, ‘streambed’, ‘streambed’, ‘grass’, ‘grass’, ‘grass’) hertle = c(0,2,4,2,0,5,2,8,5,10,22,42,30,59,35) gertie = c(26,14,15,22,34,6,2,12,8,0,0,1,0,6,8) pappamaw = c(6,4,3,4,8,2,3,0,2,0,0,0,3,0,0) You may notice something new here: non-numeric data (i.e. Character data) was put together using the c() command but with single quotes around each string of characters. This tells R that the information is a string of characters, and not variables. For example, if I tried: site = c(a,b,c,d) I would get an error because R would look for variables in the workspace named a, b, c, and d to put together into site, but wouldn’t find them. Now to put it all together into a data.frame, we use the data.frame() command: my_data = data.frame(site=site,env_type=env_type,hertle=hertle,gertie=gertie,pappamaw = pappamaw) Now your data is all nicely put together into one object! You can click on it in the workspace to see it nicely arranged. Or type my_data into the console to have it show up there. I should note that while I made the column titles the same as the variables, that is not necessary. The general way to think about putting together a data.frame is like this: Final_dataframe_name = data.frame(column1_title = column1_data, column2_title = column2_data, …) You can do a few neat commands to take a look at the data.frame. Let’s try a few: head(my_data,5) head() shows you the first 5 observations in the data.frame called my_data. tail(my_data, 6) Like head(), but tail() shows the last observations. The number can be any size. I recommend using these to make sure your dataframe looks like you expect it to. str(my_data) This one is very important. It shows you the ‘structure’ of your data, in other words, what types of data R thinks are in each column of the data.frame. Right now it says: > str(my_data) ‘data.frame’: 15 obs. of 5 variables: $ site : Factor w/ 15 levels “a”,”b”,”c”,”d”,..: 1 2 3 4 5 6 7 8 9 10 … $ env_type: Factor w/ 2 levels “grass”,”streambed”: 1 1 1 1 2 2 2 2 2 2 … $ hertle : num 0 2 4 2 0 5 2 8 5 10 … $ gertie : num 26 14 15 22 34 6 2 12 8 0 … $ pappamaw: num 6 4 3 4 8 2 3 0 2 0 … which is telling you that my_data is a data.frame with 15 observations (rows) of 5 variables (columns). Then it tells you the type of each variable and shows the first handful. Hertle, gertie and pappamaw are numbers, as expected. Site and env_type, on the other hand, are ‘factors.’ This is what we want them to be if we want to be able to separate the data based upon these variables. Don’t worry if this is confusing, it will become clearer as you work more with R. A good rule of thumb is that if a variable represents different treatments, it makes sense to store it as a factor data type. Remember the summary() command which we used on a single variable last time? Well, we can use it on data.frames as well, and this is extremely useful to get a sense of your data! > summary(my_data)
site env_type hertle gertie pappamaw
a :1 grass :7 Min. : 0.00 Min. : 0.00 Min. :0.000
b :1 streambed:8 1st Qu.: 2.00 1st Qu.: 1.50 1st Qu.:0.000
c :1 Median : 5.00 Median : 8.00 Median :2.000
d :1 Mean :15.07 Mean :10.27 Mean :2.333
e :1 3rd Qu.:26.00 3rd Qu.:14.50 3rd Qu.:3.500
f :1 Max. :59.00 Max. :34.00 Max. :8.000
(Other):9For the factor variables, it tells you (at least some of) the variable ‘levels’ and how many are in each level. For example, for env_type it says there are 7 grass and 8 streambed observations. In site, it shows that there is 1 of each level being shown and 9 other levels. For hertle, gertie, and pappamaw you get a 5-point summary (min, median, max, and quartiles) as well as the mean. From here, you can see that the range of population of each species differs, with hertle having a much larger maximum population size than the other species. You can also see that the mean is similar to the median within the gertie and pappamaw data, but that they are very different for hertle. Comparing the mean and the median is very useful: it can tell you if your data has a chance at being symmetric and normal (but certainly does NOT tell you that for sure!). Let’s spend the rest of the time pulling out specific pieces of the data.frame and plotting them using R’s base graphics. Note: this is not going to show off some of the great reasons to store your data in a data.frame; that will come later when we really get into data manipulation and advanced plotting with some non-standard libraries. For now, it’s best to start working with data.frames because more familiarity with them will make things much easier later on. First, how do we get the data from a single column, for example, the hertle data? There are a few ways, none of which is inherently better. The two methods I use most commonly look like this: > my_data$hertle [1] 0 2 4 2 0 5 2 8 5 10 22 42 30 59 35 and > my_data[,’hertle’] [1] 0 2 4 2 0 5 2 8 5 10 22 42 30 59 35 The first method, using the dollar sign, is specific to data frames. Putting a column title after a dollar sign after the data frame’s name gives you the information in that column. The second method, which used square brackets, can be used in any type of R array (vectors, matrices, data.frames, etc.). What is happening here is that we are saying what rows we want (all of them) and what columns we want (the one named ‘hertle’). The confusing thing about R is that if you want everything in a category, you don’t give the square brackets any information. In other words, because there is nothing before the comma, that means we want all the rows. data_frame_name[rows_you_want,columns_you_want] # will return the part of data_frame_name that is just the rows and columns you want. We don’t even have to use a name. Since we know the hertle data is in the third column, we could instead do: my_data[,3] What if we just wanted rows 1-5 of the hertle data? Easy! We would do this: my_data[1:5,3] The colon says “first and last and everything between” so this command says we want all the rows between 1 and 5 of column three, returning this: > my_data[1:5,3] [1] 0 2 4 2 0 Let’s leave the data manipulation at that right now, in later lessons we’ll introduce more complicated manipulations. Let’s now use R’s base graphics to look at some important exploratory data plots: boxplots, histograms, and scatterplots. We made a boxplot in the last post so I won’t go into detail explaining them here, but typing in the following should show you a boxplot of your data. boxplot(my_data$hertle,my_data$gertie,names=c(‘hertle’, ‘gertie’),ylab=’N’,main=’Population sizes’) Here is a histogram: hist(my_data$hertle,xlab=’population size’,ylim=c(0,20),xlim=c(0,70),breaks=10) The parameters of hist shown here are the data, the x-axis label (xlab), the minimum and maximum that the y-axis shows (ylim), the minimum and maximum that the x-axis shows (xlim), and how many different bars I want the data broken into (breaks). Note that ylim and xlim both take in vectors of length 2, which I assign within the hist command using c(). It was important to specify ylim and xlim so that when we call the same command for gertie, the data are shown within the same range. In a later post, we’ll use a plotting package that takes care of all of that for you. Here’s the same thing for gertie: hist(gertie,xlab=’population size’,ylim=c(0,20),xlim=c(0,70),breaks=10) Now, it would be nice to have both plots show simultaneously. To do that, we’ll first have to tell R that when we plot, we want to plot 2 things. I’m not going to go into the details of the command that does this, except to show you how to say how many rows and columns of plots we want. We use the command: par(mfrow = c(1,2)) which tells R to make 1 row and 2 columns. In other words, it reads it like this: par(mfrow= c(num_rows,num_columns)) If we do: par(mfrow=c(1,2)) hist(my_data$hertle,xlab=’population size’,ylim=c(0,20),xlim=c(0,70),breaks=10) hist(gertie,xlab=’population size’,ylim=c(0,20),xlim=c(0,70),breaks=10) We should see: Now it’s easy to see that hertle has a longer tail, but mostly has tiny population sizes, whereas gertie has fewer tiny population sizes but also a less extreme tail. Ok, now let’s get into the meat of our data. We know that hertle is a better competitor than gertie, do their populations negatively correlate? To look at this, we’ll use the plot() command, and plot hertle vs. Gertie. plot(my_data$hertle,my_data$gertie,xlab=’hertle pop.’,ylab=’gertie pop.’) Yes, in fact they only seem to exist where the other isn’t. Let’s see if the pappamaw—gertie’s alternative food source, has an effect on this. To examine this, we are going to set the size of the dot to vary depending on the population size of pappamaw. Changing the size of the dot is done using the cex parameter in plot commands, like so: plot(my_data$hertle,my_data$gertie,xlab=’hertle pop.’,ylab=’gertie pop.’,cex=my_data$pappamaw+1)
Please let me know if there is anything in particular you would like to see in these posts in the future!