We’ve been slowly coaxing you along in our R tutorials. We’ve introduced what R is, gave you a basic tutorial into how to use R and also spent some time learning how to explore your data with R.

By now you are probably itching to use R for more complicated analyses. To indulge you, I am going to diverge from the incremental teaching (which will continue in the future) and show you how to use R to do statistical comparisons of groups of data. I am also going to show you at least one way to plot the data that would be appropriate for each test. This post parallels another BitesizeBio article, which gives a guide to the statistical test to do depending on your data.

Below is the code you need to perform each test. A nice thing about having this code all in one place is that you can see that for many of the tests, R uses similar syntax. For example, many of the models follow the format of:

response_variable ~ independent_variable1 + independent_variable2

And also you can get more information from many models—if you put them into a variable–by calling summary(my_model) and plot(my_model).

In each case, I am going to do the following things in order:

- make up some data by pulling random numbers from a normal distribution (with rnorm) or a uniform distribution (with runif).
- plot the data. I tried to use a variety of methods, trying to choose what I would start with when examining this type of data,
- build the model and save it as a variable (skipped on the simpler tests), then
- get useful information out of the model. Remember, if you ever want more information on a function, just run ?function_name!

Just like my earlier articles, the commands to enter are bolded.

## Unpaired t-test

Used to compare the means of groups of data where each datapoint is independent, when the data from each group are similarly spread out around the group mean (i.e. have similar variance).

1.make up data

**group1_data = rnorm(10,1,2)**

**group2_data = rnorm(10,6,2)**

2. plot data

**boxplot(group1_data,group2_data)**

3. do test

**t.test(group1_data,group2_data)**

## Paired t-test

A paired t-test is used when each observation in one group has a matching observation in the other group. For example, you measured fluorescence in 50 cells, then dosed them with some drug then measured fluorescence again. For this to work right, the individuals (here, cells) need to be in the same place in the data in the before and after group. Like the unpaired t-test, you need to be able to assume that the observations are roughly normal (look like a bell curve when you plot them as a histogram).

1. make up data

**before_data = seq(from=2,to=20,length.out=10)+rnorm(10,0,2)**

**after_data = before_data-rnorm(10,5,2)**

2. plot data

**matplot(c(1,10),rbind(before_data,after_data),type=’l’,axes=F,ylab=’measure’)**

**axis(2)**

**axis(1,c(1,10),label=c(‘before’,’after’))**

3. do test

**t.test(before_data-after_data)**

**t.test(before_data,after_data,paired=TRUE)** # same thing

## Mann-Whitney

This is what you use when you want to do an unpaired t-test, but you don’t think your data are from normal distributions with similar variance. It’s kind-of more foolproof, but it is also less likely to find real differences when they are actually there. You can see in the example below that the example data is pulled from uniform distributions, and that the data don’t look like bell curves in the histograms.

1. make up data

**group1_data = runif(20,min=3,max=10)**

**group2_data = runif(20,min=6,max=15)**

**all_data = c(group1_data,group2_data)**

**group_labels = c(rep(‘group1’,20),rep(‘group2’,20))**

2. plot data

**par(mfrow=c(1,2)) **(this cool command lets you plot multiple plots in one figure)

**hist(group1_data,breaks=5,main=’histogram of group 1′, xlim=c(0,20))**

**hist(group2_data,breaks=5,main=’histogram of group 2′, xlim=c(0,20))**

3. do test

**wilcox.test(all_data ~ group_labels)**

## One-way ANOVA

Used to see if at least one group out of three or more is different from the rest of the groups. Here, we need to be able to assume that the *residuals *of the model are normally distributed, if your sample size is low. If it is pretty high (>50 per group), it’s probably okay even if the data aren’t actually normal. I think it is important to point out here that an ANOVA never tells you which group(s) are different. Instead, if you get a significant answer it only tells you that at least one group is different. From there, you can just take a look at the data to figure out why, or do a multiple comparisons, one of which I’ve tossed in here (a Tukey’s Honestly Significant Difference test).

1. make up data

**group1_data = rnorm(10,1,2)**

**group2_data = rnorm(10,6,2)**

**group3_data = rnorm(10,6,2)**

**all_data = c(group1_data,group2_data,group3_data)**

**group_labels=c(rep(‘group1’,10),rep(‘group2’,10),rep(‘group3’,10))**

2. plot data

**plot(factor(group_labels),all_data)**

3. do test

**my_anova_model = aov(all_data ~ factor(group_labels))**

**summary(my_anova_model)**

**plot(my_anova_model)*** # (to do a visual inspection of the ANOVA assumptions)

**TukeyHSD(my_anova_model)**

## Kruskal-Wallace

Just like the t-test needed a partner that could compare two groups when you weren’t sure if they were from normally-distributed populations, the one-way ANOVA also needed a non-normal partner. So the Kruskal-Wallace test looks to see if at least one group out of three or more is different from the others, and does not assume normally-distributed residuals. Similar to the Mann-Whitney, it is also less powerful but more fool-proof.

1.make up data

**group1_data = runif(15,min=1,max=2)**

**group2_data = runif(15,min=1,max=5)**

**group3_data = runif(15,min=1.4,max=3)**

**all_data = c(group1_data,group2_data,group3_data)**

**group_labels=c(rep(‘group1’,15),rep(‘group2’,15),rep(‘group3’,15))**

2.plot data

**par(mfrow=c(3,1))**

**hist(group1_data,xlim=c(0,6))**

**hist(group2_data,xlim=c(0,6))**

**hist(group3_data,xlim=c(0,6))**

3.do test

**kruskal.test(all_data ~ factor(group_labels))**

## Two-way ANOVA

Here is where the tests start getting pretty complicated, and the last one I’ll introduce is the two-way ANOVA. A two-way ANOVA is for when you’ve kind-of done two experiments simultaneously, ending up with (at least) four groups of data. For example, you are measuring fluorescence in cells. In half the cells, you do a knock-down of some gene. Separately, you treat half the cells with a drug. This ends up with four different groups: (1) knock-down, no drug, (2) knock-down, with drug, (3) no knock-down, no drug, (4) no knock-down, with drug. We say the treatments are in different ‘factors’: drug treatment and knock-down treatment.

The cool thing about a two-way ANOVA, and the reason we analyze this kind of data differently from a one-way ANOVA, is that the two way can tell you if the drug had any general effect, if the knock-down had any general effect, and if there was interaction between the two (e.g. the knock-down only had an effect when the drug was present).

1.make up data

**group1_data = rnorm(10,1,2) #no drug, knockdown**

**group2_data = rnorm(10,3,2) #no drug, no knockdown**

**group3_data = rnorm(10,6,2) #drug, knockdown**

**group4_data = rnorm(10,3,2) #drug, no knockdown**

**factor1 = c(rep(‘no drug’,20),rep(‘drug’,20))**

**factor2 = c(rep(‘knock down’,10),rep(‘no knock down’,10),rep(‘knock down’,10),rep(‘no knock down’,10))**

**all_data = c(group1_data,group2_data,group3_data,group4_data)**

2.plot data

**interaction.plot(factor1,factor2,all_data)**

3.do test

**my_2way_anova_model = aov(all_data ~ factor(factor1)*factor(factor2))**

**my_2way_anova_model**

**summary(my_2way_anova_model)**

**plot(my_2way_anova_model)**

Now that you have the code, you are ready to roll! Stay tuned for more!