Our previous articles have discussed statistical tests for comparing two datasets. Unfortunately, many experiments are more complicated and have three or more datasets.
Various statistical tests are used for comparing multiple data sets. Let’s focus on the right side of the diagram and talk about statistical tests for comparing more than two datasets.
One independent variable
One-way ANOVA (analysis of variance)
If you are comparing multiple sets of data in which there is just one independent variable, then the one-way ANOVA is the test for you!
ANOVA makes the same assumptions as the t-test; continuous data, which is normally distributed and has the same variance. ANOVA produces an F-ratio from which the significance (p-value) is calculated. You don’t really need to know what the F value is, but simply put it is the ratio of between-group variance and within-group variance. An F-value equal or close to 1 means that there is no significant difference between your data, whereas the higher the F-value, the more likely it is that there is a true difference within your data.
Choose a free resource to help you move forward
DIGITAL TOOL
Lab Math Calculator
CHEAT SHEET
Lab Math Cheat Sheet
ANOVA tests whether there is a significant difference within your data as a whole, and provides a single p-value, but won’t be able to tell you between which datasets the significance is found. To get more insight into your data and to discover where the significance lies you need to do a post-hoc (which simply means ‘after this’) test.
While the ANOVA is primarily used for comparing multiple sets of data, it can also be used as an alternative to the t-test when comparing two groups of data.
Kruskal-Wallis test
This Kruskal-Wallis test is similar to the one-way ANOVA however it is used when you cannot assume normal distribution or similar variances. As with all non-parametric tests (where no assumptions about distribution and variance are made) this test is less powerful, but more conservative than its parametric equivalent. This test is an extension of the Mann Whitney U test, meaning it is a rank test, but allows for comparison of multiple samples. Like the ANOVA, the result of this test provides the information as to whether there is a significant difference within the data but does not provide the details of this.
Two independent variables
There are often situations in biology where you are looking at the effect of multiple variables on your chosen observation. For example, let’s say I am comparing the effects of Drug A and Drug B in humans to determine which drug gives a better outcome. Then I realize that the effects of the drug might be different in males versus females. In this scenario you can analyse the two variables (Drug and sex) simultaneously. This has the benefit of being able to uncover if an association between the two independent variables exists and whether this affects the unknown variable you are testing. For example Drug A may have a great effect compared to Drug B, but only in males. So what tests exist for the analysis of such data?
Two-way ANOVA
You’ve probably guessed that this is simply an extension of the one-way ANOVA and therefore has the same assumptions (to refresh you these are continuous data, approximately normally distributed and equal variance). Like the one-way ANOVA, the two-way also produces an F value that is used to calculate the significance value. In a two-way ANOVA you have three null hypotheses you are testing; the first being that the means of variable one are the same, the second is that the means of variable two are the same, and finally that the two variables do not influence one another.
Scheirer-Ray-Hare
Of course this guide wouldn’t be complete without the provision of a test that is the non-parametric equivalent of the two-way ANOVA.
It is at this point I should admit – I am not a statistician. There appears to be some conflict as to whether or not a non-parametric two way ANOVA is possible, with some people recommending to use a one way non-parametric equivalent (such as the Kruskal-Wallis test) for the two variables separately.
However, one test that claims to be the non-parametric equivalent of a two-way ANOVA is the Schierer-Rare-Hare test. This is an extension of the Kruskal-Wallis test and is therefore similar in the assumptions it makes (i.e. very few about distribution) and is a rank test. This test appears to be a relatively new test in the world of statistics and as alluded to above, is seen as impossible by some. The relative newness of the test also means that is it not widely included in statistics packages. Therefore I would recommend to use this test with caution, and consider curling up in a ball and rocking back and forth for a few hours before seeking the advice of your resident statistics expert.
Post-hoc testing
All of the above tests will tell you if there is any significant difference within your data as a whole, but they won’t tell you between which sets of data the significance lies. In order to delve deeper and uncover this detail you need to apply a post-hoc test.
When you are comparing multiple sets of data it might seem like a logical thought to simply perform an individual t-test between each set of data to determine the significance. However, not only is this going to take a long time, particularly if you have a large number of datasets to compare, it is also mathematically/statistically incorrect. In fact, using a t-test in this way will increase the risk of a Type 1 error because of the multiple comparisons problem (see below).
Instead of using a t-test to determine which datapoints are statistically significant, you need to use a separate post-hoc test to further investigate your data. Some of the most common one’s include the Bonferroni method, the Dunn-Sidák method, and the Tukey’s (not Turkey!) honestly significant difference (HSD) test. All these tests allow a deeper analysis of your datasets. Most, but not all, of these tests also correct for multiple comparisons.
The multiple comparisons problem
Ok, so I know I said this was a simple guide to statistics, but if you are anything like me, then you’ll want to know (at least at a basic level) why making multiple comparisons can be an issue when interpreting your data using statistics.
When I first started using statistics I didn’t understand why comparing individual data sets using a t-test was incorrect when you had multiple datasets? Surely comparing each group to each other individually would give you the same result as if you compared them in one larger test? Unfortunately that’s not the case.
When we pick a significance level of 0.05 we are determining that there is a 5% chance that the result will have been found to be significant when it actually isn’t. This level of error is deemed acceptable in experiments where you are comparing two sets of data. However, when you are comparing multiple sets of data, 5% suddenly becomes quite a large figure. For example, if you are looking at the effect of a drug on 500 different genes you will expect to get 20 significant genes just by chance alone. The more comparisons you perform, the greater the number of false positives you are likely to get. Therefore it is necessary to correct for multiple comparisons in order to limit the chance of making a Type I error.
How do we correct for multiple comparisons?
Instead of simply doing multiple t-tests, you need to use a post-hoc test that is designed to deal with multiple comparisons. There are lots of tests available, but I’ll just describe a few.
The Bonferroni method corrects for multiple comparisons by lowering the significance level (p-value) proportionally to the number of comparisons being made. To generate the new, lower p-value, the old p-value (0.05) is divided by the number of comparisons being made and the resulting value is used to determine if comparisons are significant.
Another post-hoc test similar to the Bonferroni method is the Dunn-Sidák method. This post-hoc test also corrects for multiple comparisons by reducing the level required for significance from 0.05, but uses a slightly more complex formula [1-(0.951/k) where k=number of tests carried out].
Both of these tests are considered to be quite conservative and they do increase the chance of Type 2 error (meaning that you determine a result to be not significant, although it actually is). I always prefer to use conservative tests since you can have more confidence that any significance you observe is really there.
Other post-hoc tests include Tukey’s honestly significant difference (HSD) test. This test is similar to a t-test, except that it compensates for multiple comparisons.
However be aware that not all post-hoc tests actually compensate for multiple comparisons. The Fisher least significant difference test (Fischer LSD test), while being a suitable post-hoc test for a one way ANOVA, does not actually correct for multiple comparisons. This test should only be used when the ANOVA result is significant. This is because it assumes that if the ANOVA was significant you don’t need to lower the threshold at which something is considered to be significant. It does differ from doing a standard t-test however, as it takes the whole dataset into account.
Post-hoc testing on non-parametric data
Opinions differ as to whether there are suitable post-hoc tests for non-parametric data, but suggestions include performing Mann-Whitney U test (although for reasons described earlier regarding t-tests and multiple comparison issues, there is a risk of Type I errors), the Nemenyi Test (however this requires all samples sizes to be equal), or you can use the Mann-Whitney U test but apply a Bonferroni adjustment of the p-value to correct for multiple comparisons.
Now you know which test to use, how do you go about performing these tests?
Luckily there is a great array of statistics packages that make applying the majority of tests a breeze. In my final statistics post I’ll discuss the different statistics packages out there so you can find the best one for you.
Glossary: Understanding the Lingo
It’s terrible to read about a particular statistical test and have to look up the meaning of every third word. The type of data you have, the number of measurements, the range of your data values, and how your data cluster are all described using statistical terms. To determine which type of statistical test is the best fit for analyzing your data, you need to learn some statistics lingo.
Variables
Variables are anything that can be measured; they are your data points, and the type you have affects the statistical test you use. Measurement or numerical variables are the main type of variables that are obtained in biological research, so I’ll focus on these.
Measurement variables can either be continuous, which means they can be any value between two points, (for example half and quarter measurements) while discrete variables are whole numbers (such as ranking 1-5). As I will show you later, some tests can only be used with continuous variables, while others can accommodate discrete values.
Sample size (n)
Sample size refers to the number of data points in your set of data. In general, the larger your sample size, the better. However, factors such as time, cost and practicality limit the sample size you use. As an absolute minimum you need an n of 3 to perform a statistical test, but look at publications with similar experiments to determine what is considered acceptable. The size of your sample will affect the variance of your data (see below).
Data Spread
How spread out your data is gives you an idea of how reliable it is – data with low variance is more reliable than data with high variance. It is therefore useful to know how variable your data is, and there are several simple measures for determining this.
Variance
Variance is the simplest measure of the spread of the data and is the average of the squared differences of the mean. It tells us how spread out the data is from the mean; the larger the number the more spread out the data (higher variance). To calculate variance, first find the mean of the data points. Next, find the difference between each sample and the mean and then square the result. Finally, average the results of the squared differences.
Note: if you are calculating sample variance (which you most likely are, since this means you are measuring just a sample of a population rather than the entire population) then you divide by n-1 when finding the average of the squared differences rather than n. This is to correct for the fact that you are only estimating the variance (since you are not measuring the entire population) rather than accurately computing it.
Standard Deviation
The standard deviation (SD) is the most widely used method for measuring the spread of the data. SD is simply the square root of the variance and similarly tells us how much the samples deviate from the mean. The standard deviation is often preferred to the variance as it is produces figures in which the majority of the data is on the same scale, making the results easier to display.
Standard Error
Standard error (SE) and SD are often thought to be interchangeable, however this isn’t the case. While SD tells you about the variability of your data, SE provides information on the precision of the sample mean.
SE is calculated by dividing the variance by n and taking the square root of this number. Since SE is calculated by dividing the variance by the sample size, it decreases with increasing sample size. SE therefore is often quoted rather than the SD as it tends to produce small numbers due to the additional division step, but choosing it over the SD for this reason alone can be misleading.
Distribution
Distribution, as the name implies, describes how your data is distributed. There are many ways your data can be distributed and this can affect the statistical test you use. The most well-known distribution is of course the normal distribution, which has a bell shape. A normal distribution means the data is symmetrical, with values higher and lower than the mean equally likely, but the frequency of values drops off quickly the further away from the mean.
Non-normal distributions are skewed; the mean is usually not in the middle. Most statistical tests assume that the distribution is normal, but beware – many common statistical tests are not valid for highly skewed data.
p value
The p value is what you are searching for – the number that will tell you whether you have achieved the holy grail of science: statistical significance! It is generally considered that a result with a p value less than 0.05 is unlikely to have occurred by random chance and is therefore statistically significant. In contrast, results with a p value greater than 0.05 are not considered significant, as it cannot be ruled out that they did not occur by random chance.
The p value is affected by sample size and if your sample size is too small you will not obtain a significant result even if the observed effect is real. Therefore you need to ensure you have a suitable sample size.
Paired or unpaired data
One factor that will be important in determining which type of test to use is whether or not your data is paired. Paired data is derived from equivalent and matched populations. For example, if you are comparing two drugs and you give drug A to 10 people of a certain age and population one day and 10 people of the same age and population drug B another day, your data is matched and you can use a paired test. If 10 people are given drug A but 15 people given drug B, then your data is unpaired.
Parametric vs non-parametric test
A parametric test is used when the data is assumed to be of normal distribution and equal variance. In contrast, non-parametric tests make no assumptions about distribution or variance. In general, non-parametric tests are less powerful, but more conservative. Any significance you find with the test is probably more real.
Type 1 and Type 2 errors
A statistical test can give a false result – often when the wrong test is used or a test is used incorrectly. Two types of errors can be encountered.
A Type 1 error is a false positive. It is when you conclude that a result is statistically significant when in fact it isn’t. A Type 2 error is a false negative; it occurs when actual significance is missed.
You made it to the end—nice work! If you’re the kind of scientist who likes figuring things out without wasting half a day on trial and error, you’ll love our newsletter. Get 3 quick reads a week, packed with hard-won lab wisdom. Join FREE here.

