Let’s Talk About Stats: Getting the Most out of your Multiple Datasets with Post-hoc Testing
So you’ve performed a test such as an ANOVA and have found that there is statistical significance in your data (lucky you!), however you now want to know where that significance lies.
When you are comparing multiple sets of data it might seem like a logical thought to simply perform an individual t-test between each set of data to determine the significance. However not only is this going to take a long time, particularly if you have a large number of datasets to compare, it is also mathematically/statistically incorrect. In fact, using a t-test in this way will increase the chance of a Type 1 error, because of something called the multiple comparisons problem (see below).
Instead of using a t-test to determine which datapoints are statistically significant, you need to use a separate post-hoc test to further investigate your data. Some of the most common one’s include the Bonferroni method, the Dunn-Sidák method, and the Tukey’s (not Turkey!) honestly significant difference (HSD) test. All these tests allow a deeper analysis of your datasets. Most, but not all, of these tests also correct for multiple comparisons.
The multiple comparisons problem
Ok, so I know I said this was a simple guide to statistics, but if you are anything like me, then you’ll want to know (at least at a basic level) why making multiple comparisons can be an issue when interpreting your data using statistics.
When I first started using statistics I didn’t understand why comparing individual data sets using a t-test was incorrect when you had multiple datasets? Surely comparing each group to each other individually would give you the same result as if you compared them in one larger test? Unfortunately that’s not the case.
When we pick a significance level of 0.05 we are determining that there is a 5% chance that the result will have been found to be significant when it actually isn’t. This level of error is deemed acceptable in experiments where you are comparing two sets of data. However, when you are comparing multiple sets of data, 5% suddenly becomes quite a large figure. For example, if you are looking at the effect of a drug on 500 different genes you will expect to get 20 significant genes just by chance alone. The more comparisons you perform, the greater the number of false positives you are likely to get. Therefore it is necessary to correct for multiple comparisons in order to limit the chance of making a Type I error.
How do we correct for multiple comparisons?
Instead of simply doing multiple t-tests, you need to use a post-hoc test that is designed to deal with multiple comparisons. There are lots of tests available, but I’ll just describe a few.
The Bonferroni method corrects for multiple comparisons by lowering the significance level (p-value) proportionally to the number of comparisons being made. To generate the new, lower p-value, the old p-value (0.05) is divided by the number of comparisons being made and the resulting value is used to determine if comparisons are significant.
Another post-hoc test similar to the Bonferroni method is the Dunn-Sidák method. This post-hoc test also corrects for multiple comparisons by reducing the level required for significance from 0.05, but uses a slightly more complex formula [1-(0.951/k) where k=number of tests carried out].
Both of these tests are considered to be quite conservative and they do increase the chance of Type 2 error (meaning that you determine a result to be not significant, although it actually is). I always prefer to use conservative tests since you can have more confidence that any significance you observe is really there.
Other post-hoc tests include Tukey’s honestly significant difference (HSD) test. This test is similar to a t-test, except that it compensates for multiple comparisons.
However be aware that not all post-hoc tests actually compensate for multiple comparisons. The Fisher least significant difference test (Fischer LSD test), while being a suitable post-hoc test for a one way ANOVA, does not actually correct for multiple comparisons. This test should only be used when the ANOVA result is significant. This is because it assumes that if the ANOVA was significant you don’t need to lower the threshold at which something is considered to be significant. It does differ from doing a standard t-test however, as it takes the whole dataset into account.
Post-hoc testing on non-parametric data
Opinions differ as to whether there are suitable post-hoc tests for non-parametric data, but suggestions include performing Mann-Whitney U test (although for reasons described earlier regarding t-tests and multiple comparison issues, there is a risk of Type I errors), the Nemenyi Test (however this requires all samples sizes to be equal), or you can use the Mann-Whitney U test but apply a Bonferroni adjustment of the p-value to correct for multiple comparisons.
Now you know which test to use, how do you go about performing these tests?
Luckily there is a great array of statistics packages that make applying the majority of tests a breeze. In my final statistics post I’ll discuss the different statistics packages out there so you can find the best one for you.
Leave a Comment
You must be logged in to post a comment.
[…] it should not be used when the sample size is less than 50.) For smaller sample sizes, the fischer’s test might be […]
Uh, I’m pretty sure it’s the Tukey test and not the Turkey test. You may want to change the typo…and the picture.
Thanks for catching that! Although we managed to keep the picture incorporated!