Yay! P<0.05! Time to wrap up the project and publish it!
Wait a minute, do you trust the p value you got?
You might scratch your head and ask “what are you talking about? You want me to get the p value lower? Perhaps 0.01 or even smaller? No. The value itself is not informative, no matter how impressive it is. When it comes to data analysis and interpretation, you need to be very clear about the meaning of a given p value and be aware of all its intrinsic limitations.
What can your supervisor or colleagues get from your results on the basis of just a single p value? The only information that is obvious to them is there is an effect (p<0.05). For example, suppose you want to test if treatment with drug A would increase the performance of mice in a memory test. A significant p value would suggest your drug-treated subjects performed better than untreated littermate. But it cannot reveal the magnitude of the difference. It can be as big as a several fold improvement or just a slight 10% increase. It’s important to remember that: statistical significance is not equivalent to biological significance. A very high level of significance can mean nothing in practical perspective. Besides, there is more valuable information that cannot be revealed solely by p value such as:
- Is the finding really reflects the true effect? (Reliability)
- How big is the effect? (Effect size)
- Is it reproducible? (Repeatability)
- What is the probability of getting a significant result (p<0.05) again in a repeat experiment? (Consistency)
- Can we estimate the result of a repeat experiment from the p value we get? (Predictability)
All those pieces of hidden information explain why we should not treat the p value as the gold standard to evaluate data. In this article, I want to point out 3 common myths regarding p values that we should all be aware.
p Value Myths
Myth 1: A result with p<0.05 says that there is less than a 5% chance that the null hypothesis is right and we are 95% confident that the positive result is a true effect (e.g., the drug treatment indeed improves memory test score in comparison to placebo).
The reported p value measures the improbability that the value, X1, from the drug treatment group could also be obtained from the placebo control group because of random chance (in the assumption that the null hypothesis is true). A significant result (p<0.05) says that there is less than a 5% chance we can get this “outlier X1” from a control animal.
For example, suppose we are testing the efficacy of drug A in improving memory test score. The statistical test reports that the average test score of 110 with drug A treatment is significantly higher than the average value 100 from control (p<0.05). What the test tells us is that the chance of obtaining the value 110 from a given individual treated with placebo is less than 5%. However, the test cannot tell us how confident we are to claim that drug A is really improving the memory performance, or how reliable the significant p value is.
To quantify the reliability of a statistically significant effect, please refer to the positive predictive value (PPV).
Myth 2: A result with a significant p value (p<0.05) suggests a repeat experiment will return back another significant p value.
A single p value gives you a very uncertain prediction about repeatability, and it is unable to estimate the value of a repeat experiment. Any obtained p values can only be valid in the sample from which they are calculated. You would be surprised to see how variable the p values are across experimental replicates. This variability is even greater in low power test. If the associated statistical power is low (often caused by a small sample size), a repeat experiment can result in a substantially different p value. In a commentary in Nature Methods, Halsey and colleagues demonstrated how inconsistent the p values are across different comparisons. The computer simulation result showed a wide spread of p value (0-0.6) among the 1,000 repeat comparisons. To be 80% confident that the next repeat experiment will be significant (p<0.05), we need a sample size (N) of 64! This is impractical in the real world, especially when you are doing expensive experiments such as animal behavior or employing human volunteers for psychological test. Unfortunately, “power failure” can result in a fickle p value and has been suggested as the main cause of the widespread irreproducible results in biology.
For estimating the value of the next repeat experiment, “confidence intervals” do much better job than p values.
Want to know how to estimate the repeatability of a significant p value? Check out this article about statistical power.
Myth 3: A result with a very small p value indicates the effect is big (e.g. A great improvement in memory test when the associated p value is less than 0.01).
The p value tells us if there is an effect (drug treatment improves memory) among the subjects we tested, but it cannot tell us how much the effect is. Indeed, no matter how impressive (p<0.001) or reproducible the result, we still cannot get any insight regarding the magnitude of the effect. To know that, we need another evaluation factor, effect size.
The p value tells you the two groups (experimental vs control) are different; while effect size (d) reveals to you the magnitude of the difference.
Let’s go back to the drug efficacy in improving the memory test score experiment. Obviously, looking at the data by eye, we can easily tell that drug B (50 vs 30) is more effective than A (35 vs 30) in improving memory test score. Their associated p values also agree with this intuitive judgement (an insignificant effect of A and highly significant improvement in B).
However, let’s look at another batch of data from the same type of experiment. The only difference is that we increase the sample size (N) of drug A group from 6 to 16. Although with the same magnitude of test score increase (30 vs 35), the statistical test gives us an even smaller p value in A than B. This result suggests that p value (significant effect) is a function of sample size and statistical significance does not necessary have practical meaning. Effectively, we can easily get an impressive p value by employing a huge sample size (N) although the difference between the two groups is small. If our judgement in the drug efficacy is based on p value alone, we will make a faulty conclusion that drug A is more effective than B.
To make a more biologically sound judgement on the data you get, you need to also measure the “effect size”.
What is effect size and how to calculate it?
Effect size determines the magnitude of difference between groups. It tells us how much the difference is. As opposed to p value, it is independent to sample size (N). Depending on the type of comparisons, effect size can be calculated through different indices. A common formula for calculating the effect size (d) for two independent group design is:
µ1: the mean of the experimental group
µ2: the mean of the control group
σ : standard deviation of the group.
If we plug in the value from the drug efficacy test experiment and calculate the corresponding effect sizes (formula: d=|µ1-µ2|/σ) of drug A and B, we find that drug B (d=3.8) is definitely more “effective” than A (d=1.7).
P value can fool you and your readers into making wrong judgements. To avoid any misinterpretation and increase data reproducibility, more and more journals and scientific communities, including “American Psychological Association,” require the report of not only p value but also the effect size with confidence interval. P value simply examines the likelihood that the finding is due to random chance; while the effect size with the associated confidence interval reveals the magnitude of the difference or association, the spread of data points, and more important, a more reliable estimation of a repeat experiment. With all the information at hand, readers can make more accurate judgement on the basis of the full spectrum of the data.