P-value abuse directly contributes to one of the biggest problems facing the scientific community: the prominence of false-positive results in the published literature. Contrary to popular interpretation, the p-value doesn’t indicate the likelihood that the observed result was due to chance. There are important qualifications to p-value interpretation. Moreover, the p-value cannot directly speak to the strength of evidence, which can be better inferred when considering effect size, prior probability, and experimental reproducibility.
But this isn’t an article about the p-value, per se. Rather, I want to talk about something which we all have experience with, to some degree or another: data dredging, or p-hacking.
What is P-Hacking?
The term p-hacking, coined in 2014 by Regina Nuzzo in Nature News, describes the conscious or subconscious manipulation of data in a way that produces a desired p-value. P-hacking is typically done through manipulation of “researcher degrees of freedom,” or the decisions made by the investigator. These include when to stop collecting data, whether or not your data will be transformed, which statistical tests (and parameters) will be used, and so on. By simply manipulating researcher degrees of freedom, even absolutely negative data can produce a p-value under 0.05 an incredible 61% of the time. A great and timely tool to visualize this is available at 538 Science.
Is Close Enough?
One of the authors of the study that brought researcher degrees of freedom to the forefront, Uri Simonsohn, points out that p-values just under 0.05 are extremely over-represented in the published data. There’s no great reason for this to be the case; instead, he (and others) posit that this is evidence of how widespread hacking borderline p-values has become. Indeed, it’s tempting to see p=0.06 and to alter the statistical test, test a few transformations, or exclude that one point that didn’t follow your trend. But this is wrong; with the purpose of science being to establish what is true, p-hacking muddies the waters, wastes time and resources, and may (in extreme cases) diminish the public’s belief in science.
The point of this article is not to bash everyone who has subconsciously p-hacked at some point in their life. It’s likely that we all have, save those who were extremely well-trained from day one. But awareness of this phenomenon is critically important, both to prevent bias in our own research and to improve our ability to interpret the results of others. Here are three action items that you can take to the bench with you to prevent p-hacking and enhance the reliability of your conclusions:
- Decide your statistical parameters early, and report any changes. This means deciding ahead of time your tests (e.g., performing a two-tailed, equal-variance Student’s T-test for outcome X between group A and B, but not C). If something legitimate came up, such as the variances being decidedly not equal, you could change this parameter—but you should report the rationale behind this change.
- Decide when to stop collecting data and what composes an outlier beforehand. Decide how many replications will be performed (e.g., each sample will be repeated three times exactly) and at what level (e.g., 2.5 times SD) a sample/replicate will be excluded. This prevents stopping early because you have a desired result, and it prevents repeating until the result is closer to what you desire.
- Correct for multiple comparisons, and replicate your own result. If you investigate multiple outcomes, be sure your statistics reflect that. If you came across something interesting, but not in the most savory way (i.e., exploratory fishing), test the new hypothesis again under pre-determined experimental conditions to get a true p-value.
In general, avoiding p-hacking comes down to awareness, planning ahead, and being open when post-hoc manipulation is legitimately needed. Before signing off, I’ll address what you’re probably thinking: significant is still the magic word for many journals, and p-hacking does increase the number of significant (p<0.05) results you can produce. But with the rise of knowledge about the issue of p-hacking, the availability of other descriptive statistics, and the capacity to better plan experiments by knowing these intrinsic biases, we can (and should) proceed without the fear of seeing p = 0.051.
Hopefully, by doing so, we can contribute to more truth in science.