Data-dredging Bias

A distortion that arises from presenting the results of unplanned statistical tests as if they were a fully prespecified course of analyses.

Background

Data-dredging bias is a general category which includes a number of misuses of statistical inference (e.g. fishing, p-hacking), but each essentially involves probing the data in unplanned ways, finding and reporting an “attractive” result, without accurately conveying the course of analysis. For example, in nearly any analysis of data there are several “researcher degrees of freedom”— i.e., choices that must be made in the process of analysis. Ideally, these choices are guided by the principles of best practice and prespecified in a publicly available protocol. In contrast, p-hacking occurs when an initial analysis produces results which are close to being statistically significant, then, in absence of a study protocol, researchers can make analytic choices (e.g. how to handle outliers, whether to combine groups, including/excluding covariates) which will produce a statistically significant p-value.[1]

While many different choices might be defensible, a canonical case of p-hacking would involve trying out multiple different options and reporting the result which yields the lowest p-value (particularly when alternative choices generate values that do not yield a significant result). Such an analysis can often generate statistically significant results in absence of a true effect (i.e. ‘false positives’) and is thus unreliable.

Though some forms of data-dredging are lamentably common, it is important to note that often such problems arise from a lack of awareness rather than malfeasance.[2] As Gelman and Loken (2014) note, “it can seem entirely appropriate to look at the data and construct reasonable rules for data exclusion, coding, and analysis that can lead to statistical significance” (p. 461).[3] In such cases an unconscious tendency to interpret the results in a biased fashion can be guarded against by prespecifying the course of analysis.

Apart from p-hacking, other forms of data dredging include: assessing models with multiple combinations of variables and selectively reporting the “best” model (i.e., “fishing”);[4] making decisions about whether to collect new data on the basis of interim results; making post-hoc decisions about which statistical analyses to conduct; and generating a hypothesis to explain results which have already been obtained but presenting it as it were a hypothesis one had prior to collecting the data (i.e., HARKing (“hypothesizing after the results are known”)).[5] In general, these procedures are acceptable when transparently reported; however, when authors neglect to accurately report how the results were in fact generated, they are rightfully classified as data-dredging.

Example

Despite numerous published trials and meta-analyses which appear to support the use of progestogens to mitigate pregnancy loss, when trials were limited to those that have been preregistered the evidence does not support their use.[6] Of 93 randomized controlled trials, 22 were classified as unlikely to be p-hacked. Of these, only one produced a statistically significant result and a meta-analysis of the trials found evidence that progesterone was not effective (RR = 1.00, 95% CI 0.94-1.07). In contrast, most previous meta-analyses support the use of progestogens. It is unlikely that this difference is the result of publication bias since, given the ratio between non-significant to significant results in pre-registered trials, we should expect an enormous number of unpublished studies which found no statistically significant effect. Prior et. al. therefore suggest that the difference is likely due to the inclusion of studies susceptible to data-dredging in the previous meta-analyses.[6]

The work of Brian Wansink and the Cornell Food and Brand Lab presents a more extreme example, which has nevertheless become a paradigm case of data-dredging. In a November 2016 blog post Wansink described how, at his encouragement, a visiting scholar reanalyzed the data from a “failed” study a generated four articles supporting claims such as food tastes worse when buffet prices are low, and men eat more when dining with women in order to impress them. Investigations revealed that Wansink and his team strategized how to generate statistical analyses which would produce flashy results. Crucially, the process which the team engaged in to generate the results was not revealed in the published papers. Instead, the results were presented as if hypotheses were developed before the data were gathered. Further investigation uncovered almost a decade’s worth of emails in which Wansink and his team strategized ways to dredge through data to find results they felt would be easy to publish, including correspondence with the visiting scholar in which Wansink requests that she “squeeze some blood out of this rock”.[7]

Impact

Though there is no definitive account of the frequency of data-dredging or the severity of the bias it induces, evidence for its existence derives from the fact that there is an unusually large number of published studies with a p-value slightly less than p = .05.[8, 9, 10, 11] While some authors have concluded that p-hacking is unlikely to have had a significant effect on meta-analytic estimates,[12] this assumes that most meta-analyses will include a number of studies with large sample sizes. Authors using p-curve analyses have found that the distribution of p-values is consistent with most research investigating real effects; however, the data is also consistent with some forms of data-dredging.[13, 14] What is clear is that the bias induced by data-dredging will be most severe in cases where the effect size is small, the dependent measures are imprecise, research designs are flexible, and studies are conducted on small populations.[15]

Preventive steps

John, Loewenstein, & Prelec (2012) find that researchers were generally unaware that data-dredging would induce bias.[2] Researchers frequently endorsed flawed practices such as deciding whether to gather more data after inspecting interim results or whether to exclude outliers after assessing the impact of doing so. Accordingly, there is hope that better statistical education would be beneficial. Authors should prespecify rules for stopping data collection, how to analyse data (including how to handle outliers, any expected transformations of the variables, what covariates will be controlled for, etc.) and studies should be sufficiently powered to detect meaningful effects. Publication of studies should list all variables collected in the study, report all planned analyses (specifying primary and secondary outcomes), and include robustness analyses for methodological choices.[1] Optimally, these choices should be registered prior to beginning the study[15]. When methodological decisions were informed by the data collected, the results should be clearly identified as an exploratory analysis and replicated.

The p-curve analysis has been suggested as a formal procedure for correcting p-hacking;[16, 17] however, its validity has not yet been established empirically. Particularly in non-randomized studies, confounding can render the p-curve analysis unreliable.[14]

Ultimately, as Banks et. al. (2016) have argued, data-dredging is a problem of bad barrels rather than bad apples because research systems incentivize producing nice-looking results.[19] Thus, one effective means of intervention would be to change the standards for conceiving, conducting, and publishing scientific research.[20] For example, the acceptance of articles on the basis of their design rather than their results would significantly alleviate the pressure to dredge data for “attractive” results.[21] Such alterations would represent sweeping changes for most disciplines.

Sources

1] Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632
[2] John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the Prevalence of Questionable Research Practices with Incentives for Truth Telling. Psychological Science, 23(5), 524–532. https://doi.org/10.1177/0956797611430953
[3] Gelman, A., & Loken, E. (2014). The Statistical Crisis in Science. American Scientist, 102(6), 460. https://doi.org/10.1511/2014.111.460
[4] Selvin, H., & Stuart, A. (1966). Data-dredging procedures in Survey Analysis. The American Statistician, 20(3), 20-23.
[5] Kerr, N. (1998). HARKing: Hypothesizing After the Results are Known. Personality and Social Psychology Review, 2(3), 196-217.
[6] Prior, M., Hibberd, R., Asemota, N., & Thornton, J. (2017). Inadvertent p-hacking among trials and systematic reviews of the effect of progestogens in pregnancy? A systematic review and meta-analysis. BJOG: An International Journal of Obstetrics & Gynaecology, 124(7), 1008–1015. https://doi.org/10.1111/1471-0528.14506
[7] Lee, S. (2018). Sliced and Diced: The Inside Story Of How An Ivy League Food Scientist Turned Shoddy Data Into Viral Studies. BuzzFeed News. https://www.buzzfeednews.com/article/stephaniemlee/brian-wansink-cornell-p-hacking
[8] Jager, L. R., & Leek, J. T. (2014). An estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics, 15(1), 1–12. https://doi.org/10.1093/biostatistics/kxt007
[9] Albarqouni, L. N., López-López, J. A., & Higgins, J. P. T. (2017). Indirect evidence of reporting biases was found in a survey of medical research studies. Journal of Clinical Epidemiology, 83, 57–64. https://doi.org/10.1016/j.jclinepi.2016.11.013
[10] Perneger, T. V., & Combescure, C. (2017). The distribution of p-values in medical research articles suggested selective reporting associated with statistical significance. Journal of Clinical Epidemiology, 87, 70–77. https://doi.org/10.1016/j.jclinepi.2017.04.003
[11] Ioannidis, J. P. A. (2019). What Have We (Not) Learnt from Millions of Scientific Papers with p-Values? The American Statistician, 73(sup1), 20–25. https://doi.org/10.1080/00031305.2018.1447512
[12] Head, M. L., Holman, L., Lanfear, R., Kahn, A. T., & Jennions, M. D. (2015). The Extent and Consequences of p-hacking in Science. PLOS Biology, 13(3), e1002106. https://doi.org/10.1371/journal.pbio.1002106
[13] Bishop, D. V. M., & Thompson, P. A. (2016). Problems in using p-curve analysis and text-mining to detect rate of p -hacking and evidential value. PeerJ, 4, e1715. https://doi.org/10.7717/peerj.1715
[14] Bruns, S. B., & Ioannidis, J. P. A. (2016). P-Curve and p-Hacking in Observational Research. PLOS ONE, 11(2), e0149144. https://doi.org/10.1371/journal.pone.0149144
[15] Ioannidis, J. P. A. (2005). Why Most Published Research Findings Are False. PLoS Medicine, 2(8), e124. https://doi.org/10.1371/journal.pmed.0020124
[16] Wagenmakers, E.-J., Wetzels, R., Borsboom, D., van der Maas, H. L. J., & Kievit, R. A. (2012). An Agenda for Purely Confirmatory Research. Perspectives on Psychological Science, 7(6), 632–638. https://doi.org/10.1177/1745691612463078
[17] Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). p-curve: A key to the file-drawer. Journal of Experimental Psychology: General, 143(2), 534–547. https://doi.org/10.1037/a0033242
[18] Simonsohn, U., Simmons, J. P., & Nelson, L. D. (2015). Better p-curves: Making p-curve analysis more robust to errors, fraud, and ambitious P-hacking, a Reply to Ulrich and Miller (2015). Journal of Experimental Psychology. General, 144(6), 1146–1152. https://doi.org/10.1037/xge0000104
[19] Banks, G. C., O’Boyle, E. H., Pollack, J. M., White, C. D., Batchelor, J. H., Whelpley, C. E., Abston, K. A., Bennett, A. A., & Adkins, C. L. (2016). Questions About Questionable Research Practices in the Field of Management: A Guest Commentary. Journal of Management, 42(1), 5–20. https://doi.org/10.1177/0149206315619011
[20] Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., Percie du Sert, N., Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., & Ioannidis, J. P. A. (2017). A Manifesto for Reproducible Science. Nature Human Behaviour, 1(1). https://doi.org/10.1038/s41562-016-0021
[21] Nosek, B. A., & Lakens, D. (2014). Registered Reports: A Method to Increase the Credibility of Published Results. Social Psychology, 45(3), 137–141. https://doi.org/10.1027/1864-9335/a000192


PubMed feed