For social science research to fully exploit the opportunities afforded by machine learning, expert knowledge of both social science and computer science best research practices is essential. To demonstrate this, the following example highlights the dangers arising from the disconnect that still exists between the two fields.
A recent comprehensive review published in Psychological Methods on imputation methods for machine learning (Gunn et al., 2022) led readers through statistical multiple imputation methods for regularised LASSO regression. This is an important and relatively unique example of research addressing the use of machine learning for psychologists, integrating the two research streams and accommodating the needs of psychologists to statistically estimate the variance introduced by data imputation. Indeed, more work like this is needed!
However, the authors did not consider a well-known phenomena in the machine learning literature, that of data (or information) leakage. Data leakage is when information from the test (or validation) set, which is supposed to be a separate external dataset, leaks and contaminates the training data. This occurs when preprocessing steps, including imputation, occur prior to splitting the data into training and test data (or prior to cross validation).
My own simulation analyses on real world data (see Figure 1), as well as literature from other fields (Jaeger et al., 2020), clearly demonstrates that data leakage caused by imputing data before cross validation (CV) (or any train:test splits) can have non-trivial effects on the prediction R squared estimated by the model. As shown in Figure 1, when leakage occurs, the model overestimates prediction performance. This effect is more severe in cases with greater missing data, or smaller sample sizes (Jaeger et al., 2020).
Figure 1. A graph to illustrate the inflated prediction performance (measured using prediction R squared) when data leakage occurred caused purely by imputation prior to cross validation (CV).
If this practice continues to be widespread among those using machine learning in the social sciences, it has the potential to negatively contribute to the replication crisis via the reporting of inflated performance metrics — counteracting the very thing that external validation can help with in Psychology.
This is just one of many examples of problematic clashes of methodology and research practices I have come across in my research in the space where Psychology and inferential statistics meets machine learning methods. Examples like this highlights the need for more interdisciplinary research effort — not only to make machine learning more accessible to social scientists and provide a springboard for novel insights and research methods, but also to ensure that if these tools are being used and adapted for social science research, that it is being done in a way which does not harm the research integrity of the field of Psychology.
Gunn, H. J., Hayati Rezvan, P., Fernández, M. I., & Comulada, W. S. (2022). How to apply variable selection machine learning algorithms with multiply imputed data: A missing discussion. Psychological Methods.
Jaeger, B. C., Tierney, N. J., & Simon, N. R. (2020). When to Impute? Imputation before and during cross-validation. arXiv preprint arXiv:2010.00718.