Understanding what quantitive data captures and the context in which it is produced is crucial for helping to prevent bias and overfitting in applied data science and AI. When conducting big data analyses, the researcher is often removed from the data collection process, leading to an increased likelihood of misinterpreting variables, the sample population, or the environment in which the data were collected. Here, I illustrate how such misunderstandings can lead to procedural overfitting (Yarkoni & Westfall, 2017) and algorithm bias. I also provide examples of how qualitative data, in combination with interpretable or explainable AI, can help.
Procedural overfitting, also referred to as "human-in-the-loop overfitting" (Hofman et al., 2017), is where decisions within the researcher's degrees of freedom cause a model to perform exceedingly well on a specific dataset but fail to generalize to other data or contexts. This can be caused by questionable research practices (John et al., 2012), human biases (e.g., confirmation bias), or, as I argue, a lack of data knowledge and understanding. For example, the researcher can incorrectly specify the research question, make incorrect assumptions about data independence, or misunderstand what the variables are measuring. As an example, I’ll describe how misconceptions about what the outcome variable measures can lead to answering a different, easier question than what is claimed.
Misunderstanding what engineered features capture
Let’s walk through an example from my research on understanding the attributes of people who regularly buy single-use plastic bags (Lavelle-Hill et al., 2020). In this study, the outcome variable was real-world bag buying behavior, measured using transaction data, and the predictors were shopping behavior, demographics, personality traits, and attitudes measured using a large self-report questionnaire. As is often the case in big data analyses, we needed to engineer our outcome variable. This required some decisions, which is often referred to as using the researcher's “degrees of freedom”. For example, to attain a binary variable, we grouped people into two equally sized categories: regular bag buyers and irregular bag buyers. An initial naive grouping, using the absolute number of bags purchased, led to incredibly high predictive performance by our machine learning algorithm.
But, prediction performance alone is not the goal in social science research! By using eXplainable AI (XAI) methods to unpack what variables the algorithm used to make its prediction, it was clear that our outcome was capturing something subtly different from what we had anticipated. The most important variables in the model were all related to shopping frequency. Thus what we were really predicting was the person's overall shopping activity at the store (because people who shop more frequently also need more plastic bags).
Thus, although we had a model that accurately predicted frequent vs. infrequent plastic bag buyers, as we had defined them, the prediction performance was clearly driven by the confounding variable of overall shopping activity. What really wanted to investigate was which factors predict single-use plastic bag purchasing when controlling for overall shopping activity.
As it is not straightforward to control for variables in machine learning models, we re-engineered our outcome variable to partial-out the effects of 1) overall shopping activity and 2) whether or not a bag was actually needed to carry home the purchase (given the amount and size of items bought). The latter was to prevent us from accidentally predicting people who regularly purchase one item versus many or small items (such as gum, which can fit in a pocket) versus large items. We did this by engineering our outcome variable using the proportion of shopping trips a person bought a single-use plastic bag, only for trips where we estimated a bag was needed (using the overall volume of items in their shopping basket).
Although we can never perfectly eliminate confounds in behavioral trace data, our reengineered outcome variable allowed us to better capture people’s choices to opt (or not) for an environmentally friendly alternative to single-use plastic bags. When predicting our new outcome variable, we didn’t achieve as high prediction accuracy as we had (mostly) partialled out the strong effect of shopping frequency. However, when we analyzed the feature importance metrics, we gained interesting insights about the psychological and demographic drivers of single-use plastic bag buying (read the full paper here). This example shows how correctly specifying the research problem, question, and outcome variable is crucial for conducting meaningful social science research using digital data. This is particularly pertinent in the social sciences when we typically want to understand a phenomenon and not just predict it.
Using qualitative methods to understand the nuances in quantitive data
In the above example, misunderstanding what the outcome variable captures is only harmful to the efforts made to understand the research topic. But in applied data science, where algorithms are increasingly being employed to help us make decisions, misunderstanding the data can produce real consequences to people's lives. This is where qualitative data can help.
Qualitative methods provide a crucial mechanism to better understand more complex and nuanced behaviors captured in data. In the plastic bag example, using interpretable machine learning methods, in combination with general knowledge about shopping behavior, was enough to re-specify the outcome variable to capture what we wanted. But what if we don’t have “general knowledge” of the domain area? Or worse, we think we do, but in actuality we do not.
The real-world consequences of misinterpreting data became particularly apparent to me when working on a project at The Alan Turing Institute in collaboration with charity partners, NGOs, and the UK government to estimate the prevalence of sexual exploitation from online sex worker adverts. After speaking with different stakeholders, it quickly became apparent the danger of making assumptions, as someone from outside the communities that the data involves (i.e., sex workers and paying customers), about the behaviors, activities, and roles captured in the data.
To illustrate, let’s take the example of us trying to understand the large volume of highly similar or near-duplicate adverts offering sexual services in web-scraped data. After speaking to and interviewing different stakeholders, it became clear that there were different behavioral explanations for this artifact - all with wildly different interpretations.
For example, reportedly, a professional sex worker will re-post different versions of the same advert under different names, either for privacy/safety reasons or to create different personas to market themselves to different or new clientele. Alternatively, a “pimp” might use the same template text to advertise multiple different girls, changing only their names and descriptions of their appearance. The latter case implies that the sex worker is not in control of their own work. Therefore, while one interpretation of the duplicate ads implicates potentially exploitative practices, the other exemplifies professionalism and business-savvy behavior from an independent sex worker.
Figure 1. The spatial-temporal distribution of three adverts with a large amount of duplication
This was actually just one of many examples we came across in which a potential indicator of exploitation could also be an indicator of sex worker professionalism. It was only through ongoing qualitative research, speaking to domain experts, law enforcement, survivors, and charities who have worked with both professional sex workers and victims of sexual exploitation, as well as the websites hosting the adverts, could we (as outsiders) even start to understand the complex and multi-faceted phenomenon hidden behind the data.
In this project, if the qualitative insights from the professional sex workers had not been gained, and only the perspective of a law enforcement officer looking to detect sexual exploitation had been used, a biased algorithm could have been built that unfairly directed unwanted police attention towards professional sex workers - a community who already report feeling marginalized (and sometimes targeted) by the police.
A take-home message?
In summary, qualitative data collected from interviews, surveys, and “lived experiences” of the data context represents a crucial part of responsible quantitative data science methods. Such an approach can help protect against procedural overfitting from the misspecification of variables and help to prevent accidental bias in algorithms resulting from interpreting the data from a narrow perspective.
Hofman, J. M., Sharma, A., & Watts, D. J. (2017). Prediction and explanation in social systems. Science, 355(6324), 486-488.
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological science, 23(5), 524-532.
Lavelle-Hill, R., Goulding, J., Smith, G., Clarke, D. D., & Bibby, P. A. (2020). Psychological and demographic predictors of plastic bag consumption in transaction data. Journal of Environmental Psychology, 72, 101473. https://doi.org/10.1016/j.jenvp.2020.101473
Yarkoni, T., & Westfall, J. (2017). Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science, 12(6), 1100-1122.