We begin by describing a typical data annotation workflow to illustrate the contexts of data work within the ML pipeline (section 4.1). Our study identified three approaches towards diversity in practice and the underlying logics motivating our participants’ choices (section 4.2). We then present the barriers limiting a nuanced consideration of annotator diversity (section 4.3).
4.1 Data Annotation Workflow and Tasks
The first step in a typical data annotation workflow was identifying the data needs for the ML projects while considering the downstream applications (depicted in figure
1). The practitioners then selected the annotation service/platform and proceeded to design the annotation task with the help of the platform managers, starting with a pilot phase to test and iteratively improve the annotation guidelines. These improvements often included additional examples or edge cases to provide clearer instructions or clarifications to the annotators. While the annotators labelled the data in bulk, the practitioners monitored the process regularly to ensure the quality of the annotations. The monitoring often involved comparing the annotated data with a ‘golden dataset’ created by experts or the practitioners, and verified by the machine learning models they built.
In our study, most practitioners relied on internal data infrastructures to produce annotations. These platforms were used to recruit and manage annotators and facilitate annotation tasks. Our survey results showed a preference for these internal platforms over external annotator-facing marketplaces such as MTurk (echoing Wang
et al. [
95]). Of the 44 survey respondents, 25 used internal infrastructures and 8 outsourced to third-party vendors like Appen and Scale AI. The top factors influencing platform choice were cost (15), timeline (16), and platform quality (17) (reflected through proprietary tech, UX, support). Only 8 practitioners considered the diversity of data workers on the platform as a deciding factor.
The distribution of type of ML projects that our participants worked on skewed heavily towards language-based ML tasks. Among the survey respondents, 22 (50%) worked on language-related tasks (classification or generation), eight practitioners identified object/entity recognition as their task-type, and five worked on human evaluation of model generated data. Examples of language tasks from our interview participants include semantic parsing, translation, de-contextualising sentences, harmful content detection, and more. Other types of projects represented among our interview and survey respondents include detecting anomalies in chest x-rays, segmenting rivers in images for flood forecasting, developing taxonomies of items found on an online marketplace.
4.2 Approaches to Annotator Diversity
Below, we capture the varied perspectives to annotator diversity among the participants in our study, from some who considered diversity as irrelevant, to others who made efforts to accommodate it, even if only in partial ways. Most practitioners acknowledged the role of annotator subjectivities in the annotation process; some emphasised the importance of diversity in achieving a balanced view and bringing previously overlooked sub-populations into consideration. However, many went on to explain their decisions to not consider annotator diversity in setting up annotation tasks. Their primary focus was on achieving a certain level of quality, which was often measured against how closely an annotator followed pre-defined parameters and guidelines. We discuss the justifications provided by practitioners for taking a representationalist approach to annotator diversity and prioritising measures of quality.
According to the survey results, a significant majority of respondents (75%) considered diversity to be a
somewhat to
extremely influential factor (>=3 on the Likert scale) in the quality of their annotated datasets. However, despite the perceived importance of various individual attributes and characteristics of annotators, very few of these factors were utilised in the recruitment process (see table
2). For instance, only 4 out of 15 practitioners who considered gender to be a relevant criterion included it in their selection criteria. Furthermore, even when additional information on annotator characteristics was available, it was rarely utilised in the recruitment process (table
2, column ‘Available Information’). These findings suggest a potential disconnect between the perceived importance of diversity in annotation and actual recruitment practices.
Additionally, our interviews revealed significant contrast between participants’ reflections on the concept of annotator diversity and its actual implementation. Annotator diversity was often understood as representative of a particular perspective or point of view. For example, annotators were selected based on the low-resource dialect they spoke or the high flood risk areas they lived in, using language and location as proxies for diversity. Very few participants actively recruited annotators based on their lived experiences, knowledge, or expertise as facets of diversity. In the rare cases where local knowledge and expertise were considered important, measurable criteria were applied to assess the annotators’ expertise or knowledge. For instance, P6’s motivation for selecting an annotator for a mapping project was to include individuals from underrepresented backgrounds and “capture the other parts of the population”. Similarly, P9 spoke of recruiting “a person of X identity with Y knowledge” to ensure diversity. Overall, interview participants demonstrated a view of diversity through the lens of categories and metrics, rather than tied to experiential knowledge and expertise.
4.2.1 The pursuit of objective annotations.
Many practitioners invoked the domain and specific nature of their annotation task (e.g., text style transfer) to justify de-prioritising annotator diversity. In both the survey and interviews, several practitioners stated that annotator diversity was not necessary when the task was considered objective. One survey respondent stated, “our data had ground truth,” to suggest how some annotation tasks can be objectively assessed. A belief in the objectivity of certain annotation tasks was often based on the notion that some questions have definitive answers and certain content can be definitively labelled. Annotator diversity was dismissed as irrelevant, particularly by those who described their annotation tasks as requiring subject expertise, such as detecting anomalies in chest X-rays or linguistic corpus detection tasks. P1 highlighted this perspective in their experience with annotation tasks:
“Our primary consideration was how medically trained the annotators were and how much time they had for the annotation. So with regard to factors for diversity, I do not think that was a consideration because it was never intended to be used in the general population.”
We observed similar practices in contexts of quality checking, when some annotation results triggered additional quality checks. In cases of disagreement between annotators, resolvers (acting as experts) were brought in for final decision-making on the correct annotation. For a language understanding annotation task for voice assistants, P11 described how resolvers would handle discrepancies, either by choosing one of the existing annotations as correct or creating a new one from scratch. The resolvers’ expertise was often determined by professional experience (typically greater number of years of experience in the field of annotation).
Practitioners considered the design of tasks as an effective intervention point to ensure objective annotations. They used training sessions and guidelines to teach annotators how to make correct judgements by closely following the instructions without deviation. Participants provided examples of their tasks (e.g., river segmentation) which were intended to capture predetermined phenomena that could be made explicit through the annotation instructions. Detailed instructions were passed from practitioners to annotators through layers of quality checks conducted by platform leads or team managers. Instruction documents broke down annotation tasks into simple, repeatable sub-tasks that were “very hard to answer in a biased way” (P8), all as efforts to reduce inconsistencies and to standardise the work for all raters. As P7 articulated, “it is less about choosing the right raters and more about ensuring that they have that [standardised] understanding.” In effect, being "objective" was considered a trainable skill and the training sessions and guidelines were essential for instructing annotators to "see" objectively.
4.2.2 The attempt to remove bias.
Most practitioners recognised the complexity in accounting for annotators’ diverse subjectivities, and used that as a justification to avoid over-complicating the goal of achieving useful and testable AI/ML outcomes. AI/ML workflows were designed to facilitate consistent evaluation across a range of source data, tasks, techniques, and annotators. This control over the development process helped participants compare the performance of AI/ML models and identify areas for optimisation. For example, P6 included questions with definitive answers in their dataset “to have an easy way to evaluate the answers in the end.” Speaking of their specific area of work, P6 explained how information-seeking tasks are created using “a specific span, so you can point to which span contains the answers.” In effect, practitioners enacted mechanisms to circumvent complexity by limiting the plausible options in an annotation task and reduce ambiguity for evaluating a model’s performance.
The desire to control ambiguity and complexity in data annotations extended to addressing annotator subjectivities, which were regularly framed as a form of ‘bias’ manifested in disagreements between annotators. In discussions about the implications of diversity, participants frequently conflated the concept of diversity with bias, viewing it not as something to be understood, but rather as a source of variability to be corrected or technically resolved. Practitioners made concerted efforts to minimise the effects of annotator diversity in order to make practical progress in modelling. To account for potential biases and differences in annotators’ backgrounds and experiences, interventions were carefully implemented throughout the annotation process in order to eliminate disagreement and reduce annotator bias.
Many practitioners acknowledged that the data quality (i.e., typically measured the accuracy rate) and disagreement could be a result of the flaws in the design of the annotation guidelines or annotation interfaces. However, a few participants attributed the disagreement and inconsistency to individual annotators’ attributes. Differences were not understood in terms of annotators’ diverse opinions but rather unsatisfying work quality, or worse, questionable work ethics. P3’s comment is illustrative of this perspective: “[the] reason for disagreement could be multiple factors. They don’t have the required knowledge, [the] task itself could be ambiguous [...] Second is the quality of guidelines. Third is their motivation and the quality of the work. If they are doing it without a high consideration for quality, they may not push themselves enough for high quality output and that could show up in the disagreements.”
4.2.3 The quest for neutral representation.
Only a handful of participants took the extra step of incorporating annotator information into their data production and model building processes. In an effort to engage with annotator diversity, they recruited annotators from various backgrounds, such as a balanced gender ratio and multiple geographic locations. However, these practitioners struggled to determine and prioritise a set of relevant social categories for their specific task and domain. For example, P5, expressed the desire to capture a ‘representation of every single person and every single dimension’ in their research on toxicity annotation and the rationale behind this attempt: “otherwise we get biased annotations, and if we train models on that, they will amplify this bias [...] There is a disagreement based on demographic characteristics. Even if your other demographic attributes are the same, just because of the location, you might have a different perception of the data.” Seeking representative annotations was closely intertwined with efforts to eliminate bias and the pursuit of objectivity.
Practitioners often described how collecting a diverse range of perspectives can accurately reflect the real world, and that this representation and aggregation can achieve a neutral stance for building machine learning models. Participants who were attuned to the effects of diversity often attempted to capture differences in annotation patterns and establish a correlation between the patterns and the identities of the annotators. Many also noted the tension between representing the diverse global population and catering to their specific user base. P1 discussed their medical image annotation project where they procured and annotated 80% of their training data from the Global South, despite the eventual deployment of their models in the Global North due to varying data regulations. A few participants attempted to build systems that would effectively serve marginalised groups, but struggled to justify the additional resources (e.g., time and budget) needed for these efforts from a business perspective.
The current state of data and machine learning practices did not actively support explorations of annotator diversity, limiting early attempts to incorporate diverse annotator perspectives into AI/ML models. When annotators provided label assessments from diverse viewpoints, practitioners were unable to distinguish minority opinions from ‘noise’ that deviated from instructions. The annotations, potentially rich in diversity, were aggregated and distilled to eventually arrive at an agreement or an acceptable range to be useful for AI/ML modelling. However, at an individual level, annotators were trained to adhere closely to task instructions and to move away from their individual interpretations. At a cohort level, the majority vote technique was commonly used to select the salient result, resulting in data that is neither diverse nor neutral.
4.3 Barriers to Incorporating Diversity
In our study, participants identified several challenges to accommodating annotator diversity. Firstly, participants reported a lack of access to information about annotators, hindering their ability to understand and account for annotators’ unique perspectives and backgrounds. Additionally, the limited communication and collaboration between practitioners and annotators resulted in practitioners having minimal knowledge of annotators beyond their worker identification (worker-ID). Lastly, the lack of clear and actionable pathways to incorporate annotator socio-demographic information into the development and evaluation process further diminished the motivation for practitioners to prioritise diversity among data annotators.
4.3.1 Lack of information about annotators.
Several practitioners expressed their lack of knowledge about the annotators working on their tasks. Often, the only information they had about these annotators came from website brochures or blog posts from the third-party data-labelling platforms they used for recruitment. This information was not specific to their projects, but rather was reported in aggregate and publicly available. In practice, annotation projects were run on a "good-faith basis" where the third-party platforms were trusted to satisfy the annotator recruitment requirements, and there were rarely any opportunities to confirm if annotators met the specified criteria. This lack of transparency raised concerns among practitioners.
Out of the 44 survey respondents, 19 reported having access to the annotators’ geographic location and 15 had access to their language proficiency. The most commonly available annotator information was education level (11), followed by subject matter expertise (11) and gender (10). While 17 respondents did not face any challenges in obtaining the information they required, many others faced limited project timelines (8) and legal constraints (7), that limited practitioners’ ability to better understand their annotators’ background. Additionally, 18 respondents reported difficulties in accessing suitable annotators for their tasks. Overall, the respondents expressed a lack of control over the selection of their annotators.
While understanding annotators’ backgrounds and considering diversity was important for improving the data production process, the incorporation of annotator information into data production had legal implications as well. Practitioners were wary of collecting sensitive personal information about annotators, such as their sexual or political orientations. This created a tension between protecting annotator privacy and gaining a more nuanced understanding of their backgrounds. In managing annotation projects at a big tech company, P8 explained the challenges of recruiting a diverse group of annotators:
“You are not allowed to select people for a job based on certain characteristics. It is illegal to give a questionnaire as to their sexual orientation and select people based on certain orientations to fill up a data center. Even in countries where it can be done, there is no way [big tech company] would expose themselves to a potential PR nightmare of an article about how [big tech company] is selecting certain sexual orientations for data annotations. ’’
A range of structures, such as legal, ethical, and corporate considerations, created obstacles to recruiting diverse annotators. As P8 noted, these considerations intersect to make “selecting annotators to be diverse... impossible”. Additionally, the moral obligation to protect annotators from potential harm during repeated annotation tasks involving harmful content added another layer of complexity to the process. For P8, to effectively collect information about annotators, it was crucial to consider the well-being of annotators as a fundamental practice.
4.3.2 Separation of operations.
Most practitioners in our interviews relied on third-party platforms to annotate their datasets. However, a separation of operations between the annotation platforms and the practitioners led to several challenges. Annotation platforms, while helpful in reducing the workload for practitioners, introduced a disconnect between the practitioners and the annotators. Most annotation projects were mediated by a platform manager or team lead who facilitated the communication between the practitioners and annotators. As a result, direct communication between these groups was rare, if it happened at all. For example, P14 outlined the communication barriers in getting the most value out of their annotation process at a large tech company:
“The thing is that it was all contracted out externally. [Big tech company] has this business deal with [annotation company]. All of [big tech company]’s tools are proprietary and internal and we have a specific interface where the moderators quickly review things. All of [the annotation company]’s workers are out in the Philippines. In the US, the majority of them are in Austin, but there is a total disconnect between the actual [big tech company] engineers and the contingent workers. ’’
Challenges still existed even when direct communication between practitioners and annotators was possible. P5 discussed the difficulties in building trust and gathering information from annotators. In one project using MTurk, the annotators were uncomfortable sharing personal information with P5’s project team. The annotators wanted to understand why their information was being collected and how it would be used, but despite the explanation from P5, the annotators’ distrust of MTurk (and consequently P5’s team) persisted. Due to the use of intermediaries in facilitating interactions between practitioners and annotators, a three-way trust needs to be established, but the practitioners had limited power to rebuild trust between annotators and the platform. Therefore, factors such as establishing trust with annotators and determining appropriate pay came before considerations of diversity among annotators.
Geographical distance, time differences, and heavily-facilitated communication enforced and amplified the separation between practitioners and annotators. In order to avoid significant delays in communication, practitioners often had to resolve ‘inconsistencies’ in data labels on their own. While this was acknowledged as poor practice, it was largely dictated by tight turnaround times and business pressures. The lack of information and communication channels further exacerbated the separation between practitioners and their data annotators. Without knowing their annotators or having the opportunity to meet them, practitioners were more likely to consider them as interchangeable workers carrying out standardised tasks. As a result, only a few practitioners conceptualised the impact of annotator identities on their data and the importance of diversity within these identities.
4.3.3 Competing priorities in machine learning development.
The status quo of driving practices in ML workflow presented challenges to conceptualising and operationalising annotators’ diversity. Several practitioners noted how they had to prioritise curating larger datasets and building better-performing models over bringing in a diverse group of annotators under the pressure of short-term development timelines. Many participants worked in emerging and niche application areas, with a focus on exploring the limits and capabilities of ML models by testing new ideas and concepts. They had to prioritise ‘hitting the ground running’ and reaching an ‘MVP’ (minimal viable product) before considering the specifics of their annotator pool. P11 articulated a sentiment echoed by many participants who struggled to find evidence for justifying annotator diversity:
“Even if [annotator diversity] could matter, it depends on where the project is and the priorities—you want [the product] to work well for everyone but at an early stage you’re trying to just make the product work in general. [...] It takes additional resources to address smaller user groups. There’s a persistent, open question on whether it was even a priority to get models working for, let’s say older people or for speakers of dialect where there’s not that many users because that ends up being harder to justify. It is always about trying to satisfy the needs of the largest groups of people.”
For ML practitioners, data annotation was a part of the ML pipeline that must seamlessly integrate with the other components of the workflow, such as model building. P12 highlighted how data collection and annotation pipelines were often configured to support model building rather than advancing task understanding. The complexity of incorporating diverse annotator subjectivities stood in conflict with machine learning pipelines designed to produce definitive answers.
Both the platform managers and practitioners actively worked towards reducing the cost of annotation in setting up ML data production. As previously documented, annotation companies often recruit their data workers from countries in the Global South (such as India or the Philippines) where labour costs are considerably lower, while their engineering offices are in the Global North, in order to remain cost-effective (similar to [
61,
64,
95]). We observed similar patterns in our study (
e.g., P14). According to P16 (platform account manager with a mid-size annotation company), annotation companies operated in a highly competitive environment where they typically hired workers from lower-wage countries to offer pricing lower than competitors. Practitioners also focused on minimising annotation costs, and thus, had limited control over which annotators are assigned to their projects. However, P6 noted the connection between appropriate incentives and ‘data quality’, positioning fair compensation as an obstacle only to high-quality data.
Practitioners discussed the complexities in setting up an annotation pipeline, including creating instruction documents, choosing a platform, and refining the process through multiple iterations. However, they also noted that the setup of annotation tasks was effort-intensive, taking away from valuable time and resources that could be spent on model building. In addition to ‘competing’ with model building for time and resources, data annotation procurement accounted for a significant portion of the operational costs of ML projects, making it difficult to justify repeating the process for any diversity-related concerns that surface post-hoc. Annotator diversity, which has yet to be proven to add value to model building, was often overlooked in favour of more pressing priorities, such as building a better-performing model quickly and cost-effectively.