Next, we present the details described in the articles collected from the four communities for detecting any type of bias in a system using Auditing and Discrimination Discovery approaches.
4.1 Auditing Approaches for Bias Detection
The most common auditing approach used for bias detection involves humans (external testers, researchers, journalists or the end users) acting as the auditors of the system. In IR systems, researchers usually perform an audit by submitting queries to search engine(s) and analyzing the results. For instance, Vincent et al. [
204] performed an audit on Google result pages, where six types of important queries (e.g., trending, expensive advertising) were analyzed. The goal was to examine the importance of
user-generated content (UGC) on the Web, in terms of the quality of information that the search engines provide to users (i.e., if there was a bias in favor/penalizing such content). Similarly, Kay et al. [
105], Magno et al. [
132], and Otterbacher et al. [
147] submit queries to image search engines to study the perpetuation of gender stereotypes, while Metaxa et al. [
139] consider the impact of gender and racial representation in image search results for common occupations. They compare gender and racial composition of occupations to that reflected in image search and find evidence of deviations on both dimensions. They also compare the gender representation data with that collected earlier by Kay et al. [
105], finding little improvement over time.
Another example of bias detection in a search engine via auditing is the work of Kilman-Silver et al. [
112] who examine the influence of geolocation on Web search (Google) personalization. They collect and analyze Google results for 240 queries over 30 days from 59 different GPS coordinates, looking for systematic differences. In addition, Robertson et al. [
169] audited Google
search engine result pages (SERPs) collected by study participants for evidence of filter bubble effects. Participants in the study completed a questionnaire on their political leaning and used a browser extension allowing the researchers to collect their SERPs.
Kulshrestha et al. [
119] propose an auditing technique where queries are submitted on Twitter, to measure bias on Twitter results as compared to search engines. The proposed technique considers both the input and output bias. Input bias allows the researchers to understand what a user would see if shown a set of random items relevant to her query. The output bias isolates the bias of the ranking mechanism. In addition, Johnson et al. [
101] investigate the demographic bias detection in Twitter results using as an auditing technique, the retrieval of geotagged content using Twitter API. Another example where researchers are the auditors is the study of Edelman et al. [
57] where the authors run an experiment to audit Airbnb applications to detect racial bias in ranked results, and more specifically, for African American names.
Another cluster of user-based studies in IR systems concerned the detection of perceived biases about search and/or during a search for information. In these studies, users are the auditors. For instance, Kodama et al. [
113] assessed young people’s mental models of the Google search engine, through a drawing task. Many informants anthropomorphized Google, and a few focused on inferring its internal workings. The authors called for a better understanding of young people’s conceptions of search tools, to better design information literacy interventions and programs. In addition, Otterbacher et al. [
148] described a study in which participants were the auditors for detecting perceived bias. They were shown image search results for queries on personal traits (e.g., “sensitive person,” “intelligent person”) and were asked to evaluate the results on a number of aspects, including the extent to which they were “biased.”
Auditing approaches using ML have also been widely used. A situational testing auditing approach has been proposed by Luong et al. [
130], to detect discrimination against a specific group of individuals, using an ML algorithm. K-nearest neighbors was combined with the situation testing approach to identify a group of tuples with similar characteristics to a target individual. Zhang et al. [
227] proposed an improvement over the method of Luong et al. [
130], by engaging
Causal Bayesian networks (CBNs), which are probabilistic graphical models used for reasoning and inference. For the development of a CBN, the causal structure of the dataset and the causal effect of each attribute on the decision are used to guide the identification of the similar tuples to a target individual. Robertson et al. [
170], present an auditing approach in the form of an opaque algorithm, called “recursive algorithm interrogation” used for detecting bias in search engines. The auto-complete functions of Google and Bing are treated as opaque algorithms. They recursively submitted queries, and their resulting child queries, to create a network of the algorithm’s suggestions.
Hu et al. [
92] audited Google SERPs snippets, for evidence of partisanship where the generation of snippets is an opaque process. Moreover, Le et al. [
122] audit Google News Search for evidence of reinforcing a user’s presumed partisanship. Using a sock-puppet technique, the browser first visited a political Web page, and then continued on to conduct a Google news search. The results of the audit suggested significant reinforcement of inferred partisanship via personalization. In addition, Eslami et al. [
62] use a cross-platform audit technique that analyzed online ratings of hotels across three platforms, to understand how users perceived and managed biases in reviews.
In the HCI literature, auditing often involves characterizing the behavior of the algorithm from a user perspective. For instance, in Matsangidou and Otterbacher [
135], the authors consider the inferences on physical attractiveness made by image tagging algorithms on images of people. They audited the output of four image recognition APIs on standardized portraits of people across genders and races. In a more recent work [
12], the authors use auditing to understand machine behaviors in proprietary image tagging algorithms. The authors created a highly controlled dataset of people images, imposed on gender-stereotyped backgrounds. Evaluating five proprietary algorithms, they find that in three, gender inference is hindered when a background is introduced. Of the two that “see” both backgrounds and gender, it is the one with output that is most consistent with human stereotyping processes that is superior in recognizing gender. Another example is the work of Eslami et al. [
63], where the authors describe a qualitative study of online discussions about Yelp on the algorithm existence and opacity. The authors further enhanced the results by conducting 15 interviews with Yelp users who acted as auditors of the system, in an attempt to understand how the reviews filtering algorithm works.
Auditing approaches have also been used to detect bias in ML classification systems. For instance, in Reference [
24], the authors (developers) audit three automated facial analysis algorithms to detect any gender inequalities in the classification results. They found that males were classified more accurately than females in all the three algorithms and that all the algorithms performed worst on darker female subjects.
Recently, automated methods for auditing have been introduced to detect gender or race bias in the context of online housing advertisements and search engine rankings. Asplund et al. [
6] propose the use of controlled “Sock-puppet” auditing techniques, which are automated systems that mimic user behavior in offline audits. They use these techniques to investigate gender-based and race-based discrimination in the context of online housing advertisements and any bias in search-result ranking. The authors use the definition of disparate impact to consider both application systems as fair or not.
4.2 Discrimination Discovery
A common approach for discrimination discovery is to compute metrics to detect any direct/indirect discrimination of specific groups in the data. Examples include absolute measures, conditional measures or statistical tests [
233]. Absolute measures define the magnitude of discrimination over a dataset by taking into account the protected characteristics and the predicted outcome. Statistical tests, rather than measuring the magnitude of discrimination, indicate its presence/absence at a dataset level. Conditional measures compute the magnitude of discrimination that cannot be explained by any non-protected characteristics of individuals. Fairness notions have also been used in many works as metrics for discrimination.
In Bellogin et al. [
15], the authors detect statistical biases in the evaluation metrics used in recommender systems that affect the effectiveness of the recommendations. They found out that there is sparsity and popularity bias on the evaluation metrics. Many works focus on investigating the racial bias in advertising recommendations systems. For instance, Sweeney [
192] investigates the racial bias in advertising recommendations by an ad server when searching for particular names in Google and Reuters search engines. She finds that ads for services providing criminal records on names were significantly more likely to be served if the name search was on a typically Black first name. Ali et al. [
2], Speicher et al. [
188] and Imana et al. [
94] detected significantly skewed ad delivery on racial lines in Facebook ads for employment, financial services and housing. More specifically, in Reference [
94], the authors first build an audience that allows them to infer the gender of the ad recipients on the platforms that do not provide ad delivery statistics along gender lines, i.e., Facebook, Linkedin. They use this audience to distinguish between skew in ad delivery due to protected categories from the skew due to differences in qualifications among people in the targeted audience. Indirectly, they measure the “equality of opportunity” fairness notion. Another example of bias detection in RecSys and online social networks is the work of Chackraborty et al. [
33] who detect demographic bias in the input data of crowds in Twitter who make posts worthy of being recommended as trending. The bias is detected by comparing the characteristics of the trend promoters with the demographics of the general population of Twitter. Apart from demographic bias, political bias is very common in social networks. In Jiang et al. [
98], the authors measure the fairness in social media contexts based on the fairness notions:
demographic parity and the
equalized odds. The authors detect political bias through content moderation. Bias in the social platform Facebook has also been assessed through reverse engineering of the Facebook API ranking algorithm using logistic regression in Reference [
89]. More specifically, the authors identify the features of a post that would affect its odds of being selected. Sentiment analysis reveals that there are significant differences in the sentiment word usage between the selected and non-selected posts.
In IR systems, discrimination discovery is primarily used in user-focused studies. Weber and Castillo [
208] conducted a study of user search habits, which involved a large-scale analysis of Web logs from Yahoo!. Using the logs, as well as users’ profile information and US-census information (e.g., average income within a given zip code), the authors were able to characterize the typical behaviors of various segments of the population and detect any discrimination related to the users’ sensitive demographic attributes. In a similar manner, Yom-Tov [
221] used search query logs to characterize the differences in the way that users of different ages, genders, and income brackets, formulate health-related queries. His driving concern was the ability to discover users with similar profiles, according to their demographic information (user cohorts), who are looking for the same information, e.g., a health condition. Pal et al. [
150] considered the identification of experts in the context of a question-answering community. Their analysis revealed that as compared to other users with less expertise, experts exhibited significant selection biases in their engagement with content. They proposed to exploit this bias in a probabilistic model, to identify both current and potential experts. A method to identify selection bias, IMITATE, has also been proposed in Dost et al. [
56]. IMITATE investigates the dataset’s probability density, then adds generated points to smooth out the density and have it resemble a Gaussian, the most common density occurring in real-world applications.
In a study of information exposure on the Mendeley platform for sharing academic research, Thelwall and Maflahi [
196] illustrated a
home-country bias. Articles were significantly more likely to be read by users in the home country of the authors, as compared to users located in other countries. Chen et al. [
35] investigated direct and indirect (implicit) gender-based discrimination in the context of resume search engines, by a system toward its users. Direct discrimination happens when the system explicitly uses the inferred gender or other attributes to rank candidates, while indirect discrimination is when the system unintentionally discriminates against users (indirectly via sensitive attributes). The results suggested that the system under review indirectly discriminates against females, however, it does not implicitly use gender as a parameter. Another method for detecting bias in search engine results involves the use of metrics that quantify bias in search engines [
142]. A series of articles by Wikie et al. [
211,
212,
213] and a paper of Bashir and Rauber [
13] investigates the identification retrieval bias in IR systems. Bashir and Rauber study the relationship between query characteristics and document retrievability using the TREC Chemical Retrieval track. In Wilkie and Azzopardi [
212], they examined the issue of fairness vs. performance. Wilkie and Azzopardi [
211] consider specific measures of retrieval bias and the correlation to the system performance. Wilkie and Azzopardi [
213] consider the issue of bias resulting from the process of pooling in the creation of test sets. A recent study [
178] detects gender and race bias in the annotation process of training data of image databases used for facial analysis. The authors found that the majority of image databases rarely contain underlying source material for how those identities are defined. Further, when they are annotated with race and gender information, database authors rarely describe the process of annotation. Instead, classifications of race and gender are portrayed as insignificant, indisputable, and apolitical.
A set of works in HCI analyzes crowdsourced data from the OpenStreetMap to detect potential biases such as gender and geographic information bias [
48,
161]. In a similar vein, two other studies run a crowdsourcing study to detect any bias on human versus algorithmic decision-making [
11,
78]. Green and Chen [
78] run a crowdsourcing study to examine the influence of algorithmic risk assessment to human decision-making, while Barlas et al. [
11] compared human and algorithmic generated descriptions of people images in a crowdsourcing study in an attempt to identify what is perceived as fair when describing the depicted person. The execution of a crowdsourcing study for detecting bias has also been used in IR systems [
59,
127].
Many works study the problem of bias detection in textual data using data mining methods concerning specific protected groups. The typical approach is to extract association or classification rules from the data and to evaluate these rules according to discrimination of protected groups [
157,
171]. For instance, Datta et al. [
49] analyse the gender discrimination in online advertising (Google ads) using ML techniques, to identify the gender-based ad serving patterns. Specifically, they train a classifier to learn differences in the served ads and to predict the corresponding gender. Similarly, Leavy et al. [
123] detect gender bias in training textual data by identifying linguistic features that are gender-discriminative, according to gender theory and sociolinguistics. Zhao et al. [
229] detect gender bias in coreference resolution systems. They introduce a new benchmark dataset, WinoBias, which focuses on gender bias. They also use a data augmentation approach that in combination with existing word-embedding debiasing techniques, removes the gender bias demonstrated in the data. Madaan et al. [
131] detect gender discrimination in movies using knowledge graphs and word embeddings after analyzing the data (using, for example, mentions of each gender in movies, emotions of the actors during the movies, occupation of each gender in the movies, screen time, etc.) In a similar vein, Ferrer et al. [
66] propose a data-driven approach to discover and categorize language bias encoded on the vocabulary of online communities in the Reddit platform. They use word embeddings to discover the most biased words toward protected attributes, apply k-means clustering combined with a semantic analysis system to label the clusters, and use sentiment analysis to further specify biased words. Rekabsaz and his colleagues [
164] also explore the detection of societal bias in word-embedding models by utilizing the first-order co-occurrence relations between the word and the representative concept words. Islam et al. [
95] introduce a collaborative filtering method to detect gender bias in social media. Their proposed method is called
Neural Fair Collaborative Filtering (NFCF). They also use debiasing embeddings, and fairness interventions via penalty term.
A cluster of works in IR addresses the detection of bias such as age-based bias, and text-frequency and stylistic biases in sentiment classification [
51,
163,
184]. Other examples of detecting bias in classifiers that use sentiment analysis are the existence of offensive language or stereotyping of sensitive attributes in automated hate speech detection algorithms [
7,
50] and the detection of cultural biases at Wikipedia pages using sentiment analysis [
27]. Shandilya et al. [
181] also detect the under-representation of sensitive attributes in the summarization algorithms. Keyes [
107] identified the problem of automatic gender recognition in HCI research and how the approaches followed until recently are discriminatory toward transgendered people. For systems to be fair, Keyes [
107] proposed alternative methods and the development of more inclusive approaches in the gender inference process and evaluation. Apart from automatic gender recognition, an additional significant advancement in the field of HCI is that of data-driven personas. Salminen et al. [
175] investigated the presence of demographic bias in automatically generated, data-driven personas. They discovered that the more personas they generated, the more diverse the sample became in terms of gender and age representation. Practitioners who use data generated personas should consider the possibility of unintentional bias in the data they use, that consequently is transferred to the personas they generate.
Multiple approaches have been proposed in ML that detect discrimination in the data or classifier. Choi et al. [
40] discover and mine discrimination patterns that reveal if an individual is classified differently when some sensitive attributes were observed. The algorithm detects discrimination patterns in a Naive Bayes classifier using branch and bound search and removes them. It learns maximum likelihood parameters based on these parameters. Pedreshi et al. [
157] use an opaque predictive model to extract frequent classification rules based on an inductive approach. Background knowledge is used to identify the groups to be detected as potentially discriminated. However, Zhang et al. [
226] use a causal Bayesian network and a learning structure algorithm to identify the causal factors for discrimination. The direct causal effect of the protected variable on the dependent variable represents the sensitivity of the dependent variable to changes in the discrimination grounds while all other variables are held fixed. They also detect discrimination in the prediction/classification outcome by computing the classification error rate (error bias). In a more recent work, Zucker et al. [
234] introduce a new domain-specific programming language, the Arbiter for ML practitioners. It allows users to make guarantees about the level of bias in any produced models.
The notion of divergence [
153], which estimates the difference in classification performance measures, has also been proposed as a metric to identify data subgroups in which a classifier performs differently. Pastor et al. [
154] introduce the DivExplorer, an interactive visual analytics tool that identifies algorithmic bias using the divergence notion. An interactive system to detect fairness issues in the classifiers has also been proposed in Reference [
125]. The system is called DENOUNCER and it allows users to explore fairness issues for a given test dataset, considering different fairness notions. In addition, Nargesian et al. [
143] detect the groups in the dataset that are unfairly treated by the classifier by developing an exploration-exploitation-based strategy. Their approach captures the cost and approximations of group distributions in the given dataset.
In IR systems, a common type of bias is the cognitive or perception bias that arises from the manner in which information is presented to users, in combination with the user’s own cognition and/or perception. For example, Jansen and Resnick [
96] analyzed the behaviors of 56 participants engaged in e-commerce search tasks, with the goal of understanding users’ perceptions of sponsored versus un-sponsored (organic) Web links. The links suggested by the search engine were manipulated to control content and quality. Even controlling for these factors, it was shown that users have a strong preference for organic Web links. In a similar vein, Bar-Ilan et al. [
10] conducted a user experiment to examine the effect of position in a search engine results page (SERPs). Across a variety of queries and synthetic orderings of the results, they demonstrated a strong placement bias; a result’s placement, along with a small effect on its source, is the main determinant of perceived quality. User perception is also examined in a study [
139] where the authors consider people’s impressions of occupations and sense of belonging in a given field when shown search results with different proportions of women and people of color. They find that both axes of representation as well as people’s own racial and gender identities impact their experience of image search results. Gezici et al. [
73] propose a new evaluation framework to measure bias in the content of SERPs (on political and controversial search topics) by measuring stance and ideological bias. They propose three novel fairness-aware measures of bias based on common IR utility-based evaluation measures.
Ryen White, of Microsoft Research, has published extensively on detecting users’ perception bias during and after a search, particularly when trying to find information to answer health-related queries. In an initial work [
209], a user study focused on finding yes-no answers to medical questions, showed that pre-search beliefs influence users’ search behaviors. For instance, those with strong beliefs pre-search, are less likely to explore the results page, thus reinforcing the above-mentioned positioning bias. A follow-up study by White and Horvitz [
210] looked more specifically at users’ beliefs on the efficiency of medical treatments, and how these beliefs could be influenced by a Web search. An example of searching for user perception bias in recommender systems was presented in Reference [
172], where drivers’ perceptions of the Uber application were investigated, taking into consideration drivers’ profiles and their history performance.