We first describe background information about the participants and the manipulation check results, then we present the details of the qualitative and quantitative analyses.
We conducted thematic analysis on the open-ended responses about product preference and effect size analysis for each UX metric. Since the interaction effects between guideline compliance and AI performance were not significant for most studies, we report the interaction effects between guideline compliance and AI performance only when they are detected. The thematic analysis revealed participants’ perceptions about the products, providing context about the guideline’s impact on metrics and also pointing to pitfalls in applying and implementing each guideline.
The manipulation check results showed that two of the 18 factorial surveys failed to manipulate the independent variable. We reflect on the issues with the vignettes of the two guidelines, and present the results for each of the 16 remaining studies.
It is important to keep in mind that the results do not support comparisons across guidelines and are discussed independently. Each factorial survey used different products and features in the vignettes, therefore, the 16 studies were independent. Future users of this protocol might choose to collect data only about a few relevant guidelines as needed for their specific product or feature.
4.1 Participants’ Background
In total, we collected 1,300 responses from MTurk. As we will describe in Section
4.2, the studies for Guidelines 2 and 16 failed the manipulation check. Therefore, the total number of participants in the successful studies is 1,155. Of those, we eliminated the responses from participants who failed the attention checks. As a result, 1,043 participants were included in the data analysis. While some of the factorial surveys ended up having less than 65 (the targeted sample size) valid responses (see Table
1), the fact that most medium and large effects are also statistically significant (see Figures
2–
20) suggests we had a good amount of power. Therefore, the reason for not seeing effects for some of the dependent variables in some of the factorial surveys is not because of low statistical power.
We present the aggregated background data about the participants who passed the attention checks in Appendix
A. This includes a breakdown of participant demographics (Tables
2 and
3) and attitudes towards AI (Tables
4 and
5) by experimental group. There was no statistically significant difference between the participants in the optimal and sub-optimal AI performance conditions in each guideline.
Among the 1,043 responses, 539 (52%) participants identified as female, 488 (47%) as male, 13 (1.2%) as nonbinary or gender nonconforming, two preferred not to answer, one did not respond. Participants skewed young. Their distribution across age groups was as follows: 128 (12%) 18–24 years old; 437 (42%) 25–34 years old; 266 (26%) 35–44 years old; 139 (13%) 45–54 years old; 58 (6%) 55–64 years old; 14 (1%) 65–74 years old, and one participant did not answer.
Most participants had positive prior experiences with the product type in the vignettes they were exposed to: 953 (91%) considered products in the same category useful, and 950 (91%), reliable.
Most participants had some familiarity with computer science/technology either through college-level coursework, degrees, and/or programming experience. Of the 1,043 respondents, 357 (34%) had no such experience.
Participants’ attitudes towards AI tended to be positive. Most participants stated they would support the development of AI (852, 82%), 86 (8%) would oppose, 101 (9.7%) were neutral, and four did not know or did not answer. We also asked participants to indicate their feelings toward progress in AI. The most common feelings were curiosity (728, 70%), excitement (528, 51%), and optimism (505, 48%). Negative-leaning feelings such as concern (318, 30%), apprehension (255, 24%), and unease (177, 17%) were not as widespread among respondents. When asked whether society will become better or worse because of increased automation and AI, most participants (736, 71%) indicated better, 166 (16%) thought it would become worse, and 140 (13%) thought it would not change. Overall, our respondents’ attitudes towards AI were more favorable than those of the American public [
119].
Most of the respondents (1,005, 96%) found the vignettes easy to understand and 24 (2%) were neutral. The product scenarios in the vignettes were perceived to have a medium-stakes impact, confirming that participant perceptions (971, 93%) aligned with our intent to study medium-stakes products.
4.3 Factorial Survey Results
Therefore, we present the results of the 16 factorial surveys that successfully manipulated the independent variable. For each study, we include information pertaining to each research question: the number of participants who preferred [Product A] and [Product V], along with the reasons participants provided for their preferences, which provide insights into their perceptions. For each study, we also present the effect sizes (measured in generalized eta squared, noted as
\(\eta _G^2\) ) on the dependent variables and, where applicable, interaction effects. Despite
\(\eta _G^2\) values ranging from 0 to 1, to help interpret the directionality of the effects, we use the additive inverse of
\(\eta _G^2\) to indicate when [Product V] received more positive ratings with the dependent variables. The factorial surveys of most guidelines did not show statistically significant interaction effects between guideline compliance and AI performance on UX metrics (see Table
6 in Appendix
A), except for Guidelines 11, 13, and 15 (see Figures
12,
15,
18). Each of these guidelines demonstrated interaction effects on several UX metrics, but only the interaction effect for Guideline 13 had substantiated effect sizes. This also suggests that AI performance did not significantly influence the impact of guideline compliance on user perception and the different aspects of UX we tested. Therefore, we combine the results for both levels of AI performance when reporting the results for each guideline.
When discussing the qualitative results, in an attempt to keep the summaries concise, we focus on dominant themes. For example, if only one or two participants preferred [Product V], their comments are not summarized unless they provide insights into nuances of applying a guideline.
G1: Make Clear what the System Can Do.
The vignettes described a coaching feature in a presentation application. [Product A] informed users of the presentation coach’s capabilities with a detailed list of features, while [Product V] used a generic statement: “We will help you improve your presentation style.”
We saw a strong preference for [Product A] (55, 86%). The qualitative results show that 19 participants preferred [Product A] because of the feature’s specific description: “Both “coaches” aim to tweak presentations, but [Product A] explicitly states how it functions. Based off the narrative, I don’t know much about how [Product V] specifically aims to improve presentation skills.” These participants’ comments are consistent with the substantiated effects on feeling less uncertain and more secure. Another 12 participants expressed their understanding that [Product A] would provide more detailed feedback than [Product V]: “Again, it seems more useful. I don’t need general feedback, I need specific information in order to improve and [Product A] has that.” Even though the intention of the vignette was to be careful and not overstate what the presentation coach can do, some respondents interpreted [Product A] as being able to do more, which might explain the substantiated effect on perceived performance ( \(\eta _G^2=0.02\) ). The rest of the participants had various reasons for preferring [Product A], such as finding it useful or innovative (6), liking what it did (4), wanting to improve their presentations (3), and other (7). Among the participants who preferred [Product V], three did not feel comfortable being recorded by the product, and two wrote statements favorable to [Product A], suggesting they might have selected preference for [Product V] by mistake.
While the qualitative analysis focused on the fundamental themes of the user perception, several comments do indicate specific impact by the coaching feature. The three participants who did not feel comfortable being recorded might prefer [Product A] in a different product where no recording is needed. However, those who preferred [Product A] expressed a preference for specificity in feature description, which is consistent with accepted best practices in writing for user interface [
72]. In addition, previous works on rehearsal support and feedback systems [
100,
109,
111] primarily focused on the quality improvement when evaluating the efficacy of the systems. This study complements the previous evaluation strategies and reveals additional user concerns about being recorded. A unique finding and important consideration when applying Guideline 1 is to find a balance between providing a clear, specific description of the AI system with not over-stating system capabilities.
G3: Time Services Based on Context.
The vignettes described an email application that would stop notifications when it senses that the user is busy [Product A] or pops up notifications as messages came in [Product V].
[Product A] was preferred by 55 (81%) participants. Those who preferred [Product A] appreciated that it protected the user’s focus (25) and that it knew when it is appropriate to show notifications (24): “I like the fact that [Product A] can predict when I am too busy to deal with notifications popping up on my screen and disturbing my concentration”; “[Product A] seems to have a better understanding of when notifications are appropriate and inappropriate.” Consistent with the qualitative results, the quantitative results showed a medium effect on feeling more productive ( \(\eta _G^2=0.15\) ). Interestingly, even though it was the product that controlled notifications, [Product A] also made participants feel more in control ( \(\eta _G^2=0.16\) ). However, the desire for control was one of the reasons five participants stated for preferring [Product V]. Another five participants preferred [Product V] because they wanted to see notifications as they came in, and three respondents had privacy concerns: “[Product A] is tracking me somehow because it holds the notifications, so I prefer the less invasive [Product V].”
The results reflect an interesting tension about control: some participants felt more in control when the AI protected their focus; others felt more in control when the product showed notifications as new messages arrived. The findings suggest that when AI systems make decisions for users, these decisions can increase convenience but may also decrease the perception of control. This tension can happen in other products that apply G3, not only e-mail applications. Prior research on notifications has also pointed out challenges such as user stress and feelings of hindrance [
59] and costs of interruptions [
44,
45]. This study extends previous findings by showing factors for successfully timing the services based on context, such as providing user with sufficient control, making clear what information is tracked by the app, and clarifying privacy concerns.
G4: Show Contextually Relevant Information.
The vignettes described a document editing application with a feature that provided definitions of acronyms. [Product A] showed definitions of acronyms specific to the user’s workplace and relevant to the user’s document. [Product V] always used a standard list of definitions from a popular dictionary.
Respondents were quite split in terms of their preferences, with 26 (39%) actually preferring [Product V]. This is aligned with the lack of substantiated effects in the quantitative results. Those who preferred [Product A] (40) did mention reasons such as it being more tailored to their work. However, the participants who preferred [Product V] raised various concerns about [Product A]: Some (11) perceived [Product A] as being too limiting: “I would prefer to see all of the possible definitions as opposed to having the software narrow the options for me.” Six participants raised concerns of trust or possible errors: “I would rather use [Product V] because it gives me the choice of choosing which definition I would want to go by. [Product A] would be easier, but if [Product A] were to make a mistake on me, I would have a hard time trusting it because I did not make any part of the decision.” Four participants were concerned about privacy: “I would prefer to use [Product A] because [Product V] feels a bit more intrusive. I would be nervous that it is pulling data from things like my other software and my browsing history. This would be unacceptable since I work with sensitive PII [personally identifiable information].”
The choice of the acronym feature revealed a common dilemma in personalization: how to customize content without creating a filter bubble [
87]. The results also suggest that modern users of personalized features are exceptionally aware of these issues and desire more control over their information exposure. This is aligned with the findings of more extensive studies with deployed recommendation systems [
41,
77]. It is also possible that the findings for this study were overly influenced by the phrasing of the vignette, which led to [Product A] being perceived as more limiting than we had intended. However, some of the results point to a well-known tension between privacy and personalization [
60].
G5: Match Relevant Social Norms.
The vignettes described a document editing application. [Product A] introduced suggestions for improving writing style with the statement, “Consider using...”; however, [Product V] used a different tone: “You made a mistake. Replace with...”.
Most respondents (51, 74%) preferred [Product A]. Participants were able to perceive the difference in tone as a matter of politeness, and 49 participants referred to this in their open-ended comments: “I consider the “You made a mistake.” rather obnoxious. If you are saying something like that, you need to be absolutely perfect, no“sometimes made mistakes” allowed.” However, 13 out of the 18 participants who preferred [Product V] actually liked its tone: “I like the way it is blunt.” They interpreted it as a sign of confidence: “[Product V] sounds more confident, which makes me trust it more than I trust myself. [Product A] would just be a nuisance to me.” It is not surprising that the polite tone had an effect on feeling less inadequate. Applying the guideline also had a small effect on feeling more secure ( \(\eta _G^2=0.02\) ).
The results indicate some disagreement among participants about what might be acceptable social norms. For some participants, the more blunt tone was acceptable, too. It is possible that had the vignette used offending language, the results might have been different. Even so, the results point out the importance of conducting user research to understand social norms and what is acceptable to different user groups. The results also indicate an interaction between G5 and G2, as the more blunt tone was interpreted as a sign of confidence and better product performance. It is important to take G2 into consideration when applying Guideline 5 to describe system behaviors or capabilities, so as to avoid creating unrealistic expectations about the AI system.
G6: Mitigate Social Biases.
The vignettes described an online search engine. When searching for images of CEOs and doctors, [Product A] showed people of different genders and skin tones while [Product V] did not show images of women or people of color.
[Product A] was preferred by 64 (94%) respondents and resulted in substantiated effects on all UX metrics, one being small and the others being either medium or large. Most (34) participants’ reasons for preferring [Product A] converged around the theme that [Product A] did not have bias, or an agenda with bad intentions: “Although it makes mistakes sometimes as well, it seems more well-intentioned than [Product V]. Or rather, [Product V]’s creators.” This is consistent with the large effects on UX metrics related to trust ( \(\eta _G^2=0.30\) ). Another 14 participants referred to benefits of diverse search results: “Diversity of results (all relevant to the search) allows me easy access to a variety of content, and gives me a degree of control over what content is available for me to use.” Yet another reason for preferring [Product A] was the perception that it reflected reality, mentioned by 9 participants: “I work in the medical field and understand that many positions have more women than men filling the rolesand to have a search result that excludes them, as well as individuals of different races, is an inaccurate depiction of the occupation.” Together, these reasons align with effects on the UX metrics related to productivity ( \(\eta _G^2=0.30\) ) and perceived performance ( \(\eta _G^2=0.26\) ). Among the four participants who preferred [Product V], two indicated that [Product A] had an agenda: “Most CEOs and doctors are white males. I feel like the first one is forcing its agenda of being inclusive.” One participant seems to have clicked [Product V] by mistake, as their open-ended comment actually describes [Product A]:
“I believe the [Product V] search engine mitigates undesirable and unfair stereotypes and biases. Every individual deserves to be treated equally. I understand that we all have preference in choosing which person we feel we are comfortable to communicate with, but through [Product V] system, every individual may have the chance to excel better than before.”
Due to increased media coverage (e.g., [
31,
71,
117]), the general public is sensitive to issues of social bias in AI systems [
82], and in U.S. society at large. As in society at large, feelings about social biases were intense, as suggested by the strong language in participant quotes. Mitigating social biases is a complex issue, and even though it is difficult to achieve, there is convergence in academia and industry about the importance of doing so (see [
82] for a survey of this topic).
G7: Support Efficient Invocation.
The vignettes were about a feature in a presentation application that suggested alternative slide layouts. [Product A] had a button to invoke the feature with layout design ideas if it did not trigger automatically, while [Product V] did not have a button for manual invocation.
[Product A] was preferred by 58 (97%) participants. The main reason was that [Product A] provides the option to request design help manually, which was mentioned by 33 participants. A total of 17 participants found [Product A] more user-friendly and efficient, and 3 participants felt it offered more freedom and control: “I would prefer being able to get help easily if I need it. [Product A] seems more useful;” “[Product A] gives me more control over the program and allows me to tell when I want to apply suggestions about formatting.” Participants’ comments are consistent with the effects on UX metrics about feeling in control ( \(\eta _G^2=0.21\) ), productive ( \(\eta _G^2=0.23\) ), and those related to perceived product quality.
The results of G7 point to the fundamental user need to have easy access and control to invoke the available features in an app, which is not limited to slide editors or layout recommendation features and is consistent with existing design heuristics about user control (e.g., [
79,
105]). Of course, with other types of interfaces, the specific interaction for efficient invocation would be different—e.g., gestures, voice commands [
35,
64], but the finding about G7 supporting user control would still apply.
G8: Support Efficient Dismissal.
The vignettes used the same features as for G7, but manipulated guideline compliance through the presence or absence of a button to dismiss slide design suggestions when they were not needed.
Results showed that 56 (92%) participants prefer [Product A]. The presence of the dismissal option was the main reason for preferring [Product A], mentioned by 28 participants. The other 27 participants commented on various UX aspects that differentiated [Product A] from [Product V], such as: choice and control: “[Product A] seems to give me a little more control over my user experience;” frustration: “Because I would get frustrated not being able to get the help off the screen”; productivity: “Being able to remove/hide a tool that is not needed at the time will allow for more productivity and less frustration”; user-friendliness: “It’s more intuitive and has a more well-designed interface.” Feeling in control ( \(\eta _G^2=0.23\) ) and more productive ( \(\eta _G^2=0.19\) ) were also apparent in the quantitative effects on UX metrics, as were three metrics related to perceived product quality: usefulness ( \(\eta _G^2=0.16\) ), performance ( \(\eta _G^2=0.16\) ), and NPS ( \(\eta _G^2=0.14\) ).
Similar to G7, efficient dismissal supports user control and might be implemented differently in other interface types. We recommend AI practitioners follow UX research best practices and assess the effectiveness of a planned interaction, potentially using the same research protocol presented here.
G9: Support Efficient Correction.
The vignettes described the same layout suggestions feature as in the studies for G7 and G8. Upon selecting one of the design recommendations, [Product A] allowed the user to make further changes to the layout, while [Product V] did not allow such changes.
The results for G9 show that 56 (93%) respondents preferred [Product A] and all effects on UX metrics were substantiated. Besides stating that they liked the ability to modify the suggested layout (32 respondents), 25 participants’ open-ended comments expressed feelings consistent with the quantitative effects on the UX metrics: feeling in control ( \(\eta _G^2=0.32\) ): “I do not like feeling out of control. I like to be able to alter the layout to suit my individual needs”; feeling productive ( \(\eta _G^2=0.18\) ): “[Product A] allows you to edit things such as size and reposition things which would make the job easier to do”; reliability ( \(\eta _G^2=0.19\) ): “[Product A] is more reliable as it allows myself, the user, to fix mistakes that the software will make from time-to-time”; trust ( \(\eta _G^2=0.20\) ): “I am happy for the supported help of the Design Helper, but I don’t fully trust it, and I want to feel like I am in control of my slide deck software. Therefore, [Product A] seems like a better fit for me.” Participants also expressed feelings not captured in the quantitative UX metrics: freedom and flexibility: “I like having the freedom of being able to adjust the suggested layout if needed”; avoiding frustration: “The ability to customize and make adjustments on [Product A] is a significant improvement over [Product V]. [Product V] would make me very frustrated and unhappy if I couldn’t make things just right.”
The results about G9 also point to fundamental user needs to have control over AI systems and be able to edit their outputs.
G10: Scope Services when in Doubt.
The vignettes described an auto-complete feature in a document editing application. [Product A] provided a list of options when it was not sure which word the user is trying to type. [Product V], on the other hand, automatically completed a word after the user typed the first few letters with its best bet.
[Product A] was preferred by 62 (94%) participants. In the open-ended comments, 31 respondents perceived [Product A] as able to reduce the likelihood of errors, which is consistent with the medium effect on perceived performance ( \(\eta _G^2=0.15\) ) and reducing uncertainty ( \(\eta _G^2=0.15\) ). Another 19 participants simply stated they preferred it because of the way it worked. Participants mentioned [Product A] made them feel in control: “I would feel more in control, because I would have a choice to make, instead of one being made for me,” which is also reflected by a medium effect in quantitative results ( \(\eta _G^2=0.20\) ); more efficient: “I would prefer the speed of which I can click on the word to avoid having to type the whole word myself. [Product V] would confuse me and make me sad”; and less trusting of [Product V]: “I don’t trust [Product V]’s automatic features,” echoing a small effect on trust ( \(\eta _G^2=0.09\) ).
The results about G10 indicate that handing over the control from a less confident AI system to a human user is important for maintaining a positive UX. In this case, the cost of engaging the user in disambiguation was lower than the cost of making a mistake. There might be scenarios, where engaging in disambiguation dialogues can be distracting to the user (e.g., while driving). When applying Guideline 10, it is important to consider the relative costs of engaging in disambiguation versus just degrading services when the system is in doubt. These insights align with previous works using simulation [
65] or implemented systems [
107].
G11: Make Clear why the System Did what it Did.
The vignettes used a spreadsheet application that generated insights and recommended charts. [Product A] made available an explanation for these recommendations, whereas [Product V] did not.
All but one participant (64, 98%) preferred [Product A]. Besides liking that [Product A] provided an explanation, mentioned by 29 respondents, participants stated [Product A] helped users understand and check for errors: “If the software is making mistakes then an explanation is 100% needed to avoid frustration and inaccuracy
,” reinforcing the importance of making available an explanation especially when the AI might be wrong [
1,
5]. In fact, this guideline is one of the few that showed a statistically significant interaction effect with AI performance. Applying the guideline made a bigger difference for trust (p = 0.04) and decreased expectation of harm (p = 0.03) when AI performance was sub-optimal (Figure
12). Not surprisingly, applying Guideline 11 also resulted in a medium effect on perceived product performance (
\(\eta _G^2=0.15\) ). Additionally, 6 participants mentioned explanations were useful or valuable, and 6 participants saw [Product A] as more reliable or trustworthy, consistent with the quantitative UX effect on trust (
\(\eta _G^2=0.14\) ) and reliability (
\(\eta _G^2=0.14\) ): “I would trust [Product A] more seeing that it gives you more access to important information.”
The results point to the need for explanations of AI system behavior. This principle is not limited to a specific product or feature. The results echo previous findings that the mere presence of an explanation increases user trust in an AI system [
56,
96], and are just as concerning. Careful considerations are needed about how to design, implement, and deliver explanations to users in different scenarios. For example, recent work has found that explanations can lead to inappropriate trust and over-reliance on AI systems [
24,
86], or even poor performance in human-AI collaboration [
14].
G12: Remember Recent Interactions.
The vignettes described an e-mail application. When attaching a file, [Product A] showed a list of recent files to choose from, whereas [Product B] opened a standard file explorer window for the user to navigate to files.
[Product A] was preferred by 60 (89%) participants. In the open-ended answers, 30 participants made comments about various aspects of [Product A]’s better UX, which are consistent with the substantiated effect on PU ( \(\eta _G^2=0.02\) ). Some aspects mentioned by the participants were not included among the UX metrics we measured: convenience: “I like how it shows my recent files because its more convenient”; ease of use: “I would want it to remember recently used files to make my work easier”; user-friendliness: “I am more likely to reopen a file that I have recently looked at, so I think [Product A]would be more user-friendly for my purposes.” The aspect of productivity was measured in the UX metrics and two participants mentioned it: “I certainly like the option of a program allowing me to simply click and attach a recent file. This for me, is incredibly productive without having to constantly find where a file may or may not be located to attach it.” Surprisingly, this is inconsistent with the unsubstantiated effect ( \(\eta _G^2=0.01\) ) in quantitative results. This could be explained by quotes from the nine participants who preferred [Product V], who thought the features were not needed: “Don’t ever use when a program shows the recent files. Not needed”; or were concerned about privacy: “I don’t really want my email application looking at what I’m working on. I just want it to be neutral and allow me to choose what I want to send.”
The results point to a general privacy concern about how much information is acceptable for the system to track. This concern is much more general and not bound to an email app or a file explorer, and can be especially prevalent in the
Internet of Things (
IoT) and AI-infused cyber-physical systems [
112].
G13: Learn from user Behavior.
The vignettes described a presentation application with the same layout helper feature used in G7 through G9. [Product A] personalized its design suggestions based on previous user behavior, while [Product V] always showed the default recommendations.
[Product A] was preferred by 49 (82%) participants. Participants who preferred it not only liked that it learned user preferences (23), but also found it more efficient (18), which is consistent with the medium effect on PU ( \(\eta _G^2=0.17\) ). We also observed statistically significant and substantiated interaction effects on two UX metrics between guideline compliance and AI performance. In the sub-optimal AI performance condition, applying Guideline 13 resulted in improved effects on BI ( \(p=0.02, \eta _G^2=0.03\) ) and feeling secure ( \(p=0.04, \eta _G^2=0.03\) ). Participants’ open-ended answers such as the one below suggest users might be more likely to tolerate sub-optimal AI performance when there is an indication that the system is learning and might improve over time. “From the two [Product A] seemed to have a better user experience. Although it did make some mistakes they outweigh the ease of use and learning ability of [Product A].”
This guideline focuses on the system “learning” from user behavior. Note that some participants assumed that this learning would directly improve system behaviors. Because participants appear to conflate two guidelines, learning user behavior (G13) without actually improving the system appropriately (G14) could be perceived as problematic. In practice, it is also important to convey to users how their interactions are used for system improvement or other purposes (G16). Because in practice these three guidelines can be deeply interrelated, it is important to consider their interaction when designing and evaluating human-AI interaction. Prior research has studied adaptive user interfaces that learn from user behaviors in various contexts such as autonomous driving [
110] and e-learning [
62], to name a few. Our study results confirm the positive effects of adapting user interfaces based on user behaviors, and also points to the importance of combining multiple guidelines when designing such adaptive user interfaces.
G14: Update and Adapt Cautiously.
The vignettes described a document editing application that adapted a part of its menu to the user’s current actions, as opposed to [Product V], which changed its entire menu.
[Product A] was preferred by 55 (81%) participants. Among those who preferred [Product A], 17 participants liked [Product A]’s consistency or found [Product V] disruptive: “I feel like [Product V] would constantly change the entire menu bar and these changes would be disruptive or distracting to me as I tried to work. [Product A] would be less intrusive.” These perceptions might align with the small quantitative effects on feeling less inadequate ( \(\eta _G^2=0.03\) ), less uncertain ( \(\eta _G^2=0.08\) ), and reliability ( \(\eta _G^2=0.05\) ). Consistent with the substantiated effects on control ( \(\eta _G^2=0.09\) ) and productivity ( \(\eta _G^2=0.06\) ), 15 respondents mentioned these aspects in their open-ended comments: “I like having control of the main functions that I find useful and not having the entire bar change would be more beneficial for me”; “Since both programs occasionally make mistakes, I would be more productive when using [Product A] because I would remember where the tool buttons are.”
The results point to the importance of having a consistent UX and reducing the burden of learning an updated system, consistent with existing research on backward compatibility [
13]. Previous work has recognized that maintaining consistency in adaptive user interfaces is challenging [
57]. Future research can use this research protocol to experiment with multiple design ideas before investing in implementation.
G15: Encourage Granular Feedback.
The vignettes were about a spreadsheet application. [Product A] had an option to provide feedback on suggested charts, while [Product V] did not.
A total of 60 (91%) participants preferred [Product A]. The strong preference for [Product A] is also reflected in the substantiated effects in almost all UX metrics, except the decreased expectation of harm. In the open-ended comments, 25 participants simply stated that the reason for their preference was the availability of the feedback feature, but others provided more nuanced reasons. Another 21 participants thought that because it asked for feedback, [Product A] would learn and adapt to user needs: “I feel that if I can let the program know which features are useful to me and which aren’t, it may get better at predicting which features to suggest to me.” This could explain the statistically significant interaction effects on perceptions on reliability (p = 0.04) and feeling less inadequate (p = 0.03): “I feel like in the long run, it will become more reliable, unlike [Product V] which has no way of knowing what it is doing wrong, making it unreliable.”
The results suggest that asking for granular feedback from the users (G15) sets the expectation that the system will learn from the feedback (G13) and improve over time (G14). In a longer-term interaction scenario, user perception might change over time depending on system performance. Because these guidelines appear to be deeply interrelated, it is important to consider their interaction when designing and evaluating human-AI interaction.
G17: Provide Global Controls.
The vignettes described an email application. [Product A] provided a setting where the user could teach the system that certain contacts as important, so their emails always went to the“Important” inbox and not be miscategorized by the system in the “Other” inbox. [Product V], on the other hand, did not have this global control.
[Product A] was preferred by 58 (89%) participants. Of those, 35 participants chose [Product A] because of the availability of the feature. Another 19 participants associated [Product A] with aspects of UX that are consistent with the small effects in quantitative results: feelings of control ( \(\eta _G^2=0.10\) ): “Because it somehow gives you some control of the e-mails you want to be marked as important.”; trust ( \(\eta _G^2=0.05\) ): “With [Product A] I have more trust that the emails are where they are supposed to be...”; reliability ( \(\eta _G^2=0.03\) ): “[Product A] has more user functionality so you can customize the app to your liking and seems more reliable.”; and usefulness ( \(\eta _G^2=0.06\) ): “The setting to classify people would be useful if it works.”. In the conditions with sub-optimal AI performance, four participants raised concerns about possible mistakes “Because it [Product V] is not promising something like marking people as important going forward and then failing to deliver something like that. I would expect [the product] to fail and because of that I would be watching it more closely and manually monitoring it.”
The results revealed that users prefer to have control over system behaviors, but might have additional concerns when system performance is sub-optimal. When system performance is sub-optimal, users might not trust that the global controls will influence system behavior in the way they desire. This is not limited to a specific app or feature.
G18: Notify users About Changes.
The vignettes used the same sorting feature in an email application as in G17. [Product A] would send a notification when the e-mail categorization feature underwent an update and changed the way it worked, while [Product V] did not.
[Product A] was preferred by 61 (90%) participants. When explaining their preference, 51 participants stated they wanted to know and stay informed: “I would want to know changes coming ahead of time”; “I would prefer to use [Product A] because it would notify me when it made changes, which would make me feel like I was more in the loop with what was going on and that I wouldn’t be surprised when I signed onto my email account.” These sentiments align with the quantitative effects of feeling more secure ( \(\eta _G^2=0.04\) ), less uncertain ( \(\eta _G^2=0.03\) ), and perhaps trusting the product more ( \(\eta _G^2=0.02\) ). Also, participants mentioned [Product A] would make them feel more in control, which also showed a substantiated quantitative effect ( \(\eta _G^2=0.03\) ): “Because it keeps me aware of what actions is taking, make me feel more secure and in control of [Product A]. With [Product V] I feel like I’m left in the dark.”
The results suggest that the action of notifying users of system changes influences UX positively and minimizes surprises caused by a system update. This result is consistent with industry best practices (e.g., [
76]), but in reality might be contingent on the assumption that users will pay attention to such notifications.