Page MenuHomePhabricator

[MILESTONE] Run an A/B test to evaluate Edit Check (references) impact
Closed, ResolvedPublic

Description

Reference Check is designed to increase the likelihood that newcomers and Junior Contributors who are editing from within Sub-Saharan Africa:

  1. Publish edits that they are proud of and experienced volunteers consider useful
  2. Return to edit again in the future

This task involves the work with running an A/B test (or perhaps a multivariate test [i]) to evaluate the extent to which this initial Edit Check has been effective at impacting newcomers and Junior Contributors in the ways described above.

Decision(s) To Be Made

  • 1. Decide whether the impact Edit Check is having on users' behavior are positive enough to be made available by default, at all Wikipedias.

Hypotheses

IDHypothesisMetric(s) for evaluation
KPIThe quality of new content edits newcomers and Junior Contributors make in the main namespace will increase because a greater percentage of these edits will include a reference or an explicit acknowledgement as to why these edits lack references.1) Proportion of published edits that add new content and include a reference or explicit acknowledgement of why a citation was not added, 2) Proportion of published edits that add new content (T333714) and are reverted within 48 hours (or have a high revision risk score) if we use revision risk model (T317700, T343938))
Curiosity #1Newcomers and Junior Contributors will be more aware of the need to add a reference when contributing new content because the visual editor will prompt them to do so in cases where they have not done so themselves.Increase in the proportion of newcomers and Junior Contributors that publish at least one new content edit that includes a reference.
Curiosity #2Newcomers and Junior Contributors will be more likely to return to publish a new content edit in the future that includes a reference because Edit Check will have caused them to realize references are required when contributing new content to Wikipedia.1) Proportion of newcomers and Junior Contributors that publish an edit Edit Check was activated within and successfully and return to make an unreverted edit to a main namespace during the identified retention period., 2) Proportion of newcomers and Junior Contributors that publish an edit Edit Check was activated within and return to make a new content edit with a reference to a main namespace during the identified retention period.
Curiosity #3[v]Newcomers and Junior Contributors will be more likely to change an unreliable reference if presented with that information while attempting to add a reference when contributing new content.Proportion of newcomers and Junior Contributors that elect to "try new source" after being presented with the reference reliability check.
Curiosity #4The quality of new content edits published by newcomers and Junior Contributors will increase because these contributors will made aware that the reference they are adding is deemed to be unreliable and unfit for publishing on-wikiProportion of new content edits by newcomers and Junior Contributors presented with the reference reliability check that are reverted within 48 hours (or have a high revision risk score).

Leading indicators

See T352130.

Guardrails

This section describes the metrics we will use to make sure other important parts/dimensions of the "editing ecosystem" are not being negatively impacted by Edit Check. The scenarios named in the chart below emerged through T325851.

IDNameMetric(s) for Evaluation
1)Edit quality decrease (T317700)Proportion of published edits that add new content and are still reverted within 48hours (or have a low revision risk score if we use the revision risk model (T317700)). Will include a breakdown of revert rate of published edits with and without a reference added.
2)Edit completion rate drastically decreasesProportion of edits that are started (event.action = init) and are successfully published (event.action = saveSuccess)
3)Edit abandonment rate drastically increasesProportion of contributors that are presented Edit Check feedback and abandon their edits (indicated by event.action = abort and event.abort_type = abandon).
4)People shown Edit Check are blocked at higher ratesProportion of contributors blocked after publishing an edit where Edit Check was shown
5)High false positive or false negative ratesA) Proportion of new content edits published without a reference and without being shown edit check (indicator of false negative) & B) Proportion of contributors that dismiss adding a citation and select "I didn't add new information" or other indicator that their edit doesn't require a citation

A/B Test: Decision Matrix

IDScenarioIndicator(s)Plan of Action
1)Edit Check is disrupting, discouraging, or otherwise getting in the way of volunteers who are attempting to make edits in good faith. Read: people are less likely to publish the edits they start.Significant drop in edit completion and spike in edit abandonment in good faith edit sessions [iii][iv] where Edit Check is activated. Will include breakdown to review edits where reference reliability check was included.Pause scaling plans; investigate changes to UX
2)Edit Check is increasing the likelihood that people will publish destructive editsIncrease in proportion of contributors blocked after publishing an edit where edit check is activated, Increase in proportion of published edits where edit check was activated and are reverted within 48 hours relative to new content edits edit check was NOT activated within.Pause scaling plans, review edits to try to identify pattern in abuse and propose changes to UX to mitigate them
3)Edit Check is causing people to publish edits that align with project policiesIncrease in the proportion of edits edit check was activated within that include a reference and are not reverted within 48 hours relative to new content edits without a reference edit check was NOT activated withinMove forward with scaling plans
4)Edit Check is effective at causing people to accompany new content edits that include a reference, but those references are unreliableIncrease in the proportion of published edits edit check was activated within that include a reference and increase or no change in the proportion of these edits that are reverted within 48 hoursBlock scaling plans on reference reliability work (T276857)
5)Edit Check is not effective at causing people to accompany new content edits that include a reference but is not disrupting to volunteers.No change or decrease in the proportion of published edits edit check was activated within that include reference and A) no significant drop in edit completion or abandonment rate or B) no significant spike in block or revert rateMove forward with scaling plans

i. Where a "multivariate test" in this context could look like tests wherein we compare: A) multiple variations of Reference Check user experiences or B) people who are shown the source editor by default, to people who are shown VE by default, and people who are shown VE by default with Edit Check activated, as @MNeisler and @DLynch raised offline
ii. See T331582#9132480
iii. Being able to distinguish edits made in good faith from those made in bad faith depends on T343938
iv. Per the reasons @MNeisler discovered and named in T343938#9368298, it is not feasible to use the revert risk model to assess whether an edit was made in good or bad faith: "This would require us to determine if it is a good-faith edit session while the user is attempting an edit, which is not feasible yet per engineering constraints @Pablo mentioned in T343938#9082581. The revision risk model requires a revision ID, which is only stored with published edits."
v. Per discussions with the Editing team, we have decided not to include the reference reliability check in this AB test. We we will review the impact of this feature in a separate deployment.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Update: I've added the leading indicators @MNeisler and I discussed offline on 6 Sep 2023.

MNeisler triaged this task as Medium priority.
MNeisler added a project: Product-Analytics.
MNeisler moved this task from Triage to Current Quarter on the Product-Analytics board.
ppelberg updated the task description. (Show Details)
ppelberg moved this task from Upcoming to Needs Discussion / Investigation on the Editing-team board.

I've updated the task description with the proposals @MNeisler and I converged on during today's (18 Oct 2023) offline meeting.

Next steps

ppelberg renamed this task from [Analysis] Run an A/B test to evaluate Edit Check (references) impact to [MILESTONE] Run an A/B test to evaluate Edit Check (references) impact.Nov 22 2023, 6:20 PM

I've updated the task description to include curiosities (Curiosity #3 and #4) and guardrails (Guardrails #2 and 3) )we'd like to review as part of the planned incorporation of the reference reliability check (user experience being defined in T347531).

The measurement plan has also been updated to incorporate these changes.

@MNeisler @ppelberg: just want to verify that

Proportion of published edits that add new content (T333714) and are reverted within 48 hours (or have a high revision risk score) if we use revision risk model

is what would be reported in WE1: Program Indicator Plan Template_WE1_Contributor Experiences once initial data is available (and then updated as Edit Check is deployed to more wikis)

Trizek-WMF subscribed.

From T345298, we can proceed with the following wikis:

  • arwiki
  • afwiki
  • frwiki
  • itwiki
  • jawiki
  • ptwiki
  • swwiki
  • yowiki
  • viwiki
  • zhwiki

We might have eswiki as a bonus in a few days.

Change 1003052 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/WikimediaEvents@master] EditAttemptStep: log buckets for the edit check test

https://gerrit.wikimedia.org/r/1003052

Change 1003053 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/VisualEditor@master] Enrollment for the edit check a/b test

https://gerrit.wikimedia.org/r/1003053

Change 1003052 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] EditAttemptStep: log buckets for the edit check test

https://gerrit.wikimedia.org/r/1003052

Change 1004351 had a related patch set uploaded (by DLynch; author: DLynch):

[operations/mediawiki-config@master] Launch the Visual Editor edit check a/b test

https://gerrit.wikimedia.org/r/1004351

Change 1003053 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Enrollment for the edit check a/b test

https://gerrit.wikimedia.org/r/1003053

I've been asked to include eswiki as well, so the config patch has been updated to add them.

Change 1004708 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/WikimediaEvents@wmf/1.42.0-wmf.18] EditAttemptStep: log buckets for the edit check test

https://gerrit.wikimedia.org/r/1004708

Change 1004709 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/VisualEditor@wmf/1.42.0-wmf.18] Enrollment for the edit check a/b test

https://gerrit.wikimedia.org/r/1004709

Change 1004351 merged by jenkins-bot:

[operations/mediawiki-config@master] Launch the Visual Editor edit check a/b test

https://gerrit.wikimedia.org/r/1004351

Change 1004708 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@wmf/1.42.0-wmf.18] EditAttemptStep: log buckets for the edit check test

https://gerrit.wikimedia.org/r/1004708

Change 1004709 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@wmf/1.42.0-wmf.18] Enrollment for the edit check a/b test

https://gerrit.wikimedia.org/r/1004709

Mentioned in SAL (#wikimedia-operations) [2024-02-19T21:25:05Z] <zabe@deploy2002> Started scap: Backport for [[gerrit:1004708|EditAttemptStep: log buckets for the edit check test (T342930)]], [[gerrit:1004709|Enrollment for the edit check a/b test (T342930)]], [[gerrit:1004351|Launch the Visual Editor edit check a/b test (T342930 T352127)]], [[gerrit:1004748|Default VE on mobile for other wikis (T352127)]]

Mentioned in SAL (#wikimedia-operations) [2024-02-19T21:26:26Z] <zabe@deploy2002> kemayo and zabe: Backport for [[gerrit:1004708|EditAttemptStep: log buckets for the edit check test (T342930)]], [[gerrit:1004709|Enrollment for the edit check a/b test (T342930)]], [[gerrit:1004351|Launch the Visual Editor edit check a/b test (T342930 T352127)]], [[gerrit:1004748|Default VE on mobile for other wikis (T352127)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-02-19T21:42:31Z] <zabe@deploy2002> Finished scap: Backport for [[gerrit:1004708|EditAttemptStep: log buckets for the edit check test (T342930)]], [[gerrit:1004709|Enrollment for the edit check a/b test (T342930)]], [[gerrit:1004351|Launch the Visual Editor edit check a/b test (T342930 T352127)]], [[gerrit:1004748|Default VE on mobile for other wikis (T352127)]] (duration: 17m 25s)

I've completed an initial analysis of the Edit Check AB test results including review of the KPI, Curiosity #1, and Guardrails 1-3. Results are summarized in this slide deck which was presented to the team on 12 March 2024.

Below are the current proposed next steps:

  • Complete review of guardrails #3-4.
  • Complete review of retention as part of Curiosity #2.
  • Add findings for editors from Sub Sahrah Africa (SSA) once data is available. There currently have only been about 70 published edits from SSA in the test. We will extend the test duration so we can get statistically significant findings related to editors volunteering from editors from within SSA.
  • Edit Completion Rate (Guardrail #2)
    • Look at edit completion rate of constructive edits (unreverted)
    • Review options to limit edit completion rate of control group to only eligible edits.
    • Shift edit completion start to "saveIntent" as that is the point that edit check is or would be shown.
  • Revert Rate (KPI and Guardrail #1)
    • Review the sources people are citing when they add new content that includes a reference. Note: The analysis I completed in T346982 can be re-used to help complete this task.
    • As the observed change between the test and control groups was only slight, it might be beneficial to apply a statistical model to confirm the impact of the Edit Check on edit revert rate.
    • Review impact if revision risk score is used instead of revert rate as a measure of quality.
  • Summarize results into a final report that can be shared on the project page

I've completed an initial analysis of the Edit Check AB test results including review of the KPI, Curiosity #1, and Guardrails 1-3. Results are summarized in this slide deck which was presented to the team on 12 March 2024.

Below are the current proposed next steps

The proposed next steps you identified align with the next steps the presentation you made on 12 March 2024 brought to mind for me...thank you for documenting these and all of the work that inspired them, @MNeisler

Completed review of guardrails #4-5. Findings are summarized on slides 35-45 of the slide deck.

Note: We decided to end the AB test on 5 April 2024. At the end of the test, I will review the complete data to confirm any changes to already calculated metrics and calculate metrics that require more data (Retention rate [Curiosity #2] and findings for editors from SSA)

I've completed the retention rate analysis (Curioisity #2) now that we have two full months of data. Findings are summarized on slides 20-28 of the slide deck.

Notes re methodology:

  • I used second-month retention as the retention period. This is defined “Out of the users who made an edit in the month before the previous, the proportion who also made an edit during their second 30 day" and aligns with the new editor retention contributor metric used by movement insights.
  • Compared first edits where reference check was activated in the test group to eligible edits (edits where reference check would have been activated) in the control group.
  • Analysis is currently limited to registered newcomers and Junior Contributors.

All results and methodology are now summarized in the final report. Please let me know if you have any questions or suggested additions/changes.

Note. I'm drafting a high-level summary with a link to this report that can be published on mediawiki. This is being completed as part of T352131
cc @ppelberg

Review the sources people are citing when they add new content that includes a reference. Note: The analysis I completed in T346982 can be re-used to help complete this task.

Note: I created a new tab to the "References people cite when adding new content" dashboard ("February-March 2024 Reference Check Available", which includes an updated snapshot of the references added by new content edits at the wikis where the AB test was run. You can use the filter "was reference check shown" to review the types urls and domains published by edits shown this prompt.

This data can be used to help investigate the types of domains and urls being included in published edits still reverted after being shown reference check

Review the sources people are citing when they add new content that includes a reference. Note: The analysis I completed in T346982 can be re-used to help complete this task.

Note: I created a new tab to the "References people cite when adding new content" dashboard ("February-March 2024 Reference Check Available", which includes an updated snapshot of the references added by new content edits at the wikis where the AB test was run. You can use the filter "was reference check shown" to review the types urls and domains published by edits shown this prompt.

This data can be used to help investigate the types of domains and urls being included in published edits still reverted after being shown reference check

Oh, this is spectacular, @MNeisler! I can see it being immediately useful for T348060: Expand the Reference Reliability check to include a wider range of sources.

Specifically, I can see us sharing this data with volunteers and empowering them to decide what – if any – domains they would like to provide people attempting to cite them feedback about...

Screenshot 2024-05-14 at 2.04.22 PM.png (541×862 px, 79 KB)

In the meantime, a clarifying question about the meaning of each column within the Domain summary stats view:

Would it be accurate for me to understand the columns below as meaning...?

  • num_domain_occurrences: the number of references that cite a given domain across all pages in the main namespace at a given project
  • Number of distinct pages domain added to with new content edit: the number of distinct pages in the main namespace of a given project where a domain was cited during the period of time the data shown in this chart was gathered
  • Number of new content edits that included domain: the number of edits tagged with editcheck-newcontent that include a given domain at a given project
  • Revert rate: the percentage of Number of new content edits that included domain that are reverted within 48 hours

@ppelberg

Would it be accurate for me to understand the columns below as meaning...?

num_domain_occurrences: the number of references that cite a given domain across all pages in the main namespace at a given project

The num_domain_occurrences comes from the external links table and indicates the number of times that domain appears across all pages on a given project.

It's not necessarily restricted to just references added with new content edits but any external links on a project.

  • Number of distinct pages domain added to with new content edit: the number of distinct pages in the main namespace of a given project where a domain was cited during the period of time the data shown in this chart was gathered

Correct.

  • Number of new content edits that included domain: the number of edits tagged with editcheck-newcontent that include a given domain at a given project

Also limited to edits tagged with editcheck-newreference

  • Revert rate: the percentage of Number of new content edits that included domain that are reverted within 48 hours

Correct