Report of Reference Check Leading Indicators

Author
Affiliation

Irene Florez & Megan Neisler

Published

December 3, 2025

Modified

December 5, 2025

The Editing team is evaluating the impact of Reference Check through an A/B test.
The A/B test has one key performance indicator with two parts, two optional curiosities to explore if time allows, and four guardrails, the last of which contains two sub-parts.

KPI Hypothesis: The amount of constructive edits newcomers will publish will increase because a greater percentage of edits that add new content will include a reference or an explicit acknowledgement as to why these edits lack references. KPI Metrics(s) for evaluation: 1) Proportion of published edits that add new content and include a reference or explicit acknowledgement of why a citation was not added; 2) Proportion of published edits that add new content (T333714) and are constructive (read: NOT reverted within 48 hours).


From the Edit Check Reference-Check AB Test report when Reference Check was shown, edits were 2.2× more likely to include a new reference and be constructive (i.e. not reverted within 48 hours) than otherwise. The English Wikipedia Reference Check A/B test will be looking to how numbers compare to this 2024 finding.



Here we review indicators, prior to proceeding to analyzing the full A/B test data.


Leading indicators:

⭐ Newcomers are not encountering Reference Check: Proportion of new content edits Reference Check is shown within (Frequency)
⭐ Newcomers are not understanding the feature: Proportion of contributors that are presented Reference Check and abandon their edits (Edit Completion Rate)
⭐ Reference Check is causing disruption: Proportion of published edits that add new content and are reverted within 48hours (Revert Rate)
* Reference Check is causing disruption: (If time) Proportion of people blocked after publishing an edit where Reference Check was shown * People deem Reference Check irrelevant: Proportion of edits wherein people elect NOT to cite the text they are attempting to add (Dismissal Rate)

  • In this AB test, users in the test group will be shown Reference Check if attempting an edit that meets the requirements for the check to be shown in VisualEditor. The control group is provided the default editing experience where no Reference Check is shown.
  • We collected two weeks of AB test events logged between 7 November 2025 and 21 November 2025 on English Wikipedia.
  • We relied on events logged in EditAttemptStep, VisualEditorFeatureUse, and change tags recorded in the revision tags table.
    • Published edits eligible for Reference Check are identified by the editcheck-references revison tag.
    • For filtering to new content edits we use editcheck-newcontent.
    • To identify edits where Reference Check was shown we use VisualEditorFeatureUse events: event.feature = editCheck-addReference event.action = check-shown-presave
    • action-reject: editor dismissed Reference Check
    • edit-check-feedback-reason-*: Reason for dismissal
  • For calculating Edit Completion Rate we make an assumption and posit that all edits reaching saveIntent are eligible.
  • For calculating Revert Rate, published edits eligible for Reference Check are identified by the editcheck-references revision tag
    See the instrumentation spec for more details.
  • Data was limited to mobile web and desktop edits completed on a main page namespace using VisualEditor on English Wikipedia. We also limited to edits completed by unregistered users and users with 100 or fewer edits as those are the users that would be shown Reference Check under the default config settings.
  • For each leading indicator metric, we reviewed the following dimensions: by experiment group (test and control), by platform (mobile web or desktop), by user experience and status. We also reviewed some indicators such as edit completion rate by the number of checks shown within a single editing session.
    • Note: For the by user experience analysis, we split newer editors into three experience level groups: (1) unregistered, (2) newcomer (registered user making their first edit on Wikipedia), and (3) Junior Contributor (user that has made between 2 and 100 edits).

Results are based on initial AB test data to check if any adjustments to the feature need to be prioritized. More event data will be needed to confirm statistical significance of these findings. We will review the complete AB test data as part of the analysis for T400101

Code
# load packages
shhh <- function(expr) suppressPackageStartupMessages(suppressWarnings(suppressMessages(expr)))
shhh({
    library(lubridate)
    library(ggplot2)
    library(dplyr)
    library(gt)
    library(IRdisplay)
})
#set preferences
options(dplyr.summarise.inform = FALSE)
options(repr.plot.width = 15, repr.plot.height = 10)

# colorblind color friendly pallette:
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

Early indicators suggest Reference Check appears fairly often, may slightly lower edit completion rates, and is associated with early reductions in revert rates across all user types.

Early indicators suggest Reference Check is shown fairly often—more than Paste Check or Tone Check, but less frequently than earlier multi-check estimates—and it fires more on mobile web than desktop. While the check may slightly lower edit completion rates (especially when many checks appear), it is also associated with reduced revert rates across all user groups. Effects vary by experience level: newcomers see more checks, and although completion rates rise for unregistered and newcomer editors, they fall for junior contributors. (93.8% → 87.8%, a 6.4% relative decrease).

Reference Check Frequency

Reference Check was shown at least once in 42.4% of all published new-content edits by newer editors in the test group. This is higher than observed in Paste Check (36%), higher than Paste Check’s initial estimates in T403861 for published edits, and much higher than Tone Check in the Leading Indicators Analysis (9%). This frequency is lower than trends observed in the Multi-Check Indicators Analysis, where Reference Check was presented in about 78% of published new-content edits.

  • By platform: A notably higher proportion of mobile web edits were shown Reference Check (76.3%) compared to desktop (38.2%). This contrasts with the Paste Check Leading Indicators report patterns, where desktop edits were more frequently shown Paste Check (39%) than mobile web edits (24%).

  • By whether multiple checks were shown: 12.3% of all published new-content edits were shown more than one Reference Check in a session. This is slightly higher than Paste Check (11.4%) and lower than the 2025 Multiple Reference Check A/B test where 27% of published new-content edits in which Reference Check activated displayed multiple Reference Checks.

  • By number of checks shown: The majority of sessions shown Reference Check saw only one check (71%). Two checks were shown in 12.3% of sessions; fewer than 4% received more than six checks. For comparison, out of all editing sessions shown Paste Check, in the Paste Check Leading Indicators Report analysis, 68% included only one Paste Check shown.

  • By user experience: Reference Check appears slightly more frequently for newcomers: Newcomer new content edits are 2.5% more likely to be shown Reference Check relative to unregistered users, and 38.8% more likely relative to junior contributors. In the 2025 Multiple Reference Check A/B test we observed a noticeably stronger effect for newcomers than for unregistered users however junior contributors in the treatment group showed the highest percentages overall as far as adding references more often when exposed to multiple reference checks.

Edit Completion Rate

Edits shown Reference Check are completed at a lower rate (87.1%) than eligible edits not shown Reference Check (90.6%), a 4% relative decrease. This aligns with the 2024 Reference Check A/B test, where showing Reference Check produced a 10% decrease in edit completion rate relative to control.

  • By platform: Completion decreased modestly on both platforms. On mobile web there was a 1.5% relative decrease for the treatment group (79.7%) compared to the control (80.9%). Desktop saw a 5.9% relative decrease for the treatment group (89.2%) compared to the control (94.8%). In the 2024 Reference Check A/B test, the pattern was more dramatic on mobile (–24.3%) than desktop (–3.1%). Early trends here look milder by comparison but consistent in direction.

  • By whether multiple checks were shown: Sessions with multiple Reference Checks show lower completion rates than single-check sessions. This is the opposite of that seen in the Paste Check Leading Indicators Analysis and Tone Check Leading Indicators Analysis, where presenting multiple checks did not reduce completion rates. For reference, the 2025 Multiple Reference Check A/B test showed that the edit completion rate for users presented multiple reference checks was lower (74%) than that for users presented a single reference check (75%).

  • By number of checks shown: We don’t see a significant increase in edit abandonment rate when 2–5 Reference Checks are presented in a single session. However, the number declines at 6–10 and falls substantially over 10. In the Paste Check Leading Indicators Analysis, we did not observe a notable increase in edit abandonment even when many (>3) Paste Checks were shown within a session.

  • By user experience: Edit completion rates increased for unregistered editors (control: 80.8%, treatment: 84.7%) and for newcomers (control: 84.3%, treatment: 87.1%). Junior contributors in the treatment group (87.8%) saw a 6.4% decrease relative to their control group (93.8%) counterparts. This variation across user experiences echoes differences seen in the 2024 Multi Check Leading Indicators Analysis, where unregistered editors were largely unaffected, newcomers showed small declines when multiple checks were presented, and junior contributors in the treatment group showed a slight relative increase (3%) compared to the control.

Revert Rate

Published new-content edits shown Reference Check are reverted less frequently, with a 13.7% relative decrease compared to eligible edits not shown the check (29.3% for the control and 25.3% for the treatment). This is steeper than the 8.6% relative decrease observed in the 2024 Reference Check A/B test and higher than revert rates in the 2025 multiple Edit Checks A/B test, where control and treatment estimates were around 22.5–23.6%. Important note: These revert rates include edits where the final published text may not include a reference. We plan to review the proportion of new content edits shown or eligible to be shown Reference Check that include a reference in the AB test analysis.

  • By platform: Both desktop (24.4%) and mobile web (29.1%) treatment groups show improved revert rates relative to their controls (desktop: 25.3%, mobile web: 45.8%). Mobile web editors in the treatment group saw a 36.5% relative decrease compared to the control group while desktop editors saw a 3.6% relative decrease. This is consistent with earlier findings: In the 2024 Reference Check A/B test, relative revert rates decreased on both platforms (desktop –9.4%, mobile –5.9%) and in the 2025 Multiple Reference Check A/B test, mobile web treatment group edits also tended to show higher revert rates than desktop treatment group edits.

  • By whether multiple checks were shown: Edits shown multiple Reference Checks experienced higher revert rates than those shown only one, increasing from 24.4% for single-check sessions to 27.6% for multiple-check sessions. This same pattern was visible in the Tone Check Leading Indicators Analysis (multiple tone checks → higher revert rates). In contrast, in the 2025 Multiple Reference Check A/B test, the multi-check treatment group had a substantially lower revert rate (–34.7%) relative to single-check edits—likely because edits that require multiple references differ systematically from those that only trigger one.

  • By user experience: Revert rates decreased across all user experience types. Unregistered editors saw a decrease from 36.8% in the control to 32.2% in the treatment, newcomers from 42.3% to 39.6%, and junior contributors from 25.1% to 20.4%. This pattern is consistent with the 2024 Reference Check AB Test and in part with the 2025 Multiple Reference Checks A/B test analyses where revert rates decreased for newcomers and slightly increased for junior contributors and unregistered editors when comparing treatment to control groups. The 2025 Multiple Reference Checks A/B test highlights, “Results vary slightly based on the type of user completing the edit but none of the observed changes were statistically significant.”

A) Reference Check Frequency

Question: Are newer editors encountering Reference Check?

Methodology: We reviewed the proportion of published new content edits where at least one Reference Check was shown during the editing session (event.feature = 'editCheck-addReference' AND event.action IN ('check-shown-presave').

Published edits eligible for Reference Check are identified by the editcheck-references revison tag

This analysis was specifically limited to edits that were successfully published and identified as new content edits with the tag editcheck-newcontent.

Code
#load frequency data
reference_check_frequency_data <-
  read.csv(
    file = 'data/1-reference_check_save_data.tsv',
    header = TRUE,
    sep = "\t",
    stringsAsFactors = FALSE
  ) 
Code
# Cleaning up dataset and renaming fields to clarify meanings

# Set experience level group and factor levels
reference_check_frequency_data <- reference_check_frequency_data %>%
  mutate(
    experience_level_group = case_when(
     user_edit_count == 0 & user_status == 'registered' ~ 'Newcomer',
     user_edit_count == 0 & user_status == 'unregistered' ~ 'Unregistered',
      user_edit_count > 0 &  user_edit_count <= 100 ~ "Junior Contributor",
      user_edit_count >  100 ~ "Non-Junior Contributor"   
    ),
    experience_level_group = factor(experience_level_group,
         levels = c("Unregistered","Newcomer", "Non-Junior Contributor", "Junior Contributor")
   ))  

#rename experiment field to clarify
reference_check_frequency_data <- reference_check_frequency_data %>%
  mutate(test_group = factor(test_group,
         levels = c('2025-09-editcheck-addReference-control', '2025-09-editcheck-addReference-test'),
         labels = c("control (no Reference Check)", "test (reference check available)")))


#rename platform from phone to mobile web to clarify meaning
reference_check_frequency_data <- reference_check_frequency_data %>%
  mutate(platform = factor(platform,
         levels = c('phone', 'desktop'),
         labels = c("mobile web", "desktop")))
Code
#Set fields and factor levels to assess number of checks shown

reference_check_frequency_data <- reference_check_frequency_data %>%
  mutate(
    multiple_checks_shown = 
         ifelse(n_checks_shown > 1, 1, 0),  
     multiple_checks_shown = factor( multiple_checks_shown ,
         levels = c(0,1)))
         
# note these buckets can be adjusted as needed based on distribution of data
reference_check_frequency_data <- reference_check_frequency_data %>%
  mutate(
    checks_shown_bucket = case_when(
     is.na(n_checks_shown) ~ '0',
     n_checks_shown == 1   ~ '1', 
     n_checks_shown == 2  ~ '2',
     n_checks_shown > 2 & n_checks_shown <= 5  ~ "3-5",
     n_checks_shown > 5 & n_checks_shown <= 10  ~ "6-10", 
     n_checks_shown > 10 ~ "over 10" 
    ),
    checks_shown_bucket = factor(checks_shown_bucket ,
         levels = c("0","1","2", "3-5", "6-10", "over 10")
   ))  

Overall

Code
reference_checks_shown_saved_overall <- reference_check_frequency_data %>%
    filter(test_group == "test (reference check available)"  #limit to test group edits
             & is_new_content ==1 ) %>% #limit to published  new content edits
    group_by(test_group) %>%
    summarise(n_editing_session = n_distinct(editing_session),
              n_editing_session_refcheck = n_distinct(editing_session[was_reference_check_shown == 1])) %>%
    mutate(prop_check_shown = paste0(round(n_editing_session_refcheck/n_editing_session * 100, 1), "%"))  %>%
    gt()  %>%
    tab_header(
    title = "Published new content edits shown at least one Reference Check"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment Group",
    n_editing_session = "Number of edits",
    n_editing_session_refcheck = "Number of edits shown Reference Check",   
    prop_check_shown = "Proportion of edits shown Reference Check"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits by unregistered users and users with 100 or fewer edits')
    )


display_html(as_raw_html(reference_checks_shown_saved_overall))
Published new content edits shown at least one Reference Check
Experiment Group Number of edits Number of edits shown Reference Check Proportion of edits shown Reference Check
test (reference check available) 1827 774 42.4%
Limited to published new content edits by unregistered users and users with 100 or fewer edits

Reference Check was shown at least once in 42.4% of all published new content edits by newer editors in the test group. This is higher than rates observed for Paste Check (36%) in the Paste Check Leading Indicators Analysis report, higher than Paste Check’s initial estimates in T403861, and substantially higher than rates observed in the Tone Check Leading Indicators Analysis Report where Tone Check was been shown in 9% of all published new content edits.

In the Multi Check Indicator’s Analysis, we observed that Reference Check was shown in nearly 80% of published new content edits.

By whether multiple checks were shown

Code
reference_checks_shown_saved_bymultiple <- reference_check_frequency_data %>%
    filter(test_group == "test (reference check available)" & #limit to test group edits
            is_new_content ==1) %>% #limit to published  new content edits
    group_by(test_group) %>%
    summarise(n_editing_session = n_distinct(editing_session),
              n_editing_session_multicheck = n_distinct(editing_session[was_reference_check_shown == 1 & multiple_checks_shown == 1])) %>%
    mutate(prop_check_shown = paste0(round(n_editing_session_multicheck/n_editing_session * 100, 1), "%")) %>% 
    gt()  %>%
    tab_header(
    title = "Published new content edits shown multiple Reference Checks"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment Group",
    n_editing_session = "Number of edits",
    n_editing_session_multicheck = "Number of edits shown multiple Reference Checks",   
    prop_check_shown = "Proportion of edits shown multiple Reference Checks"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits by unregistered users and users with 100 or fewer edits')
    )


display_html(as_raw_html(reference_checks_shown_saved_bymultiple))
Published new content edits shown multiple Reference Checks
Experiment Group Number of edits Number of edits shown multiple Reference Checks Proportion of edits shown multiple Reference Checks
test (reference check available) 1827 225 12.3%
Limited to published new content edits by unregistered users and users with 100 or fewer edits

12.3% of all published new content edits were shown more than one Reference Check in a session. This is slightly higher than that observed in the Paste Check Leading Indicators Analysis report (11.4%) and lower than the 2025 Multiple Reference Check A/B test, within the test group, 27% of published new-content edits where Reference Check was activated (1,697 edits) displayed multiple reference checks within a single editing session.

By number of checks shown

Code
reference_checks_shown_saved_bynchecks <- reference_check_frequency_data %>%
     filter(test_group == "test (reference check available)" & #limit to test group edits
             is_new_content ==1 & was_reference_check_shown == 1) %>% #limit to published  new content edits
    mutate(total_sessions = n_distinct(editing_session)) %>%
    group_by(total_sessions, checks_shown_bucket) %>%
    summarise(n_editing_session_refcheck = n_distinct(editing_session)) %>%
    mutate(prop_check_shown = paste0(round(n_editing_session_refcheck/total_sessions * 100, 2), "%")) %>%
    ungroup() %>%
    select(-c(1,3)) %>%
    #mutate(n_editing_session_refcheck = ifelse(n_editing_session_refcheck < 50, "<50", n_editing_session_refcheck))  %>% #sanitizing per data publication guidelines
    gt()  %>%
    tab_header(
    title = "Published new content edits by number of Reference Checks shown"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    checks_shown_bucket = "Number of Reference Checks shown",
    #n_editing_session_refcheck = "Number of edits",   
    prop_check_shown = "Proportion of edits"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits shown at least one Reference Check')
    )


display_html(as_raw_html(reference_checks_shown_saved_bynchecks))
Published new content edits by number of Reference Checks shown
Number of Reference Checks shown Proportion of edits
1 70.93%
2 12.27%
3-5 12.14%
6-10 2.84%
over 10 1.81%
Limited to published new content edits shown at least one Reference Check

Reference Check was shown only once in the majority of all editing sessions shown Reference Check (71%). For sessions that display multiple Reference Checks, the largest share shows two checks (12.3%), while fewer than 4% present more than six.

For comparison, Paste Check was shown only once in the majority of all editing sessions shown Paste Check (68%) per the Paste Check Leading Indicators Analysis report.

By platform

Code
reference_checks_shown_byplatform <- reference_check_frequency_data %>%
      filter(test_group == "test (reference check available)" & #limit to test group edits
             is_new_content ==1) %>% #limit to published  new content edits
    group_by(platform) %>%
    summarise(n_editing_session = n_distinct(editing_session),
              #n_editing_session_refcheck = n_distinct(editing_session[was_reference_check_shown == 1])) %>%
    #mutate(prop_check_shown = paste0(round(n_editing_session_refcheck/n_editing_session * 100, 1), "%"))  %>%
    #mutate(n_editing_session_refcheck = ifelse(n_editing_session_refcheck < 50, "<50", n_editing_session_refcheck))%>% #sanitizing per data publication guideline
              n_editing_session_refcheck = n_distinct(editing_session[was_reference_check_shown == 1])) %>%
    mutate(prop_check_shown = paste0(round(n_editing_session_refcheck/n_editing_session * 100, 1), "%"))  %>%
    mutate(n_editing_session_refcheck = ifelse(n_editing_session_refcheck < 50, "<50", n_editing_session_refcheck))%>% #sanitizing per data publication guideline
    #select(-2) %>% #removing total number of edits column to santize data for publication
    gt()  %>%
    tab_header(
    title = "Published new content edits shown Reference Check by platform"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    platform = "Platform",
    #n_editing_session = "Number of edits",
    n_editing_session_refcheck = "Number of edits shown Reference Check",   
    prop_check_shown = "Proportion of edits shown Reference Check"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits by unregistered users and users with 100 or fewer edits')
    )


display_html(as_raw_html(reference_checks_shown_byplatform))
Published new content edits shown Reference Check by platform
Platform n_editing_session Number of edits shown Reference Check Proportion of edits shown Reference Check
mobile web 198 151 76.3%
desktop 1629 623 38.2%
Limited to published new content edits by unregistered users and users with 100 or fewer edits

A higher proportion of edits on mobile web are shown Reference Check (76.3%) compared to desktop (38.2%), representing a 99.7% positive difference when comparing mobile web to desktop.

For comparison, in the Paste Check Leading Indicators Analysis report we observed that a higher proportion of edits on desktop were shown Paste Check (39%) compared to mobile (24%).

By user experience

Code
reference_checks_shown_byuser_status <- reference_check_frequency_data %>%
      filter(test_group == "test (reference check available)" & #limit to test group edits
            is_new_content ==1) %>% #limit to published  new content edits
    group_by(experience_level_group ) %>%
    summarise(n_editing_session = n_distinct(editing_session),
              #n_editing_session_pastecheck = n_distinct(editing_session[was_reference_check_shown == 1])) %>%
    #mutate(prop_check_shown = paste0(round(n_editing_session_pastecheck/n_editing_session * 100, 1), "%"))  %>%
              n_editing_session_refcheck = n_distinct(editing_session[was_reference_check_shown == 1])) %>%
    mutate(prop_check_shown = paste0(round(n_editing_session_refcheck/n_editing_session * 100, 1), "%"))  %>%
    select(-2) %>% #removing total number of edits column to santize data for publication
    gt()  %>%
    tab_header(
    title = "Published new content edits shown Reference Check by user experience"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    experience_level_group  = "User Experience",
    #n_editing_session = "Number of edits",
    n_editing_session_refcheck = "Number of edits shown Reference Check",   
    prop_check_shown = "Proportion of edits shown Reference Check"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits by unregistered users and users with 100 or fewer edits')
    )


display_html(as_raw_html(reference_checks_shown_byuser_status ))
Published new content edits shown Reference Check by user experience
User Experience Number of edits shown Reference Check Proportion of edits shown Reference Check
Unregistered 149 59.1%
Newcomer 106 60.6%
Junior Contributor 519 37.1%
Limited to published new content edits by unregistered users and users with 100 or fewer edits

Reference Check appears slightly more frequently for newcomers.

Newcomer new content edits are 2.5% more likely to be shown Reference Check relative to unregistered users, and 38.8% more likely relative to junior contributors.

In the 2025 Multiple Reference Check A/B test we observed a noticeably stronger effect for newcomers than for unregistered users however junior contributors in the treatment group showed the highest percentages overall as far as adding references more often when exposed to multiple reference checks.

B) Reference Check Edit Completion Rate

Question: Do newer editors understand the feature?

Methodology: We reviewed the proportion of edits where Reference Check was shown at least once during the edit session and that were successfully published (event.action = saveSuccess). These edits were compared to the completion rate of edits in the control group that were eligible but not shown Reference Check, as implemented in T402460.

The edit_completion_rate query filters to saveIntent events, per the comment: “the moment when reference check would be shown if eligible and in test group”.

Per the Tone Check methodology and Paste Check methodology, we compare: Test group: edits where Reference Check was shown Control group: edits that were eligible but not shown.

Note: This analysis excludes edits that were abandoned prior to reaching the point where Reference Check was or would have been shown.

Code
# load data for assessing edit completion rate
edit_completion_rates <-
  read.csv(
    file = 'data/2-edit_completion_rate.tsv',
    header = TRUE,
    sep = "\t",
    stringsAsFactors = FALSE
  ) 
Code
# Set experience level group and factor levels
edit_completion_rates <- edit_completion_rates %>%
  mutate(
    experience_level_group = case_when(
     user_edit_count == 0 & user_status == 'registered' ~ 'Newcomer',
     user_edit_count == 0 & user_status == 'unregistered' ~ 'Unregistered',
      user_edit_count > 0 &  user_edit_count <= 100 ~ "Junior Contributor",
      user_edit_count >  100 ~ "Non-Junior Contributor"   
    ),
    experience_level_group = factor(experience_level_group,
         levels = c("Unregistered","Newcomer", "Non-Junior Contributor", "Junior Contributor")
   ))  

#rename experiment field to clarfiy
edit_completion_rates <- edit_completion_rates %>%
  mutate(test_group = factor(test_group,
         #levels = c('2025-09-editcheck-paste-control', '2025-09-editcheck-paste-test'),
         #levels = c('2025-09-editcheck-addReference-control', '2025-09-editcheck-addReference-test'),
         #labels = c("control (not shown Reference Check)", "test (shown Reference Check)")))
         levels = c('2025-09-editcheck-addReference-control', '2025-09-editcheck-addReference-test'),
         labels = c("control (not shown Reference Check)", "test (reference check available)")))

#rename platform from phone to mobile web to clarify meaning
edit_completion_rates <- edit_completion_rates %>%
  mutate(platform = factor(platform,
         levels = c('phone', 'desktop'),
         labels = c("mobile web", "desktop")))
Code
#Set fields and factor levels to assess number of checks shown

edit_completion_rates  <- edit_completion_rates %>%
  mutate(
    multiple_checks_shown = 
         ifelse(n_checks_shown > 1, "multiple checks shown", "one check shown"),  
     multiple_checks_shown = factor( multiple_checks_shown ,
         levels = c("one check shown", "multiple checks shown")))
         
# note these buckets can be adjusted as needed based on distribution of data
edit_completion_rates  <- edit_completion_rates %>%
  mutate(
    checks_shown_bucket = case_when(
     is.na(n_checks_shown) ~ '0',
     n_checks_shown == 1  ~ '1', 
     n_checks_shown == 2  ~ '2',
     n_checks_shown > 2 & n_checks_shown <= 5  ~ "3-5",
     n_checks_shown > 5 & n_checks_shown <= 10 ~ "6-10", 
     n_checks_shown > 10 ~ "over 10" 
    ),
    checks_shown_bucket = factor(checks_shown_bucket ,
         levels = c("0","1","2", "3-5", "6-10","over 10")
   ))
Code
# define set of all eligible edits to review (eligible in control and shown in test)
edit_completion_rates <- edit_completion_rates %>%
    mutate(is_test_eligible  = ifelse(
        (test_group == "test (reference check available)" & reference_check_shown == 1) |
        (test_group == "control (not shown Reference Check)"), 'eligible', 'not eligible'), #use different labels
      is_test_eligible = 
      factor(
      is_test_eligible,
      levels = c("eligible",  "not eligible" )
  ))

Overall

Code
edit_completion_rate_overall <- edit_completion_rates %>%
    #filter(reference_check_shown == 1 ) %>% #limit to sessions where reference check was shown
    filter(is_test_eligible == 'eligible') %>% #limit to eligible edits
    group_by(test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_saves = n_distinct(editing_session[saved_edit > 0])) %>%
    mutate(completion_rate = paste0(round(n_saves/n_edits * 100, 1), "%"))
Code
# plot visualization of overall edit completion rates
dodge <- position_dodge(width=0.9)

p <- edit_completion_rate_overall  %>%
    ggplot(aes(x= test_group, y = n_saves/n_edits)) +
    geom_col(position = 'dodge', fill = 'dodgerblue4') +
    scale_y_continuous(labels = scales::percent) +
      geom_text(aes(label = paste(completion_rate), fontface=2), vjust=1.2, size = 10, color = "white") +
    scale_fill_manual(values= cbPalette, name = "Reason")  +
    labs (y = "Percent of edit attempts completed",
           x = "Experiment Group",
          title = "Reference Check edit completion rate",
           caption = "Limited to edit attempts shown or eligible to be shown at least one Reference Check")  +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=24),
        legend.position= "none",
        axis.line = element_line(colour = "black")) 
      
p

Edits shown Reference Check are completed at a lower rate (87.1%) than edits in the control group that are eligible but not shown Reference Check (90.6%). This represents a 4% relative decrease and aligns with that observed in the 2024 Reference Check A/B test where overall, there was a 10% decrease in edit completion rate for edits where reference check was shown compared to the control group.

By whether multiple checks were shown

Code
edit_completion_rate_bymulti <- edit_completion_rates %>%
    filter(reference_check_shown == 1 &
            test_group == 'test (reference check available)') %>%
    group_by(test_group, multiple_checks_shown) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_saves = n_distinct(editing_session[saved_edit > 0])) %>%
    mutate(completion_rate = paste0(round(n_saves/n_edits * 100, 1), "%")) %>%   
    gt()  %>%
    tab_header(
    title = "Reference Check edit completion rate by if multiple checks were shown"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment group",
    multiple_checks_shown = "Multiple Reference Checks shown",
    n_edits = "Number of edit attempts shown Reference Check",
    n_saves = "Number of published edits",
    completion_rate = "Proportion of edits saved"
  ) %>%
    tab_source_note(
         gt::md('Limited to edit attempts shown or eligible to be shown at least one Reference Check')
    )


display_html(as_raw_html(edit_completion_rate_bymulti))
Reference Check edit completion rate by if multiple checks were shown
Multiple Reference Checks shown Number of edit attempts shown Reference Check Number of published edits Proportion of edits saved
test (reference check available)
one check shown 629 560 89%
multiple checks shown 280 232 82.9%
Limited to edit attempts shown or eligible to be shown at least one Reference Check

The edit completion rate of edits shown multiple Reference Checks is lower than edits shown a single Reference Check. For comparison, Paste Check Leading Indicators and Tone Check Leading Indicators showed the apposite. Per the Paste Check Leading Indicators Analysis, “We currently don’t see any increase in edit abandonment rate even if a large number (>3) Paste Checks are shown in a single session”. Additionally, those edits shown one and two Paste Checks showed similar proportions of edits saved: 50.1% for those shown one and 53.4% for those shown two. In the Tone Check Leading Indicators analysis, edit completion was 66.5% for those edits shown one check and 66.9% for those shown multiple checks.

In the 2025 Multiple Reference Check A/B test the edit completion rate for users presented multiple reference checks was lower (74%) than that for users presented a single reference check (75%).

By number of checks shown

Code
edit_completion_rate_bynchecks <- edit_completion_rates %>%
    filter(reference_check_shown == 1 & test_group == 'test (reference check available)')  %>% #limit to reference checks shown and test group    
    group_by(test_group, checks_shown_bucket) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_saves = n_distinct(editing_session[saved_edit > 0])) %>%
    mutate(completion_rate = paste0(round(n_saves/n_edits * 100, 1), "%")) %>%  
    ungroup()%>%  
    mutate(n_edits = ifelse(n_edits < 50, "<50", n_edits),
           n_saves = ifelse(n_saves < 50, "<50", n_saves))  %>% #sanitizing per data publication guidelines
    group_by(test_group) %>%  
    gt()  %>%
    tab_header(
    title = "Reference Check edit completion rate by the number of checks shown"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    checks_shown_bucket = "Number of Reference Checks shown",
    n_edits = "Number of edit attempts shown Reference Check",
    n_saves = "Number of published edits",
    completion_rate = "Proportion of edits saved"
  ) %>%
    tab_source_note(
        gt::md('Limited to edit attempts shown or eligible to be shown at least one Reference Check')
    )


display_html(as_raw_html(edit_completion_rate_bynchecks))
Reference Check edit completion rate by the number of checks shown
Number of Reference Checks shown Number of edit attempts shown Reference Check Number of published edits Proportion of edits saved
test (reference check available)
1 629 560 89%
2 111 98 88.3%
3-5 107 97 90.7%
6-10 <50 <50 76.7%
over 10 <50 <50 43.8%
Limited to edit attempts shown or eligible to be shown at least one Reference Check

We don’t see a significant increase in edit abandonment rate when 2-5 Reference Checks are presented in a single session. However, the number declines at 6-10 and falls substantially over 10. In the Paste Check Leading Indicators Analysis report we saw no increase in the edit-abandonment rate even when more than three Paste Checks were shown in a single session.

By platform

Code
edit_completion_rate_byplatform <- edit_completion_rates %>%
    #filter(reference_check_shown == 1) %>%
    filter(is_test_eligible == 'eligible') %>% #limit to eligible edits
    group_by(platform, test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_saves = n_distinct(editing_session[saved_edit > 0])) %>%
    mutate(completion_rate = paste0(round(n_saves/n_edits * 100, 1), "%")) %>% 
    #mutate(n_saves = ifelse(n_saves < 50, "<50", n_saves))%>% #sanitizing per data publication guideline
    select(-c(3,4)) %>% 
    gt()  %>%
    tab_header(
    title = "Reference Check edit completion rate by platform"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment Group",
    platform = "Platform",
    #n_edits = "Number of edit attempts shown tone check",
    #n_saves = "Number of published edits",
    completion_rate = "Proportion of edits saved"
  ) %>%
    tab_source_note(
        gt::md('Limited to edit attempts shown or eligible to be shown at least one Reference Check')
    )


display_html(as_raw_html(edit_completion_rate_byplatform))
Reference Check edit completion rate by platform
Experiment Group Proportion of edits saved
mobile web
control (not shown Reference Check) 80.9%
test (reference check available) 79.7%
desktop
control (not shown Reference Check) 94.8%
test (reference check available) 89.2%
Limited to edit attempts shown or eligible to be shown at least one Reference Check

We observed slight decreases in edit completion by both platform types. There was a 1.5% relative decrease for mobile web edits in the treatment group compared to the control and 5.9% relative decrease on desktop.

In the 2024 Reference Check A/B test we observed a significant decrease on mobile compared to desktop. On mobile, edit completion rate decreased by 24.3% (13.5pp) while on desktop it decreased by only 3.1% (2.3pp).

By user experience

Code
edit_completion_rate_byuserstatus <- edit_completion_rates %>%
    #filter(reference_check_shown == 1) %>%
    filter(is_test_eligible == 'eligible') %>% #limit to eligible edits
    group_by(experience_level_group, test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_saves = n_distinct(editing_session[saved_edit > 0])) %>%
    mutate(completion_rate = paste0(round(n_saves/n_edits * 100, 1), "%")) %>%   
    #select(-c(3,4)) %>% #data sanitizing for publication
    gt()  %>%
    tab_header(
    title = "Reference Check edit completion rate by editor experience"
      )  %>%
 opt_stylize(5) %>%
  cols_label(
    test_group = "Test Group",
    experience_level_group = "User Experience",
    n_edits = "Number of edit attempts shown Reference check",
    n_saves = "Number of published edits",
    completion_rate = "Proportion of edits saved"
  ) %>%
    tab_source_note(
        gt::md('Limited to edit attempts shown or eligible to be shown at least one Reference Check')
    )


display_html(as_raw_html(edit_completion_rate_byuserstatus))
Reference Check edit completion rate by editor experience
Test Group Number of edit attempts shown Reference check Number of published edits Proportion of edits saved
Unregistered
control (not shown Reference Check) 6753 5458 80.8%
test (reference check available) 177 150 84.7%
Newcomer
control (not shown Reference Check) 3648 3074 84.3%
test (reference check available) 124 108 87.1%
Junior Contributor
control (not shown Reference Check) 27777 26043 93.8%
test (reference check available) 608 534 87.8%
Limited to edit attempts shown or eligible to be shown at least one Reference Check

Edit completion rate increased for unregistered and newcomers in the treatment group, compared to the control.

Junior contributors in the treatment group saw a 6.4% relative decrease compared to those in the control. Differing completion rates by experience were also observed in the 2024 Multi Check Leading Indicators Analysis where unregistered editors saw no notable difference, newcomers in the treatment group receiving multiple checks saw a slight decline and junior contributors in the treatment group saw a 3% relative increase compared to the control.

C) Reference Check Revert Rate

Question: Is Reference Check causing any disruption?

Methdology: Reviewed the proportion of all published new content edits where Reference Check was shown at least once in an editing session and were reverted within 48 hours. This was compared to the revert rate of edits in the control group identifed as eligible but not shown Reference Check.

Code
# load data for assessing tone check published data
edit_check_save_data <-
  read.csv(
    file = 'data/1-reference_check_save_data.tsv',
    header = TRUE,
    sep = "\t",
    stringsAsFactors = FALSE
  ) 
Code
# Set experience level group and factor levels
edit_check_save_data <- edit_check_save_data %>%
  mutate(
    experience_level_group = case_when(
     user_edit_count == 0 & user_status == 'registered' ~ 'Newcomer',
     user_edit_count == 0 & user_status == 'unregistered' ~ 'Unregistered',
      user_edit_count > 0 &  user_edit_count <= 100 ~ "Junior Contributor",
      user_edit_count >  100 ~ "Non-Junior Contributor"   
    ),
    experience_level_group = factor(experience_level_group,
         levels = c("Unregistered","Newcomer", "Non-Junior Contributor", "Junior Contributor")
   ))  

#rename experiment field to clarify
edit_check_save_data <- edit_check_save_data %>%
  mutate(test_group = factor(test_group,
         #levels = c('2025-09-editcheck-paste-control', '2025-09-editcheck-paste-test'),
         levels = c('2025-09-editcheck-addReference-control', '2025-09-editcheck-addReference-test'),
         labels = c("control (Reference Check not shown)", "test (Reference Check shown)")))


#rename platform from phone to mobile web to clarify meaning
edit_check_save_data <- edit_check_save_data %>%
  mutate(platform = factor(platform,
         levels = c('phone', 'desktop'),
         labels = c("mobile web", "desktop")))
Code
# set field to indicate if more than one check was shown in a single session. Note: This should only be applicable to the test group 

edit_check_save_data <- edit_check_save_data %>%
  mutate(
    multiple_checks_shown = 
         ifelse(n_checks_shown > 1, "multiple checks shown", "single check shown"),  
     multiple_checks_shown = factor( multiple_checks_shown ,
         levels = c("single check shown", "multiple checks shown")))
         
# note these buckets can be adjusted as needed based on distribution of data
edit_check_save_data <- edit_check_save_data %>%
  mutate(
    checks_shown_bucket = case_when(
     is.na(n_checks_shown) ~ '0',
     n_checks_shown == 1   ~ '1', 
     n_checks_shown == 2  ~ '2',
     n_checks_shown > 2 & n_checks_shown <= 5  ~ "3-5",
     n_checks_shown > 5 & n_checks_shown <= 10  ~ "6-10", 
     n_checks_shown > 10  ~ "over 10" 
    ),
    checks_shown_bucket = factor(checks_shown_bucket ,
         levels = c("0","1","2", "3-5", "6-10","over 10")
   ))  
Code
# define set of all eligible edits to review (eligible in control and activated in test)
edit_check_save_data <- edit_check_save_data %>%
    mutate(is_test_eligible  = ifelse(
        #(test_group == "test (Reference Check shown)" & is_test_eligible == 'eligible') | #circular reference, doesn't work
        (test_group == "test (Reference Check shown)" & was_reference_check_shown == 1) |
        (test_group == "control (Reference Check not shown)" & is_reference_check_eligible == 1) , 'eligible', 'not eligible'),
      is_test_eligible = 
      factor(
      is_test_eligible,
      levels = c("eligible",  "not eligible" )
  ))

Overall

Code
edit_check_reverts_overall <- edit_check_save_data %>%
    filter(is_new_content == 1 & is_test_eligible == 'eligible') %>% #limit to eligible edits
    group_by(test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_reverts = n_distinct(editing_session[was_reverted == 1])) %>% #limit to new content edits without a refernece
    mutate(revert_rate = paste0(round(n_reverts/n_edits * 100, 1), "%"))
Code
# plot visualization of overall edit completion rates
dodge <- position_dodge(width=0.9)

p <- edit_check_reverts_overall %>%
    ggplot(aes(x= test_group, y = n_reverts/n_edits)) +
    geom_col(position = 'dodge', fill = 'dodgerblue4') +
    scale_y_continuous(labels = scales::percent) +
      geom_text(aes(label = paste(revert_rate), fontface=2), vjust=1.2, size = 10, color = "white") +
    scale_fill_manual(values= cbPalette, name = "Reason")  +
    labs (y = "Percent of edits reverted",
           x = "Experiment Group",
          title = "New content edit revert rate",
           caption = "Limited to published new content edits shown or eligible to be shown Reference Check")  +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=24),
        legend.position= "none",
        axis.line = element_line(colour = "black")) 
      
p

Overall, published new content edits shown Reference Check are reverted less frequently. We’ve observed a 13.7% relative decrease in published edits where Reference Check was shown compared to edits eligible but not shown Reference Check.

This is steeper than the 8.6% new content edit revert rate relative decrease where reference check was available observed in the 2024 Reference Check A/B test.

Revert rates for both edits shown Reference Check (25.3%) or eligible to be shown Reference Check (29.3%) are slightly higher than the revert rates we observed for Reference Check in the 2024 Reference Check A/B test (25.6% for the control and 23.4% for the treatment) and higher than the revert rates seen in the 2025 multiple Edit Checks A/B test (22.5% for the control and 23.6% for the treatment).

Notes: * These revert rates include edits where the final published text may not include a reference. We plan to review the proportion of new content edits shown or eligible to be shown Reference Check that include a reference in the AB test analysis.

By whether mulitiple checks were shown

Code
edit_check_revert_bymultiple <- edit_check_save_data %>%
    filter(is_new_content == 1 & was_reference_check_shown == 1
          & test_group == 'test (Reference Check shown)' 
         ) %>%
    group_by( multiple_checks_shown) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_reverts = n_distinct(editing_session[was_reverted == 1])) %>% #limit to new content edits without a reference
    mutate(revert_rate = paste0(round(n_reverts/n_edits * 100, 1), "%")) %>%   
    select(-c(2,3)) %>% # removing granular data columns for publication
    gt()  %>%
    tab_header(
    title = "New content edit revert rate by if multiple checks were shown"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    multiple_checks_shown = "Multiple Check",
    #n_edits = "Number of published new content edits",
    #n_reverts = "Number of edits reverted ",
    revert_rate = "Proportion of new content edits that were reverted"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits shown or eligible to shown Reference Check')
    )


display_html(as_raw_html(edit_check_revert_bymultiple ))
New content edit revert rate by if multiple checks were shown
Multiple Check Proportion of new content edits that were reverted
single check shown 24.4%
multiple checks shown 27.6%
Limited to published new content edits shown or eligible to shown Reference Check

We observed an increase in revert rate for edits that were shown multiple Reference Checks. This was also observed in the Tone Check leading indicator analysis (edits shown multiple tone checks were reverted more frequently).

In the 2025 Multiple Reference Check A/B test we observed a 34.7% decrease in revert rate when directly comparing edits presented multiple checks compared to edits presented a single reference check. It was noted that this decrease was likely in part because the types of edits that warrant multiple Reference Checks are less likely to be reverted than the types of edits that warrant only a single check.

By platform

Code
edit_check_revert_byplatform <- edit_check_save_data %>%
    #filter(is_new_content == 1 &was_reference_check_shown == 1) %>%
    filter(is_new_content == 1 & is_test_eligible == 'eligible') %>% #limit to eligible edits
    group_by(platform, test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_reverts = n_distinct(editing_session[was_reverted == 1])) %>% #limit to new content edits without a reference
    mutate(revert_rate = paste0(round(n_reverts/n_edits * 100, 1), "%")) %>%   
    select(-c(3,4)) %>% # removing granular data columns for publication
    gt()  %>%
    tab_header(
    title = "New content edit revert rate by platform"
      )  %>%
  opt_stylize(5) %>%
  cols_label(
    test_group = "Test Group",
    platform = "Platform",
    #n_edits = "Number of published new content edits",
   # n_reverts = "Number of edits reverted",
    revert_rate = "Proportion of new content edits that were reverted"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits shown or eligible to shown Reference Check')
    )



display_html(as_raw_html(edit_check_revert_byplatform ))
New content edit revert rate by platform
Test Group Proportion of new content edits that were reverted
mobile web
control (Reference Check not shown) 45.8%
test (Reference Check shown) 29.1%
desktop
control (Reference Check not shown) 25.3%
test (Reference Check shown) 24.4%
Limited to published new content edits shown or eligible to shown Reference Check

Both treatment group mobile web and desktop editors saw improved revert rates over those in the control, with desktop editors in the treatment group seeing lower revert rates (24.4%) compared to mobile web treatment group editors (29.1%).

Mobile web editors in the treatment group saw a -36.5% relative decrease compared to the control group while desktop editors saw a -3.6% relative decrease.

This is consistent with earlier findings:
In the 2024 Reference Check A/B test, there was a slight decrease in the revert rate of new content on both desktop and mobile platforms. The relative revert rate decreased by 9.4% (1.7 pp) on desktop and on mobile it decreased by 5.9% (2 pp)

In the 2025 Multiple Reference Check A/B test, we also observed higher revert rates for mobile web treatment group edits (29.1%) compared to desktop treatment group edits (24.4%).

By user experience

Code
edit_check_revert_byuserexp <- edit_check_save_data %>%
    #filter(is_new_content == 1 & was_reference_check_shown == 1) %>% #limit to eligible edits
    filter(is_new_content == 1 & is_test_eligible == 'eligible') %>% #limit to eligible edits
    group_by(experience_level_group,test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_reverts = n_distinct(editing_session[was_reverted == 1])) %>% #limit to new content edits without a reference
    mutate(revert_rate = paste0(round(n_reverts/n_edits * 100, 1), "%")) %>%
    select(-c(3,4)) %>% # removing granular data columns for publication
    gt()  %>%
    tab_header(
    title = "New content edit revert rate by user experience"
      )  %>%
   opt_stylize(5) %>%
  cols_label(
    test_group = "Experiement Group",
    experience_level_group  = "User Status",
    #n_edits = "Number of published new content edits",
    #n_reverts = "Number of edits reverted",
    revert_rate = "Proportion of new content edits that were reverted"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits shown or eligible to shown Reference Check')
    )



display_html(as_raw_html(edit_check_revert_byuserexp))
New content edit revert rate by user experience
Experiement Group Proportion of new content edits that were reverted
Unregistered
control (Reference Check not shown) 36.8%
test (Reference Check shown) 32.2%
Newcomer
control (Reference Check not shown) 42.3%
test (Reference Check shown) 39.6%
Junior Contributor
control (Reference Check not shown) 25.1%
test (Reference Check shown) 20.4%
Limited to published new content edits shown or eligible to shown Reference Check

Revert rates decreased across all user types, consistent with the 2024 Reference Check A/B test and in part with the 2025 Multiple Reference Check A/B test analyses where revert rates decreased for newcomers and slightly increased for junior contributors and unregistered editors when comparing treatment to control groups. The 2025 Multiple Reference Checks A/B test highlights, “Results vary slightly based on the type of user completing the edit but none of the observed changes were statistically significant.”

D) Reference Check Dismissal Rate (Users that select to keep text without adding a reference)

Question: Do people find Reference Check relevant?

Methodology: We reviewed the proportion of published edits shown Reference Check wherein people elected to keep the text they added (i.e. the Reference Check was dismissed). This was determined by edits where the user dismissed a Reference Check at least once in a session (event.feature = 'editCheck-addReference' AND event.action = 'action-reject').

The analysis includes splits by the reason the user selected for keeping the text.

load data for assessing edit reject frequency

edit_check_reject_data <- read.csv( file = ‘data/reference_check_rejects_data.tsv’, header = TRUE, sep = “, stringsAsFactors = FALSE )

Set experience level group and factor levels

edit_check_reject_data <- edit_check_reject_data %>% mutate( experience_level_group = case_when( user_edit_count == 0 & user_status == ‘registered’ ~ ‘Newcomer’, user_edit_count == 0 & user_status == ‘unregistered’ ~ ‘Unregistered’, user_edit_count > 0 & user_edit_count <= 100 ~ “Junior Contributor”, user_edit_count > 100 ~ “Non-Junior Contributor”
), experience_level_group = factor(experience_level_group, levels = c(“Unregistered”,“Newcomer”, “Non-Junior Contributor”, “Junior Contributor”) ))

#rename experiment field to clarify edit_check_reject_data <- edit_check_reject_data %>% mutate(test_group = factor(test_group, #levels = c(‘2025-09-editcheck-paste-control’, ‘2025-09-editcheck-paste-test’), levels = c(‘2025-09-editcheck-addReference-control’, ‘2025-09-editcheck-addReference-test’), labels = c(“control (no Reference Check)”, “test (shown Reference Check)”)))

#rename platform from phone to mobile web to clarify meaning edit_check_reject_data <- edit_check_reject_data %>% mutate(platform = factor(platform, levels = c(‘phone’, ‘desktop’), labels = c(“mobile web”, “desktop”)))

#Set fields and factor levels to assess number of checks shown

edit_check_reject_data <- edit_check_reject_data %>% mutate( multiple_checks_shown = ifelse(n_checks_shown > 1, “multiple checks shown”, “single check shown”),
multiple_checks_shown = factor( multiple_checks_shown , levels = c(“single check shown”, “multiple checks shown”)))

note these buckets can be adjusted as needed based on distribution of data

edit_check_reject_data <- edit_check_reject_data %>% mutate( checks_shown_bucket = case_when( is.na(n_checks_shown) ~ ‘0’, n_checks_shown == 1 ~ ‘1’, n_checks_shown == 2 ~ ‘2’, n_checks_shown > 2 & n_checks_shown <= 5 ~ “3-5”, n_checks_shown > 5 & n_checks_shown <= 10 ~ “6-10”, n_checks_shown > 10 ~ “over 10” ), checks_shown_bucket = factor(checks_shown_bucket , levels = c(“0”,“1”,“2”, “3-5”, “6-10”, “over 10”) ))

shorten and clarify reason field names

edit_check_reject_data <- edit_check_reject_data %>% mutate( reject_reason = case_when( reject_reason == ‘no_reject_reason’ ~ ‘No reason provided’, reject_reason == ‘edit-check-feedback-reason-other’ ~ ‘Other’, reject_reason == ‘edit-check-feedback-reason-uncertain’ ~ ‘Uncertain’, reject_reason == ‘edit-check-feedback-reason-common-knowledge’ ~ ‘Common knowledge’, reject_reason == ‘edit-check-feedback-reason-irrelevant’ ~ ‘Irrelevant’ ), reject_reason = factor(reject_reason , levels = c(“No reason provided”,“Other”,“Uncertain”, “Common knowledge”, “Irrelevant”) ))

Overall

overall dismissal rate

edit_check_dismissal_overall <- edit_check_reject_data %>% filter(was_reference_check_shown == 1) %>% #limit to where shown summarise(n_edits = n_distinct(editing_session), n_rejects = n_distinct(editing_session[n_rejects > 0])) %>% #limit to new content edits without a reference mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), “%”)) %>%
gt() %>% tab_header( title = “Reference Check dismissal rate” ) %>% opt_stylize(5) %>% cols_label( n_edits = “Number of edits shown Reference Check”, n_rejects = “Number of edits that dimisssed Reference Check”, dismissal_rate = “Proportion of edits where Reference Check was dismissed” ) %>% tab_source_note( gt::md(‘Limited to published edits where at least one Reference Check was shown’) )

display_html(as_raw_html(edit_check_dismissal_overall ))

Users elected to keep the text without a reference when prompted at 62% of edits shown Reference Check.

This dismissal rate is higher than the 55% dismissal rate for Paste Check, which is similar to rates observed for Tone Check (57%) and the 2024 Reference Check A/B test where 51.8% of contributors dismissed the citation by explicitly indicating that the information they were adding did not need a reference.

By dismissal reason

edit_check_dismissal_byreason_overall <- edit_check_reject_data %>% filter(was_reference_check_shown == 1 & n_rejects > 0) %>% #limit to where shown and user elected to keep text group_by(reject_reason) %>% summarise(n_edits_rejected = n_distinct(editing_session)) %>% mutate(select_rate = paste0(round(n_edits_rejected/sum(n_edits_rejected) * 100, 1), “%”))

plot bar chart of reason selection

dodge <- position_dodge(width=0.9)

p <- edit_check_dismissal_byreason_overall %>% # Reorder by frequency (descending) arrange(desc(n_edits_rejected)) %>% mutate(reject_reason = factor(reject_reason, levels = unique(reject_reason))) %>% ggplot(aes(x= reject_reason, y = n_edits_rejected/sum(n_edits_rejected))) + geom_col(position = ‘dodge’, fill = ‘dodgerblue4’) + scale_y_continuous(labels = scales::percent) + geom_text(aes(label = paste(select_rate, “”, n_edits_rejected,“edits”), fontface=2), vjust=1.2, size = 10, color = “white”) + scale_fill_manual(values= cbPalette, name = “Reason”) + labs (y = “Percent of edits”, x = “Selected reason”, title = “Reasons users selected for keeping the text without adding a reference - Sorted”, caption = “Limited to published edits where a user selected to keep the text without adding a reference”) + theme( panel.grid.minor = element_blank(), panel.background = element_blank(), plot.title = element_text(hjust = 0.5), text = element_text(size=24), legend.position= “none”, axis.line = element_line(colour = “black”))

p

By whether multiple checks were shown

edit_check_dismissal_bymultiple <- edit_check_reject_data %>% filter(was_reference_check_shown == 1) %>% #limit to where shown group_by(multiple_checks_shown) %>% summarise(n_edits = n_distinct(editing_session), n_rejects = n_distinct(editing_session[n_rejects > 0])) %>% #limit to new content edits without a reference mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), “%”)) %>%
gt() %>% tab_header( title = “Reference Check dismissal rate by if multiple checks shown” ) %>% opt_stylize(5) %>% cols_label( multiple_checks_shown = “Multiple Checks”, n_edits = “Number of edits shown Reference Check”, n_rejects = “Number of edits that dimisssed Reference Check”, dismissal_rate = “Proportion of edits where Reference Check was dismissed” ) %>% tab_source_note( gt::md(‘Limited to published edits where at least one Reference Check was shown’) )

display_html(as_raw_html(edit_check_dismissal_bymultiple ))

We see a higher dismissal rate if more checks are shown; this was not observed with Tone Check.

In the 2025 Multiple Reference Check A/B test, we observed a 66.3% dismissal rate for the single check group compared to 54.8% for the multiple checks treatment group.

By platform

edit_check_dismissal_byplatform <- edit_check_reject_data %>% filter(was_reference_check_shown == 1) %>% #limit to where shown group_by(platform) %>% summarise(n_edits = n_distinct(editing_session), n_rejects = n_distinct(editing_session[n_rejects > 0 ])) %>% #limit to new content edits without a reference mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), “%”)) %>% ungroup() %>% #mutate(n_edits = ifelse(n_edits < 50, “<50”, n_edits), #n_rejects = ifelse(n_rejects < 50, “<50”, n_rejects)) %>% #sanitizing per data publication guidelines #select(-2) %>% gt() %>% tab_header( title = “Reference Check dismissal rate by platform” ) %>% opt_stylize(5) %>% cols_label( platform = “Platform”, n_edits = “Number of edits shown Reference Check”, n_rejects = “Number of edits that dimisssed Reference Check”, dismissal_rate = “Proportion of edits where Reference Check was dismissed” ) %>% tab_source_note( gt::md(‘Limited to published edits where at least one Reference Check was shown’) )

display_html(as_raw_html(edit_check_dismissal_byplatform ))

Users are slightly more likely to keep the text without adding a reference on mobile web. Users selected to keep the text without adding a reference at 62.8% of all published mobile web edits where Reference Check was shown compared to 61.9% of desktop published edits. There was -1.4% decrease in the Reference Check dismissal rate on mobile web compared to desktop. This is similar to the 2024 Reference Check A/B test, where there was a slightly higher rate of reference checks declined on mobile (58%) compared to desktop (44.3%).

The Paste Check Leading Indicators Analysis report showed the opposite: Users selected to keep the pasted text at 48% of all published mobile web edits where Paste Check was shown compared to 56% of desktop published edits. There was -14% decrease in the Paste Check dismissal rate on mobile compared to desktop.

Dismissal reason by platform

edit_check_dismissal_byreason_byplatform <- edit_check_reject_data %>% filter(was_reference_check_shown == 1 & n_rejects > 0) %>% #limit to where shown and user elected to keep text group_by(platform, reject_reason) %>% summarise(n_edits_rejected = n_distinct(editing_session)) %>% mutate(select_rate = round(n_edits_rejected/sum(n_edits_rejected), 2))

plot bar chart of reason selection

dodge <- position_dodge(width=0.9)

p <- edit_check_dismissal_byreason_byplatform %>% # Calculate total across platforms for sorting group_by(reject_reason) %>% mutate(total_edits = sum(n_edits_rejected)) %>% ungroup() %>% # Reorder by overall frequency (descending) arrange(desc(total_edits)) %>% mutate(reject_reason = factor(reject_reason, levels = unique(reject_reason))) %>% ggplot(aes(x= reject_reason, y =select_rate, fill = reject_reason)) + geom_col(position = ‘dodge’,) + scale_y_continuous(labels = scales::percent) + geom_text(aes(label = paste0(select_rate * 100, “%”), fontface=2), vjust=1.2, size = 10, color = “white”) + facet_grid(~ platform ) + labs (y = “Percent of edits”, x = “Selected reason”, title = “Reasons users selected for keeping text without adding a reference - Sorted”) + scale_fill_manual(values= cbPalette, name = “Reason”) + theme( panel.grid.minor = element_blank(), panel.background = element_blank(), plot.title = element_text(hjust = 0.5), text = element_text(size=24), legend.position= “bottom”, axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.line = element_line(colour = “black”))

p

By user experience

edit_check_dismissal_byuserexp <- edit_check_reject_data %>% filter(was_reference_check_shown == 1) %>% #limit to where shown group_by(experience_level_group) %>% summarise(n_edits = n_distinct(editing_session), n_rejects = n_distinct(editing_session[n_rejects > 0])) %>% #limit to new content edits without a reference mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), “%”)) %>%
ungroup() %>% mutate(n_edits = ifelse(n_edits < 50, “<50”, n_edits), n_rejects = ifelse(n_rejects < 50, “<50”, n_rejects)) %>% #sanitizing per data publication guidelines #select(-2) %>% gt() %>% tab_header( title = “Reference Check dismissal rate by user experience” ) %>% opt_stylize(5) %>% cols_label( experience_level_group = “User Experience”, n_edits = “Number of edits shown Reference check”, n_rejects = “Number of edits that dimisssed Reference Check”, dismissal_rate = “Proportion of edits where Reference Check was dismissed” )%>% tab_source_note( gt::md(‘Limited to published edits where at least one Reference Check was shown’) )

display_html(as_raw_html(edit_check_dismissal_byuserexp ))

Unlike in the Paste Check Leading Indicators Analysis report and Tone Check, Unregistered users are dismissing Reference Check at higher rates compared to Newcomers and Junior Contributors.

Dismissal reason by user experience

edit_check_dismissal_byreason_byuserexp <- edit_check_reject_data %>% filter(was_reference_check_shown == 1 & n_rejects > 0) %>% #limit to where shown and user elected to keep text group_by(experience_level_group, reject_reason) %>% summarise(n_edits_rejected = n_distinct(editing_session)) %>% mutate(select_rate = round(n_edits_rejected/sum(n_edits_rejected),2))

Code
# plot bar chart of reason selection
dodge <- position_dodge(width=0.9)

p <- edit_check_dismissal_byreason_byuserexp %>%
    # Calculate total across experience groups for sorting
    group_by(reject_reason) %>%
    mutate(total_edits = sum(n_edits_rejected)) %>%
    ungroup() %>%
    # Reorder by overall frequency (descending)
    arrange(desc(total_edits)) %>%
    mutate(reject_reason = factor(reject_reason, levels = unique(reject_reason))) %>%
    ggplot(aes(x= reject_reason, y = select_rate, fill = reject_reason)) +
    geom_col(position = 'dodge') +
    scale_y_continuous(labels = scales::percent) +
      geom_text(aes(label = paste0(select_rate * 100, "%"), fontface=2), vjust=1.2, size = 10, color = "white") +
    facet_grid( ~ experience_level_group) +
    labs (y = "Percent of edits ",
           x = "Selected reason",
          title = "Reasons users selected for keeping text without adding a reference - Sorted")  +
    scale_fill_manual(values= cbPalette, name = "Reason")  +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=24),
        legend.position= "bottom",
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.line = element_line(colour = "black")) 
      
p