Report of Tone Check Leading Indicators

Published

September 24, 2025

Modified

October 2, 2025

Overview

The Editing team is evaluating the impact of Tone Check through an A/B test.

Tone Check is an Edit check that uses a language model to prompt people adding promotional, derogatory, or otherwise subjective language to consider “neutralizing” the tone of what they are writing. Multiple tone checks can be shown within an editing session either during the edit or during the save process. You can find more details about this check on the Project Page.

The Tone Check A/B test was deployed on 3 September 2025 to French, Japanese, and Portuguese Wikipedias. Prior to completing the full analysis, we reviewed the following set of leading indicators 2 weeks after starting the Tone A/B Test:

  1. Proportion of edits Tone Check is shown within
  2. Proportion of contributors that are presented Tone Check and complete their edits
  3. Proportion of edits wherein people elect to dismiss/not change the text they’ve added
  4. Proportion of people blocked after publishing an edit where Tone Check was shown
  5. Proportion of published edits that add new content and are reverted within 48hours
  6. Proportion of edits that are published before the model is able to return an evaluation.

Decision to be made: What – if any – adjustments/investigations will we prioritize for us to be confident moving forward with evaluating the Tone Check’s impact in T387918?

Please see the task description for additional details.

Methodology

  • We collected two weeks of AB test events logged between 8 September 2025 and 22 September 2025 on French, Japanese, and Portuguese Wikipedia.
  • In this AB test, users in the test group will be shown tone check if attempting an edit that meets the requirements for the check to be shown in VisualEditor. The control group is provided the default editing experience where no tone check is shown.
  • For each leading indicator metric, we reviewed the following dimensions: by experiment group (test and control), by platform (mobile web or desktop), by user experience and status, and by partner wikipedia. We also reviewed some indicators such as completion rate by the number of checks shown within a single editing session.
  • We relied on events logged in VisualEditorFeatureUse and change tags recorded in the revision tags table. See instrumentation spec.
  • Data was limited to mobile and desktop edits completed on a main page namespace using VisualEditor on one of the partner Wikipedias. We also limited to edits completed by unregistered users and users with 100 or fewer edits as those are the users that would be shown tone check under the default config settings.
Note

Results are based on initial AB test data to check if any adjustments to the feature need to be prioritized. More event data will be needed to confirm statistical significance for many of these findings especially for any per user experience or per Wikipedia breakdowns. We will review the complete AB test data as part of the analysis in T387918

Summary of results

  1. Proportion of edits Tone Check is shown within

    • Tone Check has been shown within 421 editing sessions across all three partner Wikipedias over the reviewed two-week timeframe. This represents only about 0.1% of all edit attempts. When limited to saved edits, Tone Check has been shown in 9% of all published new content edits (125 of 1,377 published edits in the test group) since the AB test started.

    • It appears just slightly more frequently for desktop published edits compared to mobile. It has been shown at 9.5% of all published new content edits on desktop and 7.6% of all published new content edits on mobile.

  2. Proportion of contributors that are presented Tone Check and complete their edits

    • We’ve observed a slight -2.3% decrease in the edit completion rate for edits shown tone check compared to eligible edits in the control group.  In the test group, 66.7% of all edits shown tone check were successfully completed compared to 68.3% in the control group. So far, there have been no significant decreases in edit completion rate by experience level, Wikipedia, or for editing sessions where multiple tone checks were shown.
  3. Proportion of edits wherein people elect to dismiss/not change the text they’ve added

    • A little over half of all published edits where tone check was shown (57%) included at least one tone check that the user dismissed. This is similar to the rates observed for Reference Check.
    • Tone checks are dismissed more frequently on desktop compared to mobile. 63.8% of all published desktop edits where tone check was shown include at least one check that was dismissed compared to 39% of all published mobile edits.
  4. Proportion of published edits that add new content and are reverted within 48hours

    • There have been no significant changes in the revert rate of all new content edits overall or by platform or Wikipedia. However, we’ve observed decreases in revert rate when limiting to edits where tone check was shown or eligible to be shown.
    • We’ve observed a -5.3% decrease in the revert rate of desktop edits and -19% decrease in the revert rate of mobile edits for edits where tone check was shown at least once in an editing session compared to eligible edits in the control group.
    • For edits shown tone check and where text was revised to address the issue, we observed almost a 2x decrease in revert rate compared to eligible control edits. More published edit data is needed to confirm impacts on a per Wikipedia and per platform basis.
  5. Proportion of people blocked after publishing an edit where Tone Check was shown

    • Less than 1% users have been blocked after publishing an edit where at least one tone check was shown.
  6. Proportion of edits that are published before the model is able to return an evaluation

    • Only about 0.6% of all published edits (264 edits) in the AB test were saved before the model returned an evaluation. The majority of these edits occurred in the control group and on desktop.
Code
# load packages
shhh <- function(expr) suppressPackageStartupMessages(suppressWarnings(suppressMessages(expr)))
shhh({
    library(lubridate)
    library(ggplot2)
    library(dplyr)
    library(gt)
    library(IRdisplay)
})
#set preferences
options(dplyr.summarise.inform = FALSE)
options(repr.plot.width = 15, repr.plot.height = 10)

Proportion of published edits shown at least one Tone Check

Question: Are newcomers encountering Tone Check?

Methodology: Reviewed the number of published new content edits where at least one Tone Check was shown during the editing session (event.feature = 'editCheck-Tone' AND event.action IN ('check-shown-midedit', 'check-shown-presave').

This analysis was specifically limited to edits that were successfully published and identified as new content edits with the tag editcheck-newcontent.

Code
#load frequency data
tone_check_frequency_data <-
  read.csv(
    file = 'Queries/data/tone_check_frequency_data.tsv',
    header = TRUE,
    sep = "\t",
    stringsAsFactors = FALSE
  ) 
Code

# Set experience level group and factor levels
tone_check_frequency_data <- tone_check_frequency_data %>%
  mutate(
    experience_level_group = case_when(
     user_edit_count == 0 & user_status == 'registered' ~ 'Newcomer',
     user_edit_count == 0 & user_status == 'unregistered' ~ 'Unregistered',
      user_edit_count > 0 &  user_edit_count <= 100 ~ "Junior Contributor",
      user_edit_count >  100 ~ "Non-Junior Contributor"   
    ),
    experience_level_group = factor(experience_level_group,
         levels = c("Unregistered","Newcomer", "Non-Junior Contributor", "Junior Contributor")
   ))  

#rename experiment field to clarify
tone_check_frequency_data <- tone_check_frequency_data %>%
  mutate(test_group = factor(test_group,
         levels = c('2025-09-editcheck-tone-control', '2025-09-editcheck-tone-test'),
         labels = c("control (no tone check)", "test (tone check available)")))
Code
#Set fields and factor levels to assess number of checks shown

tone_check_frequency_data <- tone_check_frequency_data %>%
  mutate(
    multiple_checks_shown = 
         ifelse(n_checks_shown > 1, 1, 0),  
     multiple_checks_shown = factor( multiple_checks_shown ,
         levels = c(0,1)))
         
# note these buckets can be adjusted as needed based on distribution of data
tone_check_frequency_data <- tone_check_frequency_data %>%
  mutate(
    checks_shown_bucket = case_when(
     is.na(n_checks_shown) ~ '0',
     n_checks_shown == 1   ~ '1', 
     n_checks_shown == 2  ~ '2',
     n_checks_shown > 2 & n_checks_shown <= 5  ~ "3-5",
     n_checks_shown > 5 & n_checks_shown <= 10  ~ "6-10", 
     n_checks_shown > 10 ~ "over 10" 
    ),
    checks_shown_bucket = factor(checks_shown_bucket ,
         levels = c("0","1","2", "3-5", "6-10", "over 10")
   ))  

Overall proportion of all edit attempts

We first looked at the proportion of all VE edit attempts to determine how frequently across all editing sessions this check was appearing. Note: This includes a large number of edits that were abandoned prior to reaching a point where tone check would be shown.

Code
tone_checks_shown_overall <- tone_check_frequency_data %>%
    filter(test_group == "test (tone check available)") %>% #limit to test group edits
    group_by(test_group) %>%
    summarise(n_editing_session = n_distinct(editing_session),
              n_editing_session_refcheck = n_distinct(editing_session[was_tone_check_shown == 1])) %>%
    mutate(prop_check_shown = paste0(round(n_editing_session_refcheck/n_editing_session * 100, 1), "%"))  %>%
    gt()  %>%
    tab_header(
    title = "Edit attempts shown at least one tone check by experiment group"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment Group",
    n_editing_session = "Number of edit attempts",
    n_editing_session_refcheck = "Number of edit attempts shown tone check",   
    prop_check_shown = "Proportion of edit attempts shown tone check"
  ) %>%
    tab_source_note(
        gt::md('Limited to edits by unregistered users and users with 100 or fewer edits')
    )


display_html(as_raw_html(tone_checks_shown_overall))
Edit attempts shown at least one tone check by experiment group
Experiment Group Number of edit attempts Number of edit attempts shown tone check Proportion of edit attempts shown tone check
test (tone check available) 701484 421 0.1%
Limited to edits by unregistered users and users with 100 or fewer edits

Overall proportion of saved new content edits

We then limited to saved new content edits to check the prevalence of tone checks in published edits.

Code
tone_checks_shown_saved_overall <- tone_check_frequency_data %>%
    filter(test_group == "test (tone check available)" & #limit to test group edits
            was_saved == 1 & is_new_content ==1) %>% #limit to published  new content edits
    group_by(test_group) %>%
    summarise(n_editing_session = n_distinct(editing_session),
              n_editing_session_refcheck = n_distinct(editing_session[was_tone_check_shown == 1])) %>%
    mutate(prop_check_shown = paste0(round(n_editing_session_refcheck/n_editing_session * 100, 1), "%"))  %>%
    gt()  %>%
    tab_header(
    title = "Published new content edits shown at least one tone check by experiment group"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment Group",
    n_editing_session = "Number of edits",
    n_editing_session_refcheck = "Number of edits shown tone check",   
    prop_check_shown = "Proportion of edits shown tone check"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits by unregistered users and users with 100 or fewer edits')
    )


display_html(as_raw_html(tone_checks_shown_saved_overall))
Published new content edits shown at least one tone check by experiment group
Experiment Group Number of edits Number of edits shown tone check Proportion of edits shown tone check
test (tone check available) 1377 125 9.1%
Limited to published new content edits by unregistered users and users with 100 or fewer edits

By if multiple checks were shown

Code
tone_checks_shown_saved_bymultiple <- tone_check_frequency_data %>%
    filter(test_group == "test (tone check available)" & #limit to test group edits
            was_saved == 1 & is_new_content ==1) %>% #limit to published  new content edits
    group_by(test_group) %>%
    summarise(n_editing_session = n_distinct(editing_session),
              n_editing_session_multicheck = n_distinct(editing_session[was_tone_check_shown == 1 & multiple_checks_shown == 1])) %>%
    mutate(prop_check_shown = paste0(round(n_editing_session_multicheck/n_editing_session * 100, 1), "%")) %>% 
    gt()  %>%
    tab_header(
    title = "Published new content edits shown multiple tone checks in the test group"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment Group",
    n_editing_session = "Number of edits",
    n_editing_session_multicheck = "Number of edits shown multiple tone checks",   
    prop_check_shown = "Proportion of edits shown multiple tone checks"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits by unregistered users and users with 100 or fewer edits')
    )


display_html(as_raw_html(tone_checks_shown_saved_bymultiple))
Published new content edits shown multiple tone checks in the test group
Experiment Group Number of edits Number of edits shown multiple tone checks Proportion of edits shown multiple tone checks
test (tone check available) 1377 93 6.8%
Limited to published new content edits by unregistered users and users with 100 or fewer edits

By number of checks shown

Code
tone_checks_shown_saved_bynchecks <- tone_check_frequency_data %>%
     filter(test_group == "test (tone check available)" & #limit to test group edits
            was_saved == 1 & is_new_content ==1) %>% #limit to published  new content edits
    #filter(was_tone_check_shown == 1) %>% #if you want to limit to just edits shown tone check
    mutate(total_sessions = n_distinct(editing_session)) %>%
    group_by(total_sessions, checks_shown_bucket) %>%
    summarise(n_editing_session_tonecheck = n_distinct(editing_session)) %>%
    mutate(prop_check_shown = paste0(round(n_editing_session_tonecheck/total_sessions * 100, 2), "%")) %>%
    ungroup() %>%
    select(-c(1,3)) %>%
    #mutate(n_editing_session_refcheck = ifelse(n_editing_session_refcheck < 50, "<50", n_editing_session_refcheck))  %>% #sanitizing per data publication guidelines
    gt()  %>%
    tab_header(
    title = "Published new content edits by total number of tone checks shown"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    checks_shown_bucket = "Number of tone checks shown",
    #n_editing_session_tonecheck = "Number of edits",   
    prop_check_shown = "Proportion of edits"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits by unregistered users and users with 100 or fewer edits')
    )


display_html(as_raw_html(tone_checks_shown_saved_bynchecks))
Published new content edits by total number of tone checks shown
Number of tone checks shown Proportion of edits
0 90.92%
1 2.32%
2 2.69%
3-5 2.69%
6-10 0.73%
over 10 0.65%
Limited to published new content edits by unregistered users and users with 100 or fewer edits

By platform

Code
tone_checks_shown_byplatform <- tone_check_frequency_data %>%
      filter(test_group == "test (tone check available)" & #limit to test group edits
            was_saved == 1 & is_new_content ==1) %>% #limit to published  new content edits
    group_by(platform) %>%
    summarise(n_editing_session = n_distinct(editing_session),
              n_editing_session_tonecheck = n_distinct(editing_session[was_tone_check_shown == 1])) %>%
    mutate(prop_check_shown = paste0(round(n_editing_session_tonecheck/n_editing_session * 100, 1), "%"))  %>%
    mutate(n_editing_session_tonecheck = ifelse(n_editing_session_tonecheck < 50, "<50", n_editing_session_tonecheck))%>% #sanitizing per data publication guideline
    select(-2) %>% #removing total number of edits column to santize data for publication
    gt()  %>%
    tab_header(
    title = "Published new content edits shown at least one tone check by platform"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    platform = "Platform",
    #n_editing_session = "Number of edits",
    n_editing_session_tonecheck = "Number of edits shown tone check",   
    prop_check_shown = "Proportion of edits shown tone check"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits by unregistered users and users with 100 or fewer edits')
    )


display_html(as_raw_html(tone_checks_shown_byplatform))
Published new content edits shown at least one tone check by platform
Platform Number of edits shown tone check Proportion of edits shown tone check
desktop 99 9.5%
phone <50 7.6%
Limited to published new content edits by unregistered users and users with 100 or fewer edits

By user experience

Code
tone_checks_shown_byuser_status <- tone_check_frequency_data %>%
      filter(test_group == "test (tone check available)" & #limit to test group edits
            was_saved == 1 & is_new_content ==1) %>% #limit to published  new content edits
    group_by(experience_level_group ) %>%
    summarise(n_editing_session = n_distinct(editing_session),
              n_editing_session_tonecheck = n_distinct(editing_session[was_tone_check_shown == 1])) %>%
    mutate(prop_check_shown = paste0(round(n_editing_session_tonecheck/n_editing_session * 100, 1), "%"))  %>%
    mutate(n_editing_session_tonecheck = ifelse(n_editing_session_tonecheck < 50, "<50", n_editing_session_tonecheck))%>% #sanitizing per data publication guideline
    select(-2) %>% #removing total number of edits column to santize data for publication
    gt()  %>%
    tab_header(
    title = "Published new content edits shown at least one tone check by user experience"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    experience_level_group  = "User Experience",
    #n_editing_session = "Number of edits",
    n_editing_session_tonecheck = "Number of edits shown tone check",   
    prop_check_shown = "Proportion of edits shown tone check"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits by unregistered users and users with 100 or fewer edits')
    )


display_html(as_raw_html(tone_checks_shown_byuser_status ))
Published new content edits shown at least one tone check by user experience
User Experience Number of edits shown tone check Proportion of edits shown tone check
Unregistered <50 14.2%
Newcomer <50 13.8%
Junior Contributor 51 6%
Limited to published new content edits by unregistered users and users with 100 or fewer edits

By Partner Wikipedia

Code
tone_checks_shown_bywiki <- tone_check_frequency_data %>%
     filter(test_group == "test (tone check available)" & #limit to test group edits
            was_saved == 1 & is_new_content ==1) %>% #saved test group edits
    group_by(wiki) %>%
    summarise(n_editing_session = n_distinct(editing_session),
              n_editing_session_tonecheck = n_distinct(editing_session[was_tone_check_shown == 1])) %>%
    mutate(prop_check_shown = paste0(round(n_editing_session_tonecheck/n_editing_session * 100, 1), "%"))  %>%
      mutate(n_editing_session_tonecheck = ifelse(n_editing_session_tonecheck < 50, "<50", n_editing_session_tonecheck))%>% #sanitizing per data publication guideline
    select(-2) %>% #removing total number of edits column to santize data for publication
    gt()  %>%
    tab_header(
    title = "Published new content edits shown at least one tone check by Wikipedia"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    wiki  = "Wikipedia",
    #n_editing_session = "Number of edits",
    n_editing_session_tonecheck = "Number of edits shown tone check",   
    prop_check_shown = "Proportion of edits shown tone check"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits by unregistered users and users with 100 or fewer edits')
    )


display_html(as_raw_html(tone_checks_shown_bywiki))
Published new content edits shown at least one tone check by Wikipedia
Wikipedia Number of edits shown tone check Proportion of edits shown tone check
frwiki 93 11.4%
jawiki <50 6.3%
ptwiki <50 5.1%
Limited to published new content edits by unregistered users and users with 100 or fewer edits

Key Insights

  • Tone Check has been shown within 9% of all published new content edits (125 of 13777 published edits in the test group) since the AB test started.
      • The majority of new content edits shown tone check (75%) were shown more than one tone check in an editing session (93 edits total; 6.8% of all published new content edits).
    • Only 1.4% of all published new content edits were shown more than 6 tone checks within a session.
  • It has been shown at 9.5% of all published new content edits on desktop and 7.6% of all published new content edits on mobile.
  • Tone Check appears more frequently in published new content edits by newcomers and unregistered users compared to Junior Contributors.
  • Frequency appears to slightly vary by partner Wikipedia. 11% of all published new content edits at French Wikipedia have been shown Tone Check compared to 6% at Japanese and 5% at Portuguese Wikipedia.
  • Overall, tone check appear much less frequently compared to Reference Check which was shown in close to 80% of all editing sessions.

Proportion of contributors that are presented tone check and complete their edits

Question Do newcomers understand the feature?

Methodology We reviewed the proportion of edits where tone check was shown at least once during the edit session and that were successfully published (event.action = saveSuccess). These edits were compared to the completion rate of edits in the control group that were eligible but not shown tone check, as implemented in T394952.

Note: This anlysis excludes edits that were abandoned prior to reaching the point where tone check was or would have been shown.

Code
# load data for assessing edit completion rate
edit_completion_rates <-
  read.csv(
    file = 'Queries/data/edit_completion_rate.tsv',
    header = TRUE,
    sep = "\t",
    stringsAsFactors = FALSE
  ) 
Code
# Set experience level group and factor levels
edit_completion_rates <- edit_completion_rates %>%
  mutate(
    experience_level_group = case_when(
     user_edit_count == 0 & user_status == 'registered' ~ 'Newcomer',
     user_edit_count == 0 & user_status == 'unregistered' ~ 'Unregistered',
      user_edit_count > 0 &  user_edit_count <= 100 ~ "Junior Contributor",
      user_edit_count >  100 ~ "Non-Junior Contributor"   
    ),
    experience_level_group = factor(experience_level_group,
         levels = c("Unregistered","Newcomer", "Non-Junior Contributor", "Junior Contributor")
   ))  

#rename experiment field to clarfiy
edit_completion_rates <- edit_completion_rates %>%
  mutate(test_group = factor(test_group,
         levels = c('2025-09-editcheck-tone-control', '2025-09-editcheck-tone-test'),
         labels = c("control (no tone check)", "test (tone check available)")))
Code
#Set fields and factor levels to assess number of checks shown

edit_completion_rates  <- edit_completion_rates %>%
  mutate(
    multiple_checks_shown = 
         ifelse(n_checks_shown > 1, "multiple checks shown", "one check shown"),  
     multiple_checks_shown = factor( multiple_checks_shown ,
         levels = c("one check shown", "multiple checks shown")))
         
# note these buckets can be adjusted as needed based on distribution of data
edit_completion_rates  <- edit_completion_rates %>%
  mutate(
    checks_shown_bucket = case_when(
     is.na(n_checks_shown) ~ '0',
     n_checks_shown == 1  ~ '1', 
     n_checks_shown == 2  ~ '2',
     n_checks_shown > 2 & n_checks_shown <= 5  ~ "3-5",
     n_checks_shown > 5 & n_checks_shown <= 10 ~ "6-10", 
     n_checks_shown > 10 ~ "over 10" 
    ),
    checks_shown_bucket = factor(checks_shown_bucket ,
         levels = c("0","1","2", "3-5", "6-10","over 10")
   ))

Overall by experiment group

Code
edit_completion_rate_overall <- edit_completion_rates %>%
    filter(tone_check_shown == 1) %>% #limit to sessions where tone check was shown
    group_by(test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_saves = n_distinct(editing_session[saved_edit > 0])) %>%
    mutate(completion_rate = paste0(round(n_saves/n_edits * 100, 1), "%")) %>% 
    gt()  %>%
    tab_header(
    title = "Edit completion rate by experiment group"
      )  %>%
opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment Group",
    n_edits = "Number of edit attempts",
    n_saves = "Number of published edits",
    completion_rate = "Proportion of edits saved"
  ) %>%
    tab_source_note(
        gt::md('Limited to edit attempts shown or eligible to be shown at least one tone check')
    )



display_html(as_raw_html(edit_completion_rate_overall ))
Edit completion rate by experiment group
Experiment Group Number of edit attempts Number of published edits Proportion of edits saved
control (no tone check) 284 194 68.3%
test (tone check available) 430 287 66.7%
Limited to edit attempts shown or eligible to be shown at least one tone check

By if multiple checks were shown for test group

Code
edit_completion_rate_bymulti <- edit_completion_rates %>%
    filter(tone_check_shown == 1 &
            test_group == 'test (tone check available)') %>%
    group_by(test_group, multiple_checks_shown) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_saves = n_distinct(editing_session[saved_edit > 0])) %>%
    mutate(completion_rate = paste0(round(n_saves/n_edits * 100, 1), "%")) %>%   
    gt()  %>%
    tab_header(
    title = "Edit completion rate by if multiple checks were shown"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment group",
    multiple_checks_shown = "Multiple tone checks shown",
    n_edits = "Number of edit attempts shown tone check",
    n_saves = "Number of published edits",
    completion_rate = "Proportion of edits saved"
  ) %>%
    tab_source_note(
         gt::md('Limited to edit attempts shown or eligible to be shown at least one tone check')
    )


display_html(as_raw_html(edit_completion_rate_bymulti))
Edit completion rate by if multiple checks were shown
Multiple tone checks shown Number of edit attempts shown tone check Number of published edits Proportion of edits saved
test (tone check available)
one check shown 158 105 66.5%
multiple checks shown 272 182 66.9%
Limited to edit attempts shown or eligible to be shown at least one tone check

By number of checks shown for test group

Code
edit_completion_rate_bynchecks <- edit_completion_rates %>%
    filter(tone_check_shown == 1 & test_group == 'test (tone check available)')  %>% #limit to tone checks shown and test group
    group_by(test_group, checks_shown_bucket) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_saves = n_distinct(editing_session[saved_edit > 0])) %>%
    mutate(completion_rate = paste0(round(n_saves/n_edits * 100, 1), "%")) %>%  
    ungroup()%>%  
    mutate(n_edits = ifelse(n_edits < 50, "<50", n_edits),
           n_saves = ifelse(n_saves < 50, "<50", n_saves))  %>% #sanitizing per data publication guidelines
    group_by(test_group) %>%  
    gt()  %>%
    tab_header(
    title = "Edit completion rate by the number of tone checks shown"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    checks_shown_bucket = "Number of tone checks shown",
    n_edits = "Number of edit attempts shown tone check",
    n_saves = "Number of published edits",
    completion_rate = "Proportion of edits saved"
  ) %>%
    tab_source_note(
        gt::md('Limited to edit attempts shown or eligible to be shown at least one tone check')
    )


display_html(as_raw_html(edit_completion_rate_bynchecks))
Edit completion rate by the number of tone checks shown
Number of tone checks shown Number of edit attempts shown tone check Number of published edits Proportion of edits saved
test (tone check available)
1 158 105 66.5%
2 132 91 68.9%
3-5 91 63 69.2%
6-10 <50 <50 50%
over 10 <50 <50 66.7%
Limited to edit attempts shown or eligible to be shown at least one tone check

By Platform

Code
edit_completion_rate_byplatform <- edit_completion_rates %>%
    filter(tone_check_shown == 1) %>%
    group_by(platform, test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_saves = n_distinct(editing_session[saved_edit > 0])) %>%
    mutate(completion_rate = paste0(round(n_saves/n_edits * 100, 1), "%")) %>% 
    #mutate(n_saves = ifelse(n_saves < 50, "<50", n_saves))%>% #sanitizing per data publication guideline
    select(-c(3,4)) %>% 
    gt()  %>%
    tab_header(
    title = "Edit completion rate by experiment group and platform"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment Group",
    platform = "Platform",
    #n_edits = "Number of edit attempts shown tone check",
    #n_saves = "Number of published edits",
    completion_rate = "Proportion of edits saved"
  ) %>%
    tab_source_note(
        gt::md('Limited to edit attempts shown or eligible to be shown at least one tone check')
    )


display_html(as_raw_html(edit_completion_rate_byplatform))
Edit completion rate by experiment group and platform
Experiment Group Proportion of edits saved
desktop
control (no tone check) 69.4%
test (tone check available) 65.8%
phone
control (no tone check) 64.5%
test (tone check available) 69.4%
Limited to edit attempts shown or eligible to be shown at least one tone check

By user experience

Code
edit_completion_rate_byuserstatus <- edit_completion_rates %>%
    filter(tone_check_shown == 1) %>%
    group_by(experience_level_group, test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_saves = n_distinct(editing_session[saved_edit > 0])) %>%
    mutate(completion_rate = paste0(round(n_saves/n_edits * 100, 1), "%")) %>%   
    select(-c(3,4)) %>% #data sanitizing for publication
    gt()  %>%
    tab_header(
    title = "Edit completion rate by experiment group and editor experience"
      )  %>%
 opt_stylize(5) %>%
  cols_label(
    test_group = "Test Group",
    experience_level_group = "Experiment Group",
    #n_edits = "Number of edit attempts shown tone check",
    #n_saves = "Number of published edits",
    completion_rate = "Proportion of edits saved"
  ) %>%
    tab_source_note(
        gt::md('Limited to edit attempts shown or eligible to be shown at least one tone check')
    )


display_html(as_raw_html(edit_completion_rate_byuserstatus))
Edit completion rate by experiment group and editor experience
Test Group Proportion of edits saved
Unregistered
control (no tone check) 55.8%
test (tone check available) 57.1%
Newcomer
control (no tone check) 64.7%
test (tone check available) 63.9%
Junior Contributor
control (no tone check) 78.3%
test (tone check available) 75.6%
Limited to edit attempts shown or eligible to be shown at least one tone check

By Partner Wikipedia

Code
edit_completion_rate_bywiki <- edit_completion_rates %>%
    filter(tone_check_shown == 1) %>%
    group_by(wiki, test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_saves = n_distinct(editing_session[saved_edit > 0])) %>%
    mutate(completion_rate = paste0(round(n_saves/n_edits * 100, 1), "%")) %>% 
    select(-c(3,4)) %>% #data sanitizing for publication
    gt()  %>%
    tab_header(
    title = "Edit completion rate by experiment group and Wikipedia"
      )  %>%
 opt_stylize(5) %>%
  cols_label(
    test_group = "Test Group",
    wiki = "Wikipedia",
    #n_edits = "Number of edit attempts shown tone check",
    #n_saves = "Number of published edits",
    completion_rate = "Proportion of edits saved"
  ) %>%
    tab_source_note(
        gt::md('Limited to edit attempts shown or eligible to be shown at least one tone check')
    )



display_html(as_raw_html(edit_completion_rate_bywiki ))
Edit completion rate by experiment group and Wikipedia
Test Group Proportion of edits saved
frwiki
control (no tone check) 71.9%
test (tone check available) 70.2%
jawiki
control (no tone check) 65%
test (tone check available) 63.5%
ptwiki
control (no tone check) 56.4%
test (tone check available) 51%
Limited to edit attempts shown or eligible to be shown at least one tone check

Key Insights

  • We’ve only observed a slight decrease in the edit completion rate for edits shown tone check compared to eligible edits in the control group. In the test group, 66.7% of all edits shown tone check were successfully completed compared to 68.3% in the control group (-2.3% decrease).
  • We have not seen any significant decreases if multiple checks were shown. Edit completion rate was about 66% even if over 6 tone checks were shown in an editing session (Note: These’s been very few editing sessions where more than 6 checks have been shown, so more data will be needed to verify impacts to completion rates at this level).
  • We have not observed any significant decreases in edit completion rate by platform or user experience type.
  • Currently, the decrease in edit completion has primarily been observed on desktop edits. On mobile, the edit completion rate for edits shown tone check has increased compared to the control (64.5% (control) → 69.4% (test); +7.6% increase)
  • We did not observe any significant changes in completion rate by experience group. The highest relative decrease was observed for Junior Contributors (-3.4%).
  • Results vary by wiki. We observed a 6.3% increase in edit completion rate at Japanese Wikipedia and close to a 10% decrease at Portuguese Wikipedia. Each of these wikis currently have small sample of tone check edits to review (<100 per test group) so more data is needed to confirm trends.

Proportion of edits wherein people elect to dismiss/not change the text they’ve added

Question: Do people find Tone Check relevant?

Methodology: We reviewed the propotion of published edits shown tone check wherein people elected to dismiss changing the text they added. This was determined by edits where the user dimissed a tone check at least once in a session (event.feature = 'editCheck-tone'AND event.action = 'action-dismiss').

We also reviewed the proportion of all saved edits that the model still identified having non-neutral language event after being shown tone check at least once. These edits are tagged with both editcheck-tone and editcheck-tone-shown.

Code
# load data for assessing edit reject frequency
edit_check_reject_data <-
  read.csv(
    file = 'Queries/data/edit_check_rejects_data.tsv',
    header = TRUE,
    sep = "\t",
    stringsAsFactors = FALSE
  ) 
Code
# Set experience level group and factor levels
edit_check_reject_data <- edit_check_reject_data %>%
  mutate(
    experience_level_group = case_when(
     user_edit_count == 0 & user_status == 'registered' ~ 'Newcomer',
     user_edit_count == 0 & user_status == 'unregistered' ~ 'Unregistered',
      user_edit_count > 0 &  user_edit_count <= 100 ~ "Junior Contributor",
      user_edit_count >  100 ~ "Non-Junior Contributor"   
    ),
    experience_level_group = factor(experience_level_group,
         levels = c("Unregistered","Newcomer", "Non-Junior Contributor", "Junior Contributor")
   ))  

#rename experiment field to clarify
edit_check_reject_data <- edit_check_reject_data %>%
  mutate(test_group = factor(test_group,
         levels = c('2025-09-editcheck-tone-control', '2025-09-editcheck-tone-test'),
         labels = c("control (no tone check)", "test (tone check available)")))
Code
#Set fields and factor levels to assess number of checks shown
#Note limited to 1 sidebar open as we're looking for cases where multiple checks presented in a single sidebar (vs user going back and forth)

edit_check_reject_data <- edit_check_reject_data %>%
  mutate(
    multiple_checks_shown = 
         ifelse(n_checks_shown > 1, "multiple checks shown", "single check shown"),  
     multiple_checks_shown = factor( multiple_checks_shown ,
         levels = c("single check shown", "multiple checks shown")))
         
# note these buckets can be adjusted as needed based on distribution of data
edit_check_reject_data <- edit_check_reject_data %>%
  mutate(
    checks_shown_bucket = case_when(
     is.na(n_checks_shown) ~ '0',
     n_checks_shown == 1  ~ '1', 
     n_checks_shown == 2 ~ '2',
     n_checks_shown > 2 & n_checks_shown <= 5 ~ "3-5",
     n_checks_shown > 5 & n_checks_shown <= 10 ~ "6-10", 
     n_checks_shown > 10 ~ "over 10" 
    ),
    checks_shown_bucket = factor(checks_shown_bucket ,
         levels = c("0","1","2", "3-5", "6-10", "over 10")
   ))   

Overall published edits where tone check was dismissed

Code
edit_check_dismissal_overall <- edit_check_reject_data %>%
    filter(was_edit_check_shown == 1) %>% #limit to where shown
    summarise(n_edits = n_distinct(editing_session),
              n_rejects = n_distinct(editing_session[n_rejects > 0])) %>% #limit to new content edits without a refernece
    mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), "%")) %>%   
    gt()  %>%
    tab_header(
    title = "Proportion of edits where at least one tone check was dismissed"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    n_edits = "Number of edits shown tone check",
    n_rejects = "Number of edits that dimisssed tone check",
    dismissal_rate = "Proportion of edits where tone check was dismissed"
  ) 

display_html(as_raw_html(edit_check_dismissal_overall ))
Proportion of edits where at least one tone check was dismissed
Number of edits shown tone check Number of edits that dimisssed tone check Proportion of edits where tone check was dismissed
287 164 57.1%

Overall published edits that still have tone issue at save after being shown tone check

Code
tone_check_issues_remain_overall <- edit_check_reject_data %>%
    filter(was_edit_check_shown == 1) %>% #limit to where shown
    summarise(n_edits = n_distinct(editing_session),
              n_rejects = n_distinct(editing_session[is_tone_check_eligible == 1])) %>% #limit to edits identifed as having tone issue at time of save
    mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), "%")) %>%   
    gt()  %>%
    tab_header(
    title = "Proportion of edits still identifed as having non-neutral language after being shown tone check"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    n_edits = "Number of edits shown tone check",
    n_rejects = "Number of edits still identified as eligible",
    dismissal_rate = "Proportion of edits with non-neutral language"
  ) 


display_html(as_raw_html(tone_check_issues_remain_overall ))
Proportion of edits still identifed as having non-neutral language after being shown tone check
Number of edits shown tone check Number of edits still identified as eligible Proportion of edits with non-neutral language
287 184 64.1%

By if multiple checks shown

Code
edit_check_dismissal_bymultiple <- edit_check_reject_data %>%
    filter(was_edit_check_shown == 1) %>% #limit to where shown
    group_by(multiple_checks_shown) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_rejects = n_distinct(editing_session[n_rejects > 0])) %>% #limit to new content edits without a refernece
    mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), "%")) %>%   
    gt()  %>%
    tab_header(
    title = "Proportion of edits where at least one tone check was dismissed by if multiple checks shown"
      )  %>%
 opt_stylize(5) %>%
  cols_label(
     multiple_checks_shown = "Multiple Checks",
    n_edits = "Number of edits shown tone check",
    n_rejects = "Number of edits that dimisssed tone check",
    dismissal_rate = "Proportion of edits where tone check was dismissed"
  ) %>%
    tab_source_note(
        gt::md('Limited to published edits where at least one tone check was shown and dismissed')
    )


display_html(as_raw_html(edit_check_dismissal_bymultiple ))
Proportion of edits where at least one tone check was dismissed by if multiple checks shown
Multiple Checks Number of edits shown tone check Number of edits that dimisssed tone check Proportion of edits where tone check was dismissed
single check shown 104 71 68.3%
multiple checks shown 183 94 51.4%
Limited to published edits where at least one tone check was shown and dismissed

By platform

Code
edit_check_dismissal_byplatform <- edit_check_reject_data %>%
    filter(was_edit_check_shown == 1) %>% #limit to where shown
    group_by(platform) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_rejects = n_distinct(editing_session[n_rejects > 0])) %>% #limit to new content edits without a refernece
    mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), "%")) %>% 
    ungroup() %>%
    mutate(n_edits = ifelse(n_edits < 50, "<50", n_edits),
     n_rejects = ifelse(n_rejects < 50, "<50", n_rejects))  %>% #sanitizing per data publication guidelines
    select(-2) %>%
    gt()  %>%
    tab_header(
    title = "Proportion of edits where at least one tone check was dismissed by number of checks shown"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    platform = "Platform",
    #n_edits = "Number of edits shown tone check",
    n_rejects = "Number of edits that dimisssed tone check",
    dismissal_rate = "Proportion of edits where tone check was dismissed"
  ) 
display_html(as_raw_html(edit_check_dismissal_byplatform ))
Proportion of edits where at least one tone check was dismissed by number of checks shown
Platform Number of edits that dimisssed tone check Proportion of edits where tone check was dismissed
desktop 136 63.8%
phone <50 39.2%

By user experience

Code
edit_check_dismissal_byuserexp <- edit_check_reject_data %>%
    filter(was_edit_check_shown == 1) %>% #limit to where shown
    group_by(experience_level_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_rejects = n_distinct(editing_session[n_rejects > 0])) %>% #limit to new content edits without a refernece
    mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), "%")) %>%   
    ungroup() %>%
    mutate(n_edits = ifelse(n_edits < 50, "<50", n_edits),
     n_rejects = ifelse(n_rejects < 50, "<50", n_rejects))  %>% #sanitizing per data publication guidelines
    select(-2) %>%
    gt()  %>%
    tab_header(
    title = "Proportion of edits where at least one tone check was dismissed by user experience"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    experience_level_group = "User Experience",
    #n_edits = "Number of edits shown tone check",
    n_rejects = "Number of edits that dimisssed tone check",
    dismissal_rate = "Proportion of edits where tone check was dismissed"
  )
display_html(as_raw_html(edit_check_dismissal_byuserexp ))
Proportion of edits where at least one tone check was dismissed by user experience
User Experience Number of edits that dimisssed tone check Proportion of edits where tone check was dismissed
Unregistered <50 55.7%
Newcomer <50 62.3%
Junior Contributor 84 57.5%

By partner Wikipedia

Code
edit_check_dismissal_bywiki <- edit_check_reject_data %>%
    filter(was_edit_check_shown == 1) %>% #limit to where shown
    group_by(wiki) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_rejects = n_distinct(editing_session[n_rejects > 0])) %>% #limit to new content edits without a refernece
    mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), "%")) %>% 
   ungroup() %>%
    mutate(n_edits = ifelse(n_edits < 50, "<50", n_edits),
     n_rejects = ifelse(n_rejects < 50, "<50", n_rejects))  %>% #sanitizing per data publication guidelines
    select(-2) %>%
    gt()  %>%
    tab_header(
    title = "Proportion of edits where at least one tone check was dismissed by partner Wikipedia"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    wiki = "Wikipedia",
    #n_edits = "Number of edits shown tone check",
    n_rejects = "Number of edits that dimisssed tone check",
    dismissal_rate = "Proportion of edits where tone check was dismissed"
  ) 

display_html(as_raw_html(edit_check_dismissal_bywiki ))
Proportion of edits where at least one tone check was dismissed by partner Wikipedia
Wikipedia Number of edits that dimisssed tone check Proportion of edits where tone check was dismissed
frwiki 135 63.1%
jawiki <50 40.4%
ptwiki <50 46.2%

Key Insights

  • A little over half of all published edits where tone check was shown (57%) included at least one check that the user dismissed. This is similar to the rates observed for Reference Check.
    • 64% of published edits shown tone check were still identified as having non-neutral language at the time they were published.
  • Tone checks are dismissed more frequently on desktop compared to mobile. 63.8% of all published desktop edits where tone check was shown include at least one check that was dismissed compared to 39% of all published mobile edits.
  • Newcomers are slightly more likely to dismiss a tone check compared to unregistered or Junior Contributors. 62.3% of all published edits by newcomers included at least one dismissal of a tone check compared to 57.5% of edits by Junior Contributors.
  • Dismissal rates are also currently higher at French Wikipedia compared to Japanese and Portuguese Wikipedia.

Proportion of published new content edits that are reverted within 48 hours

Question:Is tone check causing any disruption?

Methdology: Reviewed the proportion of all published new content edits where tone check was shown at least once in an editing session (identified by editCheck-tone-shown tag) and were reverted within 48 hours. This was compared to the revert rate of edits in the control group identifed as eligible for tone check (identified by editcheck-tone tag).

Code
# load data for assessing tone check published data
edit_check_save_data <-
  read.csv(
    file = 'Queries/data/edit_check_saves_data.tsv',
    header = TRUE,
    sep = "\t",
    stringsAsFactors = FALSE
  ) 
Code
# Set experience level group and factor levels
edit_check_save_data <- edit_check_save_data %>%
  mutate(
    experience_level_group = case_when(
     user_edit_count == 0 & user_status == 'registered' ~ 'Newcomer',
     user_edit_count == 0 & user_status == 'unregistered' ~ 'Unregistered',
      user_edit_count > 0 &  user_edit_count <= 100 ~ "Junior Contributor",
      user_edit_count >  100 ~ "Non-Junior Contributor"   
    ),
    experience_level_group = factor(experience_level_group,
         levels = c("Unregistered","Newcomer", "Non-Junior Contributor", "Junior Contributor")
   ))  

#rename experiment field to clarify

edit_check_save_data <- edit_check_save_data %>%
  mutate(test_group = factor(test_group,
         levels = c('2025-09-editcheck-tone-control', '2025-09-editcheck-tone-test'),
         labels = c("control (no tone check shown)", "test (tone check shown)")))
Code
# set field to indicate if more than one check was shown in a single session. Note: This should only be applicable to the test group 

edit_check_save_data <- edit_check_save_data %>%
  mutate(
    multiple_checks_shown = 
         ifelse(n_checks_shown > 1, "multiple checks shown", "single check shown"),  
     multiple_checks_shown = factor( multiple_checks_shown ,
         levels = c("single check shown", "multiple checks shown")))
         
# note these buckets can be adjusted as needed based on distribution of data
edit_check_save_data <- edit_check_save_data %>%
  mutate(
    checks_shown_bucket = case_when(
     is.na(n_checks_shown) ~ '0',
     n_checks_shown == 1   ~ '1', 
     n_checks_shown == 2  ~ '2',
     n_checks_shown > 2 & n_checks_shown <= 5  ~ "3-5",
     n_checks_shown > 5 & n_checks_shown <= 10  ~ "6-10", 
     n_checks_shown > 10  ~ "over 10" 
    ),
    checks_shown_bucket = factor(checks_shown_bucket ,
         levels = c("0","1","2", "3-5", "6-10","over 10")
   ))  
Code
# define set of all eligible edits to review (eligible in control and activated in test)
edit_check_save_data <- edit_check_save_data %>%
    mutate(is_test_eligible  = ifelse(
        (test_group == 'test (tone check shown)' & was_tone_check_shown_tag == 1) |
        (test_group == 'control (no tone check shown)' & is_tone_check_eligible == 1) , 'eligible', 'not eligible'),
      is_test_eligible = 
      factor(
      is_test_eligible,
      levels = c("eligible",  "not eligible" )
  )) 

Code
# use tone check eligible tag to define edits that were detected as having non-netural language
edit_check_save_data <- edit_check_save_data %>%
    mutate(is_tone_check_eligible  = ifelse(is_tone_check_eligible == 1, 'non-neutral language detected', 'tone check addressed'),
      is_tone_check_eligible = 
      factor(
      is_tone_check_eligible,
      levels = c("non-neutral language detected",  "tone check addressed" )
  )) 

Overall by experiment group

Code
edit_check_save_overall <- edit_check_save_data %>%
    filter(is_new_content == 1 & is_test_eligible == 'eligible') %>% #limit to eligible edits
    group_by(test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_reverts = n_distinct(editing_session[was_reverted == 1])) %>% #limit to new content edits without a refernece
    mutate(revert_rate = paste0(round(n_reverts/n_edits * 100, 1), "%")) %>%
    select(-c(2,3)) %>% # removing granular data columns for publication
    gt()  %>%
    tab_header(
    title = "New content edit revert rate by experiment group"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Test Group",
    #n_edits = "Number of published edits shown tone check",
    #n_reverts = "Number of edits reverted",
    revert_rate = "Proportion of new content edits that were reverted"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits shown or eligible to shown tone check')
    )



display_html(as_raw_html(edit_check_save_overall ))
New content edit revert rate by experiment group
Test Group Proportion of new content edits that were reverted
control (no tone check shown) 24.8%
test (tone check shown) 21.7%
Limited to published new content edits shown or eligible to shown tone check

By if mulitiple checks were shown

Code
edit_check_revert_bymultiple <- edit_check_save_data %>%
    filter(is_new_content == 1 & is_test_eligible == 'eligible' 
          & test_group == 'test (tone check shown)' 
          & !is.na(multiple_checks_shown)) %>% #limit to eligible edits and removing 2 abonormal test instance tagged as eligible not shown check
    group_by( multiple_checks_shown) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_reverts = n_distinct(editing_session[was_reverted == 1])) %>% #limit to new content edits without a refernece
    mutate(revert_rate = paste0(round(n_reverts/n_edits * 100, 1), "%")) %>%   
    select(-c(2,3)) %>% # removing granular data columns for publication
    gt()  %>%
    tab_header(
    title = "New content edit revert rate by if multiple checks were shown"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    multiple_checks_shown = "Multiple Check",
    #n_edits = "Number of published new content edits",
    #n_reverts = "Number of edits reverted ",
    revert_rate = "Proportion of new content edits that were reverted"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits shown or eligible to shown tone check')
    )


display_html(as_raw_html(edit_check_revert_bymultiple ))
New content edit revert rate by if multiple checks were shown
Multiple Check Proportion of new content edits that were reverted
single check shown 13.8%
multiple checks shown 24.7%
Limited to published new content edits shown or eligible to shown tone check

By Platform

Code
edit_check_revert_byplatform <- edit_check_save_data %>%
    filter(is_new_content == 1 & is_test_eligible == 'eligible') %>%
    group_by( platform, test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_reverts = n_distinct(editing_session[was_reverted == 1])) %>% #limit to new content edits without a refernece
    mutate(revert_rate = paste0(round(n_reverts/n_edits * 100, 1), "%")) %>%   
    select(-c(3,4)) %>% # removing granular data columns for publication
    gt()  %>%
    tab_header(
    title = "New content edit revert rate by platform"
      )  %>%
  opt_stylize(5) %>%
  cols_label(
    test_group = "Test Group",
    platform = "Platform",
    #n_edits = "Number of published new content edits",
    #n_reverts = "Number of edits reverted",
    revert_rate = "Proportion of new content edits that were reverted"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits shown or eligible to shown tone check')
    )



display_html(as_raw_html(edit_check_revert_byplatform ))
New content edit revert rate by platform
Test Group Proportion of new content edits that were reverted
desktop
control (no tone check shown) 20.7%
test (tone check shown) 19.6%
phone
control (no tone check shown) 35.5%
test (tone check shown) 28.6%
Limited to published new content edits shown or eligible to shown tone check

By user experience

Code
edit_check_revert_byuserexp <- edit_check_save_data %>%
    filter(is_new_content == 1 & is_test_eligible == 'eligible') %>%
    group_by(experience_level_group,test_group ) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_reverts = n_distinct(editing_session[was_reverted == 1])) %>% #limit to new content edits without a refernece
    mutate(revert_rate = paste0(round(n_reverts/n_edits * 100, 1), "%")) %>%
    select(-c(3,4)) %>% # removing granular data columns for publication
    gt()  %>%
    tab_header(
    title = "New content edit revert rate by user experience"
      )  %>%
   opt_stylize(5) %>%
  cols_label(
    test_group = "Experiement Group",
    experience_level_group  = "User Status",
    #n_edits = "Number of published new content edits",
    #n_reverts = "Number of edits reverted",
    revert_rate = "Proportion of new content edits that were reverted"
  ) %>%
    tab_source_note(
        gt::md('Limited to published new content edits shown or eligible to shown tone check')
    )



display_html(as_raw_html(edit_check_revert_byuserexp))
New content edit revert rate by user experience
Experiement Group Proportion of new content edits that were reverted
Unregistered
control (no tone check shown) 23.9%
test (tone check shown) 23.3%
Newcomer
control (no tone check shown) 11.8%
test (tone check shown) 30.8%
Junior Contributor
control (no tone check shown) 30%
test (tone check shown) 15.7%
Limited to published new content edits shown or eligible to shown tone check

By partner Wikipedia

Code
edit_check_revert_bywiki <- edit_check_save_data %>%
    filter(is_new_content == 1 & is_test_eligible == 'eligible' ) %>%
    group_by( wiki, test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_reverts = n_distinct(editing_session[was_reverted == 1])) %>% #limit to new content edits without a refernece
    mutate(revert_rate = paste0(round(n_reverts/n_edits * 100, 1), "%")) %>%  
    select(-c(3,4)) %>% # removing granular data columns for publication
    gt()  %>%
    tab_header(
    title = "New content edit revert rate by partner Wikipedia"
      )  %>%
  opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment Group",
    wiki  = "Wikipedia",
    #n_edits = "Number of published new content edits",
    #n_reverts = "Number of edits reverted",
    revert_rate = "Proportion of new content edits that were reverted"
  )  %>%
    tab_source_note(
        gt::md('Limited to published new content edits shown or eligible to shown tone check')
    )


display_html(as_raw_html(edit_check_revert_bywiki))
New content edit revert rate by partner Wikipedia
Experiment Group Proportion of new content edits that were reverted
frwiki
control (no tone check shown) 24.2%
test (tone check shown) 21.3%
jawiki
control (no tone check shown) 13.3%
test (tone check shown) 22.2%
ptwiki
control (no tone check shown) 57.1%
test (tone check shown) 23.1%
Limited to published new content edits shown or eligible to shown tone check

Key Insights

  • There have been no significant changes in the revert rate of new content edits overall or by platform or Wikipedia. However, we’ve observed decreases in revert rate when limiting to edits where tone check was shown or eligible to be shown.
  • Overall, there has been a -3% decrease in the revert rate of published edits when tone check was shown compared to eligible edits in the control group.
  • We’ve observed a -5.3% decrease in the revert rate of desktop edits where tone check was shown and -19% decrease in the revert rate of mobile edits.
  • More data is needed to confirm per Wikipedia and per experience level trends.

Revert rate of published edits identifed by model as having non-neutral language

In the above revert rate analysis section, we reviewed the overall revert rate of all published edits shown tone check but do not consider how many of those edits revised the text to address any problematic language prior to publishing their text.

In this analysis, we review the revert rate of edits shown tone check by if the saved edit still included non-neutral language. This is to check our hypothesis that addressing tone issues identified in text will decrease the likelihood that newcomers edits will be reverted. To complete this analysis, we used the revision tag created in T388716 to identify when the model detects non-neutral language within new content edit.

Overall

Code
tone_check_eligible_revert_overall <- edit_check_save_data %>%
    filter( is_test_eligible == 'eligible',
          test_group == 'test (tone check shown)') %>% #limit to edits where edit check was shown 
    group_by(is_tone_check_eligible) %>%
    summarise(n_edits = n_distinct(editing_session),
             n_reverts = n_distinct(editing_session[was_reverted == 1]))  %>%  #look at reverted
     mutate(revert_rate = paste0(round(n_reverts/n_edits * 100, 1), "%")) %>%  
    select(-c(2,3)) %>% # removing granular data columns
    gt()  %>%
    tab_header(
    title = "Revert rate of edits by if non-neutral language was detected at time of publishing"
      )  %>%
  opt_stylize(5) %>%
  cols_label(
    is_tone_check_eligible  = "Were tone issues detected at time of save?",
    #n_edits = "Number of published edits",
    #n_reverts = "Number of edits reverted",
    revert_rate = "Proportion of edits that were reverted"
  )  %>%
    tab_source_note(
        gt::md('Limited to edits shown tone check in the test group')
    )


display_html(as_raw_html(tone_check_eligible_revert_overall))
Revert rate of edits by if non-neutral language was detected at time of publishing
Were tone issues detected at time of save? Proportion of edits that were reverted
non-neutral language detected 27.4%
tone check addressed 12.1%
Limited to edits shown tone check in the test group

By Platform

Code
tone_check_eligible_revert_byplatform <- edit_check_save_data %>%
    filter( is_test_eligible == 'eligible',
          test_group == 'test (tone check shown)') %>% #limit to edits where edit check was shown
    group_by(platform, is_tone_check_eligible) %>%
    summarise(n_edits = n_distinct(editing_session),
             n_reverts = n_distinct(editing_session[was_reverted == 1]))  %>%  #look at reverted
     mutate(revert_rate = paste0(round(n_reverts/n_edits * 100, 1), "%")) %>% 
   select(-c(3,4)) %>% # removing granular data columns
    gt()  %>%
    tab_header(
    title = "Revert rate of edits by if non-neutral language was detected at time of publishing"
      )  %>%
  opt_stylize(5) %>%
  cols_label(
    platform  = "Platform",
    is_tone_check_eligible  = "Were tone issues detected at time of save?",
    #n_edits = "Number of published edits",
    #n_reverts = "Number of edits reverted",
    revert_rate = "Proportion of edits that were reverted"
  )  %>%
    tab_source_note(
        gt::md('Limited to edits shown tone check in the test group')
    )


display_html(as_raw_html(tone_check_eligible_revert_byplatform))
Revert rate of edits by if non-neutral language was detected at time of publishing
Were tone issues detected at time of save? Proportion of edits that were reverted
desktop
non-neutral language detected 23.2%
tone check addressed 8.1%
phone
non-neutral language detected 36.1%
tone check addressed 29.4%
Limited to edits shown tone check in the test group

By User Experience

Code
tone_check_eligible_revert_byuserexp <- edit_check_save_data %>%
     filter( is_test_eligible == 'eligible',
          test_group == 'test (tone check shown)') %>% #limit to edits where edit check was shown
    group_by(experience_level_group, is_tone_check_eligible) %>%
    summarise(n_edits = n_distinct(editing_session),
             n_reverts = n_distinct(editing_session[was_reverted == 1]))  %>%  #look at reverted
     mutate(revert_rate = paste0(round(n_reverts/n_edits * 100, 1), "%")) %>% 
   select(-c(3,4)) %>% # removing granular data columns
    gt()  %>%
    tab_header(
    title = "Revert rate of edits shown or eligible to shown tone check by if tone issues were detected at time of publishing"
      )  %>%
  opt_stylize(5) %>%
  cols_label(
    experience_level_group  = "User Experience",
    is_tone_check_eligible  = "Were tone issues detected at time of save?",
    #n_edits = "Number of published edits",
    #n_reverts = "Number of edits reverted",
    revert_rate = "Proportion of edits that were reverted"
  )  %>%
    tab_source_note(
        gt::md('Limited to edits shown tone check in the test group')
    )


display_html(as_raw_html(tone_check_eligible_revert_byuserexp))
Revert rate of edits shown or eligible to shown tone check by if tone issues were detected at time of publishing
Were tone issues detected at time of save? Proportion of edits that were reverted
Unregistered
non-neutral language detected 34.4%
tone check addressed 15%
Newcomer
non-neutral language detected 41.9%
tone check addressed 20%
Junior Contributor
non-neutral language detected 18.1%
tone check addressed 7.8%
Limited to edits shown tone check in the test group

By Partner Wikipedia

Code
tone_check_eligible_revert_bywiki <- edit_check_save_data %>%
     filter( is_test_eligible == 'eligible',
          test_group == 'test (tone check shown)') %>% #limit to edits where edit check was shown
    group_by(wiki,  is_tone_check_eligible) %>%
    summarise(n_edits = n_distinct(editing_session),
             n_reverts = n_distinct(editing_session[was_reverted == 1]))  %>%  #look at reverted
     mutate(revert_rate = paste0(round(n_reverts/n_edits * 100, 1), "%")) %>%  
    select(-c(3,4)) %>% # removing granular data columns
    gt()  %>%
    tab_header(
    title = "Revert rate of edits shown or eligible to shown tone check by if tone issues were detected at time of publishing"
      )  %>%
  opt_stylize(5) %>%
  cols_label(
    wiki  = "Wikipedia",
    is_tone_check_eligible  = "Were tone issues detected at time of save?",
    #n_edits = "Number of published edits",
    #n_reverts = "Number of edits reverted",
    revert_rate = "Proportion of edits that were reverted"
  )  %>%
    tab_source_note(
        gt::md('Limited to edits shown tone check in the test group')
    )


display_html(as_raw_html(tone_check_eligible_revert_bywiki))
Revert rate of edits shown or eligible to shown tone check by if tone issues were detected at time of publishing
Were tone issues detected at time of save? Proportion of edits that were reverted
frwiki
non-neutral language detected 25.7%
tone check addressed 14%
jawiki
non-neutral language detected 33.3%
tone check addressed 10%
ptwiki
non-neutral language detected 40%
tone check addressed 7.1%
Limited to edits shown tone check in the test group

Key Insights

  • Edits where text was revised to address the tone check shown were 2x less likely to be reverted.
  • On mobile, edits where tone checks were addressed are 13% less likely to be reverted and on desktop edits where tone check was address are almost 3x less likely to be reverted.
  • Decreases were observed on across all user experiences and partner Wikipedias; however, more data is needed to confirm trends for these breakdowns as there is still limited published edits on a per wiki or user experience.

Proportion of people blocked after publishing an edit where Multi Check was shown

Question:Is tone check causing any disruption?

Methodology: We gathered all edits where edit check was shown from the mediawiki_revision_change_tag table and joined with mediawiki_private_cu_changes to gather user name info. We then reviewed both global and local blocks made within 6 hours of the tone check event as identified in the logging table.

Code
# load data for assessing blocks
edit_check_blocks <-
  read.csv(
    file = 'Queries/data/edit_check_eligible_users_blocked.csv',
    header = TRUE,
    sep = ",",
    stringsAsFactors = FALSE
  ) 
Code
#rename experiment field to clarify
edit_check_blocks <- edit_check_blocks%>%
  mutate(test_group = factor(bucket,
         levels = c('2025-09-editcheck-tone-control', '2025-09-editcheck-tone-test'),
         labels = c("control (no tone check)", "test (tone check available)")))
Code
edit_check_local_blocks_overall <- edit_check_blocks %>%
    #filter(user_id == 0) %>%
    group_by(test_group) %>%
    summarise(blocked_users = n_distinct(ip[is_local_blocked == 'True' | is_global_blocked == 'True']),
              all_users = n_distinct(ip))  %>%  #look at blocks
    mutate(prop_blocks = paste0(round(blocked_users/all_users * 100, 1), "%")) %>%
    select(-c(2,3)) %>% #removing granular data columns 
    gt()  %>%
    tab_header(
    title = "Proportion of users blocked by experiment group"
      )  %>%
  opt_stylize(5) %>%
  cols_label(
    test_group = "Test Group",
    prop_blocks = "Proportion of users blocked"
  )  %>%
    tab_source_note(
        gt::md('Limited to users blocked 6 hours after publishing an edit where tone check was shown')
    )


display_html(as_raw_html(edit_check_local_blocks_overall))
Proportion of users blocked by experiment group
Test Group Proportion of users blocked
test (tone check available) 0.9%
Limited to users blocked 6 hours after publishing an edit where tone check was shown

Key Insights

  • 0.9% of all users were blocked after publishing an edit where at least one tone check was shown compared to 0% in the control group. This difference is not statistically significant.
  • No global blocks were issued to any users that published an edit where at least one tone check was shown.

Proportion of edits that are published before the model is able to return an evaluation

Question: Is the model not able to evaluate tone of published edit quickly enough?

Methdology: In T388716, we added instrumentation (feature: editCheck-tone, action: save-before-check-finalized ) to indicate an edit was published before the model returned an evaluation. These events would not have the editcheck-tone tag applied to indicate if the published edit includes promotional language.

For this analysis, we reviewed the proportion of all published edits in each test group where this event was logged to determine how frequently this is occuring.

Overall by Experiment Group

Code
saves_before_rate_overall <- edit_check_save_data %>%
    group_by(test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_prior_saves = n_distinct(editing_session[saves_before_finalized == 1])) %>% 
    mutate(saves_before_rate = paste0(round(n_prior_saves /n_edits * 100, 1), "%")) %>%
    mutate( n_prior_saves  = ifelse( n_prior_saves  < 50, "<50",  n_prior_saves ))%>% #sanitizing per data publication guideline
    select(-2) %>%
    gt()  %>%
    tab_header(
    title = "Edits published before the model returns an evaluation"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Test Group",
    #n_edits = "Number of published edits",
    n_prior_saves = "Number of edits published before check finalized",
    saves_before_rate = "Proportion of edits published before check finalized"
  ) 



display_html(as_raw_html(saves_before_rate_overall))
Edits published before the model returns an evaluation
Test Group Number of edits published before check finalized Proportion of edits published before check finalized
control (no tone check shown) 249 1.2%
test (tone check shown) <50 0.1%

By platform

Code
saves_before_rate_byplatform <- edit_check_save_data %>%
    group_by(platform, test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_prior_saves = n_distinct(editing_session[saves_before_finalized == 1])) %>% 
    mutate(saves_before_rate = paste0(round(n_prior_saves /n_edits * 100, 1), "%")) %>% 
   mutate( n_prior_saves  = ifelse( n_prior_saves  < 50, "<50",  n_prior_saves ))%>% #sanitizing per data publication guideline
    select(-3) %>%
    gt()  %>%
    tab_header(
    title = "Edits published before the model returns an evaluation by platform"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Test Group",
    platform = "Platform",
    #n_edits = "Number of published edits",
    n_prior_saves = "Number of edits published before check finalized",
    saves_before_rate = "Proportion of edits published before check finalized"
  ) 



display_html(as_raw_html(saves_before_rate_byplatform))
Edits published before the model returns an evaluation by platform
Test Group Number of edits published before check finalized Proportion of edits published before check finalized
desktop
control (no tone check shown) 166 1.3%
test (tone check shown) <50 0.1%
phone
control (no tone check shown) 83 1%
test (tone check shown) <50 0%

By partner Wikipedia

Code
saves_before_rate_bywiki <- edit_check_save_data %>%
    group_by(wiki, test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_prior_saves = n_distinct(editing_session[saves_before_finalized == 1])) %>% 
    mutate(saves_before_rate = paste0(round(n_prior_saves /n_edits * 100, 1), "%")) %>% 
    mutate( n_prior_saves  = ifelse( n_prior_saves  < 50, "<50",  n_prior_saves ))%>% #sanitizing per data publication guideline
    select(-3) %>%
    gt()  %>%
    tab_header(
    title = "Edits published before the model returns an evaluation by Wikipedia"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Test Group",
    wiki = "Wikipedia",
    #n_edits = "Number of published edits",
    n_prior_saves = "Number of edits published before check finalized",
    saves_before_rate = "Proportion of edits published before check finalized"
  )



display_html(as_raw_html(saves_before_rate_bywiki))
Edits published before the model returns an evaluation by Wikipedia
Test Group Number of edits published before check finalized Proportion of edits published before check finalized
frwiki
control (no tone check shown) 71 0.6%
test (tone check shown) <50 0.1%
jawiki
control (no tone check shown) 156 2.1%
test (tone check shown) <50 0.1%
ptwiki
control (no tone check shown) <50 0.8%
test (tone check shown) <50 0%

Key Insights

  • About 0.6% of all pubished edits (264 edits) in the AB test were saved before the model returned an evaluation.
  • The majority of these edits occured in the control group and on desktop.
  • This occurs very infrequently in the test group. Only 0.1% of all published edits in the test group (all on desktop) were saved before the model returned an evaluation. This is expected based on comments documented in T388716#10911327.
  • We’ve observed a slightly higher rate of this occurring at Japanese Wikipedia compared to the other two partner Wikipedias.