Tone Check AB Test Report

Published

February 6, 2026

Modified

April 10, 2026

Overview

The Wikimedia Foundation’s Editing team is working on a set of improvements for the visual editor to help new volunteers understand and follow some of the policies necessary to make constructive changes to Wikipedia projects.

In this AB test, we are evaluating the impact of Tone Check. Tone Check is an Edit Check that uses a language model to prompt people adding promotional, derogatory, or otherwise subjective language to consider “neutralizing” the tone of what they are writing. Tone Check is the first Edit Check that uses machine learning. In this case, a BERT language model initially selected and fine-tuned by the Research team to identify biased language within the new text people are attempting to publish to Wikipedia.

This A/B test will help us make the following decision:

What – if any – changes in the Tone Check UX, and/or the model that enables it, will we make before we can be confident in the following?

  • Newcomers and Junior Contributors that encounter Tone Check are more likely to publish new content edits in the main namespace that are devoid of biased language.
  • Newcomers and Junior Contributors will intuitively interact with the Tone Check experience in ways that are NOT disruptive to them or the wikis

This work is guided by the Wikimedia Foundation Annual Plan, specifically by the Wiki Experiences 1.1 objective key result: Increase the rate at which editors with ≤100 cumulative edits publish constructive edits on mobile web by 4%, as measured by controlled experiments (by the end of Q2).

You can find more details about this check on the Project Page.

The Tone Check A/B test was deployed on 3 September 2025 to French, Japanese, and Portuguese Wikipedias.

Methodology

AB Test Design

The team ran an AB test from 3 September 2025 through 28 January 2026 to determine the impact of presenting Tone Check to eligible editing sessions and evaluate the extent to which the feature, in its current form, warrants being deployed to all wikis.

Specifically, we want to test the following hypothesis:

  • If we prompt newcomers and Junior Contributors to reconsider the tone they are writing in when software detects them using – what experienced volunteers would agree is – then non-neutral/peacock language, then we will decrease the percentage of new content edits newcomers publish that are reverted on the grounds of WP:NPOV (and related policies).

During this experiment, 50% of users editing a desktop or mobile main namespace page using Visual Editor were randomly assigned to the test group and could be shown Tone Check if their edit met the specified requirements during their edit, and 50% were randomly assigned to the control group and could not be shown Tone Check.

The test included all mobile web and desktop contributors (both registered and unregistered) to the 3 participating wikis that started an edit with Visual Editor. Users remained in the same test group for the duration of the test. We limited the analysis to edits completed by unregistered users and users with 100 or fewer edits as those are the users that would be shown Tone Check under the default config settings.

Figure 1: Tone Check AB Test Bucketing Overview

As shown in Figure 1, not all edits bucketed in the AB test experiment met the requirements for being shown Tone Check. Tone Check was shown at about 11% of all published new content edits in the test group (989 edits). It was shown at similar rates on both desktop and mobile web.

In this analysis, we compared all new content edits that were shown Tone Check to edits that were eligible but not shown Tone Check in the control group (based on instrumentation added in this task. This comparison was done to ensure the analysis is focused on the actual effects of the feature.

Evaluation Plan

We used a set of primary and secondary metrics to evaluate the impact of this feature. We also reviewed a set of guardrails to ensure that Tone Check was not disruptive to the contributor or to the Wikipedias. These metrics are documented in the task.

For each metric, we reviewed the following dimensions: overall by experiment group (test and control), by platform (mobile web or desktop), by user experience and status, and by partner Wikipedia. We also reviewed some indicators such as edit completion rate by the number of checks shown within a single editing session to determine if there was a significant impact at a certain number of checks presented.

Note: For the user experience analysis, we split newer editors into three experience level groups: (1) unregistered, (2) newcomer (registered user making their first edit on Wikipedia), and (3) Junior Contributor (user that has made between 1 and 100 edits).

Please refer to the data collection notebook notebook for more details on the steps to collect the data reviewed in this report.

Summary of Results

New content edits published without biased language

  • Tone Check successfully decreases the frequency of non-neutral language in published content. Users with access to Tone Check were -15.6% less likely to publish edits containing non-neutral language (falling from 9.6% to 8.1%; a -1.5 pp decrease) compared to the control group. We have 99.8% confidence that this improvement is directly attributable to the tool.
  • However, Tone Check’s level of impact depends heavily on the platform. Results confirm a highly significant impact on Desktop, where we observed the highest reduction in revert rate. In contrast, there was no detectable effect yet on Mobile Web.

New content edits revert rate

  • Edits made by users shown Tone Check are also 15% less likely to be reverted than eligible control edits (29.5% → 25.1%; a -4.4 pp decrease).
  • This reduction is primarily driven by Junior Contributors. While we observed a statistically significant -33% relative [-10.2 pp] decrease in reverts for Junior Contributors, we did not confirm any change for in the revert rate of newcomers or unregistered users. These trends indicate that Tone Check may be more effective for people who have already succeeded in completing at least one edit on a Wikipedia namespace. Since these users are more experienced, their edits are less likely to be reverted for other policy violations compared to registered users completing their first edit or unregistered users.

New Content edit revert rate: impact of removing non-neutral language

  • When a user removes non-neutral language in response to a Tone Check, the likelihood of that edit being reverted decreases significantly. Across both platforms, there was a -44.1% decrease in the revert rate for edits where the prompt was addressed. This confirms that Tone Check is highly effective at helping people identify and correct edits that would otherwise be reverted.
  • We observed decreases on both platforms, but there is a larger impact on desktop compared to mobile web. On desktop, we observed a significant -47% decrease [-13.4 pp] in revert rate for people who revised their text in response to Tone Check.
  • On mobile web, there was -14.8% [-4.8pp] decrease in revert rate for edits where non-neutral language was removed. Mobile web edits appear to be inherently trickier for newcomers and are still more likely to be reverted compared to desktop edits, even when non-neutral language is removed.

Edit Completion Rate

  • Tone Check does not appear to be causing any significant disruption to most people’s editing experience. Edit completion rates for people shown Tone Check decreased only slightly by -3.2% (-1.6) percentage points. This decrease was primarily concentrated on Desktop (-2.6%), with no significant change on Mobile Web.
    • The decrease in completion rate does not exceed over 10% until more than 10 tone checks are presented in a single editing session. For these edits, edit completion rate decreased to 44.3% (a -12% decrease from the control). These edits represent only 3% of edits and potentially low quality edits that we’d want to deter.
  • While completion rates slightly decreased for newcomers and unregistered users, they slightly increased for Junior Contributors, suggesting the check is encouraging and helps a portion of people complete their edit successfully.

Constructive Edit Rate

  • Tone Check improved the rate of constructive edits by +6.2% [4.4] percentage points. We observed improvements in overall edit quality at each of the three partner Wikipedias.
  • Aligned with the revert rate findings, the magnitude of impact varies by platform. On desktop, constructive edit rate increased by +6.4% while we observed no statistically significant change in mobile web constructive edits.
  • Tone Check appears especially effective at increasing the constructive edit rate of a registered Junior Contributors, where we observed a +14.8% increase [10.2 pp] in constructive edit rates. When limited to desktop edits, there was a +19.7% increase in constructive edits by Junior Contributors.

Retention Rate.

  • We further found that people shown Tone Check were more likely to return, indicating that the feature results in a positive editing experience for most contributors.
  • People who encountered Tone Check are 24% more likely to return again to make a constructive edit in their second week. Retention rates increased from 5.8% to 7.2% when Tone Check was shown (+1.4 percentage points).
  • We observed increases for both mobile web and desktop users and across all user types as well.

Guardrails. Tone check is not causing significant disruption on either desktop or mobile web based on analysis of identified guardrails. The decline rate is lower than other existing Edit Checks, and there was no spike in user blocks or revert rates.

Code
# load packages
shhh <- function(expr) suppressPackageStartupMessages(suppressWarnings(suppressMessages(expr)))
shhh({
    library(lubridate)
    library(ggplot2)
    library(dplyr)
    library(gt)
    library(IRdisplay)
    library(tidyr)
      # Modeling completed used relax package developed by Mikhail Popov (WMF)
    library(relax) # https://gitlab.wikimedia.org/repos/product-analytics/experimentation-lab/relax)
    set.seed(5)
    
})
#set preferences
options(dplyr.summarise.inform = FALSE)
options(repr.plot.width = 15, repr.plot.height = 10)

# colorblind color friendly pallette:
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

Data Cleaning

Code
# load tone check save data (initial dataset)
tone_check_publish_data_1 <-
  read.csv(
    file = 'data/tone_check_save_data_AB.tsv',
    header = TRUE,
    sep = "\t",
    stringsAsFactors = TRUE
  ) 

# load tone check save data (second dataset)
# Second dataset was created to obtain updated event data while preserving initial aggregated dataset that could no loner
# be queried in Data Lake due to data retention policies. 
tone_check_publish_data_2 <-
  read.csv(
    file = 'data/tone_check_save_data_AB_pt2.tsv',
    header = TRUE,
    sep = "\t",
    stringsAsFactors = TRUE
  ) 

# Combine the two datasets

tone_check_publish_data <- rbind(tone_check_publish_data_1, tone_check_publish_data_2)
Code
# Cleaning up dataset and renaming fields to clarify meanings

# Set experience level group and factor levels
tone_check_publish_data <- tone_check_publish_data  |>
  mutate(
    experience_level_group = case_when(
     user_edit_count == 0 & user_status == 'registered' ~ 'Newcomer',
     user_edit_count == 0 & user_status == 'unregistered' ~ 'Unregistered',
      user_edit_count > 0 &  user_edit_count <= 100 ~ "Junior Contributor",
      user_edit_count >  100 ~ "Non-Junior Contributor"   #these users should already be filterd out of dataset but adding to confirm
    ),
    experience_level_group = factor(experience_level_group,
         levels = c("Unregistered","Newcomer", "Non-Junior Contributor", "Junior Contributor")
   ))  

#rename test group field to clarify groups
tone_check_publish_data <- tone_check_publish_data  |>
  mutate(test_group = factor(test_group,
         levels = c('2025-09-editcheck-tone-control', '2025-09-editcheck-tone-test'),
         labels = c("control (eligible but not shown tone check)", "test (tone check shown)")))


#rename platform from phone to mobile web to clarify meaning
tone_check_publish_data <- tone_check_publish_data  |>
  mutate(platform = factor(platform,
         levels = c('phone', 'desktop'),
         labels = c("mobile web", "desktop")))


# rename Wiki values to human readable form
wiki_name_lookup <- c(
  "jawiki" = "Japanese Wikipedia",
  "ptwiki" = "Portuguese Wikipedia",
  "frwiki" = "French Wikipedia"
)

tone_check_publish_data <- tone_check_publish_data %>%
  mutate(
    wiki = recode(wiki, !!!wiki_name_lookup)
  )
Code
#Set fields and factor levels to assess number of checks shown

tone_check_publish_data <- tone_check_publish_data  |>
  mutate(
    multiple_checks_shown = case_when(
         test_group == "test (tone check shown)" & n_checks_shown == 1 ~ "one tone check",
         test_group == "test (tone check shown)" & n_checks_shown > 1 ~ "multiple tone checks",
         TRUE ~ "no tone checks" #default if no conditions met
    ) ,
     multiple_checks_shown = factor(multiple_checks_shown ,
         levels = c('no tone checks', 'one tone check', 'multiple tone checks')
        ))
         
# note these buckets can be adjusted as needed based on distribution of data
tone_check_publish_data <- tone_check_publish_data  |>
  mutate(
    checks_shown_bucket = case_when(
     test_group == "test (tone check shown)" & is.na(n_checks_shown) ~ '0',
     test_group == "test (tone check shown)" & n_checks_shown == 1   ~ '1', 
     test_group == "test (tone check shown)" & n_checks_shown == 2  ~ '2',
     test_group == "test (tone check shown)" & n_checks_shown > 2 & n_checks_shown <= 5  ~ "3-5",
     test_group == "test (tone check shown)" & n_checks_shown > 5 & n_checks_shown <= 10  ~ "6-10", 
     test_group == "test (tone check shown)" & n_checks_shown > 10 ~ "over 10" 
    ),
    checks_shown_bucket = factor(checks_shown_bucket ,
         levels = c("0","1","2", "3-5", "6-10", "over 10")
   ))  

# define set of all eligible edits to review (eligible in control and shown tone check in test)
# Note there's 5 edits in the control group that were identiifed as eligible in VEFU instrumentation
# but did not have eligible tag applied
tone_check_publish_data <- tone_check_publish_data |>
    mutate(is_test_eligible  = ifelse(
        (test_group == 'test (tone check shown)' & was_tone_check_shown_tag == 1) |
        (test_group == 'control (eligible but not shown tone check)' & is_tone_check_eligible == 1) , 'eligible', 'not eligible'),
      is_test_eligible = 
      factor(
      is_test_eligible,
      levels = c("eligible",  "not eligible" )
  )) 


# use tone check eligible tag to define test edits where tone check was addressed (tone_check_eligible == 0)

tone_check_publish_data <- tone_check_publish_data  |>
    mutate(is_tone_check_addressed  = case_when(
        test_group == 'control (eligible but not shown tone check)' & is_tone_check_eligible == 1  ~ 'Eligible control edits',
        test_group == 'test (tone check shown)' & was_tone_check_shown_tag == 1 & is_tone_check_eligible == 0 ~ 'Tone check shown and addressed',
        TRUE ~ "Tone check shown but not addressed"),
        is_tone_check_addressed = factor(
      is_tone_check_addressed,
      levels = c('Eligible control edits', 'Tone check shown but not addressed', 'Tone check shown and addressed')
  )) 

#We also removed all edits that were published before the model returned in an evaluation.
# These events would not have the `editcheck-tone` tag applied to indicate if the published edit 
# includes promotional language. 
# This was done using events added in [T388716](https://phabricator.wikimedia.org/T388716#10872915).

tone_check_publish_data <- tone_check_publish_data |>
    filter(was_saved_before_check == 0)
 

New content edits published without biased language (Primary Metric)

Hypothesis: The quality of new content edits newcomers and Junior Contributors make in the main namespace will increase because a greater percentage of these edits will not contain non-neutral language.

Methodology: As part of this hypothesis, we first evaluated if Tone Check reduces the frequency of non-neutral language in published edits.

We reviewed the proportion of all new content edits published without biased language (identified by the editcheck-tone tag, created in T388716 to identify when the model detected non-neutral language at the time of publishing).

Overall

Code
tone_issue_edits_overall <- tone_check_publish_data  |>
    filter(is_new_content ==1 ) |> #limit to new content edits
    group_by(test_group)  |>
    summarise(n_edits = n_distinct(editing_session),
              n_tone_issues = n_distinct(editing_session[is_tone_check_eligible == 1])) |> # tone issues detected
    mutate(non_neutral_language_rate = paste0(round(n_tone_issues/n_edits * 100, 1), "%")) 
Code
# plot visualization of non-neutral edits
dodge <- position_dodge(width=0.9)

p <- tone_issue_edits_overall |>
    ggplot(aes(x= test_group, y = n_tone_issues/n_edits, fill = test_group)) +
    geom_col(position = 'dodge') +
    scale_y_continuous(labels = scales::percent) +
    geom_text(aes(label = paste(non_neutral_language_rate, "\n", n_tone_issues,"edits \n with non-netural language"), 
                  fontface=2), vjust=1.2, size = 8, color = "white") +
    scale_fill_manual(values= c("#999999", "dodgerblue4"), name = "Experiment Group")  +
    scale_x_discrete(breaks = c("control (eligible but not shown tone check)",
                                "test (tone check shown)"),
                   labels = c("Control (no tone check)", "Test (tone check available)")) + #renaming as this metric is not limited to shown tone checks
    labs (y = "Percent of new content edits ",
          x = "Experiment Group",
          title = "New content edits with non-neutral language",
          caption = "Limited to published new content edits")  +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=24),
        axis.text.x = element_text(size = 24),
        axis.title.x = element_text(margin = margin(t = 20, unit = "pt")),
        legend.position= "none",
        axis.line = element_line(colour = "black")) 
      
p

Tone Check successfully decreases the prevalence of non-neutral language in published content. Across both platforms, there was a -15.6% decrease [-1.5 percentage points] in the proportion of new content edits published with non-neutral language for the test group where Tone Check was available.

Note: The rate observed for the control (9.6%) is similar to the rates we observed in an initial baseline analysis estimating the frequency of these types of edits and rates identified in the leading indicator analysis.

By Platform

Code
tone_issue_edits_byplatform <- tone_check_publish_data  |>
    filter(is_new_content ==1) |>
    group_by(platform, test_group)  |>
    summarise(n_edits = n_distinct(editing_session),
              n_tone_issues = n_distinct(editing_session[is_tone_check_eligible == 1])) |> # reverted within 48 hours
    mutate(non_neutral_language = paste0(round(n_tone_issues/n_edits * 100, 1), "%"))  |>
   select(-c(3,4)) %>% # removing granular data columns
    gt()  |>
    tab_header(
    title = md("New Content edits with non-neutral language by\nplatform")
      )  |>
  opt_stylize(5) |>
  cols_label(
    platform  = "Platform",
    test_group  = "Experiment Group",
    #n_edits = "Number of published edits",
    #n_tone_issues = "Number of edits with non-neutral language",
    non_neutral_language = "Proportion of edits with non-neutral language"
  )  |>
    tab_source_note(
        gt::md('Limited to published new content edits')
    )


display_html(as_raw_html(tone_issue_edits_byplatform))
New Content edits with non-neutral language by platform
Experiment Group Proportion of edits with non-neutral language
mobile web
control (eligible but not shown tone check) 9.2%
test (tone check shown) 9.4%
desktop
control (eligible but not shown tone check) 9.7%
test (tone check shown) 7.6%
Limited to published new content edits

Trends vary by platform. On desktop, we observed a -21% decrease [-2 pp] in the proportion of edits with non-neutral language. There was no statistically significant change on mobile web.

By User Experience

Code
tone_issue_edits_byuserexp <- tone_check_publish_data  |>
    filter(is_new_content ==1) |>
    group_by(experience_level_group, test_group)  |>
    summarise(n_edits = n_distinct(editing_session),
              n_tone_issues = n_distinct(editing_session[is_tone_check_eligible == 1])) |> # tone check issues detected
    mutate(non_neutral_language = paste0(round(n_tone_issues/n_edits * 100, 1), "%"))  |>
   select(-c(3,4)) %>% # removing granular data columns
    gt()  |>
    tab_header(
    title = "New content edits with non-neutral language by user experience"
      )  |>
  opt_stylize(5) |>
  cols_label(
    experience_level_group  = "User Experience",
    test_group  = "Experiment Group",
    #n_edits = "Number of published edits",
    #n_tone_issues = "Number of edits with non-neutral language",
    non_neutral_language = "Proportion of edits with non-neutral language"
  )  |>
    tab_source_note(
        gt::md('Limited to published new content edits')
    )


display_html(as_raw_html(tone_issue_edits_byuserexp ))
New content edits with non-neutral language by user experience
Experiment Group Proportion of edits with non-neutral language
Unregistered
control (eligible but not shown tone check) 13.2%
test (tone check shown) 12.5%
Newcomer
control (eligible but not shown tone check) 13.1%
test (tone check shown) 10.6%
Junior Contributor
control (eligible but not shown tone check) 8%
test (tone check shown) 6.6%
Limited to published new content edits

Tone check decreases the frequency of non-neutral language for all reviewed user types.

We saw the highest absolute decrease in proportion of non-neutral edits published by newcomers (19% decrease [2.5pp]). Unregistered users saw the smallest change (-5.3% [0.7pp]).

By Wikipedia

Code
tone_issue_edits_bywiki <- tone_check_publish_data  |>
   filter(is_new_content ==1) |>
    group_by(wiki, test_group)  |>
    summarise(n_edits = n_distinct(editing_session),
              n_tone_issues = n_distinct(editing_session[is_tone_check_eligible == 1])) |> # reverted within 48 hours
    mutate(non_neutral_language = paste0(round(n_tone_issues/n_edits * 100, 1), "%"))  |>
    select(-c(3,4)) %>% # removing granular data columns
    gt()  |>
    tab_header(
    title = "New content edits with non-neutral language by Wikipedia"
      )  |>
  opt_stylize(5) |>
  cols_label(
    wiki  = "Wikipedia",
    test_group  = "Experiment Group",
    #n_edits = "Number of published edits",
    #n_tone_issues = "Number of edits with non-neutral language",
    non_neutral_language = "Proportion of edits with non-neutral language"
  )  |>
    tab_source_note(
        gt::md('Limited to new content edits')
    )


display_html(as_raw_html(tone_issue_edits_bywiki ))
New content edits with non-neutral language by Wikipedia
Experiment Group Proportion of edits with non-neutral language
French Wikipedia
control (eligible but not shown tone check) 11.5%
test (tone check shown) 10.5%
Japanese Wikipedia
control (eligible but not shown tone check) 6.7%
test (tone check shown) 3.2%
Portuguese Wikipedia
control (eligible but not shown tone check) 6.9%
test (tone check shown) 6.1%
Limited to new content edits

We also observed decreases in the proportion of edits with non-neutral language at each partner Wikipedia. At Japanese Wikipedia, there was a significant -52.2% decrease [-3.5 pp] in the proportion of edits with non-neutral language when tone check was shown to eligible edits.

Confirming the impact of Tone Check on edits published without biased language

We analyzed the above results using two complementary statistical frameworks (Bayesian and Frequentist) to correctly infer the impact of offering Tone Check on decreasing the likelihood a new content edit includes biased language when published. This allows us to confirm if the observed changes detailed above are statistically significant (did not occur due to random chance).

Since multiple edits can be made by the same user, we first calculated the rates for each user (proportion of all edits saved by a user that include non-neutral language).

Note: This is an implementation of Bayesian and Frequentist engines also used in Test Kitchen’s automated analytics

Code
# calculate the proportion for each user

tone_issue_edits_overall_byuser <- tone_check_publish_data  |>
    filter(is_new_content ==1) |>
    group_by(test_group, platform, user_id)  |>
    summarise(n_edits = n_distinct(editing_session),
              n_tone_issues = n_distinct(editing_session[is_tone_check_eligible == 1])) |> # reverted within 48 hours
    mutate(non_neutral_language_rate = n_tone_issues/n_edits) 
Code
# rename field names to align with relax package naming convention
tone_issue_edits_overall_byuser <- tone_issue_edits_overall_byuser |>
  mutate(variation = factor(test_group,
         levels = c("control (eligible but not shown tone check)", "test (tone check shown)"),
         labels = c("control", "treatment")))

tone_issue_edits_overall_byuser$outcome = tone_issue_edits_overall_byuser$non_neutral_language_rate 
Code
overall_impact_toneissues <- tone_issue_edits_overall_byuser |>
    analyze_relative_lift(metric_type = "proportion") |>
    gt() |>
  tab_header(
    title = md("**Evaluating Tone Check impact on edits published with non-neutral language**"),
    subtitle = md("Difference in Metric (Test Group - Control Group)")
  ) |>
  tab_spanner(
    label = md("**Bayesian Analysis**"),
    columns = c(estimate_bayes, chance_to_win, cred_lower, cred_upper)
  ) |>
  tab_spanner(
    label = md("**Frequentist Analysis**"),
    columns = c(estimate_freq, p_value, conf_lower, conf_upper)
  ) |>
  
  # Rename Columns for clarity ---
  cols_label(
    estimate_bayes = md("Point Estimate"),
    chance_to_win = md("Chance to Win"),
    cred_lower = md("95% CI Lower"),
    cred_upper = md("95% CI Upper"),
    estimate_freq = md("Point Estimate"),
    p_value = md("*p*-value"),
    conf_lower = md("95% CI Lower"),
    conf_upper = md("95% CI Upper")
  ) |>
  
  # pply Formatting (Decimals and CI Grouping) ---
  fmt_number(
    columns = everything(),
    decimals = 3 # Use 3 decimals for precision
  ) |>
  
  # Highlight key finding (Inconclusive) ---
  tab_footnote(
    footnote = md("The 95% intervals do not cross zero indicating the results are statistically significant."),
    locations = cells_column_labels(columns = c(cred_lower, conf_lower))
  ) %>%
  
  #  Style the table ---
  tab_options(
    table.border.top.color = "lightgray",
    column_labels.border.bottom.color = "black",
    column_labels.border.bottom.width = px(2),
    data_row.padding = px(5)
  )

display_html(as_raw_html(overall_impact_toneissues))
Evaluating Tone Check impact on edits published with non-neutral language
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate Chance to Win 95% CI Lower1 95% CI Upper Point Estimate p-value 95% CI Lower1 95% CI Upper
−0.136 0.002 −0.228 −0.043 −0.139 0.004 −0.232 −0.046
1 The 95% intervals do not cross zero indicating the results are statistically significant.

Analysis of the A/B test data confirms a statistically significant reduction in non-neutral language across all platforms. We have high confidence (>99.8%) that this effect is driven by Tone Check.

Code
# check by platform numbers

platform_impact_toneissues <- tone_issue_edits_overall_byuser |>
    group_by(platform) |>
    group_modify(~ analyze_relative_lift(.x, metric_type = "proportion"))|>
    gt() |>
  tab_header(
    title = md("**Evaluating Tone Check impact on edits published in non-neutral language**"),
    subtitle = md("Difference in Metric (Test Group - Control Group)")
  ) |>
  tab_spanner(
    label = md("**Bayesian Analysis**"),
    columns = c(estimate_bayes, chance_to_win, cred_lower, cred_upper)
  ) |>
  tab_spanner(
    label = md("**Frequentist Analysis**"),
    columns = c(estimate_freq, p_value, conf_lower, conf_upper)
  ) |>
  
  # Rename Columns for clarity ---
  cols_label(
    platform = md("Platform"),
    estimate_bayes = md("Point Estimate"),
    chance_to_win = md("Chance to Win"),
    cred_lower = md("95% CI Lower"),
    cred_upper = md("95% CI Upper"),
    estimate_freq = md("Point Estimate"),
    p_value = md("*p*-value"),
    conf_lower = md("95% CI Lower"),
    conf_upper = md("95% CI Upper")
  ) |>
  
  # pply Formatting (Decimals and CI Grouping) ---
  fmt_number(
    columns = everything(),
    decimals = 3 # Use 3 decimals for precision
  ) |>
  
  # Highlight key finding (Inconclusive) ---
  tab_footnote(
    footnote = md("The 95% intervals do not cross zero indicating results are statistically signficant."),
    locations = cells_column_labels(columns = c(cred_lower, conf_lower))
  ) %>%
  
  #  Style the table ---
  tab_options(
    table.border.top.color = "lightgray",
    column_labels.border.bottom.color = "black",
    column_labels.border.bottom.width = px(2),
    data_row.padding = px(5)
  )

display_html(as_raw_html(platform_impact_toneissues))
Evaluating Tone Check impact on edits published in non-neutral language
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate Chance to Win 95% CI Lower1 95% CI Upper Point Estimate p-value 95% CI Lower1 95% CI Upper
mobile web
−0.013 0.444 −0.192 0.167 −0.014 0.883 −0.203 0.174
desktop
−0.186 0.000 −0.291 −0.081 −0.192 0.000 −0.299 −0.086
1 The 95% intervals do not cross zero indicating results are statistically signficant.

However, Tone Check’s effectiveness depends heavily on the platform. Results confirm a highly significant impact on Desktop (p < 0.001), where the reduction was most pronounced. In contrast, there was no detectable effect on Mobile Web (p = 0.883), suggesting that people respond to Tone Check differently when making mobile edits.

New Content Edit Revert Rate (Primary Metric]

Hypothesis: The quality of new content edits newcomers and Junior Contributors make in the main namespace will increase because a greater percentage of these edits will not contain non-neutral language.

Methodology:

In addition to evaluating if Tone Check reduces the frequency of non-neutral language, we also wanted to assess the impact of Tone Check on edit revert rate.

To do this, we reviewed the proportion of all published new content edits where tone check was shown at least once in an editing session (identified by editCheck-tone-shown tag) and were reverted within 48 hours. This was compared to the revert rate of edits in the control group identified as eligible for Tone Check (identified by editcheck-tone tag).

Note: This metric does not consider the final text of the published edit. It’s possible edits shown Tone Check still included non-neutral language at the time of publishing if the Tone Check was not addressed. It’s also possible that non-neutral language was removed but the edit was still reverted for other reasons. This purpose of this metric is to evaluate if presenting a tone check to a user while editing will increase the overall quality of new content edits.

## Overall

Code
tone_check_reverts_overall <- tone_check_publish_data  |>
    filter(is_new_content == 1 & is_test_eligible == 'eligible') %>% #limit to edit shown or eligible to be shown tone check
    group_by(test_group)  |>
    summarise(n_edits = n_distinct(editing_session),
              n_reverts = n_distinct(editing_session[was_reverted == 1])) %>% # reverted within 48 hours
    mutate(revert_rate = paste0(round(n_reverts/n_edits * 100, 1), "%")) 
Code
# plot visualization of overall edit revert rates
dodge <- position_dodge(width=0.9)

p <- tone_check_reverts_overall  |>
    ggplot(aes(x= test_group, y = n_reverts/n_edits, fill = test_group)) +
    geom_col(position = 'dodge') +
    scale_y_continuous(labels = scales::percent) +
    geom_text(aes(label = paste(revert_rate, "\n", n_reverts,"reverted edits"), fontface=2), vjust=1.2, size = 10, color = "white") +
    scale_fill_manual(values= c("#999999", "dodgerblue4"), name = "Experiment Group")  +
    labs (y = "Percent of edits reverted ",
           x = "Experiment Group",
          title = "New content edit revert rate",
           caption = "Limited to published new content edits shown or eligible to be shown Tone Check")  +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=24),
        axis.text.x = element_text(size = 24),
        axis.title.x = element_text(margin = margin(t = 20, unit = "pt")),
        legend.position= "none",
        axis.line = element_line(colour = "black")) 
      
p

People show Tone Check are less likely to be reverted. Across both platform, there was a -15% decrease [-4.4 ppp] in the revert rate of edits shown Tone Check in the test group compared to edits eligible but not shown Tone Check in the control group.

By if multiple checks were shown

Code
tone_check_reverts_bymultiple  <- tone_check_publish_data  |>
    filter(is_new_content == 1 & is_test_eligible == 'eligible' 
           & multiple_checks_shown != "no tone checks"  # Removing 3 events where eligible edits in control were incorrectly tagged as being shown checks
          & test_group == 'test (tone check shown)') |> 
    group_by( multiple_checks_shown) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_reverts = n_distinct(editing_session[was_reverted == 1])) |> 
    mutate(revert_rate = paste0(round(n_reverts/n_edits * 100, 1), "%")) |>   
    select(-c(2,3)) %>% # removing granular data columns for publication
    gt()  |>
    tab_header(
    title = "New content edit revert rate by if multiple checks were shown"
      )  |>
    opt_stylize(5) |>
  cols_label(
    multiple_checks_shown = "Multiple Check",
    #n_edits = "Number of published new content edits",
    #n_reverts = "Number of edits reverted ",
    revert_rate = "Proportion of new content edits that were reverted"
  ) |>
    tab_source_note(
        gt::md('Limited to published new content edits shown or eligible to shown tone check')
    )


display_html(as_raw_html(tone_check_reverts_bymultiple))
New content edit revert rate by if multiple checks were shown
Multiple Check Proportion of new content edits that were reverted
one tone check 25.7%
multiple tone checks 25.2%
Limited to published new content edits shown or eligible to shown tone check

The numbers of Tone Checks shown within a single editing session does not impact the likelihood an edit is reverted. The revert rate for edits shown one or multiple tone checks is about the same (~25%).

While we initially observed a lower revert rate for edits shown a single tone check in the leading indicator analysis, additional test data indicates that the revert rate of these edits is similar to edits shown multiple tone checks.

By Platform

Code
tone_check_publish_byplatform <- tone_check_publish_data |>
    filter(is_test_eligible == 'eligible' ) |>
    group_by(platform, test_group) |>
    summarise(n_edits = n_distinct(editing_session),
              n_reverts = n_distinct(editing_session[was_reverted == 1])) |> #reverted within 48 hours
    mutate(revert_rate = paste0(round(n_reverts/n_edits * 100, 1), "%")) |>  
    select(-c(3,4)) %>% # removing granular data columns for publication
    gt()  |>
    tab_header(
    title = "New content edit revert rate by platform"
      )  |>
  opt_stylize(5) |>
  cols_label(
    test_group = "Experiment Group",
    platform = "Platform",
    #n_edits = "Number of published new content edits",
    #n_reverts = "Number of edits reverted",
    revert_rate = "Proportion of new content edits that were reverted"
  ) |>
    tab_source_note(
        gt::md('Limited to published new content edits shown or eligible to shown Tone Check')
    )



display_html(as_raw_html(tone_check_publish_byplatform))
New content edit revert rate by platform
Experiment Group Proportion of new content edits that were reverted
mobile web
control (eligible but not shown tone check) 34.5%
test (tone check shown) 34.6%
desktop
control (eligible but not shown tone check) 25%
test (tone check shown) 20.2%
Limited to published new content edits shown or eligible to shown Tone Check

The decrease in new content edit revert rate is primarily driven by a decrease in the revert rate of desktop edits.

We observed -19% [-4.8pp] statistically significant decrease in the revert rate of desktop edits shown Tone Check. On mobile web, there was no statistically significant change.

By User Experience

Code
tone_check_revert_byuserexp <- tone_check_publish_data |>
    filter(is_new_content == 1 & is_test_eligible == 'eligible') |>
    group_by(experience_level_group,test_group) |>
    summarise(n_edits = n_distinct(editing_session),
              n_reverts = n_distinct(editing_session[was_reverted == 1])) |> #reverted within 48 hours
    mutate(revert_rate = paste0(round(n_reverts/n_edits * 100, 1), "%")) |>
    select(-c(3,4)) |> # removing granular data columns for publication
    gt()  |>
    tab_header(
    title = "New content edit revert rate by user experience"
      )  |>
   opt_stylize(5) |>
  cols_label(
    test_group = "Experiment Group",
    experience_level_group  = "User Status",
    #n_edits = "Number of published new content edits",
    #n_reverts = "Number of edits reverted",
    revert_rate = "Proportion of new content edits that were reverted"
  ) |>
    tab_source_note(
        gt::md('Limited to published new content edits shown or eligible to shown Tone Check')
    )



display_html(as_raw_html(tone_check_revert_byuserexp))
New content edit revert rate by user experience
Experiment Group Proportion of new content edits that were reverted
Unregistered
control (eligible but not shown tone check) 34.7%
test (tone check shown) 37%
Newcomer
control (eligible but not shown tone check) 21%
test (tone check shown) 26%
Junior Contributor
control (eligible but not shown tone check) 31%
test (tone check shown) 20.8%
Limited to published new content edits shown or eligible to shown Tone Check

Results vary based on user experience. While we observed a statistically significant -33% relative [-10.2 pp] decrease in reverts for Junior Contributors, we did not confirm any change for in the revert rate of newcomers or unregistered users.

These trends indicate that Tone Check may be more effective for people who have already succeeded in completing at least one edit on a Wikipedia namespace. Since these users are more experienced, their edits are less likely to be reverted for other policy violations compared to users completing their first edit.

By partner Wikipedia

Code
tone_check_revert_bywiki <- tone_check_publish_data |>
    filter(is_new_content == 1 & is_test_eligible == 'eligible' ) |>
    group_by(wiki, test_group) |>
    summarise(n_edits = n_distinct(editing_session),
              n_reverts = n_distinct(editing_session[was_reverted == 1])) |> 
    mutate(revert_rate = paste0(round(n_reverts/n_edits * 100, 1), "%")) |>  
    select(-c(3,4)) %>% # removing granular data columns for publication
    gt()  |>
    tab_header(
    title = "New content edit revert rate by partner Wikipedia"
      )  |>
  opt_stylize(5) |>
  cols_label(
    test_group = "Experiment Group",
    wiki  = "Wikipedia",
    #n_edits = "Number of published new content edits",
    #n_reverts = "Number of edits reverted",
    revert_rate = "Proportion of new content edits that were reverted"
  )  |>
    tab_source_note(
        gt::md('Limited to wikis with > 100 published new content edits')
    )


display_html(as_raw_html(tone_check_revert_bywiki))
New content edit revert rate by partner Wikipedia
Experiment Group Proportion of new content edits that were reverted
French Wikipedia
control (eligible but not shown tone check) 30.8%
test (tone check shown) 29%
Japanese Wikipedia
control (eligible but not shown tone check) 31.2%
test (tone check shown) 10.9%
Portuguese Wikipedia
control (eligible but not shown tone check) 21.4%
test (tone check shown) 16.9%
Limited to wikis with > 100 published new content edits

New content edit revert rate decreased for users shown Tone Check at all three partner Wikipedias by at least -5%.

We again see an especially high impact on edit quality at Japanese Wikipedia, where there was a -65% decrease in revert rate of edits shown tone check compared to eligible edits in the control group.

Due to the small sample size of per Wikipedia edits, we are currently not able confirm statistical significance of the decreases at any of these Wikipedias but the direction and magnitude of change indicate that Tone Check is having a positive effect on edit quality at each partner Wikipedia.

Confirming the impact of Tone Check on revert rate

Code
# calculate the proportion for each user

tone_check_reverts_overall_byuser <- tone_check_publish_data |>
    filter(is_test_eligible == 'eligible') |> #limit to edit shown or eligible to be shown tone check
    group_by(test_group, platform, user_id) |>
    summarise(n_edits = n_distinct(editing_session),
              n_reverts = n_distinct(editing_session[was_reverted == 1])) |> # reverted within 48 hours
    mutate(revert_rate = n_reverts/n_edits) # proportion for each user 
Code
# rename field names to align with relax package naming convention
tone_check_reverts_overall_byuser <- tone_check_reverts_overall_byuser |>
  mutate(variation = factor(test_group,
         levels = c("control (eligible but not shown tone check)", "test (tone check shown)"),
         labels = c("control", "treatment")))

tone_check_reverts_overall_byuser$outcome = tone_check_reverts_overall_byuser$revert_rate
Code
overall_impact_reverts <- tone_check_reverts_overall_byuser |>
    analyze_relative_lift(metric_type = "proportion", ci_level = 0.90) |>
    gt() |>
  tab_header(
    title = md("**Evaluating Tone Check impact on new content revert rate**"),
    subtitle = md("Difference in Metric (Test Group - Control Group)")
  ) |>
  tab_spanner(
    label = md("**Bayesian Analysis**"),
    columns = c(estimate_bayes, chance_to_win, cred_lower, cred_upper)
  ) |>
  tab_spanner(
    label = md("**Frequentist Analysis**"),
    columns = c(estimate_freq, p_value, conf_lower, conf_upper)
  ) |>
  
  # Rename Columns for clarity ---
  cols_label(
    estimate_bayes = md("Point Estimate"),
    chance_to_win = md("Chance to Win"),
    cred_lower = md("90% CI Lower"),
    cred_upper = md("90% CI Upper"),
    estimate_freq = md("Point Estimate"),
    p_value = md("*p*-value"),
    conf_lower = md("90% CI Lower"),
    conf_upper = md("90% CI Upper")
  ) |>
  
  # pply Formatting (Decimals and CI Grouping) ---
  fmt_number(
    columns = everything(),
    decimals = 3 # Use 3 decimals for precision
  ) |>
  
  # Highlight key finding (Inconclusive) ---
  # tab_footnote(
  #   footnote = md("The 95% intervals cross zero, indicating no statistically conclusive difference."),
  #   locations = cells_column_labels(columns = c(cred_lower, conf_lower))
  # ) %>%
  
  #  Style the table ---
  tab_options(
    table.border.top.color = "lightgray",
    column_labels.border.bottom.color = "black",
    column_labels.border.bottom.width = px(2),
    data_row.padding = px(5)
  )

display_html(as_raw_html(overall_impact_reverts))
Evaluating Tone Check impact on new content revert rate
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate Chance to Win 90% CI Lower 90% CI Upper Point Estimate p-value 90% CI Lower 90% CI Upper
−0.080 0.043 −0.158 −0.003 −0.082 0.083 −0.161 −0.004

Results confirm a slight but statistically significant decrease in the revert rate of edits shown Tone Check across all edits. Analysis indicates a 95.7% chance that Tone Check successfully reduces the likelihood of an edit being reverted. At a 90% confidence level, the results are statistically significant (p = 0.083).

Code
# check by platform numbers

platform_impact_reverts <- tone_check_reverts_overall_byuser |>
    group_by(platform) |>
    group_modify(~ analyze_relative_lift(.x, metric_type = "proportion" ))|>
    gt() |>
  tab_header(
    title = md("**Evaluating Tone Check impact on revert rate by platform**"),
    subtitle = md("Difference in Metric (Test Group - Control Group)")
  ) |>
  tab_spanner(
    label = md("**Bayesian Analysis**"),
    columns = c(estimate_bayes, chance_to_win, cred_lower, cred_upper)
  ) |>
  tab_spanner(
    label = md("**Frequentist Analysis**"),
    columns = c(estimate_freq, p_value, conf_lower, conf_upper)
  ) |>
  
  # Rename Columns for clarity ---
  cols_label(
    platform = md("Platform"),
    estimate_bayes = md("Point Estimate"),
    chance_to_win = md("Chance to Win"),
    cred_lower = md("95% CI Lower"),
    cred_upper = md("95% CI Upper"),
    estimate_freq = md("Point Estimate"),
    p_value = md("*p*-value"),
    conf_lower = md("95% CI Lower"),
    conf_upper = md("95% CI Upper")
  ) |>
  
  # pply Formatting (Decimals and CI Grouping) ---
  fmt_number(
    columns = everything(),
    decimals = 3 # Use 3 decimals for precision
  ) |>
  
  # Highlight key finding (Inconclusive) ---
  tab_footnote(
    footnote = md("The 95% intervals cross zero, indicating no statistically conclusive difference."),
    locations = cells_column_labels(columns = c(cred_lower, conf_lower))
  ) %>%
  
  #  Style the table ---
  tab_options(
    table.border.top.color = "lightgray",
    column_labels.border.bottom.color = "black",
    column_labels.border.bottom.width = px(2),
    data_row.padding = px(5)
  )

display_html(as_raw_html(platform_impact_reverts))
Evaluating Tone Check impact on revert rate by platform
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate Chance to Win 95% CI Lower1 95% CI Upper Point Estimate p-value 95% CI Lower1 95% CI Upper
mobile web
−0.024 0.370 −0.166 0.118 −0.026 0.732 −0.172 0.121
desktop
−0.092 0.063 −0.210 0.026 −0.096 0.119 −0.217 0.025
1 The 95% intervals cross zero, indicating no statistically conclusive difference.

While we do not have sufficient data to confirm statistical significance at the strict 95% level on a per-platform basis, the results strongly indicate that Tone Check is decreasing the revert rate on Desktop.

The Bayesian analysis shows a 93.7% probability that the tool reduces Desktop reverts, with a projected impact of -9.6%. On mobile web, there was almost no change on the overall new content revert rate.

New Content edit revert rate: Impact of removing non-neutral language (Primary Metric)

Hypothesis: The quality of new content edits newcomers and Junior Contributors make in the main namespace will increase because a greater percentage of these edits will not contain non-neutral language.

Methodology: As the final piece to evaluate this hypothesis, we reviewed the revert rate of new content edits in the test group for people that removed non-neutral language in response to Tone Check. Here were are measuring the impact of a person making the change Tone Check is prompting. Does removing non-neutral language decrease the likelihood that an edit is reverted?

In this section, we isolated the direct impact of Tone Check by comparing a specific subset: Control edits that contained non-neutral language versus Test edits where the user actively removed that language in response to a Tone Check. To do this, we used the revision tag created in T388716 to identify when the model detects non-neutral language within new content edit at the time of publishing.

Overall

Code
tone_check_eligible_revert_overall <- tone_check_publish_data  |>
        filter(is_new_content == 1 & is_test_eligible == 'eligible' ) |> #limit to edit shown or eligible to be shown tone check
    group_by(is_tone_check_addressed )  |> #group by prescence of non-neutral language
    summarise(n_edits = n_distinct(editing_session),
              n_reverts = n_distinct(editing_session[was_reverted == 1])) |> # reverted within 48 hours and tone check issues addressed
    mutate(revert_rate = round(n_reverts/n_edits, 3))  |>
    ungroup() |>
    mutate(n_edits = ifelse(n_edits < 50, "<50", n_edits),
           n_reverts = ifelse(n_reverts < 50, "<50", n_reverts))  #sanitizing per data publication guidelines
Code
# plot visualization of overall edit revert rates
dodge <- position_dodge(width=0.9)

p <- tone_check_eligible_revert_overall  |>
    filter(is_tone_check_addressed != 'Tone check shown but not addressed') |> #removing edits in test group where tone issues were not addressed for this analysis
    ggplot(aes(x= is_tone_check_addressed , y = revert_rate, fill = is_tone_check_addressed )) +
    geom_col(position = 'dodge') +
    scale_y_continuous(labels = scales::percent) +
    geom_text(aes(label = paste(revert_rate * 100,"%", "\n", n_reverts,"reverted edits"), fontface=2), vjust=1.2, size = 8, color = "white") +
    scale_fill_manual(values= c("#999999", "dodgerblue4"), name = "Experiment Group")  +
    labs (y = "Percent of edits reverted ",
           x = "Experiment Group",
          title = "New Content revert rate: Impact of removing non-neutral language",
           caption = "Limited to published new content edits by unregistered users or users with 100 or fewer edits")  +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=20),
        axis.text.x = element_text(size = 20),
        axis.title.x = element_text(margin = margin(t = 20, unit = "pt")),
        legend.position= "none",
        axis.line = element_line(colour = "black")) 
p

When the Tone Check successfully prompts a user to remove non-neutral language, the likelihood of that edit being reverted drops significantly. There was a -44.1% decrease in the revert rate of edits where people removed non-neutral language in response to a Tone Check prompt.

By Platform

Code
tone_check_eligible_revert_byplatform <- tone_check_publish_data |>
    filter(is_new_content == 1 & is_test_eligible == 'eligible' &
         is_tone_check_addressed != 'Tone check shown but not addressed') |> #limit to edits where tone check was addressed
    group_by(platform, is_tone_check_addressed ) |>
    summarise(n_edits = n_distinct(editing_session),
             n_reverts = n_distinct(editing_session[was_reverted == 1]))  |>  #look at reverted
     mutate(revert_rate = paste0(round(n_reverts/n_edits * 100, 1), "%")) |>
   select(-c(3,4)) |> # removing granular data columns
    gt()  |>
    tab_header(
    title = "New Content revert rate: Impact of removing non-neutral language by platform"
      )  |>
  opt_stylize(5) |>
  cols_label(
    platform  = "Platform",
    is_tone_check_addressed   = "Were tone issues detected at time of save?",
    #n_edits = "Number of published edits",
    #n_reverts = "Number of edits reverted",
    revert_rate = "Proportion of edits that were reverted"
  )  |>
    tab_source_note(
        gt::md('Limited to published new content edits shown or eligible to be shown Tone Check')
    )


display_html(as_raw_html(tone_check_eligible_revert_byplatform))
New Content revert rate: Impact of removing non-neutral language by platform
Were tone issues detected at time of save? Proportion of edits that were reverted
mobile web
Eligible control edits 32.4%
Tone check shown and addressed 27.6%
desktop
Eligible control edits 28.5%
Tone check shown and addressed 15.1%
Limited to published new content edits shown or eligible to be shown Tone Check

We observed decreases on both platforms, but there is a larger impact on desktop compared to mobile web. On desktop, we observed a significant -47% decrease [-13.4 pp] in revert rate for people who revised their text in response to Tone Check.

On mobile web, there was -14.8% [-4.8pp] decrease in revert rate for edits where non-neutral language was removed. Mobile web edits appear to be inherently trickier for newcomers and are still more likely to be reverted compared to desktop edits, even when non-neutral language is removed.

By User Experience

Code
tone_check_eligible_revert_byuserexp <- tone_check_publish_data |>
     filter(is_new_content == 1 & is_test_eligible == 'eligible' &
             is_tone_check_addressed != 'Tone check shown but not addressed') |> #limit to edits where tone check was addressed
    group_by(experience_level_group, is_tone_check_addressed) |>
    summarise(n_edits = n_distinct(editing_session),
             n_reverts = n_distinct(editing_session[was_reverted == 1]))  |>  #look at reverted
     mutate(revert_rate = paste0(round(n_reverts/n_edits * 100, 1), "%")) |> 
   select(-c(3,4)) |># removing granular data columns
    gt()  |>
    tab_header(
    title = "New Content revert rate: Impact of removing non-neutral language by user experience"
      )  |>
  opt_stylize(5) |>
  cols_label(
    experience_level_group  = "User Experience",
    is_tone_check_addressed   = "Were tone issues detected at time of save?",
    #n_edits = "Number of published edits",
    #n_reverts = "Number of edits reverted",
    revert_rate = "Proportion of edits that were reverted"
  )  |>
    tab_source_note(
        gt::md('Limited to published new content edits shown or eligible to be shown Tone Check')
    )


display_html(as_raw_html(tone_check_eligible_revert_byuserexp))
New Content revert rate: Impact of removing non-neutral language by user experience
Were tone issues detected at time of save? Proportion of edits that were reverted
Unregistered
Eligible control edits 34.7%
Tone check shown and addressed 23.8%
Newcomer
Eligible control edits 21%
Tone check shown and addressed 21.5%
Junior Contributor
Eligible control edits 31%
Tone check shown and addressed 13.7%
Limited to published new content edits shown or eligible to be shown Tone Check

We observed the highest impact for Junior Contributors, where there was -55.8% decrease [ -17.3 pp] in revert rate compared to a slight +2.5%[0.5pp] increase for newcomers and a -31.4% decrease for unregistered users.

For Newcomers and unregistered users, addressing tone issues may have less of an impact because their edits are frequently reverted for other policy violations that Tone Check is not designed to catch. Junior contributors have already successfully completed at least one edit and are more likely to publish an edit where non-neutral language is the only issue.

By Partner Wikipedia

Code
tone_check_eligible_revert_bywiki <- tone_check_publish_data|>
     filter(is_new_content & is_test_eligible == 'eligible' &
     is_tone_check_addressed != 'Tone check shown but not addressed') |> #limit to edits where tone check was addressed
    group_by(wiki, is_tone_check_addressed) %>%
    summarise(n_edits = n_distinct(editing_session),
             n_reverts = n_distinct(editing_session[was_reverted == 1]))  |>  #look at reverted
     mutate(revert_rate = paste0(round(n_reverts/n_edits * 100, 1), "%")) |>  
    select(-c(3,4)) %>% # removing granular data columns
    gt()  |>
    tab_header(
    title = "New Content revert rate: Impact of removing non-neutral language by partner Wikipedia"
      )  |>
  opt_stylize(5) |>
  cols_label(
    wiki  = "Wikipedia",
    is_tone_check_addressed  = "Were tone issues detected at time of save?",
    #n_edits = "Number of published edits",
    #n_reverts = "Number of edits reverted",
    revert_rate = "Proportion of edits that were reverted"
  )  |>
    tab_source_note(
        gt::md('Limited to published new content edits shown or eligible to be shown Tone Check')
    )


display_html(as_raw_html(tone_check_eligible_revert_bywiki))
New Content revert rate: Impact of removing non-neutral language by partner Wikipedia
Were tone issues detected at time of save? Proportion of edits that were reverted
French Wikipedia
Eligible control edits 30.8%
Tone check shown and addressed 22.5%
Japanese Wikipedia
Eligible control edits 31.2%
Tone check shown and addressed 8%
Portuguese Wikipedia
Eligible control edits 21.4%
Tone check shown and addressed 4.5%
Limited to published new content edits shown or eligible to be shown Tone Check

We also observed decreases across all three partner Wikipedias; however, the magnitude of impact varies highlighting different revert behavior at each community.

At Japanese and Portuguese Wikipedias, removing non-neutral language from edits reduces the revert rate to less than 10% (over 70% relative decrease) while there was less of an impact on French Wikipedia. See specific changes below:

  • Japanese Wikipedia: -74.4%[-23.2pp]
  • Portuguese Wikipedia: -79% [-16.9]
  • French Wikipedia: -26.9% [-8.3pp]

Note: There is a smaller sample size of published edits eligible for Tone Check on a per Wikipedia basis (< 300 edits) so these rates are more susceptible to noise.

Confirming the impact of removing non-neutral language on new content revert rate

I then evaluated the impact of removing non-neutral language on new content revert rate, controlling for variances by user and wiki. For this analysis, I specifically compared edits with non-neutral edits in the control group (eligible control edits) to edits in the test group where Tone check was shown and non-neutral language was removed.

Code
tone_check_eligible_revert_overall_byuser <- tone_check_publish_data  |>
    filter(is_new_content == 1 & is_test_eligible == 'eligible' & 
          is_tone_check_addressed != 'Tone check shown but not addressed') |> #directly comparing eligible control to test where tone check addressed
    group_by(is_tone_check_addressed, platform, user_id)  |>  #use prescence of non_neutral language as the variation 
    summarise(n_edits = n_distinct(editing_session),
              n_reverts = n_distinct(editing_session[was_reverted == 1])) |> # reverted within 48 hours
    mutate(revert_rate = n_reverts/n_edits)
Code
# rename field names to align with relax package naming convention
tone_check_eligible_revert_overall_byuser <- tone_check_eligible_revert_overall_byuser |>
  mutate(variation = factor(is_tone_check_addressed,
         levels = c("Eligible control edits", "Tone check shown and addressed"),
         labels = c("control", "treatment")))

tone_check_eligible_revert_overall_byuser$outcome = tone_check_eligible_revert_overall_byuser$revert_rate 
Code
overall_impact_toneeligible <- tone_check_eligible_revert_overall_byuser |>
    analyze_relative_lift(metric_type = "proportion") |>
    gt() |>
  tab_header(
    title = md("**Evaluating the impact of removing non-netural language on revert rate**"),
    subtitle = md("Difference in Metric (Test Group - Control Group)")
  ) |>
  tab_spanner(
    label = md("**Bayesian Analysis**"),
    columns = c(estimate_bayes, chance_to_win, cred_lower, cred_upper)
  ) |>
  tab_spanner(
    label = md("**Frequentist Analysis**"),
    columns = c(estimate_freq, p_value, conf_lower, conf_upper)
  ) |>
  
  # Rename Columns for clarity ---
  cols_label(
    estimate_bayes = md("Point Estimate"),
    chance_to_win = md("Chance to Win"),
    cred_lower = md("95% CI Lower"),
    cred_upper = md("95% CI Upper"),
    estimate_freq = md("Point Estimate"),
    p_value = md("*p*-value"),
    conf_lower = md("95% CI Lower"),
    conf_upper = md("95% CI Upper")
  ) |>
  
  # pply Formatting (Decimals and CI Grouping) ---
  fmt_number(
    columns = everything(),
    decimals = 3 # Use 3 decimals for precision
  ) |>
  
  # Highlight key finding (Inconclusive) ---
  tab_footnote(
    footnote = md("The 95% intervals do not cross zero indicating the results are statistically significant."),
    locations = cells_column_labels(columns = c(cred_lower, conf_lower))
  ) %>%
  
  #  Style the table ---
  tab_options(
    table.border.top.color = "lightgray",
    column_labels.border.bottom.color = "black",
    column_labels.border.bottom.width = px(2),
    data_row.padding = px(5)
  )

display_html(as_raw_html(overall_impact_toneeligible))
Evaluating the impact of removing non-netural language on revert rate
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate Chance to Win 95% CI Lower1 95% CI Upper Point Estimate p-value 95% CI Lower1 95% CI Upper
−0.350 0.000 −0.534 −0.165 −0.388 0.000 −0.583 −0.193
1 The 95% intervals do not cross zero indicating the results are statistically significant.

We confirmed a statistically significant reduction in revert rate for edits where non-neutral language was removed in the final published edit.

For users that removed non-neutral language in response to a Tone Check, we observed a likely 38 percentage point reduction in the likelihood of an edit being reverted. Both the Bayesian “Chance to Win” and the Frequentist p-value are at the maximum possible significance level (0.000). This confirms that the reduction is not due to chance, but is a direct result of the language being improved.

Code
# check by platform numbers

platform_impact_toneeligible <- tone_check_eligible_revert_overall_byuser |>
    group_by(platform) |>
    group_modify(~ analyze_relative_lift(.x, metric_type = "proportion" ))|>
    gt() |>
  tab_header(
    title = md("**Evaluating the impact of removing non-netural language on platform revert rate**"),
    subtitle = md("Difference in Metric (Test Group - Control Group)")
  ) |>
  tab_spanner(
    label = md("**Bayesian Analysis**"),
    columns = c(estimate_bayes, chance_to_win, cred_lower, cred_upper)
  ) |>
  tab_spanner(
    label = md("**Frequentist Analysis**"),
    columns = c(estimate_freq, p_value, conf_lower, conf_upper)
  ) |>
  
  # Rename Columns for clarity ---
  cols_label(
    platform = md("Platform"),
    estimate_bayes = md("Point Estimate"),
    chance_to_win = md("Chance to Win"),
    cred_lower = md("95% CI Lower"),
    cred_upper = md("95% CI Upper"),
    estimate_freq = md("Point Estimate"),
    p_value = md("*p*-value"),
    conf_lower = md("95% CI Lower"),
    conf_upper = md("95% CI Upper")
  ) |>
  
  # pply Formatting (Decimals and CI Grouping) ---
  fmt_number(
    columns = everything(),
    decimals = 3 # Use 3 decimals for precision
  ) |>
  
  #  Style the table ---
  tab_options(
    table.border.top.color = "lightgray",
    column_labels.border.bottom.color = "black",
    column_labels.border.bottom.width = px(2),
    data_row.padding = px(5)
  )

display_html(as_raw_html(platform_impact_toneeligible))
Evaluating the impact of removing non-netural language on platform revert rate
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate Chance to Win 95% CI Lower 95% CI Upper Point Estimate p-value 95% CI Lower 95% CI Upper
mobile web
−0.088 0.330 −0.479 0.303 −0.158 0.559 −0.699 0.384
desktop
−0.342 0.001 −0.551 −0.133 −0.392 0.001 −0.616 −0.167

Per platform findings:

  • Desktop: The model shows a highly significant 34.2 percentage point drop in the revert rate for edits where non-neutral language was removed (p = 0.001). There is a 99.9% probability that Desktop users who address Tone Check suggestions are less likely to be reverted.

  • Mobile web: Results show directional signs that removing non-neutral language decreases reverts, with a projected impact of -8.8 percentage points. While there is a 67% chance that addressing Tone Check issues decreases reverts, we cannot confirm statistical significance (p = 0.559). This is likely due to a smaller sample size and higher “signal noise” on mobile, where other factors (such as technical errors or other policy violations) often results in reverts even if non-neutral language is removed.

Edit Completion Rate (Primary Metric)

Hypothesis: Newcomers and Junior Contributors will experience Tone Check as encouraging because it will offer them more clarity about what is expected of the new information they add to Wikipedia

Methodology We reviewed the proportion of edits attempted that were successfully published (not reverted). For this analysis, we are limiting to edits that reached the point where Tone check was or would be shown to reduce noise from edits abandoned earlier in the editing workflow.

We excluded edits that were reverted to ensure we were measuring the Tone Check’s impact on productive contributions.

Code
# load data for assessing edit completion rate
tone_check_completion_rates_1 <-
  read.csv(
    file = 'data/tone_check_completion_data.tsv',
    header = TRUE,
    sep = "\t",
    stringsAsFactors = FALSE
  ) 
Code
# load edit completion rate (second dataset)
# Second dataset was created to obtain updated event data while preserving initial aggregated dataset that could no loner
# be queried in Data Lake due to data retention policies. 
tone_check_completion_rates_2 <-
  read.csv(
    file = 'data/tone_check_completion_data_pt2.tsv',
    header = TRUE,
    sep = "\t",
    stringsAsFactors = TRUE
  ) 
Code
# Combine the two datasets

tone_check_completion_rates <- rbind(tone_check_completion_rates_1, tone_check_completion_rates_2)
Code
# Set experience level group and factor levels
tone_check_completion_rates <- tone_check_completion_rates|>
  mutate(
    experience_level_group = case_when(
     user_edit_count == 0 & user_status == 'registered' ~ 'Newcomer',
     user_edit_count == 0 & user_status == 'unregistered' ~ 'Unregistered',
      user_edit_count > 0 &  user_edit_count <= 100 ~ "Junior Contributor",
      user_edit_count >  100 ~ "Non-Junior Contributor"   
    ),
    experience_level_group = factor(experience_level_group,
         levels = c("Unregistered","Newcomer", "Non-Junior Contributor", "Junior Contributor")
   ))  

#rename experiment field to clarfiy
tone_check_completion_rates <- tone_check_completion_rates |>
 mutate(test_group = factor(test_group,
         levels = c('2025-09-editcheck-tone-control', '2025-09-editcheck-tone-test'),
         labels = c("control (eligible but not shown tone check)", "test (tone check shown)")))


#rename platform from phone to mobile web to clarify meaning
tone_check_completion_rates <- tone_check_completion_rates|>
  mutate(platform = factor(platform,
         levels = c('phone', 'desktop'),
         labels = c("mobile web", "desktop")))

tone_check_completion_rates  <- tone_check_completion_rates  |>
  mutate(
    wiki = recode(wiki, !!!wiki_name_lookup)
  )
Code
#Set fields and factor levels to assess number of checks shown

tone_check_completion_rates  <- tone_check_completion_rates |>
  mutate(
    multiple_checks_shown = 
         ifelse(n_checks_shown > 1, "multiple checks shown", "one check shown"),  
     multiple_checks_shown = factor( multiple_checks_shown ,
         levels = c("one check shown", "multiple checks shown")))
         
# note these buckets can be adjusted as needed based on distribution of data
tone_check_completion_rates  <- tone_check_completion_rates |>
  mutate(
    checks_shown_bucket = case_when(
     is.na(n_checks_shown) ~ '0',
     n_checks_shown == 1  ~ '1', 
     n_checks_shown == 2  ~ '2',
     n_checks_shown > 2 & n_checks_shown <= 5  ~ "3-5",
     n_checks_shown > 5 & n_checks_shown <= 10 ~ "6-10", 
     n_checks_shown > 10 ~ "over 10" 
    ),
    checks_shown_bucket = factor(checks_shown_bucket ,
         levels = c("0","1","2", "3-5", "6-10","over 10")
   ))

Overall

Code
tone_check_completion_rate_overall <- tone_check_completion_rates %>%
    filter(tone_check_shown == 1 ) %>% #limit to sessions where tone check was shown
    group_by(test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_saves = n_distinct(editing_session[saved_edit > 0 & was_reverted == 0 ])) %>% #saved and not reverted
    mutate(completion_rate = paste0(round(n_saves/n_edits * 100, 1), "%")) 
Code
# plot visualization of overall edit completion rates
dodge <- position_dodge(width=0.9)

p <- tone_check_completion_rate_overall  %>%
    ggplot(aes(x= test_group, y = n_saves/n_edits, fill = test_group)) +
    geom_col(position = 'dodge') +
    scale_y_continuous(labels = scales::percent) +
    geom_text(aes(label = paste(completion_rate, "\n", n_saves,"saved edits"), fontface=2), vjust=1.2, size = 10, color = "white") +
    scale_fill_manual(values= c("#999999", "dodgerblue4"), name = "Experiment Group")  +
    labs (y = "Percent of edit attempts completed ",
           x = "Experiment Group",
          title = "Edit completion rate",
           caption = "Limited to edits shown or eligible to be shown at least one Tone Check and not reverted")  +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=24),
        axis.text.x = element_text(size = 24),
        axis.title.x = element_text(margin = margin(t = 20, unit = "pt")),
        legend.position= "none",
        axis.line = element_line(colour = "black")) 

      
p

Edit completion rates for people shown tone check decreased only slightly by -3.2% (-1.6) percentage points.

By if multiple checks were shown

Code
tone_check_completion_rate_bymulti <- tone_check_completion_rates %>%
    filter(tone_check_shown == 1 &
            test_group == 'test (tone check shown)'
          ) %>%
    group_by(test_group, multiple_checks_shown) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_saves = n_distinct(editing_session[saved_edit > 0])) %>%
    mutate(completion_rate = paste0(round(n_saves/n_edits * 100, 1), "%")) %>%   
    gt()  %>%
    tab_header(
    title = "Tone Check edit completion rate by if multiple checks were shown"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment group",
    multiple_checks_shown = "Multiple Tone Checks shown",
    n_edits = "Number of edit attempts shown Tone Check",
    n_saves = "Number of published edits",
    completion_rate = "Proportion of edits saved"
  ) %>%
    tab_source_note(
         gt::md('Limited to edits shown at least one Tone Check and not reverted')
    )


display_html(as_raw_html(tone_check_completion_rate_bymulti))
Tone Check edit completion rate by if multiple checks were shown
Multiple Tone Checks shown Number of edit attempts shown Tone Check Number of published edits Proportion of edits saved
test (tone check shown)
one check shown 2049 1287 62.8%
multiple checks shown 2798 1782 63.7%
Limited to edits shown at least one Tone Check and not reverted

By number of checks shown

Code
tone_check_completion_rate_bynchecks <- tone_check_completion_rates %>%
    filter(tone_check_shown == 1 & test_group == 'test (tone check shown)')  %>% #limit to paste checks shown and test group
    group_by(test_group, checks_shown_bucket) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_saves = n_distinct(editing_session[saved_edit > 0 & was_reverted == 0])) %>%
    mutate(completion_rate = paste0(round(n_saves/n_edits * 100, 1), "%")) %>%  
    ungroup()%>%  
    mutate(n_edits = ifelse(n_edits < 50, "<50", n_edits),
           n_saves = ifelse(n_saves < 50, "<50", n_saves))  %>% #sanitizing per data publication guidelines
    group_by(test_group) %>%  
    gt()  %>%
    tab_header(
    title = "Tone Check edit completion rate by the number of checks shown"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    checks_shown_bucket = "Number of Tone Check shown",
    n_edits = "Number of edit attempts shown Tone Check",
    n_saves = "Number of published edits",
    completion_rate = "Proportion of edits saved"
  ) %>%
    tab_source_note(
        gt::md('Limited to edits shown at least one Tone Check in the test group and not reverted')
    )


display_html(as_raw_html(tone_check_completion_rate_bynchecks))
Tone Check edit completion rate by the number of checks shown
Number of Tone Check shown Number of edit attempts shown Tone Check Number of published edits Proportion of edits saved
test (tone check shown)
1 2049 1011 49.3%
2 1452 718 49.4%
3-5 834 394 47.2%
6-10 327 150 45.9%
over 10 185 82 44.3%
Limited to edits shown at least one Tone Check in the test group and not reverted

The majority of published new content edits (73%) were shown two or fewer Tone Check within a single editing session. When 2 or fewer checks are presented, we see only about a 1.6% decrease in edit completion rate.

The decrease in completion rate does not exceed over 10% until more than 10 tone checks are presented in a single editing session. For these edits, edit completion rate decreased to 44.3% (a -12% decrease from the control). However, these editing sessions with over 10 tone checks represent only 3% of published edits where Tone Check was shown and are likely an indicator of very low quality edits that we’d want to deter.

By Platform

Code
tone_check_completion_rate_byplatform <- tone_check_completion_rates %>%
    filter(tone_check_shown == 1) %>%
    group_by(platform, test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_saves = n_distinct(editing_session[saved_edit > 0 & was_reverted == 0 ])) %>%
    mutate(completion_rate = paste0(round(n_saves/n_edits * 100, 1), "%")) %>% 
    #mutate(n_saves = ifelse(n_saves < 50, "<50", n_saves))%>% #sanitizing per data publication guideline
    #select(-c(3,4)) %>% 
    gt()  %>%
    tab_header(
    title = "Tone Check edit completion rate by platform"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment Group",
    platform = "Platform",
    n_edits = "Number of edit attempts shown Tone Check",
    n_saves = "Number of published edits",
    completion_rate = "Proportion of edits saved"
  ) %>%
    tab_source_note(
        gt::md('Limited to edits shown or eligible to be shown at least one Tone Check and not reverted')
    )


display_html(as_raw_html(tone_check_completion_rate_byplatform))
Tone Check edit completion rate by platform
Experiment Group Number of edit attempts shown Tone Check Number of published edits Proportion of edits saved
mobile web
control (eligible but not shown tone check) 864 336 38.9%
test (tone check shown) 1266 489 38.6%
desktop
control (eligible but not shown tone check) 2983 1597 53.5%
test (tone check shown) 3581 1866 52.1%
Limited to edits shown or eligible to be shown at least one Tone Check and not reverted

This decrease was primarily concentrated on Desktop (-2.6%; -1.4pp), with no significant change in completion rates observed for mobile web. Mobile users are nearly as likely to publish their edit whether they see a Tone Check or not.

By User Experience

Code
tone_check_completion_rate_byuserstatus <- tone_check_completion_rates %>%
    filter(tone_check_shown == 1) %>%
    group_by(experience_level_group, test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_saves = n_distinct(editing_session[saved_edit > 0 & was_reverted == 0 ])) %>%
    mutate(completion_rate = paste0(round(n_saves/n_edits * 100, 1), "%")) %>%   
    #select(-c(3,4)) %>% #data sanitizing for publication
    gt()  %>%
    tab_header(
    title = "Tone check edit completion rate by user experience"
      )  %>%
 opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment Group",
    experience_level_group = "Experiment Group",
    n_edits = "Number of edit attempts shown Tone Check",
    n_saves = "Number of published edits",
    completion_rate = "Proportion of edits saved"
  ) %>%
    tab_source_note(
        gt::md('Limited to edits shown or eligible to be shown at least one Tone Check and not reverted')
    )


display_html(as_raw_html(tone_check_completion_rate_byuserstatus))
Tone check edit completion rate by user experience
Experiment Group Number of edit attempts shown Tone Check Number of published edits Proportion of edits saved
Unregistered
control (eligible but not shown tone check) 1144 433 37.8%
test (tone check shown) 1613 555 34.4%
Newcomer
control (eligible but not shown tone check) 712 323 45.4%
test (tone check shown) 861 367 42.6%
Junior Contributor
control (eligible but not shown tone check) 1991 1177 59.1%
test (tone check shown) 2373 1433 60.4%
Limited to edits shown or eligible to be shown at least one Tone Check and not reverted

The impacts of Tone Check on edit completion rate vary based on user experience. See relative changes below:

  • Unregistered: -9.0% decrease [-3.4pp]
  • Newcomers: -6.4% decrease [-2.9pp]
  • Junior Contributors: +2.2% increase [1.3pp]

While Tone Check resulted in slight decreases in edit completion rates for newcomer and unregistered users, it caused minimal disruption to Junior Contributors. We actually observed a +2.2% relative increase in the completion rate of Junior Contributors shown Tone Check, suggesting the check is encouraging and facilitates successful publishing for a subset of users.

By Partner Wikipedia

Code
tone_check_completion_rate_bywiki <- tone_check_completion_rates %>%
    filter(tone_check_shown == 1) %>%
    group_by(wiki, test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_saves = n_distinct(editing_session[saved_edit > 0 & was_reverted == 0])) %>%
    mutate(completion_rate = paste0(round(n_saves/n_edits * 100, 1), "%")) %>% 
    #filter(n_edits > 200) %>%  #limit to wikis with sufficient events
    #select(-c(3,4)) %>% #data sanitizing for publication
    gt()  %>%
    tab_header(
    title = "Tone Check edit completion rate by Wikipedia"
      )  %>%
 opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment Group",
    wiki = "Wikipedia",
    n_edits = "Number of edit attempts shown Tone Check",
    n_saves = "Number of published edits",
    completion_rate = "Proportion of edits saved"
  ) %>%
    tab_source_note(
        gt::md('Limited to Wikipedias with at least 200 edit attempts during reviewed timeframe')
    )



display_html(as_raw_html(tone_check_completion_rate_bywiki ))
Tone Check edit completion rate by Wikipedia
Experiment Group Number of edit attempts shown Tone Check Number of published edits Proportion of edits saved
French Wikipedia
control (eligible but not shown tone check) 2579 1245 48.3%
test (tone check shown) 3332 1600 48%
Japanese Wikipedia
control (eligible but not shown tone check) 704 420 59.7%
test (tone check shown) 902 476 52.8%
Portuguese Wikipedia
control (eligible but not shown tone check) 564 268 47.5%
test (tone check shown) 613 279 45.5%
Limited to Wikipedias with at least 200 edit attempts during reviewed timeframe

The most significant decrease in edit completion rate was at Japanese Wikipedia (-11.6% decrease [-6.9pp]) while there was slight increase at French Wikipedia.

Results indicate an inverse correlation between edit completion and revert rates at each Wikipedia. The most significant decrease in completion occurred on the Japanese Wikipedia, which also saw the most substantial decrease in revert rate. In contrast, the French Wikipedia saw almost no change in edit completion rates and only a small decrease in revert rates.

This correlation suggests that the Tone Check is effectively deterring some lower-quality edits that would have been reverted.

Confirming impact of Tone Check on edit completion rate

Code
# calculate the proportion for each user

tone_check_completion_rate_overall_byuser <- tone_check_completion_rates %>%
    filter(tone_check_shown == 1 ) %>% #limit to sessions where tone check was shown
    group_by(test_group, platform, user_id) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_saves = n_distinct(editing_session[saved_edit > 0 & was_reverted == 0])) %>%
    mutate(completion_rate = n_saves/n_edits)
Code
# rename field names to align with relax package naming convention
tone_check_completion_rate_overall_byuser <- tone_check_completion_rate_overall_byuser |>
  mutate(variation = factor(test_group,
         levels = c("control (eligible but not shown tone check)", "test (tone check shown)"),
         labels = c("control", "treatment")))

tone_check_completion_rate_overall_byuser$outcome = tone_check_completion_rate_overall_byuser$completion_rate 
Code
overall_impact_completes <- tone_check_completion_rate_overall_byuser |>
    analyze_relative_lift(metric_type = "proportion") |>
    gt() |>
  tab_header(
    title = md("**Evaluating Tone Check impact on edit completion rate**"),
    subtitle = md("Difference in Metric (Test Group - Control Group)")
  ) |>
  tab_spanner(
    label = md("**Bayesian Analysis**"),
    columns = c(estimate_bayes, chance_to_win, cred_lower, cred_upper)
  ) |>
  tab_spanner(
    label = md("**Frequentist Analysis**"),
    columns = c(estimate_freq, p_value, conf_lower, conf_upper)
  ) |>
  
  # Rename Columns for clarity ---
  cols_label(
    estimate_bayes = md("Point Estimate"),
    chance_to_win = md("Chance to Win"),
    cred_lower = md("95% CI Lower"),
    cred_upper = md("95% CI Upper"),
    estimate_freq = md("Point Estimate"),
    p_value = md("*p*-value"),
    conf_lower = md("95% CI Lower"),
    conf_upper = md("95% CI Upper")
  ) |>
  
  # pply Formatting (Decimals and CI Grouping) ---
  fmt_number(
    columns = everything(),
    decimals = 3 # Use 3 decimals for precision
  ) |>
  
  # Highlight key finding (Inconclusive) ---
  tab_footnote(
    footnote = md("The 95% intervals does not cross zero, indicating no statistically conclusive difference"),
    locations = cells_column_labels(columns = c(cred_lower, conf_lower))
  ) %>%
  
  #  Style the table ---
  tab_options(
    table.border.top.color = "lightgray",
    column_labels.border.bottom.color = "black",
    column_labels.border.bottom.width = px(2),
    data_row.padding = px(5)
  )

display_html(as_raw_html(overall_impact_completes))
Evaluating Tone Check impact on edit completion rate
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate Chance to Win 95% CI Lower1 95% CI Upper Point Estimate p-value 95% CI Lower1 95% CI Upper
−0.078 0.000 −0.123 −0.033 −0.079 0.001 −0.124 −0.034
1 The 95% intervals does not cross zero, indicating no statistically conclusive difference

Results indicate that Tone Check introduced a small but statistically significant (p =0.001) level of friction into the editing process. Estimates indicate that Tone Check likely decreased edit completion rate by 7.9 percentage points.

Code
# check by platform numbers

platform_impact_completes <- tone_check_completion_rate_overall_byuser |>
    group_by(platform) |>
    group_modify(~ analyze_relative_lift(.x, metric_type = "proportion"))|>
    gt() |>
  tab_header(
    title = md("**Evaluating Tone Check impact on complation rate by platform**"),
    subtitle = md("Difference in Metric (Test Group - Control Group)")
  ) |>
  tab_spanner(
    label = md("**Bayesian Analysis**"),
    columns = c(estimate_bayes, chance_to_win, cred_lower, cred_upper)
  ) |>
  tab_spanner(
    label = md("**Frequentist Analysis**"),
    columns = c(estimate_freq, p_value, conf_lower, conf_upper)
  ) |>
  
  # Rename Columns for clarity ---
  cols_label(
    platform = md("Platform"),
    estimate_bayes = md("Point Estimate"),
    chance_to_win = md("Chance to Win"),
    cred_lower = md("95% CI Lower"),
    cred_upper = md("95% CI Upper"),
    estimate_freq = md("Point Estimate"),
    p_value = md("*p*-value"),
    conf_lower = md("95% CI Lower"),
    conf_upper = md("95% CI Upper")
  ) |>
  
  # pply Formatting (Decimals and CI Grouping) ---
  fmt_number(
    columns = everything(),
    decimals = 3 # Use 3 decimals for precision
  ) |>
  
  # Highlight key finding (Inconclusive) ---
  tab_footnote(
    footnote = md("The 95% intervals cross zero, indicating no statistically conclusive difference."),
    locations = cells_column_labels(columns = c(cred_lower, conf_lower))
  ) %>%
  
  #  Style the table ---
  tab_options(
    table.border.top.color = "lightgray",
    column_labels.border.bottom.color = "black",
    column_labels.border.bottom.width = px(2),
    data_row.padding = px(5)
  )

display_html(as_raw_html(platform_impact_completes))
Evaluating Tone Check impact on complation rate by platform
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate Chance to Win 95% CI Lower1 95% CI Upper Point Estimate p-value 95% CI Lower1 95% CI Upper
mobile web
−0.024 0.342 −0.140 0.092 −0.025 0.679 −0.143 0.093
desktop
−0.082 0.000 −0.131 −0.034 −0.083 0.001 −0.131 −0.035
1 The 95% intervals cross zero, indicating no statistically conclusive difference.

The per platform analysis reveals that 3.2% overall decrease in completion rates is driven by desktop editors.

On desktop, we confirmed a statistically significant decrease in edit completion rate. Tone check shown to users on desktop likely caused a decrease of around 8.3%. On mobile web, we observed a small but statistically insignificant decrease in edit completion rate.

Constructive Edit Rate

Hypothesis: A larger proportion of new content edits by Newcomers and Junior Contributors will be constructive because they will be made aware the new text they’re attempting to publish needs to be written in a neutral tone, when they don’t first think/know to write in this way themselves.

Methodology: The proportion of all published edits by users with ≤100 cumulative edits on a mobile web main namespace that are constructive (not reverted with 48 hours). Similar to revert rate, the analysis was limited to new content edits shown or eligible to be shown Tone Check so we can isolate data to edits that would be impacted by this feature.

Note: This metric is also the WE 1.1 Key Result. We will include Tone Check’s impact on this metric as part of our evaluation of the collective impact of interventions deployed under WE 1.1 on this metric.

Overall

Code
tone_check_constructive_overall <-tone_check_publish_data |>
    filter(is_new_content == 1 & is_test_eligible == 'eligible') |> #limit to eligible edits
    group_by(test_group, user_id) |>
    summarise(n_edits = n_distinct(editing_session),
              n_const = n_distinct(editing_session[was_reverted == 0])) |> #limit to new content edits without a refernece
    mutate(constructive_edit_rate = n_const/n_edits) |>
    group_by(test_group) |>
    summarise(avg_rate = mean(constructive_edit_rate))
Code
# plot visualization of overall edit completion rates
dodge <- position_dodge(width=0.9)

p <- tone_check_constructive_overall |>
    ggplot(aes(x= test_group, y = n_const/n_edits, fill = test_group)) +
    geom_col(position = 'dodge') +
    scale_y_continuous(labels = scales::percent) +
    geom_text(aes(label = paste(constructive_edit_rate, "\n", n_const,"constructive edits"), fontface=2), vjust=1.2, size = 10, color = "white") +
    scale_fill_manual(values= c("#999999", "dodgerblue4"), name = "Experiment Group")  +
    labs (y = "Percent of edits that were constructive ",
           x = "Experiment Group",
          title = "Constructive edit rate",
           caption = "Limited to published new content edits shown or eligible to be shown Tone Check")  +
  theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=24),
        axis.text.x = element_text(size = 24),
        axis.title.x = element_text(margin = margin(t = 20, unit = "pt")),
        legend.position= "none",
        axis.line = element_line(colour = "black")) 

      
p

Overall, constructive edit rates increased by +6.2% increase [4.4 percentage points] for people shown Tone Check in the test group.

By platform

Code
tone_check_constructive_byplatform <- tone_check_publish_data |>
    filter( is_test_eligible == 'eligible') |>
    group_by(platform, test_group) |>
    summarise(n_edits = n_distinct(editing_session),
              n_const = n_distinct(editing_session[was_reverted == 0])) |> 
    mutate(constructive_edit_rate = paste0(round(n_const/n_edits * 100, 1), "%")) |>   
    select(-c(3,4)) %>% # removing granular data columns for publication
    gt()  |>
    tab_header(
    title = "Constructive edit rate by platform"
      )  |>
  opt_stylize(5)  |>
  cols_label(
    test_group = "Test Group",
    platform = "Platform",
    #n_edits = "Number of published new content edits",
   # n_const = "Number of constructive edits",
    constructive_edit_rate = "Proportion of new content edits that were constructive"
  ) |>
    tab_source_note(
        gt::md('Limited to published new content edits shown or eligible to shown Tone Check')
    )

display_html(as_raw_html(tone_check_constructive_byplatform))
Constructive edit rate by platform
Test Group Proportion of new content edits that were constructive
mobile web
control (eligible but not shown tone check) 65.5%
test (tone check shown) 65.4%
desktop
control (eligible but not shown tone check) 75%
test (tone check shown) 79.8%
Limited to published new content edits shown or eligible to shown Tone Check

We continue to see differing trends on mobile web compared to desktop. Constructive edit rates on desktop increased while they decreased on mobile web.

On desktop, constructive edit rate increased by 6.4% while we observed no statistically significant change in mobile web constructive edits.

By user experience

Code
tone_check_constructive_byexp <- tone_check_publish_data  |>
    filter(platform == 'desktop' & is_new_content == 1 & is_test_eligible == 'eligible')  |>
    group_by(experience_level_group, test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_const = n_distinct(editing_session[was_reverted == 0]))  |> 
    mutate(constructive_edit_rate = paste0(round(n_const/n_edits * 100, 1), "%"))  |>   
    select(-c(3,4))  |> # removing granular data columns for publication
    gt()   |>
    tab_header(
    title = "Constructive edit rate by user experience"
      )  %>%
  opt_stylize(5)  |>
  cols_label(
    test_group = "Test Group",
    experience_level_group = "User type",
    #n_edits = "Number of published new content edits",
   # n_const = "Number of constructive edits",
    constructive_edit_rate = "Proportion of new content edits that were constructive"
  )  |>
    tab_source_note(
        gt::md('Limited to published new content edits shown or eligible to shown Tone Check')
    )



display_html(as_raw_html(tone_check_constructive_byexp ))
Constructive edit rate by user experience
Test Group Proportion of new content edits that were constructive
Unregistered
control (eligible but not shown tone check) 74.4%
test (tone check shown) 69.1%
Newcomer
control (eligible but not shown tone check) 79.6%
test (tone check shown) 77.6%
Junior Contributor
control (eligible but not shown tone check) 68%
test (tone check shown) 81.4%
Limited to published new content edits shown or eligible to shown Tone Check

The increase in constructive edit rate appears to be primarily due to an increase in constructive edits by Junior Contributors shown Tone Check, where we observed a +14.8% increase [10.2 pp]. When limited to desktop edits, there was a 19.7% increase in constructive edits by Junior Contributors.

By Partner Wikipedia

Code
tone_check_constructive_bywiki <- tone_check_publish_data |>
    filter(is_new_content == 1 & is_test_eligible == 'eligible') |>
    group_by(wiki, test_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_const = n_distinct(editing_session[was_reverted == 0])) |> 
    mutate(constructive_edit_rate = paste0(round(n_const/n_edits * 100, 1), "%")) |> 
    #filter(n_edits > 100) %>%  #limit to wikis with sufficient events
    select(-c(3,4)) |> # removing granular data columns for publication
    gt()  |>
    tab_header(
    title = "Constructive edit rate by Wikipedia"
      )  |>
  opt_stylize(5) |>
  cols_label(
    test_group = "Test Group",
    wiki = md("**Wikipedia**"),
    #n_edits = "Number of published new content edits",
    #n_const = "Number of constructive edits",
    constructive_edit_rate = "Proportion of new content edits that were constructive"
  )  


display_html(as_raw_html(tone_check_constructive_bywiki ))
Constructive edit rate by Wikipedia
Test Group Proportion of new content edits that were constructive
French Wikipedia
control (eligible but not shown tone check) 69.2%
test (tone check shown) 71%
Japanese Wikipedia
control (eligible but not shown tone check) 68.8%
test (tone check shown) 89.1%
Portuguese Wikipedia
control (eligible but not shown tone check) 78.6%
test (tone check shown) 83.1%

Tone Check increased constructive edit rates at all three partner Wikipedias.

Aligned with the decreased revert rate findings, we confirmed that Tone Check has the biggest impact on constructive edit rates at Japanese Wikipedia, where there was a +29.5% increase in the constructive edit rate for users shown Tone Check compared to eligible edits in the control group.

Due to the small sample size of per Wikipedia edits, we are currently not able confirm statistical significance of the increases at any of these Wikipedias but the direction and magnitude of change indicate that Tone Check is having a positive effect on edit quality at each partner Wikipedia.

Confirming impact of Tone Check on constructive edit rate

We also modeled the impact of Tone Check on constructive edits rates to confirm the magnitude and direction of Tone Check’s effect on a user completing a higher proportion of constructive edits. This helps account for random effects of the user and wiki.

Code
tone_check_constructive_overall_byuser <- tone_check_publish_data |>
    filter(is_test_eligible == 'eligible') |> #limit to eligible edits
    group_by(test_group, platform, user_id) |>
    summarise(n_edits = n_distinct(editing_session),
              n_const = n_distinct(editing_session[was_reverted == 0])) |> #edits not reverted
    mutate(constructive_edit_rate = n_const/n_edits) 
Code
# rename test group field names to align with relax package naming convention
tone_check_constructive_overall_byuser  <- tone_check_constructive_overall_byuser |>
  mutate(variation = factor(test_group,
         levels = c("control (eligible but not shown tone check)", "test (tone check shown)"),
         labels = c("control", "treatment")))
Code
# create new column name to align with relax package naming
tone_check_constructive_overall_byuser$outcome = tone_check_constructive_overall_byuser$constructive_edit_rate
Code
# overall impact
overall_impact_const_edits <- tone_check_constructive_overall_byuser |>
    analyze_relative_lift(metric_type = "proportion", ci_level = 0.9) |>
    gt() |>
  tab_header(
    title = md("**Evaluating Tone Check impact on overall constructive edit rate**"),
    subtitle = md("Difference in Metric (Test Group - Control Group)")
  ) |>
  tab_spanner(
    label = md("**Bayesian Analysis**"),
    columns = c(estimate_bayes, chance_to_win, cred_lower, cred_upper)
  ) |>
  tab_spanner(
    label = md("**Frequentist Analysis**"),
    columns = c(estimate_freq, p_value, conf_lower, conf_upper)
  ) |>
  
  # Rename Columns for clarity ---
  cols_label(
    estimate_bayes = md("Point Estimate"),
    chance_to_win = md("Chance to Win"),
    cred_lower = md("90% CI Lower"),
    cred_upper = md("90% CI Upper"),
    estimate_freq = md("Point Estimate"),
    p_value = md("*p*-value"),
    conf_lower = md("90% CI Lower"),
    conf_upper = md("90% CI Upper")
  ) |>
  
  # pply Formatting (Decimals and CI Grouping) ---
  fmt_number(
    columns = everything(),
    decimals = 3 # Use 3 decimals for precision
  ) |>
  
  #  Style the table ---
  tab_options(
    table.border.top.color = "lightgray",
    column_labels.border.bottom.color = "black",
    column_labels.border.bottom.width = px(2),
    data_row.padding = px(5)
  )

display_html(as_raw_html(overall_impact_const_edits))
Evaluating Tone Check impact on overall constructive edit rate
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate Chance to Win 90% CI Lower 90% CI Upper Point Estimate p-value 90% CI Lower 90% CI Upper
0.032 0.948 0.000 0.064 0.032 0.103 0.000 0.064

The Tone Check feature resulted in a slight but statistically significant increase in the overall constructive edit rate (p = 0.103). The Chance to Win indicates a 95% probability that the Tone Check increases the likelihood a user completes a constructive edit.

Code
# check by platform numbers

platform_impact_constr_edits <- tone_check_constructive_overall_byuser  |>
    group_by(platform) |>
    group_modify(~ analyze_relative_lift(.x, metric_type = "proportion", ci_level = 0.9))|>
    gt() |>
  tab_header(
    title = md("**Evaluating Tone Check impact on constructive edit rate by platform**"),
    subtitle = md("Difference in Metric (Test Group - Control Group)")
  ) |>
  tab_spanner(
    label = md("**Bayesian Analysis**"),
    columns = c(estimate_bayes, chance_to_win, cred_lower, cred_upper)
  ) |>
  tab_spanner(
    label = md("**Frequentist Analysis**"),
    columns = c(estimate_freq, p_value, conf_lower, conf_upper)
  ) |>
  
  # Rename Columns for clarity ---
  cols_label(
    platform = md("Platform"),
    estimate_bayes = md("Point Estimate"),
    chance_to_win = md("Chance to Win"),
    cred_lower = md("90% CI Lower"),
    cred_upper = md("90% CI Upper"),
    estimate_freq = md("Point Estimate"),
    p_value = md("*p*-value"),
    conf_lower = md("90% CI Lower"),
    conf_upper = md("90% CI Upper")
  ) |>
  
  # pply Formatting (Decimals and CI Grouping) ---
  fmt_number(
    columns = everything(),
    decimals = 3 # Use 3 decimals for precision
  ) |>
  
  #  Style the table ---
  tab_options(
    table.border.top.color = "lightgray",
    column_labels.border.bottom.color = "black",
    column_labels.border.bottom.width = px(2),
    data_row.padding = px(5)
  )

display_html(as_raw_html(platform_impact_constr_edits))
Evaluating Tone Check impact on constructive edit rate by platform
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate Chance to Win 90% CI Lower 90% CI Upper Point Estimate p-value 90% CI Lower 90% CI Upper
mobile web
0.015 0.630 −0.060 0.091 0.016 0.737 −0.061 0.092
desktop
0.030 0.926 −0.004 0.064 0.030 0.147 −0.004 0.064

Results confirm that Tone Check is most effective at increasing the quality of edits on desktop. On Desktop, we see a strong positive trend. The Bayesian 92.6% Chance to Win suggests the tool is highly likely to be increasing constructive edits.

Mobile Web results are slightly directionally positive (+1.5 pp) but are not yet statistically significant. Consistent with other findings, presenting Tone Check on mobile web does not appear to be disruptive but has less of an impact on user behavior compared to desktop.

Constructive Retention Rate (Second Week)

Hypothesis: Newcomers and Junior Contributors will be more likely to return to publish a new content edit in the future that does not include non-neutral language because Tone Check will have caused them to realize when they are at risk of of this not being true.

Methodology: First we reviewed the proportion of newcomers and Junior Contributors that publish an edit on a main namespace where Tone Check was shown and successfully return to make an unreverted edit to a main namespace 7 and 14 days after their first edit (second week retention).

Code
# load  retention
constructive_retention_rate_1 <-
  read.csv(
    file = 'data/constructive_retention_data_14day.tsv',  
    header = TRUE,
    sep = "\t",
    stringsAsFactors = FALSE
  ) 
Code
# load constructive retention rate (second dataset)
constructive_retention_rate_2 <-
  read.csv(
    file = 'data/constructive_retention_data_14day_pt2.tsv',
    header = TRUE,
    sep = "\t",
    stringsAsFactors = TRUE
  ) 
Code
# Combine the two datasets

constructive_retention_rate <- rbind(constructive_retention_rate_1, constructive_retention_rate_2)
Code
# Cleaning up dataset and renaming fields to clarify meanings

# Set experience level group and factor levels
constructive_retention_rate <- constructive_retention_rate %>%
  mutate(
    experience_level_group = case_when(
     user_edit_count == 0 & user_status == 'registered' ~ 'Newcomer',
     user_edit_count == 0 & user_status == 'unregistered' ~ 'Unregistered',
      user_edit_count > 0 &  user_edit_count <= 100 ~ "Junior Contributor",
      user_edit_count >  100 ~ "Non-Junior Contributor"   
    ),
    experience_level_group = factor(experience_level_group,
         levels = c("Unregistered","Newcomer", "Non-Junior Contributor", "Junior Contributor")
   ))  

#rename experiment field to clarify
constructive_retention_rate <- constructive_retention_rate %>%
  mutate(test_group = factor(test_group,
         levels = c('2025-09-editcheck-tone-control', '2025-09-editcheck-tone-test'),
         labels = c("control (eligible but not shown tone check)", "test (tone check shown)")))


#rename platform from phone to mobile web to clarify meaning
constructive_retention_rate <-constructive_retention_rate %>%
  mutate(platform = factor(platform,
         levels = c('phone', 'desktop'),
         labels = c("mobile web", "desktop")))

Overall

Code

constructive_retention_overall <- constructive_retention_rate %>%
    group_by(test_group)  %>%
    summarise(return_editors = sum(return_editors),
              editors = sum(editors),
        retention_rate = paste0(round(return_editors/editors * 100, 1), "%"))
Code
constructive_retention_overall_table <- constructive_retention_overall  %>%
  gt()  %>%
  tab_header(
    title = "Constructive second week retention rate"
  )  %>%
  cols_label(
    test_group = "Experiment group",
    return_editors = "Number of editors that returned second week",
    editors = "Number of first week editors",
    retention_rate = "Retention rate"
  ) %>%
  opt_stylize(5) %>%
  tab_footnote(
    footnote = "Limited to users shown or eligible to be shown at least one Tone Check their first week",
    locations = cells_column_labels(
      columns = 'retention_rate'
    )
  ) 

display_html(as_raw_html(constructive_retention_overall_table))
Constructive second week retention rate
Experiment group Number of editors that returned second week Number of first week editors Retention rate1
control (eligible but not shown tone check) 115 1995 5.8%
test (tone check shown) 167 2309 7.2%
1 Limited to users shown or eligible to be shown at least one Tone Check their first week

People who encountered Tone Check are 24% more likely to return again to make a constructive edit in their second week. 7.2% of people in shown Tone Check the test group returned to make a subsequent constructive edit, compared to 5.8% in the control group. (+1.4 percentage points).

This suggests that rather than discouraging users, Tone check may be make them feel more supported or successful in their contributions, leading them to return at higher rates.

By Platform

Code

constructive_retention_byplatform <- constructive_retention_rate %>%
    group_by(platform, test_group)  %>%
    summarise(return_editors = sum(return_editors),
              editors = sum(editors),
        retention_rate = paste0(round(return_editors/editors * 100, 1), "%"))
Code
constructive_retention_byplatform_table <- constructive_retention_byplatform %>%
 select(-c(3,4)) |> # removing granular data columns for publication
  gt()  %>%
  tab_header(
    title = "Constructive second week retention rate by platform"
  )  %>%
  cols_label(
    test_group = "Experiment group",
    platform = "Platform",
    #return_editors = "Number of editors that returned second week",
    #editors = "Number of first week editors",
    retention_rate = "Retention rate"
  ) %>%
  opt_stylize(5) %>%
  tab_footnote(
    footnote = "Limited to users shown or eligible to be shown at least one Tone Check",
    locations = cells_column_labels(
      columns = 'retention_rate'
    )
  ) 

display_html(as_raw_html(constructive_retention_byplatform_table))
Constructive second week retention rate by platform
Experiment group Retention rate1
mobile web
control (eligible but not shown tone check) 3.1%
test (tone check shown) 5.2%
desktop
control (eligible but not shown tone check) 6.8%
test (tone check shown) 7.9%
1 Limited to users shown or eligible to be shown at least one Tone Check

While we don’t have sufficient sample size to confirm statistical significance on a per platform basis, we observed high relative increases in constructive retention rate on both platforms.

By User Experience

Code

constructive_retention_byuserexp <- constructive_retention_rate %>%
    group_by(experience_level_group, test_group)  %>%
    summarise(return_editors = sum(return_editors),
              editors = sum(editors),
        retention_rate = paste0(round(return_editors/editors * 100, 1), "%"))
Code
constructive_retention_byuserexp_table <- constructive_retention_byuserexp %>%
select(-c(3,4)) |> # removing granular data columns for publication
  gt()  %>%
  tab_header(
    title = "Constructive second week retention rate by user experience"
  )  %>%
  cols_label(
    test_group = "Experiment group",
    experience_level_group = "Experience level group",
    #return_editors = "Number of editors that returned second week",
    #editors = "Number of first week editors",
    retention_rate = "Retention rate"
  ) %>%
  opt_stylize(5) %>%
  tab_footnote(
    footnote = "Limited to users shown or eligible to be shown at least one Tone Check",
    locations = cells_column_labels(
      columns = 'retention_rate'
    )
  ) 

display_html(as_raw_html(constructive_retention_byuserexp_table))
Constructive second week retention rate by user experience
Experiment group Retention rate1
Unregistered
control (eligible but not shown tone check) 0.4%
test (tone check shown) 1.3%
Newcomer
control (eligible but not shown tone check) 2.7%
test (tone check shown) 3.8%
Junior Contributor
control (eligible but not shown tone check) 9.5%
test (tone check shown) 11%
1 Limited to users shown or eligible to be shown at least one Tone Check

We observed increases in retention rate across all user groups as well.

Confirming Impact on Retention Rate

Because retention rate can be assumed to be idependent of one another (a user can only be retained once), we’ll just use a simple test of proportions to confirm significance.

Code
# reframe data for model

constructive_retention_overall <- constructive_retention_rate %>%
    group_by(test_group)  %>%
    summarise(return_editors = sum(return_editors),
              editors = sum(editors),
        retention_rate = return_editors/editors )
Code
#Extract the vectors
successes <- constructive_retention_overall$return_editors
totals <- constructive_retention_overall$editors

#Run  Proportion Test
res <- prop.test(x = successes, n = totals, conf.level = 0.90)

res

The result is directionally positive and shows a clear improvement in constructive retention.

We are 90% confident that the Tone Check results in a relative increase in retention of somewhere between 3.3% and 47.7%.

Guardrails

We identified a set of 5 guardrails to make sure that Tone Check is not negatively impacting peoples’ experience completing an edit or causing disruption on the wikis. These were identified through a pre-mortem task completed at the beginning of the project. We’ve confirmed that Tone Check did not cause any edit quality decreases (See New Content edit rate section) or significant decreases in edit completion rate >20% (See edit completion rate section). We also confirmed that Tone Check did not result in a high block rates or false positive rates (see sections below).

Guardrail #1: False Positive Rate

Description: Proportion of contributors that decline revising the text they have drafted and indicate that it was irrelevant.

Methodology: For this check were are defining false positive as the proportion of contributors that decline revising the text they have drafted (event.feature = 'editCheck-tone' AND event.action = 'action-dismiss') and selected “The tone is appropriate” when declining the check. We further limited the analysis to any edits that were not reverted within 48 hours (indicator of a quality edit).

Overall

Code

# overall dismissal rate
tone_check_false_positive_overall <- tone_check_reject_data %>%
    filter(was_tone_check_shown == 1 & is_new_content == 1) %>% #limit to where shown
    summarise(n_edits = n_distinct(editing_session),
              n_rejects = n_distinct(editing_session[n_rejects > 0 & reject_reason == 'The tone is appropriate'
                                                    & was_reverted == 0])) %>% # at least one paste check declined and edit not reverted
    mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), "%")) %>%   
    gt()  %>%
    tab_header(
    title = "Edits where user declined to revise text because the tone was appropriate"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    n_edits = "Number of edits shown Tone check",
    n_rejects = "Number of edits that declined Tone Check as irrelevant",
    dismissal_rate = "Decline Rate"
  ) %>%
    tab_source_note(
        gt::md('Limited to unreverted published edits where at least one Tone Check was shown')
    )

display_html(as_raw_html(tone_check_false_positive_overall))
Edits where user declined to revise text because the tone was appropriate
Number of edits shown Tone check Number of edits that declined Tone Check as irrelevant Decline Rate
1729 283 16.4%
Limited to unreverted published edits where at least one Tone Check was shown

Editors declined a tone check and selected “the tone is appropriate” at 16.4% of all published edits where Tone Check was shown. This excludes edits that were reverted within 48 hours.

For comparison, this is higher than the rates observed for Reference Check (6.6% of editors indicated that the content they were adding did not require a reference) and lower than Paste Check (30% of editors indicated that they wrote the content).

By Platform

Code
# platform false postive rate
tone_check_false_positive_byplatform <- tone_check_reject_data %>%
    group_by(platform) %>%
    filter(was_tone_check_shown == 1 & is_new_content == 1) %>% #limit to where shown
    summarise(n_edits = n_distinct(editing_session),
              n_rejects = n_distinct(editing_session[n_rejects > 0 & reject_reason == 'The tone is appropriate'
                                                    & was_reverted == 0])) %>% # at least one paste check declined and edit not reverted
    mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), "%")) %>%   
    select(-c(2,3)) |> # removing granular data columns for publication
    gt()  %>%
    tab_header(
    title = "Edits where user declined to revise text because the tone was appropriate by platform"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    platform = "Platform",
    #n_edits = "Number of edits shown Tone check",
    #n_rejects = "Number of edits that declined Tone Check as irrelevant",
    dismissal_rate = "Decline Rate"
  ) %>%
    tab_source_note(
        gt::md('Limited to unreverted published edits where at least one Tone Check was shown')
    )

display_html(as_raw_html(tone_check_false_positive_byplatform))
Edits where user declined to revise text because the tone was appropriate by platform
Platform Decline Rate
mobile web 15%
desktop 16.8%
Limited to unreverted published edits where at least one Tone Check was shown

There are similar rates on mobile web and desktop.

By User Experience

Code
# platform false postive rate
tone_check_false_positive_byuserexp <- tone_check_reject_data %>%
    group_by(experience_level_group) %>%
    filter(was_tone_check_shown == 1 & is_new_content == 1) %>% #limit to where shown
    summarise(n_edits = n_distinct(editing_session),
              n_rejects = n_distinct(editing_session[n_rejects > 0 & reject_reason == 'The tone is appropriate'
                                                    & was_reverted == 0])) %>% # at least one paste check declined and edit not reverted
    mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), "%")) %>%   
    select(-c(2,3)) |> # removing granular data columns for publication
    gt()  %>%
    tab_header(
    title = "Edits where user declined to revise text because the tone was appropriate by platform"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    experience_level_group = "User experience",
    #n_edits = "Number of edits shown Tone check",
    #n_rejects = "Number of edits that declined Tone Check as irrelevant",
    dismissal_rate = "Decline Rate"
  ) %>%
    tab_source_note(
        gt::md('Limited to unreverted published edits where at least one Tone Check was shown')
    )

display_html(as_raw_html(tone_check_false_positive_byuserexp))
Edits where user declined to revise text because the tone was appropriate by platform
User experience Decline Rate
Unregistered 18.2%
Newcomer 16%
Junior Contributor 16%
Limited to unreverted published edits where at least one Tone Check was shown

By Partner Wikipedia

Code
# per wiki false postive rate
tone_check_false_positive_bywiki <- tone_check_reject_data %>%
    group_by(wiki) %>%
    filter(was_tone_check_shown == 1 & is_new_content == 1) %>% #limit to where shown
    summarise(n_edits = n_distinct(editing_session),
              n_rejects = n_distinct(editing_session[n_rejects > 0 & reject_reason == 'The tone is appropriate'
                                                    & was_reverted == 0])) %>% # at least one paste check declined and edit not reverted
    mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), "%")) %>%   
    select(-c(2,3)) |> # removing granular data columns for publication
    gt()  %>%
    tab_header(
    title = "Edits where user declined to revise text because the tone was appropriate by partner Wikipedia"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    wiki = "Wikipedia",
    #n_edits = "Number of edits shown Tone check",
    #n_rejects = "Number of edits that declined Tone Check as irrelevant",
    dismissal_rate = "Decline Rate"
  ) %>%
    tab_source_note(
        gt::md('Limited to unreverted published edits where at least one Tone Check was shown')
    )

display_html(as_raw_html(tone_check_false_positive_bywiki))
Edits where user declined to revise text because the tone was appropriate by partner Wikipedia
Wikipedia Decline Rate
French Wikipedia 17.4%
Japanese Wikipedia 7.3%
Portuguese Wikipedia 19.1%
Limited to unreverted published edits where at least one Tone Check was shown

At Japanese Wikipedia, editors declined Tone Check as irrelevant (tone was appropriate) at only 7% of edits where shown. This is notably lower than the rates observed for French Wikipedia (17.4%) and Portuguese Wikipedia (19.1%).

Notably, Japanese Wikipedia is also where we observed the highest increase in edit quality as measured by decrease in revert rates.

Guardrail #2: Block Rate

Description Proportion of contributors blocked after publishing an edit where Paste Check was shown, compared to contributors eligible but not shown Paste Check.

Methodology: We gathered all edits where edit check was shown from the mediawiki_revision_change_tag table and joined with mediawiki_private_cu_changes to gather user name info. We then reviewed both global and local blocks made within 6 hours of the Paste Check event as identified in the logging table.

Code
# load data for assessing blocks
edit_check_blocks <-
  read.csv(
    file = 'data/tone_check_eligible_users_blocked.csv',
    header = TRUE,
    sep = ",",
    stringsAsFactors = FALSE
  ) 
Code
#rename experiment field to clarify
edit_check_blocks <- edit_check_blocks%>%
  mutate(test_group = factor(bucket,
         levels = c('2025-09-editcheck-tone-control', '2025-09-editcheck-tone-test'),
         labels = c("control (eligible but not shown tone check)", "test (tone check shown)")))
Code
edit_check_local_blocks_overall <- edit_check_blocks %>%
    #filter(user_id == 0) %>% #filter to identify logged out users
    group_by(test_group) %>%
    summarise(blocked_users = n_distinct(ip[is_local_blocked == 'True' | is_global_blocked == 'True']),
              all_users = n_distinct(ip))  %>%  #look at blocks
    mutate(prop_blocks = paste0(round(blocked_users/all_users * 100, 1), "%")) %>%
    select(-c(2,3)) %>% #removing granular data columns 
    gt()  %>%
    tab_header(
    title = "Proportion of users blocked by experiment group"
      )  %>%
  opt_stylize(5) %>%
  cols_label(
    test_group = "Test Group",
    prop_blocks = "Proportion of users blocked"
  )  %>%
    tab_source_note(
        gt::md('Limited to users blocked 6 hours after publishing an edit where Tone Check was shown')
    )


display_html(as_raw_html(edit_check_local_blocks_overall))
Proportion of users blocked by experiment group
Test Group Proportion of users blocked
control (eligible but not shown tone check) 0.8%
test (tone check shown) 0.8%
Limited to users blocked 6 hours after publishing an edit where Tone Check was shown

People shown Tone Check are not blocked at higher rates than users in the control group. 0.8% of users were blocked in both the test and control groups.

Appendix

We reviewed a number of additional secondary metrics or curiosities. These are used to learn more about the impact of Tone Check on editing behavior, but are not primary targets of the intervention.

Tone Check Decline Rates and Reasons

Hypothesis Knowing the reasons why people do not elect to revise tone when the Check prompts them to do so (by platform), will help us to decide what (if anything) can be done to decrease the proportion of people on desktop who do so

Methodology: We reviewed the proportion of published edits new content edits shown Tone Check wherein people elected to not revise the tone of the text they added (i.e. the Tone Check was dismissed) by the decline reason the user selected.

This was determined by edits where the user dismissed a Tone Check at least once in a session (event.feature = 'editCheck-tone' AND event.action = 'action-dismiss'). The analysis includes splits by the reason the user selected for dismissing the check.

Code
# load data for assessing edit reject frequency
tone_check_reject_data_1 <-
  read.csv(
    file = 'data/tone_check_rejects_data_ab.tsv',
    header = TRUE,
    sep = "\t",
    stringsAsFactors = FALSE
  ) 
Code
# load constructive retention rate (second dataset)
tone_check_reject_data_2 <-
  read.csv(
    file = 'data/tone_check_rejects_data_ab_pt2.tsv',
    header = TRUE,
    sep = "\t",
    stringsAsFactors = FALSE
  ) 
Code
# Combine the two datasets

tone_check_reject_data <- rbind(tone_check_reject_data_1, tone_check_reject_data_2)
Code
# Set experience level group and factor levels
tone_check_reject_data <- tone_check_reject_data %>%
  mutate(
    experience_level_group = case_when(
     user_edit_count == 0 & user_status == 'registered' ~ 'Newcomer',
     user_edit_count == 0 & user_status == 'unregistered' ~ 'Unregistered',
      user_edit_count > 0 &  user_edit_count <= 100 ~ "Junior Contributor",
      user_edit_count >  100 ~ "Non-Junior Contributor"   
    ),
    experience_level_group = factor(experience_level_group,
         levels = c("Unregistered","Newcomer", "Non-Junior Contributor", "Junior Contributor")
   ))  

#rename experiment field to clarify
tone_check_reject_data <- tone_check_reject_data %>%
  mutate(test_group = factor(test_group,
         levels = c('2025-09-editcheck-tone-control', '2025-09-editcheck-tone-test'),
         labels = c("control (eligible but not shown Tone Check)", "test (shown Tone Check)")))



#rename platform from phone to mobile web to clarify meaning
tone_check_reject_data <- tone_check_reject_data %>%
  mutate(platform = factor(platform,
         levels = c('phone', 'desktop'),
         labels = c("mobile web", "desktop")))

# rename Wiki names
tone_check_reject_data  <- tone_check_reject_data  %>%
  mutate(
    wiki = recode(wiki, !!!wiki_name_lookup)
  )
Code
#Set fields and factor levels to assess number of checks shown

tone_check_reject_data <- tone_check_reject_data %>%
  mutate(
    multiple_checks_shown = 
         ifelse(n_checks_shown > 1, "multiple checks shown", "single check shown"),  
     multiple_checks_shown = factor( multiple_checks_shown ,
         levels = c("single check shown", "multiple checks shown")))
         
# note these buckets can be adjusted as needed based on distribution of data
tone_check_reject_data <- tone_check_reject_data %>%
  mutate(
    checks_shown_bucket = case_when(
     is.na(n_checks_shown) ~ '0',
     n_checks_shown == 1  ~ '1', 
     n_checks_shown == 2 ~ '2',
     n_checks_shown > 2 & n_checks_shown <= 5 ~ "3-5",
     n_checks_shown > 5 & n_checks_shown <= 10 ~ "6-10", 
     n_checks_shown > 10 ~ "over 10" 
    ),
    checks_shown_bucket = factor(checks_shown_bucket ,
         levels = c("0","1","2", "3-5", "6-10", "over 10")
   ))   
Code
# shorten and clarify reason field names
tone_check_reject_data <- tone_check_reject_data %>%
  mutate(
    reject_reason = case_when(
     reject_reason == 'no_reject_reason' ~ 'No reason provided',
     reject_reason == 'edit-check-feedback-reason-other'  ~ 'None applies', 
     reject_reason == 'edit-check-feedback-reason-appropriate' ~ 'The tone is appropriate',
     reject_reason == 'edit-check-feedback-reason-uncertain'  ~ 'Not sure how to revise tone'
    ),
    reject_reason = factor(reject_reason ,
         levels = c("No reason provided","None applies","The tone is appropriate", "Not sure how to revise tone")
   )) 

Overall

Code
# overall dismissal rate
tone_check_dismissal_overall <- tone_check_reject_data %>%
    filter(was_tone_check_shown == 1 & is_new_content == 1) %>% #limit to where shown
    summarise(n_edits = n_distinct(editing_session),
              n_rejects = n_distinct(editing_session[n_rejects > 0])) %>% # at least one paste check declined and edit not reverted
    mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), "%")) %>%   
    gt()  %>%
    tab_header(
    title = "Tone Check decline rate"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    n_edits = "Number of edits shown Tone check",
    n_rejects = "Number of edits that declined Tone Check",
    dismissal_rate = "Proportion of edits where Tone Check was declined"
  ) %>%
    tab_source_note(
        gt::md('Limited to published edits where at least one Tone Check was shown')
    )


display_html(as_raw_html(tone_check_dismissal_overall ))
Tone Check decline rate
Number of edits shown Tone check Number of edits that declined Tone Check Proportion of edits where Tone Check was declined
1729 646 37.4%
Limited to published edits where at least one Tone Check was shown

Tone check was declined at 37.4% of all new content edits where at least Tone Check was shown at least one during an editing session. This is lower than the rates reported for other available checks including Paste Check (54.8% decline rate)

Code
### By decline reason

tone_check_dismissal_byreason_overall <- tone_check_reject_data %>%
    filter(is_new_content == 1 & was_tone_check_shown == 1 
           & n_rejects > 0) %>% #limit to where shown and user elect to not revise test
    group_by(reject_reason)  %>%
    summarise(n_edits_rejected = n_distinct(editing_session)) %>%
    mutate(select_rate = paste0(round(n_edits_rejected/sum(n_edits_rejected) * 100, 1), "%"))
Code
# plot bar chart of reason selection
dodge <- position_dodge(width=0.9)

p <- tone_check_dismissal_byreason_overall  %>%
    ggplot(aes(x= reject_reason, y = n_edits_rejected/sum(n_edits_rejected))) +
    geom_col(position = 'dodge', fill = 'dodgerblue4') +
    scale_y_continuous(labels = scales::percent) +
      geom_text(aes(label = paste(select_rate, "\n", n_edits_rejected,"edits"), fontface=2), vjust=1.2, size = 8, color = "white") +
    scale_fill_manual(values= cbPalette, name = "Reason")  +
    labs (y = "Percent of edits ",
           x = "Selected reason",
          title = "Reasons users selected for not revising text",
           caption = "Limited to published edits where a user elected to not revise text")  +
   theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=24),
        axis.text.x = element_text(size = 18),
        axis.title.x = element_text(margin = margin(t = 20, unit = "pt")),
        legend.position= "none",
        axis.line = element_line(colour = "black")) 

      
p

Editors selected “The tone is appropriate” in over half (58.9%) of all published new content edits where the user elected to not revise their text.

By if multiple checks were shown

Code

tone_check_dismissal_bymultiple <- tone_check_reject_data %>%
    filter(is_new_content == 1 & was_tone_check_shown == 1) %>% #limit to where shown
    group_by(multiple_checks_shown) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_rejects = n_distinct(editing_session[n_rejects > 0 & was_reverted == 0])) %>% #limit to new content edits without a refernece
    mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), "%")) %>%   
    gt()  %>%
    tab_header(
    title = "Tone Check decline rate by if multiple checks shown"
      )  %>%
 opt_stylize(5) %>%
  cols_label(
     multiple_checks_shown = "Multiple Checks",
    n_edits = "Number of edits shown Tone Check",
    n_rejects = "Number of edits that declined Tone Check",
    dismissal_rate = "Proportion of edits where Tone Check was declined"
  ) %>%
    tab_source_note(
        gt::md('Limited to published edits where at least one Tone Check was shown')
    )

display_html(as_raw_html(tone_check_dismissal_bymultiple ))
Tone Check decline rate by if multiple checks shown
Multiple Checks Number of edits shown Tone Check Number of edits that declined Tone Check Proportion of edits where Tone Check was declined
single check shown 512 178 34.8%
multiple checks shown 1217 292 24%
Limited to published edits where at least one Tone Check was shown

As we also observed in the leading indicators report, the decline rate slightly decreases if multiple checks were shown.

Edits where multiple checks are shown are likely longer edits where the user may have more incentive to ensure their edit does not get reverted.

By Platform

Code
tone_check_dismissal_byplatform <- tone_check_reject_data %>%
    filter(is_new_content == 1 & was_tone_check_shown == 1) %>% #limit to where shown
    group_by(platform) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_rejects = n_distinct(editing_session[n_rejects > 0])) %>% #limit to new content edits without a refernece
    mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), "%")) %>% 
    ungroup() %>%
    #mutate(n_edits = ifelse(n_edits < 50, "<50", n_edits),
     #n_rejects = ifelse(n_rejects < 50, "<50", n_rejects))  %>% #sanitizing per data publication guidelines
    #select(-2) %>%
    gt()  %>%
    tab_header(
    title = "Tone Check decline rate by platform"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    platform = "Platform",
    n_edits = "Number of edits shown Tone check",
    n_rejects = "Number of edits that declined Tone Check",
    dismissal_rate = "Proportion of edits where Tone Check was declined"
  ) %>%
    tab_source_note(
        gt::md('Limited to published edits where at least one Tone Check was shown')
    )

display_html(as_raw_html(tone_check_dismissal_byplatform ))
Tone Check decline rate by platform
Platform Number of edits shown Tone check Number of edits that declined Tone Check Proportion of edits where Tone Check was declined
mobile web 380 151 39.7%
desktop 1349 496 36.8%
Limited to published edits where at least one Tone Check was shown

Tone checks are declined only slightly more frequently on mobile compared to desktop. 36.8% of all published desktop edits where Tone Check was shown include at least one check that was declined compared to 39.7% of all published mobile edits.

This suggests that the lower impact Tone Check has on mobile web edit quality is not due to users explicitly rejecting the check.

Code
### Decline reason by platform

tone_check_dismissal_byreason_byplatform <- tone_check_reject_data %>%
    filter(is_new_content == 1 & was_tone_check_shown == 1 & 
           n_rejects > 0 ) %>% #limit to where shown and user did not revise text
    group_by(platform, reject_reason)  %>%
    summarise(n_edits_rejected = n_distinct(editing_session)) %>%
    mutate(select_rate = round(n_edits_rejected/sum(n_edits_rejected), 2))
Code
# plot bar chart of reason selection
dodge <- position_dodge(width=0.9)
# slightly larger chart needed here
options(repr.plot.width = 18, repr.plot.height = 10)

p <- tone_check_dismissal_byreason_byplatform  %>%
    ggplot(aes(x= reject_reason, y =select_rate, fill = reject_reason)) +
    geom_col(position = 'dodge',) +
    scale_y_continuous(labels = scales::percent) +
      geom_text(aes(label = paste0(select_rate * 100, "%"), fontface=2), vjust=1.2, size = 10, color = "white") +
    facet_grid(~ platform ) +
    labs (y = "Percent of edits ",
           x = "Selected reason",
          title = "Reasons users selected for not revising text")  +
      scale_fill_manual(values= cbPalette, name = "Reason")  +
   theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=24),
        legend.position= "bottom",
        legend.text=element_text(size=18),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.line = element_line(colour = "black")) 

      
p

On both mobile web and desktop, “the tone is appropriate” is the most frequently selected reason for electing to not revise text. The other decline option, including “None applies” and “Not sure how to revise tone”, see similar rates of selection on both platforms.

By User Experience

Code
tone_check_dismissal_byuserexp <- tone_check_reject_data %>%
    filter(is_new_content ==1 & was_tone_check_shown == 1) %>% #limit to where shown
    group_by(experience_level_group) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_rejects = n_distinct(editing_session[n_rejects > 0])) %>% #limit to new content edits without a refernece
    mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), "%")) %>%   
    ungroup() %>%
    mutate(n_edits = ifelse(n_edits < 50, "<50", n_edits),
     n_rejects = ifelse(n_rejects < 50, "<50", n_rejects))  %>% #sanitizing per data publication guidelines
    #select(-2) %>%
    gt()  %>%
    tab_header(
    title = "Tone Check decline rate by user experience"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    experience_level_group = "User Experience",
    n_edits = "Number of edits shown Tone check",
    n_rejects = "Number of edits that declined Tone Check",
    dismissal_rate = "Proportion of edits where Tone Check was declined"
  )%>%
    tab_source_note(
        gt::md('Limited to published edits where at least one Tone Check was shown')
    )

display_html(as_raw_html(tone_check_dismissal_byuserexp ))
Tone Check decline rate by user experience
User Experience Number of edits shown Tone check Number of edits that declined Tone Check Proportion of edits where Tone Check was declined
Unregistered 291 141 48.5%
Newcomer 406 141 34.7%
Junior Contributor 1032 366 35.5%
Limited to published edits where at least one Tone Check was shown

Unregistered users are most likely to decline a Tone Check compared to registered users. 48.5% of all published new content edits were declined by unregistered users.

Code
### Dismissal reason by user experience

tone_check_dismissal_byreason_byuserexp <- tone_check_reject_data %>%
    filter(is_new_content ==1 & was_tone_check_shown == 1 
           & n_rejects > 0) %>% #limit to where shown and user did not revise their ext
    group_by(experience_level_group, reject_reason)  %>%
    summarise(n_edits_rejected = n_distinct(editing_session)) %>%
    mutate(select_rate = round(n_edits_rejected/sum(n_edits_rejected),2)) 
Code
# plot bar chart of reason selection
dodge <- position_dodge(width=0.9)
# slightly larger chart needed here
options(repr.plot.width = 18, repr.plot.height = 10)

p <- tone_check_dismissal_byreason_byuserexp %>%
    ggplot(aes(x= reject_reason, y = select_rate, fill = reject_reason)) +
    geom_col(position = 'dodge') +
    scale_y_continuous(labels = scales::percent) +
      geom_text(aes(label = paste0(select_rate * 100, "%"), fontface=2), vjust=1.2, size = 10, color = "white") +
    facet_grid( ~ experience_level_group) +
    labs (y = "Percent of edits ",
           x = "Selected reason",
          title = "Reasons users selected for not revising their text")  +
    scale_fill_manual(values= cbPalette, name = "Reason")  +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=24),
        legend.position= "bottom",
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.line = element_line(colour = "black")) 
      
p

There were no significant differences in the distribution of decline reasons across the three reviewed user groups.

A higher proportion of unregistered editors (62%) selected “The Tone is appropriate” compared to registered users (~57%). Unregistered editors are also slightly more likely to select “I am not sure how to revise tone” and less likely to select “No reason provided” compared to registered editors.

By partner Wikipedia

Code
tone_check_dismissal_bywiki <- tone_check_reject_data %>%
    filter(is_new_content == 1 & was_tone_check_shown == 1) %>% #limit to where shown
    group_by(wiki) %>%
    summarise(n_edits = n_distinct(editing_session),
              n_rejects = n_distinct(editing_session[n_rejects > 0 & was_reverted == 0])) %>%
    mutate(dismissal_rate = paste0(round(n_rejects/n_edits * 100, 1), "%")) %>% 
    #filter(n_edits > 50) %>%  # limit to wikis with over 50 edits.
   ungroup() %>%
    mutate(n_edits = ifelse(n_edits < 50, "<50", n_edits),
    n_rejects = ifelse(n_rejects < 50, "<50", n_rejects))  %>% #sanitizing per data publication guidelines
    select(-2) %>%
    gt()  %>%
    tab_header(
    title = "Tone Check decline rate by partner Wikipedia"
      )  %>%
    opt_stylize(5) %>%
  cols_label(
    wiki = "Wikipedia",
    #n_edits = "Number of edits shown Tone check",
    n_rejects = "Number of edits that declined Tone Check",
    dismissal_rate = "Proportion of edits where Tone Check was declined"
  ) %>%
    tab_source_note(
        gt::md('Limited to published edits where at least one Tone Check was shown')
    )

display_html(as_raw_html(tone_check_dismissal_bywiki ))
Tone Check decline rate by partner Wikipedia
Wikipedia Number of edits that declined Tone Check Proportion of edits where Tone Check was declined
French Wikipedia 335 27.3%
Japanese Wikipedia 61 27.9%
Portuguese Wikipedia 75 26.5%
Limited to published edits where at least one Tone Check was shown
Code
Decline rates are very similar across all three Partner Wikipedias.

Distinct users that publish a reverted edit

Hypothesis: Newcomers and Junior Contributors will be more aware of the need to write in a neutral tone when contributing new text because the visual editor will prompt them to do so in cases where they have written text that contains non-neutral language.

Methodology: The proportion of newcomers and Junior Contributors shown or eligible to be shown Tone Check that publish at least one new content edit that was reverted.

This metric is similar to the revert rate analysis except that it looks at proportion of distinct editors versus distinct edits. There were no significant differences in the results reported in Primary Metric 1: Revert rate section as the majority of newcomers and Junior Contributors posted just one new content edit during the reviewed time period. See details below.

Overall

Code
tone_check_reverts_byuser_overall <- tone_check_publish_data |>
    filter(is_new_content == 1 & is_test_eligible == 'eligible' ) |> #limit to eligible edits
    group_by(test_group) |>
    summarise(n_users = n_distinct(user_id),
              n_users_revert = n_distinct(user_id[was_reverted == 1])) |> #reverted within 48 hours
    mutate(revert_rate = paste0(round(n_users_revert/n_users * 100, 1), "%")) 
Code
tone_check_reverts_byuser_overall <- tone_check_publish_data |>
    filter(is_new_content == 1 & is_test_eligible == 'eligible' ) |> #limit to eligible edits
    group_by(test_group) |>
    summarise(n_users = n_distinct(user_id),
              n_users_revert = n_distinct(user_id[was_reverted == 1])) |> #reverted within 48 hours
    mutate(revert_rate = paste0(round(n_users_revert/n_users * 100, 1), "%")) 
Code
# plot visualization of overall users reverted
dodge <- position_dodge(width=0.9) 

p <- tone_check_reverts_byuser_overall  |>
    ggplot(aes(x= test_group, y = n_users_revert/n_users, fill = test_group)) +
    geom_col(position = 'dodge') +
    scale_y_continuous(labels = scales::percent) +
    geom_text(aes(label = paste(revert_rate, "\n", n_users_revert,"users reverted"), fontface=2), vjust=1.2, size = 10, color = "white") +
    scale_fill_manual(values= c("#999999", "dodgerblue4"), name = "Experiment Group")  +
    labs (y = "Percent of distinct users reverted ",
           x = "Experiment Group",
          title = "Proportion of users with at least one reverted edit",
           caption = "Limited to published new content edits shown or eligible to be shown Tone Check")  +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=24),
        axis.text.x = element_text(size = 24),
        axis.title.x = element_text(margin = margin(t = 20, unit = "pt")),
        legend.position= "none",
        axis.line = element_line(colour = "black")) 

      
p

By If multiple checks were shown

Code
tone_check_revert_byuser_bymultiple <- tone_check_publish_data |>
    filter(is_new_content == 1 & is_test_eligible == 'eligible' 
          & test_group == 'test (tone check shown)' 
         & multiple_checks_shown != "no tone checks") |>
    group_by( multiple_checks_shown) %>%
    summarise(n_users = n_distinct(user_id),
              n_revert_users = n_distinct(user_id[was_reverted == 1])) |> #limit to new content edits without a refernece
    mutate(revert_rate = paste0(round(n_revert_users/n_users * 100, 1), "%")) |>   
    select(-c(2,3)) |> # removing granular data columns for publication
    gt()  |>
    tab_header(
    title = "Users with at least one reverted edit by if multiple checks were shown"
      )  |>
    opt_stylize(5) |>
  cols_label(
    multiple_checks_shown = "Multiple Check",
    #n_edits = "Number of published new content edits",
    #n_reverts = "Number of edits reverted ",
    revert_rate = "Proportion of new content edits that were reverted"
  ) |>
    tab_source_note(
        gt::md('Limited to published new content edits shown or eligible to shown Tone Check')
    )


display_html(as_raw_html(tone_check_revert_byuser_bymultiple))
Users with at least one reverted edit by if multiple checks were shown
Multiple Check Proportion of new content edits that were reverted
one tone check 26%
multiple tone checks 26.6%
Limited to published new content edits shown or eligible to shown Tone Check

By Platform

Code
tone_check_revert_byuser_byplatform <- tone_check_publish_data |>
    filter(is_new_content == 1 & is_test_eligible == 'eligible'
         ) |>
    group_by(platform, test_group) |>
    summarise(n_users = n_distinct(user_id),
              n_revert_users = n_distinct(user_id[was_reverted == 1])) %>% #limit to new content edits without a refernece
    mutate(revert_rate = paste0(round(n_revert_users/n_users * 100, 1), "%")) %>%   
    select(-c(3,4)) %>% # removing granular data columns for publication
    gt()  |>
    tab_header(
    title = "Users with at least one reverted edit by platform"
      )  |>
    opt_stylize(5) %>%
  cols_label(
    test_group = "Experiment Group",
    platform = "Platform",
    #n_users = "Number of Users",
    #n_revert_users = "Number of users reverted ",
    revert_rate = "Proportion of distinct users that were reverted"
  ) |>
    tab_source_note(
        gt::md('Limited to users who published new content edits shown or eligible to shown Tone Check')
    )


display_html(as_raw_html(tone_check_revert_byuser_byplatform))
Users with at least one reverted edit by platform
Experiment Group Proportion of distinct users that were reverted
mobile web
control (eligible but not shown tone check) 33%
test (tone check shown) 37.4%
desktop
control (eligible but not shown tone check) 24.1%
test (tone check shown) 22.7%
Limited to users who published new content edits shown or eligible to shown Tone Check

By User Experience

Code
tone_check_revert_byuser_byuserexp <- tone_check_publish_data |>
    filter(is_new_content == 1 & is_test_eligible == 'eligible'
         ) |>
    group_by(experience_level_group, test_group) |>
    summarise(n_users = n_distinct(user_id),
              n_revert_users = n_distinct(user_id[was_reverted == 1])) %>% #limit to new content edits without a refernece
    mutate(revert_rate = paste0(round(n_revert_users/n_users * 100, 1), "%")) |>  
    select(-c(3,4)) %>% # removing granular data columns for publication
    gt()  |>
    tab_header(
    title = "Users with at least one reverted edit by user experience"
      )  |>
    opt_stylize(5) |>
  cols_label(
    test_group = "Test group",
    experience_level_group = "User experience",
    #n_users = "Number of users",
    #n_revert_users = "Number of users reverted ",
    revert_rate = "Proportion of distinct users that were reverted"
  ) |>
    tab_source_note(
        gt::md('Limited to users who published new content edits shown or eligible to shown Tone Check')
    )


display_html(as_raw_html(tone_check_revert_byuser_byuserexp))
Users with at least one reverted edit by user experience
Test group Proportion of distinct users that were reverted
Unregistered
control (eligible but not shown tone check) 34.9%
test (tone check shown) 37.2%
Newcomer
control (eligible but not shown tone check) 21.5%
test (tone check shown) 26.4%
Junior Contributor
control (eligible but not shown tone check) 25.5%
test (tone check shown) 22.5%
Limited to users who published new content edits shown or eligible to shown Tone Check

Constructive Retention Rate (Tone Check not shown again)

We also reviewed the proportion of newcomers and Junior Contributors that publish an edit Tone Check was activated within and return to make a new content edit where Tone Check was not shown 7 to 14 days after.

Code
# load  retention
retention_rate_norepeat_check_1 <-
  read.csv(
    file = 'data/retention_notone_data.tsv',  
    header = TRUE,
    sep = "\t",
    stringsAsFactors = FALSE
  ) 
Code
# load constructive retention rate (second dataset)
retention_rate_norepeat_check_2 <-
  read.csv(
    file = 'data/retention_notone_data_pt2.tsv',
    header = TRUE,
    sep = "\t",
    stringsAsFactors = TRUE
  ) 
Code
# Combine the two datasets

retention_rate_norepeat_check <- rbind(retention_rate_norepeat_check_1, retention_rate_norepeat_check_2)
Code
# Cleaning up dataset and renaming fields to clarify meanings

# Set experience level group and factor levels
retention_rate_norepeat_check <- retention_rate_norepeat_check %>%
  mutate(
    experience_level_group = case_when(
     user_edit_count == 0 & user_status == 'registered' ~ 'Newcomer',
     user_edit_count == 0 & user_status == 'unregistered' ~ 'Unregistered',
      user_edit_count > 0 &  user_edit_count <= 100 ~ "Junior Contributor",
      user_edit_count >  100 ~ "Non-Junior Contributor"   
    ),
    experience_level_group = factor(experience_level_group,
         levels = c("Unregistered","Newcomer", "Non-Junior Contributor", "Junior Contributor")
   ))  

#rename experiment field to clarify
retention_rate_norepeat_check <- retention_rate_norepeat_check %>%
  mutate(test_group = factor(test_group,
         levels = c('2025-09-editcheck-tone-control', '2025-09-editcheck-tone-test'),
         labels = c("control (eligible but not shown tone check)", "test (tone check shown)")))


#rename platform from phone to mobile web to clarify meaning
retention_rate_norepeat_check <-retention_rate_norepeat_check%>%
  mutate(platform = factor(platform,
         levels = c('phone', 'desktop'),
         labels = c("mobile web", "desktop")))

Overall

Code
retention_rate_norepeat_overall <- retention_rate_norepeat_check%>%
    group_by(test_group)  %>%
    summarise(return_editors = sum(return_editors),
              editors = sum(editors),
        retention_rate = paste0(round(return_editors/editors * 100, 1), "%"))
Code
retention_rate_norepeat_overall_table <- retention_rate_norepeat_overall  %>%
select(-c(2,3)) |> # removing granular data columns for publication
  gt()  %>%
  tab_header(
    title = "Constructive second week retention rate (tone check not shown again)"
  )  %>%
  cols_label(
    test_group = "Experiment group",
    #return_editors = "Number of editors that returned second week",
    #editors = "Number of first week editors",
    retention_rate = "Retention rate"
  ) %>%
  opt_stylize(5) %>%
  tab_footnote(
    footnote = "Limited to users shown or eligible to be shown at least one Tone Check",
    locations = cells_column_labels(
      columns = 'retention_rate'
    )
  ) 

display_html(as_raw_html(retention_rate_norepeat_overall_table))
Constructive second week retention rate (tone check not shown again)
Experiment group Retention rate1
control (eligible but not shown tone check) 1.6%
test (tone check shown) 1.7%
1 Limited to users shown or eligible to be shown at least one Tone Check

Less than 2% of editors in both the control and test group returned to make another new content edit where Tone Check was not shown or not eligible to be shown in their second week. While we see a slight increase in retention rate for editors shown Tone Check, we are unable to confirm statistical significance due to the small effect and sample size.

By Platform

Code
retention_rate_norepeat_byplatform <- retention_rate_norepeat_check %>%
    group_by(platform, test_group)  %>%
    summarise(return_editors = sum(return_editors),
              editors = sum(editors),
        retention_rate = paste0(round(return_editors/editors * 100, 1), "%"))
Code
retention_rate_norepeat_byplatform_table <- retention_rate_norepeat_byplatform  %>%
  select(-c(3,4)) %>% # removing granular data columns for publication
  gt()  %>%
  tab_header(
    title = "Constructive second week retention rate (tone check not shown again) by platform"
  )  %>%
  cols_label(
    test_group = "Experiment group",
    platform = "Platform",
    #return_editors = "Number of editors that returned second week",
    #editors = "Number of first week editors",
    retention_rate = "Retention rate"
  ) %>%
  opt_stylize(5) %>%
  tab_footnote(
    footnote = "Limited to users shown or eligible to be shown at least one Tone Check",
    locations = cells_column_labels(
      columns = 'retention_rate'
    )
  ) 

display_html(as_raw_html(retention_rate_norepeat_byplatform_table))
Constructive second week retention rate (tone check not shown again) by platform
Experiment group Retention rate1
mobile web
control (eligible but not shown tone check) 0.5%
test (tone check shown) 1.7%
desktop
control (eligible but not shown tone check) 2.1%
test (tone check shown) 1.7%
1 Limited to users shown or eligible to be shown at least one Tone Check

By User Experience

Code
retention_rate_norepeat_byuserexp <- retention_rate_norepeat_check %>%
    group_by(experience_level_group, test_group)  %>%
    summarise(return_editors = sum(return_editors),
              editors = sum(editors),
        retention_rate = paste0(round(return_editors/editors * 100, 1), "%"))
Code
retention_rate_norepeat_byuserexp_table <- retention_rate_norepeat_byuserexp  %>%
  select(-c(3,4)) %>% # removing granular data columns for publication
  gt()  %>%
  tab_header(
    title = "Constructive second week retention rate (tone check not shown again) by user experience"
  )  %>%
  cols_label(
    test_group = "Experiment group",
    experience_level_group = "Experience level group",
    #return_editors = "Number of editors that returned second week",
    #editors = "Number of first week editors",
    retention_rate = "Retention rate"
  ) %>%
  opt_stylize(5) %>%
  tab_footnote(
    footnote = "Limited to users shown or eligible to be shown at least one Tone Check",
    locations = cells_column_labels(
      columns = 'retention_rate'
    )
  ) 

display_html(as_raw_html(retention_rate_norepeat_byuserexp_table))
Constructive second week retention rate (tone check not shown again) by user experience
Experiment group Retention rate1
Unregistered
control (eligible but not shown tone check) 0%
test (tone check shown) 0.5%
Newcomer
control (eligible but not shown tone check) 1.1%
test (tone check shown) 2.3%
Junior Contributor
control (eligible but not shown tone check) 2.7%
test (tone check shown) 2%
1 Limited to users shown or eligible to be shown at least one Tone Check