The Wikimedia Foundation’s Editing team is working on a set of improvements for the visual editor to help new volunteers understand and follow some of the policies necessary to make constructive changes to Wikipedia projects.
In this AB test, we are evaluating the impact of Tone Check. Tone Check is an Edit Check that uses a language model to prompt people adding promotional, derogatory, or otherwise subjective language to consider “neutralizing” the tone of what they are writing. Tone Check is the first Edit Check that uses machine learning. In this case, a BERT language model initially selected and fine-tuned by the Research team to identify biased language within the new text people are attempting to publish to Wikipedia.
This A/B test will help us make the following decision:
What – if any – changes in the Tone Check UX, and/or the model that enables it, will we make before we can be confident in the following?
Newcomers and Junior Contributors that encounter Tone Check are more likely to publish new content edits in the main namespace that are devoid of biased language.
Newcomers and Junior Contributors will intuitively interact with the Tone Check experience in ways that are NOT disruptive to them or the wikis
This work is guided by the Wikimedia Foundation Annual Plan, specifically by the Wiki Experiences 1.1 objective key result: Increase the rate at which editors with ≤100 cumulative edits publish constructive edits on mobile web by 4%, as measured by controlled experiments (by the end of Q2).
You can find more details about this check on the Project Page.
The Tone Check A/B test was deployed on 3 September 2025 to French, Japanese, and Portuguese Wikipedias.
Methodology
AB Test Design
The team ran an AB test from 3 September 2025 through 28 January 2026 to determine the impact of presenting Tone Check to eligible editing sessions and evaluate the extent to which the feature, in its current form, warrants being deployed to all wikis.
Specifically, we want to test the following hypothesis:
If we prompt newcomers and Junior Contributors to reconsider the tone they are writing in when software detects them using – what experienced volunteers would agree is – then non-neutral/peacock language, then we will decrease the percentage of new content edits newcomers publish that are reverted on the grounds of WP:NPOV (and related policies).
During this experiment, 50% of users editing a desktop or mobile main namespace page using Visual Editor were randomly assigned to the test group and could be shown Tone Check if their edit met the specified requirements during their edit, and 50% were randomly assigned to the control group and could not be shown Tone Check.
The test included all mobile web and desktop contributors (both registered and unregistered) to the 3 participating wikis that started an edit with Visual Editor. Users remained in the same test group for the duration of the test. We limited the analysis to edits completed by unregistered users and users with 100 or fewer edits as those are the users that would be shown Tone Check under the default config settings.
Figure 1: Tone Check AB Test Bucketing Overview
As shown in Figure 1, not all edits bucketed in the AB test experiment met the requirements for being shown Tone Check. Tone Check was shown at about 11% of all published new content edits in the test group (989 edits). It was shown at similar rates on both desktop and mobile web.
In this analysis, we compared all new content edits that were shown Tone Check to edits that were eligible but not shown Tone Check in the control group (based on instrumentation added in this task. This comparison was done to ensure the analysis is focused on the actual effects of the feature.
Evaluation Plan
We used a set of primary and secondary metrics to evaluate the impact of this feature. We also reviewed a set of guardrails to ensure that Tone Check was not disruptive to the contributor or to the Wikipedias. These metrics are documented in the task.
For each metric, we reviewed the following dimensions: overall by experiment group (test and control), by platform (mobile web or desktop), by user experience and status, and by partner Wikipedia. We also reviewed some indicators such as edit completion rate by the number of checks shown within a single editing session to determine if there was a significant impact at a certain number of checks presented.
Note: For the user experience analysis, we split newer editors into three experience level groups: (1) unregistered, (2) newcomer (registered user making their first edit on Wikipedia), and (3) Junior Contributor (user that has made between 1 and 100 edits).
Please refer to the data collection notebook notebook for more details on the steps to collect the data reviewed in this report.
Summary of Results
New content edits published without biased language
Tone Check successfully decreases the frequency of non-neutral language in published content. Users with access to Tone Check were -15.6% less likely to publish edits containing non-neutral language (falling from 9.6% to 8.1%; a -1.5 pp decrease) compared to the control group. We have 99.8% confidence that this improvement is directly attributable to the tool.
However, Tone Check’s level of impact depends heavily on the platform. Results confirm a highly significant impact on Desktop, where we observed the highest reduction in revert rate. In contrast, there was no detectable effect yet on Mobile Web.
New content edits revert rate
Edits made by users shown Tone Check are also 15% less likely to be reverted than eligible control edits (29.5% → 25.1%; a -4.4 pp decrease).
This reduction is primarily driven by Junior Contributors. While we observed a statistically significant -33% relative [-10.2 pp] decrease in reverts for Junior Contributors, we did not confirm any change for in the revert rate of newcomers or unregistered users. These trends indicate that Tone Check may be more effective for people who have already succeeded in completing at least one edit on a Wikipedia namespace. Since these users are more experienced, their edits are less likely to be reverted for other policy violations compared to registered users completing their first edit or unregistered users.
New Content edit revert rate: impact of removing non-neutral language
When a user removes non-neutral language in response to a Tone Check, the likelihood of that edit being reverted decreases significantly. Across both platforms, there was a -44.1% decrease in the revert rate for edits where the prompt was addressed. This confirms that Tone Check is highly effective at helping people identify and correct edits that would otherwise be reverted.
We observed decreases on both platforms, but there is a larger impact on desktop compared to mobile web. On desktop, we observed a significant -47% decrease [-13.4 pp] in revert rate for people who revised their text in response to Tone Check.
On mobile web, there was -14.8% [-4.8pp] decrease in revert rate for edits where non-neutral language was removed. Mobile web edits appear to be inherently trickier for newcomers and are still more likely to be reverted compared to desktop edits, even when non-neutral language is removed.
Edit Completion Rate
Tone Check does not appear to be causing any significant disruption to most people’s editing experience. Edit completion rates for people shown Tone Check decreased only slightly by -3.2% (-1.6) percentage points. This decrease was primarily concentrated on Desktop (-2.6%), with no significant change on Mobile Web.
The decrease in completion rate does not exceed over 10% until more than 10 tone checks are presented in a single editing session. For these edits, edit completion rate decreased to 44.3% (a -12% decrease from the control). These edits represent only 3% of edits and potentially low quality edits that we’d want to deter.
While completion rates slightly decreased for newcomers and unregistered users, they slightly increased for Junior Contributors, suggesting the check is encouraging and helps a portion of people complete their edit successfully.
Constructive Edit Rate
Tone Check improved the rate of constructive edits by +6.2% [4.4] percentage points. We observed improvements in overall edit quality at each of the three partner Wikipedias.
Aligned with the revert rate findings, the magnitude of impact varies by platform. On desktop, constructive edit rate increased by +6.4% while we observed no statistically significant change in mobile web constructive edits.
Tone Check appears especially effective at increasing the constructive edit rate of a registered Junior Contributors, where we observed a +14.8% increase [10.2 pp] in constructive edit rates. When limited to desktop edits, there was a +19.7% increase in constructive edits by Junior Contributors.
Retention Rate.
We further found that people shown Tone Check were more likely to return, indicating that the feature results in a positive editing experience for most contributors.
People who encountered Tone Check are 24% more likely to return again to make a constructive edit in their second week. Retention rates increased from 5.8% to 7.2% when Tone Check was shown (+1.4 percentage points).
We observed increases for both mobile web and desktop users and across all user types as well.
Guardrails. Tone check is not causing significant disruption on either desktop or mobile web based on analysis of identified guardrails. The decline rate is lower than other existing Edit Checks, and there was no spike in user blocks or revert rates.
Code
# load packagesshhh <-function(expr) suppressPackageStartupMessages(suppressWarnings(suppressMessages(expr)))shhh({library(lubridate)library(ggplot2)library(dplyr)library(gt)library(IRdisplay)library(tidyr)# Modeling completed used relax package developed by Mikhail Popov (WMF)library(relax) # https://gitlab.wikimedia.org/repos/product-analytics/experimentation-lab/relax)set.seed(5)})#set preferencesoptions(dplyr.summarise.inform =FALSE)options(repr.plot.width =15, repr.plot.height =10)# colorblind color friendly pallette:cbPalette <-c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
Data Cleaning
Code
# load tone check save data (initial dataset)tone_check_publish_data_1 <-read.csv(file ='data/tone_check_save_data_AB.tsv',header =TRUE,sep ="\t",stringsAsFactors =TRUE ) # load tone check save data (second dataset)# Second dataset was created to obtain updated event data while preserving initial aggregated dataset that could no loner# be queried in Data Lake due to data retention policies. tone_check_publish_data_2 <-read.csv(file ='data/tone_check_save_data_AB_pt2.tsv',header =TRUE,sep ="\t",stringsAsFactors =TRUE ) # Combine the two datasetstone_check_publish_data <-rbind(tone_check_publish_data_1, tone_check_publish_data_2)
Code
# Cleaning up dataset and renaming fields to clarify meanings# Set experience level group and factor levelstone_check_publish_data <- tone_check_publish_data |>mutate(experience_level_group =case_when( user_edit_count ==0& user_status =='registered'~'Newcomer', user_edit_count ==0& user_status =='unregistered'~'Unregistered', user_edit_count >0& user_edit_count <=100~"Junior Contributor", user_edit_count >100~"Non-Junior Contributor"#these users should already be filterd out of dataset but adding to confirm ),experience_level_group =factor(experience_level_group,levels =c("Unregistered","Newcomer", "Non-Junior Contributor", "Junior Contributor") )) #rename test group field to clarify groupstone_check_publish_data <- tone_check_publish_data |>mutate(test_group =factor(test_group,levels =c('2025-09-editcheck-tone-control', '2025-09-editcheck-tone-test'),labels =c("control (eligible but not shown tone check)", "test (tone check shown)")))#rename platform from phone to mobile web to clarify meaningtone_check_publish_data <- tone_check_publish_data |>mutate(platform =factor(platform,levels =c('phone', 'desktop'),labels =c("mobile web", "desktop")))# rename Wiki values to human readable formwiki_name_lookup <-c("jawiki"="Japanese Wikipedia","ptwiki"="Portuguese Wikipedia","frwiki"="French Wikipedia")tone_check_publish_data <- tone_check_publish_data %>%mutate(wiki =recode(wiki, !!!wiki_name_lookup) )
Code
#Set fields and factor levels to assess number of checks showntone_check_publish_data <- tone_check_publish_data |>mutate(multiple_checks_shown =case_when( test_group =="test (tone check shown)"& n_checks_shown ==1~"one tone check", test_group =="test (tone check shown)"& n_checks_shown >1~"multiple tone checks",TRUE~"no tone checks"#default if no conditions met ) ,multiple_checks_shown =factor(multiple_checks_shown ,levels =c('no tone checks', 'one tone check', 'multiple tone checks') ))# note these buckets can be adjusted as needed based on distribution of datatone_check_publish_data <- tone_check_publish_data |>mutate(checks_shown_bucket =case_when( test_group =="test (tone check shown)"&is.na(n_checks_shown) ~'0', test_group =="test (tone check shown)"& n_checks_shown ==1~'1', test_group =="test (tone check shown)"& n_checks_shown ==2~'2', test_group =="test (tone check shown)"& n_checks_shown >2& n_checks_shown <=5~"3-5", test_group =="test (tone check shown)"& n_checks_shown >5& n_checks_shown <=10~"6-10", test_group =="test (tone check shown)"& n_checks_shown >10~"over 10" ),checks_shown_bucket =factor(checks_shown_bucket ,levels =c("0","1","2", "3-5", "6-10", "over 10") )) # define set of all eligible edits to review (eligible in control and shown tone check in test)# Note there's 5 edits in the control group that were identiifed as eligible in VEFU instrumentation# but did not have eligible tag appliedtone_check_publish_data <- tone_check_publish_data |>mutate(is_test_eligible =ifelse( (test_group =='test (tone check shown)'& was_tone_check_shown_tag ==1) | (test_group =='control (eligible but not shown tone check)'& is_tone_check_eligible ==1) , 'eligible', 'not eligible'),is_test_eligible =factor( is_test_eligible,levels =c("eligible", "not eligible" ) )) # use tone check eligible tag to define test edits where tone check was addressed (tone_check_eligible == 0)tone_check_publish_data <- tone_check_publish_data |>mutate(is_tone_check_addressed =case_when( test_group =='control (eligible but not shown tone check)'& is_tone_check_eligible ==1~'Eligible control edits', test_group =='test (tone check shown)'& was_tone_check_shown_tag ==1& is_tone_check_eligible ==0~'Tone check shown and addressed',TRUE~"Tone check shown but not addressed"),is_tone_check_addressed =factor( is_tone_check_addressed,levels =c('Eligible control edits', 'Tone check shown but not addressed', 'Tone check shown and addressed') )) #We also removed all edits that were published before the model returned in an evaluation.# These events would not have the `editcheck-tone` tag applied to indicate if the published edit # includes promotional language. # This was done using events added in [T388716](https://phabricator.wikimedia.org/T388716#10872915).tone_check_publish_data <- tone_check_publish_data |>filter(was_saved_before_check ==0)
New content edits published without biased language (Primary Metric)
Hypothesis: The quality of new content edits newcomers and Junior Contributors make in the main namespace will increase because a greater percentage of these edits will not contain non-neutral language.
Methodology: As part of this hypothesis, we first evaluated if Tone Check reduces the frequency of non-neutral language in published edits.
We reviewed the proportion of all new content edits published without biased language (identified by the editcheck-tone tag, created in T388716 to identify when the model detected non-neutral language at the time of publishing).
Overall
Code
tone_issue_edits_overall <- tone_check_publish_data |>filter(is_new_content ==1 ) |>#limit to new content editsgroup_by(test_group) |>summarise(n_edits =n_distinct(editing_session),n_tone_issues =n_distinct(editing_session[is_tone_check_eligible ==1])) |># tone issues detectedmutate(non_neutral_language_rate =paste0(round(n_tone_issues/n_edits *100, 1), "%"))
Code
# plot visualization of non-neutral editsdodge <-position_dodge(width=0.9)p <- tone_issue_edits_overall |>ggplot(aes(x= test_group, y = n_tone_issues/n_edits, fill = test_group)) +geom_col(position ='dodge') +scale_y_continuous(labels = scales::percent) +geom_text(aes(label =paste(non_neutral_language_rate, "\n", n_tone_issues,"edits \n with non-netural language"), fontface=2), vjust=1.2, size =8, color ="white") +scale_fill_manual(values=c("#999999", "dodgerblue4"), name ="Experiment Group") +scale_x_discrete(breaks =c("control (eligible but not shown tone check)","test (tone check shown)"),labels =c("Control (no tone check)", "Test (tone check available)")) +#renaming as this metric is not limited to shown tone checkslabs (y ="Percent of new content edits ",x ="Experiment Group",title ="New content edits with non-neutral language",caption ="Limited to published new content edits") +theme(panel.grid.minor =element_blank(),panel.background =element_blank(),plot.title =element_text(hjust =0.5),text =element_text(size=24),axis.text.x =element_text(size =24),axis.title.x =element_text(margin =margin(t =20, unit ="pt")),legend.position="none",axis.line =element_line(colour ="black")) p
Tone Check successfully decreases the prevalence of non-neutral language in published content. Across both platforms, there was a -15.6% decrease [-1.5 percentage points] in the proportion of new content edits published with non-neutral language for the test group where Tone Check was available.
Note: The rate observed for the control (9.6%) is similar to the rates we observed in an initial baseline analysis estimating the frequency of these types of edits and rates identified in the leading indicator analysis.
By Platform
Code
tone_issue_edits_byplatform <- tone_check_publish_data |>filter(is_new_content ==1) |>group_by(platform, test_group) |>summarise(n_edits =n_distinct(editing_session),n_tone_issues =n_distinct(editing_session[is_tone_check_eligible ==1])) |># reverted within 48 hoursmutate(non_neutral_language =paste0(round(n_tone_issues/n_edits *100, 1), "%")) |>select(-c(3,4)) %>%# removing granular data columnsgt() |>tab_header(title =md("New Content edits with non-neutral language by\nplatform") ) |>opt_stylize(5) |>cols_label(platform ="Platform",test_group ="Experiment Group",#n_edits = "Number of published edits",#n_tone_issues = "Number of edits with non-neutral language",non_neutral_language ="Proportion of edits with non-neutral language" ) |>tab_source_note( gt::md('Limited to published new content edits') )display_html(as_raw_html(tone_issue_edits_byplatform))
New Content edits with non-neutral language by platform
Experiment Group
Proportion of edits with non-neutral language
mobile web
control (eligible but not shown tone check)
9.2%
test (tone check shown)
9.4%
desktop
control (eligible but not shown tone check)
9.7%
test (tone check shown)
7.6%
Limited to published new content edits
Trends vary by platform. On desktop, we observed a -21% decrease [-2 pp] in the proportion of edits with non-neutral language. There was no statistically significant change on mobile web.
By User Experience
Code
tone_issue_edits_byuserexp <- tone_check_publish_data |>filter(is_new_content ==1) |>group_by(experience_level_group, test_group) |>summarise(n_edits =n_distinct(editing_session),n_tone_issues =n_distinct(editing_session[is_tone_check_eligible ==1])) |># tone check issues detectedmutate(non_neutral_language =paste0(round(n_tone_issues/n_edits *100, 1), "%")) |>select(-c(3,4)) %>%# removing granular data columnsgt() |>tab_header(title ="New content edits with non-neutral language by user experience" ) |>opt_stylize(5) |>cols_label(experience_level_group ="User Experience",test_group ="Experiment Group",#n_edits = "Number of published edits",#n_tone_issues = "Number of edits with non-neutral language",non_neutral_language ="Proportion of edits with non-neutral language" ) |>tab_source_note( gt::md('Limited to published new content edits') )display_html(as_raw_html(tone_issue_edits_byuserexp ))
New content edits with non-neutral language by user experience
Experiment Group
Proportion of edits with non-neutral language
Unregistered
control (eligible but not shown tone check)
13.2%
test (tone check shown)
12.5%
Newcomer
control (eligible but not shown tone check)
13.1%
test (tone check shown)
10.6%
Junior Contributor
control (eligible but not shown tone check)
8%
test (tone check shown)
6.6%
Limited to published new content edits
Tone check decreases the frequency of non-neutral language for all reviewed user types.
We saw the highest absolute decrease in proportion of non-neutral edits published by newcomers (19% decrease [2.5pp]). Unregistered users saw the smallest change (-5.3% [0.7pp]).
By Wikipedia
Code
tone_issue_edits_bywiki <- tone_check_publish_data |>filter(is_new_content ==1) |>group_by(wiki, test_group) |>summarise(n_edits =n_distinct(editing_session),n_tone_issues =n_distinct(editing_session[is_tone_check_eligible ==1])) |># reverted within 48 hoursmutate(non_neutral_language =paste0(round(n_tone_issues/n_edits *100, 1), "%")) |>select(-c(3,4)) %>%# removing granular data columnsgt() |>tab_header(title ="New content edits with non-neutral language by Wikipedia" ) |>opt_stylize(5) |>cols_label(wiki ="Wikipedia",test_group ="Experiment Group",#n_edits = "Number of published edits",#n_tone_issues = "Number of edits with non-neutral language",non_neutral_language ="Proportion of edits with non-neutral language" ) |>tab_source_note( gt::md('Limited to new content edits') )display_html(as_raw_html(tone_issue_edits_bywiki ))
New content edits with non-neutral language by Wikipedia
Experiment Group
Proportion of edits with non-neutral language
French Wikipedia
control (eligible but not shown tone check)
11.5%
test (tone check shown)
10.5%
Japanese Wikipedia
control (eligible but not shown tone check)
6.7%
test (tone check shown)
3.2%
Portuguese Wikipedia
control (eligible but not shown tone check)
6.9%
test (tone check shown)
6.1%
Limited to new content edits
We also observed decreases in the proportion of edits with non-neutral language at each partner Wikipedia. At Japanese Wikipedia, there was a significant -52.2% decrease [-3.5 pp] in the proportion of edits with non-neutral language when tone check was shown to eligible edits.
Confirming the impact of Tone Check on edits published without biased language
We analyzed the above results using two complementary statistical frameworks (Bayesian and Frequentist) to correctly infer the impact of offering Tone Check on decreasing the likelihood a new content edit includes biased language when published. This allows us to confirm if the observed changes detailed above are statistically significant (did not occur due to random chance).
Since multiple edits can be made by the same user, we first calculated the rates for each user (proportion of all edits saved by a user that include non-neutral language).
Note: This is an implementation of Bayesian and Frequentist engines also used in Test Kitchen’s automated analytics
Code
# calculate the proportion for each usertone_issue_edits_overall_byuser <- tone_check_publish_data |>filter(is_new_content ==1) |>group_by(test_group, platform, user_id) |>summarise(n_edits =n_distinct(editing_session),n_tone_issues =n_distinct(editing_session[is_tone_check_eligible ==1])) |># reverted within 48 hoursmutate(non_neutral_language_rate = n_tone_issues/n_edits)
Code
# rename field names to align with relax package naming conventiontone_issue_edits_overall_byuser <- tone_issue_edits_overall_byuser |>mutate(variation =factor(test_group,levels =c("control (eligible but not shown tone check)", "test (tone check shown)"),labels =c("control", "treatment")))tone_issue_edits_overall_byuser$outcome = tone_issue_edits_overall_byuser$non_neutral_language_rate
Code
overall_impact_toneissues <- tone_issue_edits_overall_byuser |>analyze_relative_lift(metric_type ="proportion") |>gt() |>tab_header(title =md("**Evaluating Tone Check impact on edits published with non-neutral language**"),subtitle =md("Difference in Metric (Test Group - Control Group)") ) |>tab_spanner(label =md("**Bayesian Analysis**"),columns =c(estimate_bayes, chance_to_win, cred_lower, cred_upper) ) |>tab_spanner(label =md("**Frequentist Analysis**"),columns =c(estimate_freq, p_value, conf_lower, conf_upper) ) |># Rename Columns for clarity ---cols_label(estimate_bayes =md("Point Estimate"),chance_to_win =md("Chance to Win"),cred_lower =md("95% CI Lower"),cred_upper =md("95% CI Upper"),estimate_freq =md("Point Estimate"),p_value =md("*p*-value"),conf_lower =md("95% CI Lower"),conf_upper =md("95% CI Upper") ) |># pply Formatting (Decimals and CI Grouping) ---fmt_number(columns =everything(),decimals =3# Use 3 decimals for precision ) |># Highlight key finding (Inconclusive) ---tab_footnote(footnote =md("The 95% intervals do not cross zero indicating the results are statistically significant."),locations =cells_column_labels(columns =c(cred_lower, conf_lower)) ) %>%# Style the table ---tab_options(table.border.top.color ="lightgray",column_labels.border.bottom.color ="black",column_labels.border.bottom.width =px(2),data_row.padding =px(5) )display_html(as_raw_html(overall_impact_toneissues))
Evaluating Tone Check impact on edits published with non-neutral language
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
95% CI Lower1
95% CI Upper
Point Estimate
p-value
95% CI Lower1
95% CI Upper
−0.136
0.002
−0.228
−0.043
−0.139
0.004
−0.232
−0.046
1The 95% intervals do not cross zero indicating the results are statistically significant.
Analysis of the A/B test data confirms a statistically significant reduction in non-neutral language across all platforms. We have high confidence (>99.8%) that this effect is driven by Tone Check.
Code
# check by platform numbersplatform_impact_toneissues <- tone_issue_edits_overall_byuser |>group_by(platform) |>group_modify(~analyze_relative_lift(.x, metric_type ="proportion"))|>gt() |>tab_header(title =md("**Evaluating Tone Check impact on edits published in non-neutral language**"),subtitle =md("Difference in Metric (Test Group - Control Group)") ) |>tab_spanner(label =md("**Bayesian Analysis**"),columns =c(estimate_bayes, chance_to_win, cred_lower, cred_upper) ) |>tab_spanner(label =md("**Frequentist Analysis**"),columns =c(estimate_freq, p_value, conf_lower, conf_upper) ) |># Rename Columns for clarity ---cols_label(platform =md("Platform"),estimate_bayes =md("Point Estimate"),chance_to_win =md("Chance to Win"),cred_lower =md("95% CI Lower"),cred_upper =md("95% CI Upper"),estimate_freq =md("Point Estimate"),p_value =md("*p*-value"),conf_lower =md("95% CI Lower"),conf_upper =md("95% CI Upper") ) |># pply Formatting (Decimals and CI Grouping) ---fmt_number(columns =everything(),decimals =3# Use 3 decimals for precision ) |># Highlight key finding (Inconclusive) ---tab_footnote(footnote =md("The 95% intervals do not cross zero indicating results are statistically signficant."),locations =cells_column_labels(columns =c(cred_lower, conf_lower)) ) %>%# Style the table ---tab_options(table.border.top.color ="lightgray",column_labels.border.bottom.color ="black",column_labels.border.bottom.width =px(2),data_row.padding =px(5) )display_html(as_raw_html(platform_impact_toneissues))
Evaluating Tone Check impact on edits published in non-neutral language
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
95% CI Lower1
95% CI Upper
Point Estimate
p-value
95% CI Lower1
95% CI Upper
mobile web
−0.013
0.444
−0.192
0.167
−0.014
0.883
−0.203
0.174
desktop
−0.186
0.000
−0.291
−0.081
−0.192
0.000
−0.299
−0.086
1The 95% intervals do not cross zero indicating results are statistically signficant.
However, Tone Check’s effectiveness depends heavily on the platform. Results confirm a highly significant impact on Desktop (p < 0.001), where the reduction was most pronounced. In contrast, there was no detectable effect on Mobile Web (p = 0.883), suggesting that people respond to Tone Check differently when making mobile edits.
New Content Edit Revert Rate (Primary Metric]
Hypothesis: The quality of new content edits newcomers and Junior Contributors make in the main namespace will increase because a greater percentage of these edits will not contain non-neutral language.
Methodology:
In addition to evaluating if Tone Check reduces the frequency of non-neutral language, we also wanted to assess the impact of Tone Check on edit revert rate.
To do this, we reviewed the proportion of all published new content edits where tone check was shown at least once in an editing session (identified by editCheck-tone-shown tag) and were reverted within 48 hours. This was compared to the revert rate of edits in the control group identified as eligible for Tone Check (identified by editcheck-tone tag).
Note: This metric does not consider the final text of the published edit. It’s possible edits shown Tone Check still included non-neutral language at the time of publishing if the Tone Check was not addressed. It’s also possible that non-neutral language was removed but the edit was still reverted for other reasons. This purpose of this metric is to evaluate if presenting a tone check to a user while editing will increase the overall quality of new content edits.
## Overall
Code
tone_check_reverts_overall <- tone_check_publish_data |>filter(is_new_content ==1& is_test_eligible =='eligible') %>%#limit to edit shown or eligible to be shown tone checkgroup_by(test_group) |>summarise(n_edits =n_distinct(editing_session),n_reverts =n_distinct(editing_session[was_reverted ==1])) %>%# reverted within 48 hoursmutate(revert_rate =paste0(round(n_reverts/n_edits *100, 1), "%"))
Code
# plot visualization of overall edit revert ratesdodge <-position_dodge(width=0.9)p <- tone_check_reverts_overall |>ggplot(aes(x= test_group, y = n_reverts/n_edits, fill = test_group)) +geom_col(position ='dodge') +scale_y_continuous(labels = scales::percent) +geom_text(aes(label =paste(revert_rate, "\n", n_reverts,"reverted edits"), fontface=2), vjust=1.2, size =10, color ="white") +scale_fill_manual(values=c("#999999", "dodgerblue4"), name ="Experiment Group") +labs (y ="Percent of edits reverted ",x ="Experiment Group",title ="New content edit revert rate",caption ="Limited to published new content edits shown or eligible to be shown Tone Check") +theme(panel.grid.minor =element_blank(),panel.background =element_blank(),plot.title =element_text(hjust =0.5),text =element_text(size=24),axis.text.x =element_text(size =24),axis.title.x =element_text(margin =margin(t =20, unit ="pt")),legend.position="none",axis.line =element_line(colour ="black")) p
People show Tone Check are less likely to be reverted. Across both platform, there was a -15% decrease [-4.4 ppp] in the revert rate of edits shown Tone Check in the test group compared to edits eligible but not shown Tone Check in the control group.
By if multiple checks were shown
Code
tone_check_reverts_bymultiple <- tone_check_publish_data |>filter(is_new_content ==1& is_test_eligible =='eligible'& multiple_checks_shown !="no tone checks"# Removing 3 events where eligible edits in control were incorrectly tagged as being shown checks& test_group =='test (tone check shown)') |>group_by( multiple_checks_shown) %>%summarise(n_edits =n_distinct(editing_session),n_reverts =n_distinct(editing_session[was_reverted ==1])) |>mutate(revert_rate =paste0(round(n_reverts/n_edits *100, 1), "%")) |>select(-c(2,3)) %>%# removing granular data columns for publicationgt() |>tab_header(title ="New content edit revert rate by if multiple checks were shown" ) |>opt_stylize(5) |>cols_label(multiple_checks_shown ="Multiple Check",#n_edits = "Number of published new content edits",#n_reverts = "Number of edits reverted ",revert_rate ="Proportion of new content edits that were reverted" ) |>tab_source_note( gt::md('Limited to published new content edits shown or eligible to shown tone check') )display_html(as_raw_html(tone_check_reverts_bymultiple))
New content edit revert rate by if multiple checks were shown
Multiple Check
Proportion of new content edits that were reverted
one tone check
25.7%
multiple tone checks
25.2%
Limited to published new content edits shown or eligible to shown tone check
The numbers of Tone Checks shown within a single editing session does not impact the likelihood an edit is reverted. The revert rate for edits shown one or multiple tone checks is about the same (~25%).
While we initially observed a lower revert rate for edits shown a single tone check in the leading indicator analysis, additional test data indicates that the revert rate of these edits is similar to edits shown multiple tone checks.
By Platform
Code
tone_check_publish_byplatform <- tone_check_publish_data |>filter(is_test_eligible =='eligible' ) |>group_by(platform, test_group) |>summarise(n_edits =n_distinct(editing_session),n_reverts =n_distinct(editing_session[was_reverted ==1])) |>#reverted within 48 hoursmutate(revert_rate =paste0(round(n_reverts/n_edits *100, 1), "%")) |>select(-c(3,4)) %>%# removing granular data columns for publicationgt() |>tab_header(title ="New content edit revert rate by platform" ) |>opt_stylize(5) |>cols_label(test_group ="Experiment Group",platform ="Platform",#n_edits = "Number of published new content edits",#n_reverts = "Number of edits reverted",revert_rate ="Proportion of new content edits that were reverted" ) |>tab_source_note( gt::md('Limited to published new content edits shown or eligible to shown Tone Check') )display_html(as_raw_html(tone_check_publish_byplatform))
New content edit revert rate by platform
Experiment Group
Proportion of new content edits that were reverted
mobile web
control (eligible but not shown tone check)
34.5%
test (tone check shown)
34.6%
desktop
control (eligible but not shown tone check)
25%
test (tone check shown)
20.2%
Limited to published new content edits shown or eligible to shown Tone Check
The decrease in new content edit revert rate is primarily driven by a decrease in the revert rate of desktop edits.
We observed -19% [-4.8pp] statistically significant decrease in the revert rate of desktop edits shown Tone Check. On mobile web, there was no statistically significant change.
By User Experience
Code
tone_check_revert_byuserexp <- tone_check_publish_data |>filter(is_new_content ==1& is_test_eligible =='eligible') |>group_by(experience_level_group,test_group) |>summarise(n_edits =n_distinct(editing_session),n_reverts =n_distinct(editing_session[was_reverted ==1])) |>#reverted within 48 hoursmutate(revert_rate =paste0(round(n_reverts/n_edits *100, 1), "%")) |>select(-c(3,4)) |># removing granular data columns for publicationgt() |>tab_header(title ="New content edit revert rate by user experience" ) |>opt_stylize(5) |>cols_label(test_group ="Experiment Group",experience_level_group ="User Status",#n_edits = "Number of published new content edits",#n_reverts = "Number of edits reverted",revert_rate ="Proportion of new content edits that were reverted" ) |>tab_source_note( gt::md('Limited to published new content edits shown or eligible to shown Tone Check') )display_html(as_raw_html(tone_check_revert_byuserexp))
New content edit revert rate by user experience
Experiment Group
Proportion of new content edits that were reverted
Unregistered
control (eligible but not shown tone check)
34.7%
test (tone check shown)
37%
Newcomer
control (eligible but not shown tone check)
21%
test (tone check shown)
26%
Junior Contributor
control (eligible but not shown tone check)
31%
test (tone check shown)
20.8%
Limited to published new content edits shown or eligible to shown Tone Check
Results vary based on user experience. While we observed a statistically significant -33% relative [-10.2 pp] decrease in reverts for Junior Contributors, we did not confirm any change for in the revert rate of newcomers or unregistered users.
These trends indicate that Tone Check may be more effective for people who have already succeeded in completing at least one edit on a Wikipedia namespace. Since these users are more experienced, their edits are less likely to be reverted for other policy violations compared to users completing their first edit.
By partner Wikipedia
Code
tone_check_revert_bywiki <- tone_check_publish_data |>filter(is_new_content ==1& is_test_eligible =='eligible' ) |>group_by(wiki, test_group) |>summarise(n_edits =n_distinct(editing_session),n_reverts =n_distinct(editing_session[was_reverted ==1])) |>mutate(revert_rate =paste0(round(n_reverts/n_edits *100, 1), "%")) |>select(-c(3,4)) %>%# removing granular data columns for publicationgt() |>tab_header(title ="New content edit revert rate by partner Wikipedia" ) |>opt_stylize(5) |>cols_label(test_group ="Experiment Group",wiki ="Wikipedia",#n_edits = "Number of published new content edits",#n_reverts = "Number of edits reverted",revert_rate ="Proportion of new content edits that were reverted" ) |>tab_source_note( gt::md('Limited to wikis with > 100 published new content edits') )display_html(as_raw_html(tone_check_revert_bywiki))
New content edit revert rate by partner Wikipedia
Experiment Group
Proportion of new content edits that were reverted
French Wikipedia
control (eligible but not shown tone check)
30.8%
test (tone check shown)
29%
Japanese Wikipedia
control (eligible but not shown tone check)
31.2%
test (tone check shown)
10.9%
Portuguese Wikipedia
control (eligible but not shown tone check)
21.4%
test (tone check shown)
16.9%
Limited to wikis with > 100 published new content edits
New content edit revert rate decreased for users shown Tone Check at all three partner Wikipedias by at least -5%.
We again see an especially high impact on edit quality at Japanese Wikipedia, where there was a -65% decrease in revert rate of edits shown tone check compared to eligible edits in the control group.
Due to the small sample size of per Wikipedia edits, we are currently not able confirm statistical significance of the decreases at any of these Wikipedias but the direction and magnitude of change indicate that Tone Check is having a positive effect on edit quality at each partner Wikipedia.
Confirming the impact of Tone Check on revert rate
Code
# calculate the proportion for each usertone_check_reverts_overall_byuser <- tone_check_publish_data |>filter(is_test_eligible =='eligible') |>#limit to edit shown or eligible to be shown tone checkgroup_by(test_group, platform, user_id) |>summarise(n_edits =n_distinct(editing_session),n_reverts =n_distinct(editing_session[was_reverted ==1])) |># reverted within 48 hoursmutate(revert_rate = n_reverts/n_edits) # proportion for each user
Code
# rename field names to align with relax package naming conventiontone_check_reverts_overall_byuser <- tone_check_reverts_overall_byuser |>mutate(variation =factor(test_group,levels =c("control (eligible but not shown tone check)", "test (tone check shown)"),labels =c("control", "treatment")))tone_check_reverts_overall_byuser$outcome = tone_check_reverts_overall_byuser$revert_rate
Code
overall_impact_reverts <- tone_check_reverts_overall_byuser |>analyze_relative_lift(metric_type ="proportion", ci_level =0.90) |>gt() |>tab_header(title =md("**Evaluating Tone Check impact on new content revert rate**"),subtitle =md("Difference in Metric (Test Group - Control Group)") ) |>tab_spanner(label =md("**Bayesian Analysis**"),columns =c(estimate_bayes, chance_to_win, cred_lower, cred_upper) ) |>tab_spanner(label =md("**Frequentist Analysis**"),columns =c(estimate_freq, p_value, conf_lower, conf_upper) ) |># Rename Columns for clarity ---cols_label(estimate_bayes =md("Point Estimate"),chance_to_win =md("Chance to Win"),cred_lower =md("90% CI Lower"),cred_upper =md("90% CI Upper"),estimate_freq =md("Point Estimate"),p_value =md("*p*-value"),conf_lower =md("90% CI Lower"),conf_upper =md("90% CI Upper") ) |># pply Formatting (Decimals and CI Grouping) ---fmt_number(columns =everything(),decimals =3# Use 3 decimals for precision ) |># Highlight key finding (Inconclusive) ---# tab_footnote(# footnote = md("The 95% intervals cross zero, indicating no statistically conclusive difference."),# locations = cells_column_labels(columns = c(cred_lower, conf_lower))# ) %>%# Style the table ---tab_options(table.border.top.color ="lightgray",column_labels.border.bottom.color ="black",column_labels.border.bottom.width =px(2),data_row.padding =px(5) )display_html(as_raw_html(overall_impact_reverts))
Evaluating Tone Check impact on new content revert rate
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
90% CI Lower
90% CI Upper
Point Estimate
p-value
90% CI Lower
90% CI Upper
−0.080
0.043
−0.158
−0.003
−0.082
0.083
−0.161
−0.004
Results confirm a slight but statistically significant decrease in the revert rate of edits shown Tone Check across all edits. Analysis indicates a 95.7% chance that Tone Check successfully reduces the likelihood of an edit being reverted. At a 90% confidence level, the results are statistically significant (p = 0.083).
Code
# check by platform numbersplatform_impact_reverts <- tone_check_reverts_overall_byuser |>group_by(platform) |>group_modify(~analyze_relative_lift(.x, metric_type ="proportion" ))|>gt() |>tab_header(title =md("**Evaluating Tone Check impact on revert rate by platform**"),subtitle =md("Difference in Metric (Test Group - Control Group)") ) |>tab_spanner(label =md("**Bayesian Analysis**"),columns =c(estimate_bayes, chance_to_win, cred_lower, cred_upper) ) |>tab_spanner(label =md("**Frequentist Analysis**"),columns =c(estimate_freq, p_value, conf_lower, conf_upper) ) |># Rename Columns for clarity ---cols_label(platform =md("Platform"),estimate_bayes =md("Point Estimate"),chance_to_win =md("Chance to Win"),cred_lower =md("95% CI Lower"),cred_upper =md("95% CI Upper"),estimate_freq =md("Point Estimate"),p_value =md("*p*-value"),conf_lower =md("95% CI Lower"),conf_upper =md("95% CI Upper") ) |># pply Formatting (Decimals and CI Grouping) ---fmt_number(columns =everything(),decimals =3# Use 3 decimals for precision ) |># Highlight key finding (Inconclusive) ---tab_footnote(footnote =md("The 95% intervals cross zero, indicating no statistically conclusive difference."),locations =cells_column_labels(columns =c(cred_lower, conf_lower)) ) %>%# Style the table ---tab_options(table.border.top.color ="lightgray",column_labels.border.bottom.color ="black",column_labels.border.bottom.width =px(2),data_row.padding =px(5) )display_html(as_raw_html(platform_impact_reverts))
Evaluating Tone Check impact on revert rate by platform
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
95% CI Lower1
95% CI Upper
Point Estimate
p-value
95% CI Lower1
95% CI Upper
mobile web
−0.024
0.370
−0.166
0.118
−0.026
0.732
−0.172
0.121
desktop
−0.092
0.063
−0.210
0.026
−0.096
0.119
−0.217
0.025
1The 95% intervals cross zero, indicating no statistically conclusive difference.
While we do not have sufficient data to confirm statistical significance at the strict 95% level on a per-platform basis, the results strongly indicate that Tone Check is decreasing the revert rate on Desktop.
The Bayesian analysis shows a 93.7% probability that the tool reduces Desktop reverts, with a projected impact of -9.6%. On mobile web, there was almost no change on the overall new content revert rate.
New Content edit revert rate: Impact of removing non-neutral language (Primary Metric)
Hypothesis: The quality of new content edits newcomers and Junior Contributors make in the main namespace will increase because a greater percentage of these edits will not contain non-neutral language.
Methodology: As the final piece to evaluate this hypothesis, we reviewed the revert rate of new content edits in the test group for people that removed non-neutral language in response to Tone Check. Here were are measuring the impact of a person making the change Tone Check is prompting. Does removing non-neutral language decrease the likelihood that an edit is reverted?
In this section, we isolated the direct impact of Tone Check by comparing a specific subset: Control edits that contained non-neutral language versus Test edits where the user actively removed that language in response to a Tone Check. To do this, we used the revision tag created in T388716 to identify when the model detects non-neutral language within new content edit at the time of publishing.
Overall
Code
tone_check_eligible_revert_overall <- tone_check_publish_data |>filter(is_new_content ==1& is_test_eligible =='eligible' ) |>#limit to edit shown or eligible to be shown tone checkgroup_by(is_tone_check_addressed ) |>#group by prescence of non-neutral languagesummarise(n_edits =n_distinct(editing_session),n_reverts =n_distinct(editing_session[was_reverted ==1])) |># reverted within 48 hours and tone check issues addressedmutate(revert_rate =round(n_reverts/n_edits, 3)) |>ungroup() |>mutate(n_edits =ifelse(n_edits <50, "<50", n_edits),n_reverts =ifelse(n_reverts <50, "<50", n_reverts)) #sanitizing per data publication guidelines
Code
# plot visualization of overall edit revert ratesdodge <-position_dodge(width=0.9)p <- tone_check_eligible_revert_overall |>filter(is_tone_check_addressed !='Tone check shown but not addressed') |>#removing edits in test group where tone issues were not addressed for this analysisggplot(aes(x= is_tone_check_addressed , y = revert_rate, fill = is_tone_check_addressed )) +geom_col(position ='dodge') +scale_y_continuous(labels = scales::percent) +geom_text(aes(label =paste(revert_rate *100,"%", "\n", n_reverts,"reverted edits"), fontface=2), vjust=1.2, size =8, color ="white") +scale_fill_manual(values=c("#999999", "dodgerblue4"), name ="Experiment Group") +labs (y ="Percent of edits reverted ",x ="Experiment Group",title ="New Content revert rate: Impact of removing non-neutral language",caption ="Limited to published new content edits by unregistered users or users with 100 or fewer edits") +theme(panel.grid.minor =element_blank(),panel.background =element_blank(),plot.title =element_text(hjust =0.5),text =element_text(size=20),axis.text.x =element_text(size =20),axis.title.x =element_text(margin =margin(t =20, unit ="pt")),legend.position="none",axis.line =element_line(colour ="black")) p
When the Tone Check successfully prompts a user to remove non-neutral language, the likelihood of that edit being reverted drops significantly. There was a -44.1% decrease in the revert rate of edits where people removed non-neutral language in response to a Tone Check prompt.
By Platform
Code
tone_check_eligible_revert_byplatform <- tone_check_publish_data |>filter(is_new_content ==1& is_test_eligible =='eligible'& is_tone_check_addressed !='Tone check shown but not addressed') |>#limit to edits where tone check was addressedgroup_by(platform, is_tone_check_addressed ) |>summarise(n_edits =n_distinct(editing_session),n_reverts =n_distinct(editing_session[was_reverted ==1])) |>#look at revertedmutate(revert_rate =paste0(round(n_reverts/n_edits *100, 1), "%")) |>select(-c(3,4)) |># removing granular data columnsgt() |>tab_header(title ="New Content revert rate: Impact of removing non-neutral language by platform" ) |>opt_stylize(5) |>cols_label(platform ="Platform",is_tone_check_addressed ="Were tone issues detected at time of save?",#n_edits = "Number of published edits",#n_reverts = "Number of edits reverted",revert_rate ="Proportion of edits that were reverted" ) |>tab_source_note( gt::md('Limited to published new content edits shown or eligible to be shown Tone Check') )display_html(as_raw_html(tone_check_eligible_revert_byplatform))
New Content revert rate: Impact of removing non-neutral language by platform
Were tone issues detected at time of save?
Proportion of edits that were reverted
mobile web
Eligible control edits
32.4%
Tone check shown and addressed
27.6%
desktop
Eligible control edits
28.5%
Tone check shown and addressed
15.1%
Limited to published new content edits shown or eligible to be shown Tone Check
We observed decreases on both platforms, but there is a larger impact on desktop compared to mobile web. On desktop, we observed a significant -47% decrease [-13.4 pp] in revert rate for people who revised their text in response to Tone Check.
On mobile web, there was -14.8% [-4.8pp] decrease in revert rate for edits where non-neutral language was removed. Mobile web edits appear to be inherently trickier for newcomers and are still more likely to be reverted compared to desktop edits, even when non-neutral language is removed.
By User Experience
Code
tone_check_eligible_revert_byuserexp <- tone_check_publish_data |>filter(is_new_content ==1& is_test_eligible =='eligible'& is_tone_check_addressed !='Tone check shown but not addressed') |>#limit to edits where tone check was addressedgroup_by(experience_level_group, is_tone_check_addressed) |>summarise(n_edits =n_distinct(editing_session),n_reverts =n_distinct(editing_session[was_reverted ==1])) |>#look at revertedmutate(revert_rate =paste0(round(n_reverts/n_edits *100, 1), "%")) |>select(-c(3,4)) |># removing granular data columnsgt() |>tab_header(title ="New Content revert rate: Impact of removing non-neutral language by user experience" ) |>opt_stylize(5) |>cols_label(experience_level_group ="User Experience",is_tone_check_addressed ="Were tone issues detected at time of save?",#n_edits = "Number of published edits",#n_reverts = "Number of edits reverted",revert_rate ="Proportion of edits that were reverted" ) |>tab_source_note( gt::md('Limited to published new content edits shown or eligible to be shown Tone Check') )display_html(as_raw_html(tone_check_eligible_revert_byuserexp))
New Content revert rate: Impact of removing non-neutral language by user experience
Were tone issues detected at time of save?
Proportion of edits that were reverted
Unregistered
Eligible control edits
34.7%
Tone check shown and addressed
23.8%
Newcomer
Eligible control edits
21%
Tone check shown and addressed
21.5%
Junior Contributor
Eligible control edits
31%
Tone check shown and addressed
13.7%
Limited to published new content edits shown or eligible to be shown Tone Check
We observed the highest impact for Junior Contributors, where there was -55.8% decrease [ -17.3 pp] in revert rate compared to a slight +2.5%[0.5pp] increase for newcomers and a -31.4% decrease for unregistered users.
For Newcomers and unregistered users, addressing tone issues may have less of an impact because their edits are frequently reverted for other policy violations that Tone Check is not designed to catch. Junior contributors have already successfully completed at least one edit and are more likely to publish an edit where non-neutral language is the only issue.
By Partner Wikipedia
Code
tone_check_eligible_revert_bywiki <- tone_check_publish_data|>filter(is_new_content & is_test_eligible =='eligible'& is_tone_check_addressed !='Tone check shown but not addressed') |>#limit to edits where tone check was addressedgroup_by(wiki, is_tone_check_addressed) %>%summarise(n_edits =n_distinct(editing_session),n_reverts =n_distinct(editing_session[was_reverted ==1])) |>#look at revertedmutate(revert_rate =paste0(round(n_reverts/n_edits *100, 1), "%")) |>select(-c(3,4)) %>%# removing granular data columnsgt() |>tab_header(title ="New Content revert rate: Impact of removing non-neutral language by partner Wikipedia" ) |>opt_stylize(5) |>cols_label(wiki ="Wikipedia",is_tone_check_addressed ="Were tone issues detected at time of save?",#n_edits = "Number of published edits",#n_reverts = "Number of edits reverted",revert_rate ="Proportion of edits that were reverted" ) |>tab_source_note( gt::md('Limited to published new content edits shown or eligible to be shown Tone Check') )display_html(as_raw_html(tone_check_eligible_revert_bywiki))
New Content revert rate: Impact of removing non-neutral language by partner Wikipedia
Were tone issues detected at time of save?
Proportion of edits that were reverted
French Wikipedia
Eligible control edits
30.8%
Tone check shown and addressed
22.5%
Japanese Wikipedia
Eligible control edits
31.2%
Tone check shown and addressed
8%
Portuguese Wikipedia
Eligible control edits
21.4%
Tone check shown and addressed
4.5%
Limited to published new content edits shown or eligible to be shown Tone Check
We also observed decreases across all three partner Wikipedias; however, the magnitude of impact varies highlighting different revert behavior at each community.
At Japanese and Portuguese Wikipedias, removing non-neutral language from edits reduces the revert rate to less than 10% (over 70% relative decrease) while there was less of an impact on French Wikipedia. See specific changes below:
Japanese Wikipedia: -74.4%[-23.2pp]
Portuguese Wikipedia: -79% [-16.9]
French Wikipedia: -26.9% [-8.3pp]
Note: There is a smaller sample size of published edits eligible for Tone Check on a per Wikipedia basis (< 300 edits) so these rates are more susceptible to noise.
Confirming the impact of removing non-neutral language on new content revert rate
I then evaluated the impact of removing non-neutral language on new content revert rate, controlling for variances by user and wiki. For this analysis, I specifically compared edits with non-neutral edits in the control group (eligible control edits) to edits in the test group where Tone check was shown and non-neutral language was removed.
Code
tone_check_eligible_revert_overall_byuser <- tone_check_publish_data |>filter(is_new_content ==1& is_test_eligible =='eligible'& is_tone_check_addressed !='Tone check shown but not addressed') |>#directly comparing eligible control to test where tone check addressedgroup_by(is_tone_check_addressed, platform, user_id) |>#use prescence of non_neutral language as the variation summarise(n_edits =n_distinct(editing_session),n_reverts =n_distinct(editing_session[was_reverted ==1])) |># reverted within 48 hoursmutate(revert_rate = n_reverts/n_edits)
Code
# rename field names to align with relax package naming conventiontone_check_eligible_revert_overall_byuser <- tone_check_eligible_revert_overall_byuser |>mutate(variation =factor(is_tone_check_addressed,levels =c("Eligible control edits", "Tone check shown and addressed"),labels =c("control", "treatment")))tone_check_eligible_revert_overall_byuser$outcome = tone_check_eligible_revert_overall_byuser$revert_rate
Code
overall_impact_toneeligible <- tone_check_eligible_revert_overall_byuser |>analyze_relative_lift(metric_type ="proportion") |>gt() |>tab_header(title =md("**Evaluating the impact of removing non-netural language on revert rate**"),subtitle =md("Difference in Metric (Test Group - Control Group)") ) |>tab_spanner(label =md("**Bayesian Analysis**"),columns =c(estimate_bayes, chance_to_win, cred_lower, cred_upper) ) |>tab_spanner(label =md("**Frequentist Analysis**"),columns =c(estimate_freq, p_value, conf_lower, conf_upper) ) |># Rename Columns for clarity ---cols_label(estimate_bayes =md("Point Estimate"),chance_to_win =md("Chance to Win"),cred_lower =md("95% CI Lower"),cred_upper =md("95% CI Upper"),estimate_freq =md("Point Estimate"),p_value =md("*p*-value"),conf_lower =md("95% CI Lower"),conf_upper =md("95% CI Upper") ) |># pply Formatting (Decimals and CI Grouping) ---fmt_number(columns =everything(),decimals =3# Use 3 decimals for precision ) |># Highlight key finding (Inconclusive) ---tab_footnote(footnote =md("The 95% intervals do not cross zero indicating the results are statistically significant."),locations =cells_column_labels(columns =c(cred_lower, conf_lower)) ) %>%# Style the table ---tab_options(table.border.top.color ="lightgray",column_labels.border.bottom.color ="black",column_labels.border.bottom.width =px(2),data_row.padding =px(5) )display_html(as_raw_html(overall_impact_toneeligible))
Evaluating the impact of removing non-netural language on revert rate
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
95% CI Lower1
95% CI Upper
Point Estimate
p-value
95% CI Lower1
95% CI Upper
−0.350
0.000
−0.534
−0.165
−0.388
0.000
−0.583
−0.193
1The 95% intervals do not cross zero indicating the results are statistically significant.
We confirmed a statistically significant reduction in revert rate for edits where non-neutral language was removed in the final published edit.
For users that removed non-neutral language in response to a Tone Check, we observed a likely 38 percentage point reduction in the likelihood of an edit being reverted. Both the Bayesian “Chance to Win” and the Frequentist p-value are at the maximum possible significance level (0.000). This confirms that the reduction is not due to chance, but is a direct result of the language being improved.
Code
# check by platform numbersplatform_impact_toneeligible <- tone_check_eligible_revert_overall_byuser |>group_by(platform) |>group_modify(~analyze_relative_lift(.x, metric_type ="proportion" ))|>gt() |>tab_header(title =md("**Evaluating the impact of removing non-netural language on platform revert rate**"),subtitle =md("Difference in Metric (Test Group - Control Group)") ) |>tab_spanner(label =md("**Bayesian Analysis**"),columns =c(estimate_bayes, chance_to_win, cred_lower, cred_upper) ) |>tab_spanner(label =md("**Frequentist Analysis**"),columns =c(estimate_freq, p_value, conf_lower, conf_upper) ) |># Rename Columns for clarity ---cols_label(platform =md("Platform"),estimate_bayes =md("Point Estimate"),chance_to_win =md("Chance to Win"),cred_lower =md("95% CI Lower"),cred_upper =md("95% CI Upper"),estimate_freq =md("Point Estimate"),p_value =md("*p*-value"),conf_lower =md("95% CI Lower"),conf_upper =md("95% CI Upper") ) |># pply Formatting (Decimals and CI Grouping) ---fmt_number(columns =everything(),decimals =3# Use 3 decimals for precision ) |># Style the table ---tab_options(table.border.top.color ="lightgray",column_labels.border.bottom.color ="black",column_labels.border.bottom.width =px(2),data_row.padding =px(5) )display_html(as_raw_html(platform_impact_toneeligible))
Evaluating the impact of removing non-netural language on platform revert rate
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
95% CI Lower
95% CI Upper
Point Estimate
p-value
95% CI Lower
95% CI Upper
mobile web
−0.088
0.330
−0.479
0.303
−0.158
0.559
−0.699
0.384
desktop
−0.342
0.001
−0.551
−0.133
−0.392
0.001
−0.616
−0.167
Per platform findings:
Desktop: The model shows a highly significant 34.2 percentage point drop in the revert rate for edits where non-neutral language was removed (p = 0.001). There is a 99.9% probability that Desktop users who address Tone Check suggestions are less likely to be reverted.
Mobile web: Results show directional signs that removing non-neutral language decreases reverts, with a projected impact of -8.8 percentage points. While there is a 67% chance that addressing Tone Check issues decreases reverts, we cannot confirm statistical significance (p = 0.559). This is likely due to a smaller sample size and higher “signal noise” on mobile, where other factors (such as technical errors or other policy violations) often results in reverts even if non-neutral language is removed.
Edit Completion Rate (Primary Metric)
Hypothesis: Newcomers and Junior Contributors will experience Tone Check as encouraging because it will offer them more clarity about what is expected of the new information they add to Wikipedia
Methodology We reviewed the proportion of edits attempted that were successfully published (not reverted). For this analysis, we are limiting to edits that reached the point where Tone check was or would be shown to reduce noise from edits abandoned earlier in the editing workflow.
We excluded edits that were reverted to ensure we were measuring the Tone Check’s impact on productive contributions.
Code
# load data for assessing edit completion ratetone_check_completion_rates_1 <-read.csv(file ='data/tone_check_completion_data.tsv',header =TRUE,sep ="\t",stringsAsFactors =FALSE )
Code
# load edit completion rate (second dataset)# Second dataset was created to obtain updated event data while preserving initial aggregated dataset that could no loner# be queried in Data Lake due to data retention policies. tone_check_completion_rates_2 <-read.csv(file ='data/tone_check_completion_data_pt2.tsv',header =TRUE,sep ="\t",stringsAsFactors =TRUE )
Code
# Combine the two datasetstone_check_completion_rates <-rbind(tone_check_completion_rates_1, tone_check_completion_rates_2)
Code
# Set experience level group and factor levelstone_check_completion_rates <- tone_check_completion_rates|>mutate(experience_level_group =case_when( user_edit_count ==0& user_status =='registered'~'Newcomer', user_edit_count ==0& user_status =='unregistered'~'Unregistered', user_edit_count >0& user_edit_count <=100~"Junior Contributor", user_edit_count >100~"Non-Junior Contributor" ),experience_level_group =factor(experience_level_group,levels =c("Unregistered","Newcomer", "Non-Junior Contributor", "Junior Contributor") )) #rename experiment field to clarfiytone_check_completion_rates <- tone_check_completion_rates |>mutate(test_group =factor(test_group,levels =c('2025-09-editcheck-tone-control', '2025-09-editcheck-tone-test'),labels =c("control (eligible but not shown tone check)", "test (tone check shown)")))#rename platform from phone to mobile web to clarify meaningtone_check_completion_rates <- tone_check_completion_rates|>mutate(platform =factor(platform,levels =c('phone', 'desktop'),labels =c("mobile web", "desktop")))tone_check_completion_rates <- tone_check_completion_rates |>mutate(wiki =recode(wiki, !!!wiki_name_lookup) )
Code
#Set fields and factor levels to assess number of checks showntone_check_completion_rates <- tone_check_completion_rates |>mutate(multiple_checks_shown =ifelse(n_checks_shown >1, "multiple checks shown", "one check shown"), multiple_checks_shown =factor( multiple_checks_shown ,levels =c("one check shown", "multiple checks shown")))# note these buckets can be adjusted as needed based on distribution of datatone_check_completion_rates <- tone_check_completion_rates |>mutate(checks_shown_bucket =case_when(is.na(n_checks_shown) ~'0', n_checks_shown ==1~'1', n_checks_shown ==2~'2', n_checks_shown >2& n_checks_shown <=5~"3-5", n_checks_shown >5& n_checks_shown <=10~"6-10", n_checks_shown >10~"over 10" ),checks_shown_bucket =factor(checks_shown_bucket ,levels =c("0","1","2", "3-5", "6-10","over 10") ))
Overall
Code
tone_check_completion_rate_overall <- tone_check_completion_rates %>%filter(tone_check_shown ==1 ) %>%#limit to sessions where tone check was showngroup_by(test_group) %>%summarise(n_edits =n_distinct(editing_session),n_saves =n_distinct(editing_session[saved_edit >0& was_reverted ==0 ])) %>%#saved and not revertedmutate(completion_rate =paste0(round(n_saves/n_edits *100, 1), "%"))
Code
# plot visualization of overall edit completion ratesdodge <-position_dodge(width=0.9)p <- tone_check_completion_rate_overall %>%ggplot(aes(x= test_group, y = n_saves/n_edits, fill = test_group)) +geom_col(position ='dodge') +scale_y_continuous(labels = scales::percent) +geom_text(aes(label =paste(completion_rate, "\n", n_saves,"saved edits"), fontface=2), vjust=1.2, size =10, color ="white") +scale_fill_manual(values=c("#999999", "dodgerblue4"), name ="Experiment Group") +labs (y ="Percent of edit attempts completed ",x ="Experiment Group",title ="Edit completion rate",caption ="Limited to edits shown or eligible to be shown at least one Tone Check and not reverted") +theme(panel.grid.minor =element_blank(),panel.background =element_blank(),plot.title =element_text(hjust =0.5),text =element_text(size=24),axis.text.x =element_text(size =24),axis.title.x =element_text(margin =margin(t =20, unit ="pt")),legend.position="none",axis.line =element_line(colour ="black")) p
Edit completion rates for people shown tone check decreased only slightly by -3.2% (-1.6) percentage points.
By if multiple checks were shown
Code
tone_check_completion_rate_bymulti <- tone_check_completion_rates %>%filter(tone_check_shown ==1& test_group =='test (tone check shown)' ) %>%group_by(test_group, multiple_checks_shown) %>%summarise(n_edits =n_distinct(editing_session),n_saves =n_distinct(editing_session[saved_edit >0])) %>%mutate(completion_rate =paste0(round(n_saves/n_edits *100, 1), "%")) %>%gt() %>%tab_header(title ="Tone Check edit completion rate by if multiple checks were shown" ) %>%opt_stylize(5) %>%cols_label(test_group ="Experiment group",multiple_checks_shown ="Multiple Tone Checks shown",n_edits ="Number of edit attempts shown Tone Check",n_saves ="Number of published edits",completion_rate ="Proportion of edits saved" ) %>%tab_source_note( gt::md('Limited to edits shown at least one Tone Check and not reverted') )display_html(as_raw_html(tone_check_completion_rate_bymulti))
Tone Check edit completion rate by if multiple checks were shown
Multiple Tone Checks shown
Number of edit attempts shown Tone Check
Number of published edits
Proportion of edits saved
test (tone check shown)
one check shown
2049
1287
62.8%
multiple checks shown
2798
1782
63.7%
Limited to edits shown at least one Tone Check and not reverted
By number of checks shown
Code
tone_check_completion_rate_bynchecks <- tone_check_completion_rates %>%filter(tone_check_shown ==1& test_group =='test (tone check shown)') %>%#limit to paste checks shown and test groupgroup_by(test_group, checks_shown_bucket) %>%summarise(n_edits =n_distinct(editing_session),n_saves =n_distinct(editing_session[saved_edit >0& was_reverted ==0])) %>%mutate(completion_rate =paste0(round(n_saves/n_edits *100, 1), "%")) %>%ungroup()%>%mutate(n_edits =ifelse(n_edits <50, "<50", n_edits),n_saves =ifelse(n_saves <50, "<50", n_saves)) %>%#sanitizing per data publication guidelinesgroup_by(test_group) %>%gt() %>%tab_header(title ="Tone Check edit completion rate by the number of checks shown" ) %>%opt_stylize(5) %>%cols_label(checks_shown_bucket ="Number of Tone Check shown",n_edits ="Number of edit attempts shown Tone Check",n_saves ="Number of published edits",completion_rate ="Proportion of edits saved" ) %>%tab_source_note( gt::md('Limited to edits shown at least one Tone Check in the test group and not reverted') )display_html(as_raw_html(tone_check_completion_rate_bynchecks))
Tone Check edit completion rate by the number of checks shown
Number of Tone Check shown
Number of edit attempts shown Tone Check
Number of published edits
Proportion of edits saved
test (tone check shown)
1
2049
1011
49.3%
2
1452
718
49.4%
3-5
834
394
47.2%
6-10
327
150
45.9%
over 10
185
82
44.3%
Limited to edits shown at least one Tone Check in the test group and not reverted
The majority of published new content edits (73%) were shown two or fewer Tone Check within a single editing session. When 2 or fewer checks are presented, we see only about a 1.6% decrease in edit completion rate.
The decrease in completion rate does not exceed over 10% until more than 10 tone checks are presented in a single editing session. For these edits, edit completion rate decreased to 44.3% (a -12% decrease from the control). However, these editing sessions with over 10 tone checks represent only 3% of published edits where Tone Check was shown and are likely an indicator of very low quality edits that we’d want to deter.
By Platform
Code
tone_check_completion_rate_byplatform <- tone_check_completion_rates %>%filter(tone_check_shown ==1) %>%group_by(platform, test_group) %>%summarise(n_edits =n_distinct(editing_session),n_saves =n_distinct(editing_session[saved_edit >0& was_reverted ==0 ])) %>%mutate(completion_rate =paste0(round(n_saves/n_edits *100, 1), "%")) %>%#mutate(n_saves = ifelse(n_saves < 50, "<50", n_saves))%>% #sanitizing per data publication guideline#select(-c(3,4)) %>% gt() %>%tab_header(title ="Tone Check edit completion rate by platform" ) %>%opt_stylize(5) %>%cols_label(test_group ="Experiment Group",platform ="Platform",n_edits ="Number of edit attempts shown Tone Check",n_saves ="Number of published edits",completion_rate ="Proportion of edits saved" ) %>%tab_source_note( gt::md('Limited to edits shown or eligible to be shown at least one Tone Check and not reverted') )display_html(as_raw_html(tone_check_completion_rate_byplatform))
Tone Check edit completion rate by platform
Experiment Group
Number of edit attempts shown Tone Check
Number of published edits
Proportion of edits saved
mobile web
control (eligible but not shown tone check)
864
336
38.9%
test (tone check shown)
1266
489
38.6%
desktop
control (eligible but not shown tone check)
2983
1597
53.5%
test (tone check shown)
3581
1866
52.1%
Limited to edits shown or eligible to be shown at least one Tone Check and not reverted
This decrease was primarily concentrated on Desktop (-2.6%; -1.4pp), with no significant change in completion rates observed for mobile web. Mobile users are nearly as likely to publish their edit whether they see a Tone Check or not.
By User Experience
Code
tone_check_completion_rate_byuserstatus <- tone_check_completion_rates %>%filter(tone_check_shown ==1) %>%group_by(experience_level_group, test_group) %>%summarise(n_edits =n_distinct(editing_session),n_saves =n_distinct(editing_session[saved_edit >0& was_reverted ==0 ])) %>%mutate(completion_rate =paste0(round(n_saves/n_edits *100, 1), "%")) %>%#select(-c(3,4)) %>% #data sanitizing for publicationgt() %>%tab_header(title ="Tone check edit completion rate by user experience" ) %>%opt_stylize(5) %>%cols_label(test_group ="Experiment Group",experience_level_group ="Experiment Group",n_edits ="Number of edit attempts shown Tone Check",n_saves ="Number of published edits",completion_rate ="Proportion of edits saved" ) %>%tab_source_note( gt::md('Limited to edits shown or eligible to be shown at least one Tone Check and not reverted') )display_html(as_raw_html(tone_check_completion_rate_byuserstatus))
Tone check edit completion rate by user experience
Experiment Group
Number of edit attempts shown Tone Check
Number of published edits
Proportion of edits saved
Unregistered
control (eligible but not shown tone check)
1144
433
37.8%
test (tone check shown)
1613
555
34.4%
Newcomer
control (eligible but not shown tone check)
712
323
45.4%
test (tone check shown)
861
367
42.6%
Junior Contributor
control (eligible but not shown tone check)
1991
1177
59.1%
test (tone check shown)
2373
1433
60.4%
Limited to edits shown or eligible to be shown at least one Tone Check and not reverted
The impacts of Tone Check on edit completion rate vary based on user experience. See relative changes below:
Unregistered: -9.0% decrease [-3.4pp]
Newcomers: -6.4% decrease [-2.9pp]
Junior Contributors: +2.2% increase [1.3pp]
While Tone Check resulted in slight decreases in edit completion rates for newcomer and unregistered users, it caused minimal disruption to Junior Contributors. We actually observed a +2.2% relative increase in the completion rate of Junior Contributors shown Tone Check, suggesting the check is encouraging and facilitates successful publishing for a subset of users.
By Partner Wikipedia
Code
tone_check_completion_rate_bywiki <- tone_check_completion_rates %>%filter(tone_check_shown ==1) %>%group_by(wiki, test_group) %>%summarise(n_edits =n_distinct(editing_session),n_saves =n_distinct(editing_session[saved_edit >0& was_reverted ==0])) %>%mutate(completion_rate =paste0(round(n_saves/n_edits *100, 1), "%")) %>%#filter(n_edits > 200) %>% #limit to wikis with sufficient events#select(-c(3,4)) %>% #data sanitizing for publicationgt() %>%tab_header(title ="Tone Check edit completion rate by Wikipedia" ) %>%opt_stylize(5) %>%cols_label(test_group ="Experiment Group",wiki ="Wikipedia",n_edits ="Number of edit attempts shown Tone Check",n_saves ="Number of published edits",completion_rate ="Proportion of edits saved" ) %>%tab_source_note( gt::md('Limited to Wikipedias with at least 200 edit attempts during reviewed timeframe') )display_html(as_raw_html(tone_check_completion_rate_bywiki ))
Tone Check edit completion rate by Wikipedia
Experiment Group
Number of edit attempts shown Tone Check
Number of published edits
Proportion of edits saved
French Wikipedia
control (eligible but not shown tone check)
2579
1245
48.3%
test (tone check shown)
3332
1600
48%
Japanese Wikipedia
control (eligible but not shown tone check)
704
420
59.7%
test (tone check shown)
902
476
52.8%
Portuguese Wikipedia
control (eligible but not shown tone check)
564
268
47.5%
test (tone check shown)
613
279
45.5%
Limited to Wikipedias with at least 200 edit attempts during reviewed timeframe
The most significant decrease in edit completion rate was at Japanese Wikipedia (-11.6% decrease [-6.9pp]) while there was slight increase at French Wikipedia.
Results indicate an inverse correlation between edit completion and revert rates at each Wikipedia. The most significant decrease in completion occurred on the Japanese Wikipedia, which also saw the most substantial decrease in revert rate. In contrast, the French Wikipedia saw almost no change in edit completion rates and only a small decrease in revert rates.
This correlation suggests that the Tone Check is effectively deterring some lower-quality edits that would have been reverted.
Confirming impact of Tone Check on edit completion rate
Code
# calculate the proportion for each usertone_check_completion_rate_overall_byuser <- tone_check_completion_rates %>%filter(tone_check_shown ==1 ) %>%#limit to sessions where tone check was showngroup_by(test_group, platform, user_id) %>%summarise(n_edits =n_distinct(editing_session),n_saves =n_distinct(editing_session[saved_edit >0& was_reverted ==0])) %>%mutate(completion_rate = n_saves/n_edits)
Code
# rename field names to align with relax package naming conventiontone_check_completion_rate_overall_byuser <- tone_check_completion_rate_overall_byuser |>mutate(variation =factor(test_group,levels =c("control (eligible but not shown tone check)", "test (tone check shown)"),labels =c("control", "treatment")))tone_check_completion_rate_overall_byuser$outcome = tone_check_completion_rate_overall_byuser$completion_rate
Code
overall_impact_completes <- tone_check_completion_rate_overall_byuser |>analyze_relative_lift(metric_type ="proportion") |>gt() |>tab_header(title =md("**Evaluating Tone Check impact on edit completion rate**"),subtitle =md("Difference in Metric (Test Group - Control Group)") ) |>tab_spanner(label =md("**Bayesian Analysis**"),columns =c(estimate_bayes, chance_to_win, cred_lower, cred_upper) ) |>tab_spanner(label =md("**Frequentist Analysis**"),columns =c(estimate_freq, p_value, conf_lower, conf_upper) ) |># Rename Columns for clarity ---cols_label(estimate_bayes =md("Point Estimate"),chance_to_win =md("Chance to Win"),cred_lower =md("95% CI Lower"),cred_upper =md("95% CI Upper"),estimate_freq =md("Point Estimate"),p_value =md("*p*-value"),conf_lower =md("95% CI Lower"),conf_upper =md("95% CI Upper") ) |># pply Formatting (Decimals and CI Grouping) ---fmt_number(columns =everything(),decimals =3# Use 3 decimals for precision ) |># Highlight key finding (Inconclusive) ---tab_footnote(footnote =md("The 95% intervals does not cross zero, indicating no statistically conclusive difference"),locations =cells_column_labels(columns =c(cred_lower, conf_lower)) ) %>%# Style the table ---tab_options(table.border.top.color ="lightgray",column_labels.border.bottom.color ="black",column_labels.border.bottom.width =px(2),data_row.padding =px(5) )display_html(as_raw_html(overall_impact_completes))
Evaluating Tone Check impact on edit completion rate
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
95% CI Lower1
95% CI Upper
Point Estimate
p-value
95% CI Lower1
95% CI Upper
−0.078
0.000
−0.123
−0.033
−0.079
0.001
−0.124
−0.034
1The 95% intervals does not cross zero, indicating no statistically conclusive difference
Results indicate that Tone Check introduced a small but statistically significant (p =0.001) level of friction into the editing process. Estimates indicate that Tone Check likely decreased edit completion rate by 7.9 percentage points.
Code
# check by platform numbersplatform_impact_completes <- tone_check_completion_rate_overall_byuser |>group_by(platform) |>group_modify(~analyze_relative_lift(.x, metric_type ="proportion"))|>gt() |>tab_header(title =md("**Evaluating Tone Check impact on complation rate by platform**"),subtitle =md("Difference in Metric (Test Group - Control Group)") ) |>tab_spanner(label =md("**Bayesian Analysis**"),columns =c(estimate_bayes, chance_to_win, cred_lower, cred_upper) ) |>tab_spanner(label =md("**Frequentist Analysis**"),columns =c(estimate_freq, p_value, conf_lower, conf_upper) ) |># Rename Columns for clarity ---cols_label(platform =md("Platform"),estimate_bayes =md("Point Estimate"),chance_to_win =md("Chance to Win"),cred_lower =md("95% CI Lower"),cred_upper =md("95% CI Upper"),estimate_freq =md("Point Estimate"),p_value =md("*p*-value"),conf_lower =md("95% CI Lower"),conf_upper =md("95% CI Upper") ) |># pply Formatting (Decimals and CI Grouping) ---fmt_number(columns =everything(),decimals =3# Use 3 decimals for precision ) |># Highlight key finding (Inconclusive) ---tab_footnote(footnote =md("The 95% intervals cross zero, indicating no statistically conclusive difference."),locations =cells_column_labels(columns =c(cred_lower, conf_lower)) ) %>%# Style the table ---tab_options(table.border.top.color ="lightgray",column_labels.border.bottom.color ="black",column_labels.border.bottom.width =px(2),data_row.padding =px(5) )display_html(as_raw_html(platform_impact_completes))
Evaluating Tone Check impact on complation rate by platform
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
95% CI Lower1
95% CI Upper
Point Estimate
p-value
95% CI Lower1
95% CI Upper
mobile web
−0.024
0.342
−0.140
0.092
−0.025
0.679
−0.143
0.093
desktop
−0.082
0.000
−0.131
−0.034
−0.083
0.001
−0.131
−0.035
1The 95% intervals cross zero, indicating no statistically conclusive difference.
The per platform analysis reveals that 3.2% overall decrease in completion rates is driven by desktop editors.
On desktop, we confirmed a statistically significant decrease in edit completion rate. Tone check shown to users on desktop likely caused a decrease of around 8.3%. On mobile web, we observed a small but statistically insignificant decrease in edit completion rate.
Constructive Edit Rate
Hypothesis: A larger proportion of new content edits by Newcomers and Junior Contributors will be constructive because they will be made aware the new text they’re attempting to publish needs to be written in a neutral tone, when they don’t first think/know to write in this way themselves.
Methodology: The proportion of all published edits by users with ≤100 cumulative edits on a mobile web main namespace that are constructive (not reverted with 48 hours). Similar to revert rate, the analysis was limited to new content edits shown or eligible to be shown Tone Check so we can isolate data to edits that would be impacted by this feature.
Note: This metric is also the WE 1.1 Key Result. We will include Tone Check’s impact on this metric as part of our evaluation of the collective impact of interventions deployed under WE 1.1 on this metric.
Overall
Code
tone_check_constructive_overall <-tone_check_publish_data |>filter(is_new_content ==1& is_test_eligible =='eligible') |>#limit to eligible editsgroup_by(test_group, user_id) |>summarise(n_edits =n_distinct(editing_session),n_const =n_distinct(editing_session[was_reverted ==0])) |>#limit to new content edits without a refernecemutate(constructive_edit_rate = n_const/n_edits) |>group_by(test_group) |>summarise(avg_rate =mean(constructive_edit_rate))
Code
# plot visualization of overall edit completion ratesdodge <-position_dodge(width=0.9)p <- tone_check_constructive_overall |>ggplot(aes(x= test_group, y = n_const/n_edits, fill = test_group)) +geom_col(position ='dodge') +scale_y_continuous(labels = scales::percent) +geom_text(aes(label =paste(constructive_edit_rate, "\n", n_const,"constructive edits"), fontface=2), vjust=1.2, size =10, color ="white") +scale_fill_manual(values=c("#999999", "dodgerblue4"), name ="Experiment Group") +labs (y ="Percent of edits that were constructive ",x ="Experiment Group",title ="Constructive edit rate",caption ="Limited to published new content edits shown or eligible to be shown Tone Check") +theme(panel.grid.minor =element_blank(),panel.background =element_blank(),plot.title =element_text(hjust =0.5),text =element_text(size=24),axis.text.x =element_text(size =24),axis.title.x =element_text(margin =margin(t =20, unit ="pt")),legend.position="none",axis.line =element_line(colour ="black")) p
Overall, constructive edit rates increased by +6.2% increase [4.4 percentage points] for people shown Tone Check in the test group.
By platform
Code
tone_check_constructive_byplatform <- tone_check_publish_data |>filter( is_test_eligible =='eligible') |>group_by(platform, test_group) |>summarise(n_edits =n_distinct(editing_session),n_const =n_distinct(editing_session[was_reverted ==0])) |>mutate(constructive_edit_rate =paste0(round(n_const/n_edits *100, 1), "%")) |>select(-c(3,4)) %>%# removing granular data columns for publicationgt() |>tab_header(title ="Constructive edit rate by platform" ) |>opt_stylize(5) |>cols_label(test_group ="Test Group",platform ="Platform",#n_edits = "Number of published new content edits",# n_const = "Number of constructive edits",constructive_edit_rate ="Proportion of new content edits that were constructive" ) |>tab_source_note( gt::md('Limited to published new content edits shown or eligible to shown Tone Check') )display_html(as_raw_html(tone_check_constructive_byplatform))
Constructive edit rate by platform
Test Group
Proportion of new content edits that were constructive
mobile web
control (eligible but not shown tone check)
65.5%
test (tone check shown)
65.4%
desktop
control (eligible but not shown tone check)
75%
test (tone check shown)
79.8%
Limited to published new content edits shown or eligible to shown Tone Check
We continue to see differing trends on mobile web compared to desktop. Constructive edit rates on desktop increased while they decreased on mobile web.
On desktop, constructive edit rate increased by 6.4% while we observed no statistically significant change in mobile web constructive edits.
By user experience
Code
tone_check_constructive_byexp <- tone_check_publish_data |>filter(platform =='desktop'& is_new_content ==1& is_test_eligible =='eligible') |>group_by(experience_level_group, test_group) %>%summarise(n_edits =n_distinct(editing_session),n_const =n_distinct(editing_session[was_reverted ==0])) |>mutate(constructive_edit_rate =paste0(round(n_const/n_edits *100, 1), "%")) |>select(-c(3,4)) |># removing granular data columns for publicationgt() |>tab_header(title ="Constructive edit rate by user experience" ) %>%opt_stylize(5) |>cols_label(test_group ="Test Group",experience_level_group ="User type",#n_edits = "Number of published new content edits",# n_const = "Number of constructive edits",constructive_edit_rate ="Proportion of new content edits that were constructive" ) |>tab_source_note( gt::md('Limited to published new content edits shown or eligible to shown Tone Check') )display_html(as_raw_html(tone_check_constructive_byexp ))
Constructive edit rate by user experience
Test Group
Proportion of new content edits that were constructive
Unregistered
control (eligible but not shown tone check)
74.4%
test (tone check shown)
69.1%
Newcomer
control (eligible but not shown tone check)
79.6%
test (tone check shown)
77.6%
Junior Contributor
control (eligible but not shown tone check)
68%
test (tone check shown)
81.4%
Limited to published new content edits shown or eligible to shown Tone Check
The increase in constructive edit rate appears to be primarily due to an increase in constructive edits by Junior Contributors shown Tone Check, where we observed a +14.8% increase [10.2 pp]. When limited to desktop edits, there was a 19.7% increase in constructive edits by Junior Contributors.
By Partner Wikipedia
Code
tone_check_constructive_bywiki <- tone_check_publish_data |>filter(is_new_content ==1& is_test_eligible =='eligible') |>group_by(wiki, test_group) %>%summarise(n_edits =n_distinct(editing_session),n_const =n_distinct(editing_session[was_reverted ==0])) |>mutate(constructive_edit_rate =paste0(round(n_const/n_edits *100, 1), "%")) |>#filter(n_edits > 100) %>% #limit to wikis with sufficient eventsselect(-c(3,4)) |># removing granular data columns for publicationgt() |>tab_header(title ="Constructive edit rate by Wikipedia" ) |>opt_stylize(5) |>cols_label(test_group ="Test Group",wiki =md("**Wikipedia**"),#n_edits = "Number of published new content edits",#n_const = "Number of constructive edits",constructive_edit_rate ="Proportion of new content edits that were constructive" ) display_html(as_raw_html(tone_check_constructive_bywiki ))
Constructive edit rate by Wikipedia
Test Group
Proportion of new content edits that were constructive
French Wikipedia
control (eligible but not shown tone check)
69.2%
test (tone check shown)
71%
Japanese Wikipedia
control (eligible but not shown tone check)
68.8%
test (tone check shown)
89.1%
Portuguese Wikipedia
control (eligible but not shown tone check)
78.6%
test (tone check shown)
83.1%
Tone Check increased constructive edit rates at all three partner Wikipedias.
Aligned with the decreased revert rate findings, we confirmed that Tone Check has the biggest impact on constructive edit rates at Japanese Wikipedia, where there was a +29.5% increase in the constructive edit rate for users shown Tone Check compared to eligible edits in the control group.
Due to the small sample size of per Wikipedia edits, we are currently not able confirm statistical significance of the increases at any of these Wikipedias but the direction and magnitude of change indicate that Tone Check is having a positive effect on edit quality at each partner Wikipedia.
Confirming impact of Tone Check on constructive edit rate
We also modeled the impact of Tone Check on constructive edits rates to confirm the magnitude and direction of Tone Check’s effect on a user completing a higher proportion of constructive edits. This helps account for random effects of the user and wiki.
Code
tone_check_constructive_overall_byuser <- tone_check_publish_data |>filter(is_test_eligible =='eligible') |>#limit to eligible editsgroup_by(test_group, platform, user_id) |>summarise(n_edits =n_distinct(editing_session),n_const =n_distinct(editing_session[was_reverted ==0])) |>#edits not revertedmutate(constructive_edit_rate = n_const/n_edits)
Code
# rename test group field names to align with relax package naming conventiontone_check_constructive_overall_byuser <- tone_check_constructive_overall_byuser |>mutate(variation =factor(test_group,levels =c("control (eligible but not shown tone check)", "test (tone check shown)"),labels =c("control", "treatment")))
Code
# create new column name to align with relax package namingtone_check_constructive_overall_byuser$outcome = tone_check_constructive_overall_byuser$constructive_edit_rate
Code
# overall impactoverall_impact_const_edits <- tone_check_constructive_overall_byuser |>analyze_relative_lift(metric_type ="proportion", ci_level =0.9) |>gt() |>tab_header(title =md("**Evaluating Tone Check impact on overall constructive edit rate**"),subtitle =md("Difference in Metric (Test Group - Control Group)") ) |>tab_spanner(label =md("**Bayesian Analysis**"),columns =c(estimate_bayes, chance_to_win, cred_lower, cred_upper) ) |>tab_spanner(label =md("**Frequentist Analysis**"),columns =c(estimate_freq, p_value, conf_lower, conf_upper) ) |># Rename Columns for clarity ---cols_label(estimate_bayes =md("Point Estimate"),chance_to_win =md("Chance to Win"),cred_lower =md("90% CI Lower"),cred_upper =md("90% CI Upper"),estimate_freq =md("Point Estimate"),p_value =md("*p*-value"),conf_lower =md("90% CI Lower"),conf_upper =md("90% CI Upper") ) |># pply Formatting (Decimals and CI Grouping) ---fmt_number(columns =everything(),decimals =3# Use 3 decimals for precision ) |># Style the table ---tab_options(table.border.top.color ="lightgray",column_labels.border.bottom.color ="black",column_labels.border.bottom.width =px(2),data_row.padding =px(5) )display_html(as_raw_html(overall_impact_const_edits))
Evaluating Tone Check impact on overall constructive edit rate
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
90% CI Lower
90% CI Upper
Point Estimate
p-value
90% CI Lower
90% CI Upper
0.032
0.948
0.000
0.064
0.032
0.103
0.000
0.064
The Tone Check feature resulted in a slight but statistically significant increase in the overall constructive edit rate (p = 0.103). The Chance to Win indicates a 95% probability that the Tone Check increases the likelihood a user completes a constructive edit.
Code
# check by platform numbersplatform_impact_constr_edits <- tone_check_constructive_overall_byuser |>group_by(platform) |>group_modify(~analyze_relative_lift(.x, metric_type ="proportion", ci_level =0.9))|>gt() |>tab_header(title =md("**Evaluating Tone Check impact on constructive edit rate by platform**"),subtitle =md("Difference in Metric (Test Group - Control Group)") ) |>tab_spanner(label =md("**Bayesian Analysis**"),columns =c(estimate_bayes, chance_to_win, cred_lower, cred_upper) ) |>tab_spanner(label =md("**Frequentist Analysis**"),columns =c(estimate_freq, p_value, conf_lower, conf_upper) ) |># Rename Columns for clarity ---cols_label(platform =md("Platform"),estimate_bayes =md("Point Estimate"),chance_to_win =md("Chance to Win"),cred_lower =md("90% CI Lower"),cred_upper =md("90% CI Upper"),estimate_freq =md("Point Estimate"),p_value =md("*p*-value"),conf_lower =md("90% CI Lower"),conf_upper =md("90% CI Upper") ) |># pply Formatting (Decimals and CI Grouping) ---fmt_number(columns =everything(),decimals =3# Use 3 decimals for precision ) |># Style the table ---tab_options(table.border.top.color ="lightgray",column_labels.border.bottom.color ="black",column_labels.border.bottom.width =px(2),data_row.padding =px(5) )display_html(as_raw_html(platform_impact_constr_edits))
Evaluating Tone Check impact on constructive edit rate by platform
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
90% CI Lower
90% CI Upper
Point Estimate
p-value
90% CI Lower
90% CI Upper
mobile web
0.015
0.630
−0.060
0.091
0.016
0.737
−0.061
0.092
desktop
0.030
0.926
−0.004
0.064
0.030
0.147
−0.004
0.064
Results confirm that Tone Check is most effective at increasing the quality of edits on desktop. On Desktop, we see a strong positive trend. The Bayesian 92.6% Chance to Win suggests the tool is highly likely to be increasing constructive edits.
Mobile Web results are slightly directionally positive (+1.5 pp) but are not yet statistically significant. Consistent with other findings, presenting Tone Check on mobile web does not appear to be disruptive but has less of an impact on user behavior compared to desktop.
Constructive Retention Rate (Second Week)
Hypothesis: Newcomers and Junior Contributors will be more likely to return to publish a new content edit in the future that does not include non-neutral language because Tone Check will have caused them to realize when they are at risk of of this not being true.
Methodology: First we reviewed the proportion of newcomers and Junior Contributors that publish an edit on a main namespace where Tone Check was shown and successfully return to make an unreverted edit to a main namespace 7 and 14 days after their first edit (second week retention).
constructive_retention_overall_table <- constructive_retention_overall %>%gt() %>%tab_header(title ="Constructive second week retention rate" ) %>%cols_label(test_group ="Experiment group",return_editors ="Number of editors that returned second week",editors ="Number of first week editors",retention_rate ="Retention rate" ) %>%opt_stylize(5) %>%tab_footnote(footnote ="Limited to users shown or eligible to be shown at least one Tone Check their first week",locations =cells_column_labels(columns ='retention_rate' ) ) display_html(as_raw_html(constructive_retention_overall_table))
Constructive second week retention rate
Experiment group
Number of editors that returned second week
Number of first week editors
Retention rate1
control (eligible but not shown tone check)
115
1995
5.8%
test (tone check shown)
167
2309
7.2%
1 Limited to users shown or eligible to be shown at least one Tone Check their first week
People who encountered Tone Check are 24% more likely to return again to make a constructive edit in their second week. 7.2% of people in shown Tone Check the test group returned to make a subsequent constructive edit, compared to 5.8% in the control group. (+1.4 percentage points).
This suggests that rather than discouraging users, Tone check may be make them feel more supported or successful in their contributions, leading them to return at higher rates.
constructive_retention_byplatform_table <- constructive_retention_byplatform %>%select(-c(3,4)) |># removing granular data columns for publicationgt() %>%tab_header(title ="Constructive second week retention rate by platform" ) %>%cols_label(test_group ="Experiment group",platform ="Platform",#return_editors = "Number of editors that returned second week",#editors = "Number of first week editors",retention_rate ="Retention rate" ) %>%opt_stylize(5) %>%tab_footnote(footnote ="Limited to users shown or eligible to be shown at least one Tone Check",locations =cells_column_labels(columns ='retention_rate' ) ) display_html(as_raw_html(constructive_retention_byplatform_table))
Constructive second week retention rate by platform
Experiment group
Retention rate1
mobile web
control (eligible but not shown tone check)
3.1%
test (tone check shown)
5.2%
desktop
control (eligible but not shown tone check)
6.8%
test (tone check shown)
7.9%
1 Limited to users shown or eligible to be shown at least one Tone Check
While we don’t have sufficient sample size to confirm statistical significance on a per platform basis, we observed high relative increases in constructive retention rate on both platforms.
constructive_retention_byuserexp_table <- constructive_retention_byuserexp %>%select(-c(3,4)) |># removing granular data columns for publicationgt() %>%tab_header(title ="Constructive second week retention rate by user experience" ) %>%cols_label(test_group ="Experiment group",experience_level_group ="Experience level group",#return_editors = "Number of editors that returned second week",#editors = "Number of first week editors",retention_rate ="Retention rate" ) %>%opt_stylize(5) %>%tab_footnote(footnote ="Limited to users shown or eligible to be shown at least one Tone Check",locations =cells_column_labels(columns ='retention_rate' ) ) display_html(as_raw_html(constructive_retention_byuserexp_table))
Constructive second week retention rate by user experience
Experiment group
Retention rate1
Unregistered
control (eligible but not shown tone check)
0.4%
test (tone check shown)
1.3%
Newcomer
control (eligible but not shown tone check)
2.7%
test (tone check shown)
3.8%
Junior Contributor
control (eligible but not shown tone check)
9.5%
test (tone check shown)
11%
1 Limited to users shown or eligible to be shown at least one Tone Check
We observed increases in retention rate across all user groups as well.
Confirming Impact on Retention Rate
Because retention rate can be assumed to be idependent of one another (a user can only be retained once), we’ll just use a simple test of proportions to confirm significance.
Code
# reframe data for modelconstructive_retention_overall <- constructive_retention_rate %>%group_by(test_group) %>%summarise(return_editors =sum(return_editors),editors =sum(editors),retention_rate = return_editors/editors )
Code
#Extract the vectorssuccesses <- constructive_retention_overall$return_editorstotals <- constructive_retention_overall$editors#Run Proportion Testres <-prop.test(x = successes, n = totals, conf.level =0.90)res
The result is directionally positive and shows a clear improvement in constructive retention.
We are 90% confident that the Tone Check results in a relative increase in retention of somewhere between 3.3% and 47.7%.
Guardrails
We identified a set of 5 guardrails to make sure that Tone Check is not negatively impacting peoples’ experience completing an edit or causing disruption on the wikis. These were identified through a pre-mortem task completed at the beginning of the project. We’ve confirmed that Tone Check did not cause any edit quality decreases (See New Content edit rate section) or significant decreases in edit completion rate >20% (See edit completion rate section). We also confirmed that Tone Check did not result in a high block rates or false positive rates (see sections below).
Guardrail #1: False Positive Rate
Description: Proportion of contributors that decline revising the text they have drafted and indicate that it was irrelevant.
Methodology: For this check were are defining false positive as the proportion of contributors that decline revising the text they have drafted (event.feature = 'editCheck-tone' AND event.action = 'action-dismiss') and selected “The tone is appropriate” when declining the check. We further limited the analysis to any edits that were not reverted within 48 hours (indicator of a quality edit).
Overall
Code
# overall dismissal ratetone_check_false_positive_overall <- tone_check_reject_data %>%filter(was_tone_check_shown ==1& is_new_content ==1) %>%#limit to where shownsummarise(n_edits =n_distinct(editing_session),n_rejects =n_distinct(editing_session[n_rejects >0& reject_reason =='The tone is appropriate'& was_reverted ==0])) %>%# at least one paste check declined and edit not revertedmutate(dismissal_rate =paste0(round(n_rejects/n_edits *100, 1), "%")) %>%gt() %>%tab_header(title ="Edits where user declined to revise text because the tone was appropriate" ) %>%opt_stylize(5) %>%cols_label(n_edits ="Number of edits shown Tone check",n_rejects ="Number of edits that declined Tone Check as irrelevant",dismissal_rate ="Decline Rate" ) %>%tab_source_note( gt::md('Limited to unreverted published edits where at least one Tone Check was shown') )display_html(as_raw_html(tone_check_false_positive_overall))
Edits where user declined to revise text because the tone was appropriate
Number of edits shown Tone check
Number of edits that declined Tone Check as irrelevant
Decline Rate
1729
283
16.4%
Limited to unreverted published edits where at least one Tone Check was shown
Editors declined a tone check and selected “the tone is appropriate” at 16.4% of all published edits where Tone Check was shown. This excludes edits that were reverted within 48 hours.
For comparison, this is higher than the rates observed for Reference Check (6.6% of editors indicated that the content they were adding did not require a reference) and lower than Paste Check (30% of editors indicated that they wrote the content).
By Platform
Code
# platform false postive ratetone_check_false_positive_byplatform <- tone_check_reject_data %>%group_by(platform) %>%filter(was_tone_check_shown ==1& is_new_content ==1) %>%#limit to where shownsummarise(n_edits =n_distinct(editing_session),n_rejects =n_distinct(editing_session[n_rejects >0& reject_reason =='The tone is appropriate'& was_reverted ==0])) %>%# at least one paste check declined and edit not revertedmutate(dismissal_rate =paste0(round(n_rejects/n_edits *100, 1), "%")) %>%select(-c(2,3)) |># removing granular data columns for publicationgt() %>%tab_header(title ="Edits where user declined to revise text because the tone was appropriate by platform" ) %>%opt_stylize(5) %>%cols_label(platform ="Platform",#n_edits = "Number of edits shown Tone check",#n_rejects = "Number of edits that declined Tone Check as irrelevant",dismissal_rate ="Decline Rate" ) %>%tab_source_note( gt::md('Limited to unreverted published edits where at least one Tone Check was shown') )display_html(as_raw_html(tone_check_false_positive_byplatform))
Edits where user declined to revise text because the tone was appropriate by platform
Platform
Decline Rate
mobile web
15%
desktop
16.8%
Limited to unreverted published edits where at least one Tone Check was shown
There are similar rates on mobile web and desktop.
By User Experience
Code
# platform false postive ratetone_check_false_positive_byuserexp <- tone_check_reject_data %>%group_by(experience_level_group) %>%filter(was_tone_check_shown ==1& is_new_content ==1) %>%#limit to where shownsummarise(n_edits =n_distinct(editing_session),n_rejects =n_distinct(editing_session[n_rejects >0& reject_reason =='The tone is appropriate'& was_reverted ==0])) %>%# at least one paste check declined and edit not revertedmutate(dismissal_rate =paste0(round(n_rejects/n_edits *100, 1), "%")) %>%select(-c(2,3)) |># removing granular data columns for publicationgt() %>%tab_header(title ="Edits where user declined to revise text because the tone was appropriate by platform" ) %>%opt_stylize(5) %>%cols_label(experience_level_group ="User experience",#n_edits = "Number of edits shown Tone check",#n_rejects = "Number of edits that declined Tone Check as irrelevant",dismissal_rate ="Decline Rate" ) %>%tab_source_note( gt::md('Limited to unreverted published edits where at least one Tone Check was shown') )display_html(as_raw_html(tone_check_false_positive_byuserexp))
Edits where user declined to revise text because the tone was appropriate by platform
User experience
Decline Rate
Unregistered
18.2%
Newcomer
16%
Junior Contributor
16%
Limited to unreverted published edits where at least one Tone Check was shown
By Partner Wikipedia
Code
# per wiki false postive ratetone_check_false_positive_bywiki <- tone_check_reject_data %>%group_by(wiki) %>%filter(was_tone_check_shown ==1& is_new_content ==1) %>%#limit to where shownsummarise(n_edits =n_distinct(editing_session),n_rejects =n_distinct(editing_session[n_rejects >0& reject_reason =='The tone is appropriate'& was_reverted ==0])) %>%# at least one paste check declined and edit not revertedmutate(dismissal_rate =paste0(round(n_rejects/n_edits *100, 1), "%")) %>%select(-c(2,3)) |># removing granular data columns for publicationgt() %>%tab_header(title ="Edits where user declined to revise text because the tone was appropriate by partner Wikipedia" ) %>%opt_stylize(5) %>%cols_label(wiki ="Wikipedia",#n_edits = "Number of edits shown Tone check",#n_rejects = "Number of edits that declined Tone Check as irrelevant",dismissal_rate ="Decline Rate" ) %>%tab_source_note( gt::md('Limited to unreverted published edits where at least one Tone Check was shown') )display_html(as_raw_html(tone_check_false_positive_bywiki))
Edits where user declined to revise text because the tone was appropriate by partner Wikipedia
Wikipedia
Decline Rate
French Wikipedia
17.4%
Japanese Wikipedia
7.3%
Portuguese Wikipedia
19.1%
Limited to unreverted published edits where at least one Tone Check was shown
At Japanese Wikipedia, editors declined Tone Check as irrelevant (tone was appropriate) at only 7% of edits where shown. This is notably lower than the rates observed for French Wikipedia (17.4%) and Portuguese Wikipedia (19.1%).
Notably, Japanese Wikipedia is also where we observed the highest increase in edit quality as measured by decrease in revert rates.
Guardrail #2: Block Rate
Description Proportion of contributors blocked after publishing an edit where Paste Check was shown, compared to contributors eligible but not shown Paste Check.
Methodology: We gathered all edits where edit check was shown from the mediawiki_revision_change_tag table and joined with mediawiki_private_cu_changes to gather user name info. We then reviewed both global and local blocks made within 6 hours of the Paste Check event as identified in the logging table.
Code
# load data for assessing blocksedit_check_blocks <-read.csv(file ='data/tone_check_eligible_users_blocked.csv',header =TRUE,sep =",",stringsAsFactors =FALSE )
Code
#rename experiment field to clarifyedit_check_blocks <- edit_check_blocks%>%mutate(test_group =factor(bucket,levels =c('2025-09-editcheck-tone-control', '2025-09-editcheck-tone-test'),labels =c("control (eligible but not shown tone check)", "test (tone check shown)")))
Code
edit_check_local_blocks_overall <- edit_check_blocks %>%#filter(user_id == 0) %>% #filter to identify logged out usersgroup_by(test_group) %>%summarise(blocked_users =n_distinct(ip[is_local_blocked =='True'| is_global_blocked =='True']),all_users =n_distinct(ip)) %>%#look at blocksmutate(prop_blocks =paste0(round(blocked_users/all_users *100, 1), "%")) %>%select(-c(2,3)) %>%#removing granular data columns gt() %>%tab_header(title ="Proportion of users blocked by experiment group" ) %>%opt_stylize(5) %>%cols_label(test_group ="Test Group",prop_blocks ="Proportion of users blocked" ) %>%tab_source_note( gt::md('Limited to users blocked 6 hours after publishing an edit where Tone Check was shown') )display_html(as_raw_html(edit_check_local_blocks_overall))
Proportion of users blocked by experiment group
Test Group
Proportion of users blocked
control (eligible but not shown tone check)
0.8%
test (tone check shown)
0.8%
Limited to users blocked 6 hours after publishing an edit where Tone Check was shown
People shown Tone Check are not blocked at higher rates than users in the control group. 0.8% of users were blocked in both the test and control groups.
Appendix
We reviewed a number of additional secondary metrics or curiosities. These are used to learn more about the impact of Tone Check on editing behavior, but are not primary targets of the intervention.
Tone Check Decline Rates and Reasons
Hypothesis Knowing the reasons why people do not elect to revise tone when the Check prompts them to do so (by platform), will help us to decide what (if anything) can be done to decrease the proportion of people on desktop who do so
Methodology: We reviewed the proportion of published edits new content edits shown Tone Check wherein people elected to not revise the tone of the text they added (i.e. the Tone Check was dismissed) by the decline reason the user selected.
This was determined by edits where the user dismissed a Tone Check at least once in a session (event.feature = 'editCheck-tone' AND event.action = 'action-dismiss'). The analysis includes splits by the reason the user selected for dismissing the check.
Code
# load data for assessing edit reject frequencytone_check_reject_data_1 <-read.csv(file ='data/tone_check_rejects_data_ab.tsv',header =TRUE,sep ="\t",stringsAsFactors =FALSE )
# Combine the two datasetstone_check_reject_data <-rbind(tone_check_reject_data_1, tone_check_reject_data_2)
Code
# Set experience level group and factor levelstone_check_reject_data <- tone_check_reject_data %>%mutate(experience_level_group =case_when( user_edit_count ==0& user_status =='registered'~'Newcomer', user_edit_count ==0& user_status =='unregistered'~'Unregistered', user_edit_count >0& user_edit_count <=100~"Junior Contributor", user_edit_count >100~"Non-Junior Contributor" ),experience_level_group =factor(experience_level_group,levels =c("Unregistered","Newcomer", "Non-Junior Contributor", "Junior Contributor") )) #rename experiment field to clarifytone_check_reject_data <- tone_check_reject_data %>%mutate(test_group =factor(test_group,levels =c('2025-09-editcheck-tone-control', '2025-09-editcheck-tone-test'),labels =c("control (eligible but not shown Tone Check)", "test (shown Tone Check)")))#rename platform from phone to mobile web to clarify meaningtone_check_reject_data <- tone_check_reject_data %>%mutate(platform =factor(platform,levels =c('phone', 'desktop'),labels =c("mobile web", "desktop")))# rename Wiki namestone_check_reject_data <- tone_check_reject_data %>%mutate(wiki =recode(wiki, !!!wiki_name_lookup) )
Code
#Set fields and factor levels to assess number of checks showntone_check_reject_data <- tone_check_reject_data %>%mutate(multiple_checks_shown =ifelse(n_checks_shown >1, "multiple checks shown", "single check shown"), multiple_checks_shown =factor( multiple_checks_shown ,levels =c("single check shown", "multiple checks shown")))# note these buckets can be adjusted as needed based on distribution of datatone_check_reject_data <- tone_check_reject_data %>%mutate(checks_shown_bucket =case_when(is.na(n_checks_shown) ~'0', n_checks_shown ==1~'1', n_checks_shown ==2~'2', n_checks_shown >2& n_checks_shown <=5~"3-5", n_checks_shown >5& n_checks_shown <=10~"6-10", n_checks_shown >10~"over 10" ),checks_shown_bucket =factor(checks_shown_bucket ,levels =c("0","1","2", "3-5", "6-10", "over 10") ))
Code
# shorten and clarify reason field namestone_check_reject_data <- tone_check_reject_data %>%mutate(reject_reason =case_when( reject_reason =='no_reject_reason'~'No reason provided', reject_reason =='edit-check-feedback-reason-other'~'None applies', reject_reason =='edit-check-feedback-reason-appropriate'~'The tone is appropriate', reject_reason =='edit-check-feedback-reason-uncertain'~'Not sure how to revise tone' ),reject_reason =factor(reject_reason ,levels =c("No reason provided","None applies","The tone is appropriate", "Not sure how to revise tone") ))
Overall
Code
# overall dismissal ratetone_check_dismissal_overall <- tone_check_reject_data %>%filter(was_tone_check_shown ==1& is_new_content ==1) %>%#limit to where shownsummarise(n_edits =n_distinct(editing_session),n_rejects =n_distinct(editing_session[n_rejects >0])) %>%# at least one paste check declined and edit not revertedmutate(dismissal_rate =paste0(round(n_rejects/n_edits *100, 1), "%")) %>%gt() %>%tab_header(title ="Tone Check decline rate" ) %>%opt_stylize(5) %>%cols_label(n_edits ="Number of edits shown Tone check",n_rejects ="Number of edits that declined Tone Check",dismissal_rate ="Proportion of edits where Tone Check was declined" ) %>%tab_source_note( gt::md('Limited to published edits where at least one Tone Check was shown') )display_html(as_raw_html(tone_check_dismissal_overall ))
Tone Check decline rate
Number of edits shown Tone check
Number of edits that declined Tone Check
Proportion of edits where Tone Check was declined
1729
646
37.4%
Limited to published edits where at least one Tone Check was shown
Tone check was declined at 37.4% of all new content edits where at least Tone Check was shown at least one during an editing session. This is lower than the rates reported for other available checks including Paste Check (54.8% decline rate)
Code
### By decline reasontone_check_dismissal_byreason_overall <- tone_check_reject_data %>%filter(is_new_content ==1& was_tone_check_shown ==1& n_rejects >0) %>%#limit to where shown and user elect to not revise testgroup_by(reject_reason) %>%summarise(n_edits_rejected =n_distinct(editing_session)) %>%mutate(select_rate =paste0(round(n_edits_rejected/sum(n_edits_rejected) *100, 1), "%"))
Code
# plot bar chart of reason selectiondodge <-position_dodge(width=0.9)p <- tone_check_dismissal_byreason_overall %>%ggplot(aes(x= reject_reason, y = n_edits_rejected/sum(n_edits_rejected))) +geom_col(position ='dodge', fill ='dodgerblue4') +scale_y_continuous(labels = scales::percent) +geom_text(aes(label =paste(select_rate, "\n", n_edits_rejected,"edits"), fontface=2), vjust=1.2, size =8, color ="white") +scale_fill_manual(values= cbPalette, name ="Reason") +labs (y ="Percent of edits ",x ="Selected reason",title ="Reasons users selected for not revising text",caption ="Limited to published edits where a user elected to not revise text") +theme(panel.grid.minor =element_blank(),panel.background =element_blank(),plot.title =element_text(hjust =0.5),text =element_text(size=24),axis.text.x =element_text(size =18),axis.title.x =element_text(margin =margin(t =20, unit ="pt")),legend.position="none",axis.line =element_line(colour ="black")) p
Editors selected “The tone is appropriate” in over half (58.9%) of all published new content edits where the user elected to not revise their text.
By if multiple checks were shown
Code
tone_check_dismissal_bymultiple <- tone_check_reject_data %>%filter(is_new_content ==1& was_tone_check_shown ==1) %>%#limit to where showngroup_by(multiple_checks_shown) %>%summarise(n_edits =n_distinct(editing_session),n_rejects =n_distinct(editing_session[n_rejects >0& was_reverted ==0])) %>%#limit to new content edits without a refernecemutate(dismissal_rate =paste0(round(n_rejects/n_edits *100, 1), "%")) %>%gt() %>%tab_header(title ="Tone Check decline rate by if multiple checks shown" ) %>%opt_stylize(5) %>%cols_label(multiple_checks_shown ="Multiple Checks",n_edits ="Number of edits shown Tone Check",n_rejects ="Number of edits that declined Tone Check",dismissal_rate ="Proportion of edits where Tone Check was declined" ) %>%tab_source_note( gt::md('Limited to published edits where at least one Tone Check was shown') )display_html(as_raw_html(tone_check_dismissal_bymultiple ))
Tone Check decline rate by if multiple checks shown
Multiple Checks
Number of edits shown Tone Check
Number of edits that declined Tone Check
Proportion of edits where Tone Check was declined
single check shown
512
178
34.8%
multiple checks shown
1217
292
24%
Limited to published edits where at least one Tone Check was shown
As we also observed in the leading indicators report, the decline rate slightly decreases if multiple checks were shown.
Edits where multiple checks are shown are likely longer edits where the user may have more incentive to ensure their edit does not get reverted.
By Platform
Code
tone_check_dismissal_byplatform <- tone_check_reject_data %>%filter(is_new_content ==1& was_tone_check_shown ==1) %>%#limit to where showngroup_by(platform) %>%summarise(n_edits =n_distinct(editing_session),n_rejects =n_distinct(editing_session[n_rejects >0])) %>%#limit to new content edits without a refernecemutate(dismissal_rate =paste0(round(n_rejects/n_edits *100, 1), "%")) %>%ungroup() %>%#mutate(n_edits = ifelse(n_edits < 50, "<50", n_edits),#n_rejects = ifelse(n_rejects < 50, "<50", n_rejects)) %>% #sanitizing per data publication guidelines#select(-2) %>%gt() %>%tab_header(title ="Tone Check decline rate by platform" ) %>%opt_stylize(5) %>%cols_label(platform ="Platform",n_edits ="Number of edits shown Tone check",n_rejects ="Number of edits that declined Tone Check",dismissal_rate ="Proportion of edits where Tone Check was declined" ) %>%tab_source_note( gt::md('Limited to published edits where at least one Tone Check was shown') )display_html(as_raw_html(tone_check_dismissal_byplatform ))
Tone Check decline rate by platform
Platform
Number of edits shown Tone check
Number of edits that declined Tone Check
Proportion of edits where Tone Check was declined
mobile web
380
151
39.7%
desktop
1349
496
36.8%
Limited to published edits where at least one Tone Check was shown
Tone checks are declined only slightly more frequently on mobile compared to desktop. 36.8% of all published desktop edits where Tone Check was shown include at least one check that was declined compared to 39.7% of all published mobile edits.
This suggests that the lower impact Tone Check has on mobile web edit quality is not due to users explicitly rejecting the check.
Code
### Decline reason by platformtone_check_dismissal_byreason_byplatform <- tone_check_reject_data %>%filter(is_new_content ==1& was_tone_check_shown ==1& n_rejects >0 ) %>%#limit to where shown and user did not revise textgroup_by(platform, reject_reason) %>%summarise(n_edits_rejected =n_distinct(editing_session)) %>%mutate(select_rate =round(n_edits_rejected/sum(n_edits_rejected), 2))
Code
# plot bar chart of reason selectiondodge <-position_dodge(width=0.9)# slightly larger chart needed hereoptions(repr.plot.width =18, repr.plot.height =10)p <- tone_check_dismissal_byreason_byplatform %>%ggplot(aes(x= reject_reason, y =select_rate, fill = reject_reason)) +geom_col(position ='dodge',) +scale_y_continuous(labels = scales::percent) +geom_text(aes(label =paste0(select_rate *100, "%"), fontface=2), vjust=1.2, size =10, color ="white") +facet_grid(~ platform ) +labs (y ="Percent of edits ",x ="Selected reason",title ="Reasons users selected for not revising text") +scale_fill_manual(values= cbPalette, name ="Reason") +theme(panel.grid.minor =element_blank(),panel.background =element_blank(),plot.title =element_text(hjust =0.5),text =element_text(size=24),legend.position="bottom",legend.text=element_text(size=18),axis.text.x =element_blank(),axis.ticks.x =element_blank(),axis.line =element_line(colour ="black")) p
On both mobile web and desktop, “the tone is appropriate” is the most frequently selected reason for electing to not revise text. The other decline option, including “None applies” and “Not sure how to revise tone”, see similar rates of selection on both platforms.
By User Experience
Code
tone_check_dismissal_byuserexp <- tone_check_reject_data %>%filter(is_new_content ==1& was_tone_check_shown ==1) %>%#limit to where showngroup_by(experience_level_group) %>%summarise(n_edits =n_distinct(editing_session),n_rejects =n_distinct(editing_session[n_rejects >0])) %>%#limit to new content edits without a refernecemutate(dismissal_rate =paste0(round(n_rejects/n_edits *100, 1), "%")) %>%ungroup() %>%mutate(n_edits =ifelse(n_edits <50, "<50", n_edits),n_rejects =ifelse(n_rejects <50, "<50", n_rejects)) %>%#sanitizing per data publication guidelines#select(-2) %>%gt() %>%tab_header(title ="Tone Check decline rate by user experience" ) %>%opt_stylize(5) %>%cols_label(experience_level_group ="User Experience",n_edits ="Number of edits shown Tone check",n_rejects ="Number of edits that declined Tone Check",dismissal_rate ="Proportion of edits where Tone Check was declined" )%>%tab_source_note( gt::md('Limited to published edits where at least one Tone Check was shown') )display_html(as_raw_html(tone_check_dismissal_byuserexp ))
Tone Check decline rate by user experience
User Experience
Number of edits shown Tone check
Number of edits that declined Tone Check
Proportion of edits where Tone Check was declined
Unregistered
291
141
48.5%
Newcomer
406
141
34.7%
Junior Contributor
1032
366
35.5%
Limited to published edits where at least one Tone Check was shown
Unregistered users are most likely to decline a Tone Check compared to registered users. 48.5% of all published new content edits were declined by unregistered users.
Code
### Dismissal reason by user experiencetone_check_dismissal_byreason_byuserexp <- tone_check_reject_data %>%filter(is_new_content ==1& was_tone_check_shown ==1& n_rejects >0) %>%#limit to where shown and user did not revise their extgroup_by(experience_level_group, reject_reason) %>%summarise(n_edits_rejected =n_distinct(editing_session)) %>%mutate(select_rate =round(n_edits_rejected/sum(n_edits_rejected),2))
Code
# plot bar chart of reason selectiondodge <-position_dodge(width=0.9)# slightly larger chart needed hereoptions(repr.plot.width =18, repr.plot.height =10)p <- tone_check_dismissal_byreason_byuserexp %>%ggplot(aes(x= reject_reason, y = select_rate, fill = reject_reason)) +geom_col(position ='dodge') +scale_y_continuous(labels = scales::percent) +geom_text(aes(label =paste0(select_rate *100, "%"), fontface=2), vjust=1.2, size =10, color ="white") +facet_grid( ~ experience_level_group) +labs (y ="Percent of edits ",x ="Selected reason",title ="Reasons users selected for not revising their text") +scale_fill_manual(values= cbPalette, name ="Reason") +theme(panel.grid.minor =element_blank(),panel.background =element_blank(),plot.title =element_text(hjust =0.5),text =element_text(size=24),legend.position="bottom",axis.text.x =element_blank(),axis.ticks.x =element_blank(),axis.line =element_line(colour ="black")) p
There were no significant differences in the distribution of decline reasons across the three reviewed user groups.
A higher proportion of unregistered editors (62%) selected “The Tone is appropriate” compared to registered users (~57%). Unregistered editors are also slightly more likely to select “I am not sure how to revise tone” and less likely to select “No reason provided” compared to registered editors.
By partner Wikipedia
Code
tone_check_dismissal_bywiki <- tone_check_reject_data %>%filter(is_new_content ==1& was_tone_check_shown ==1) %>%#limit to where showngroup_by(wiki) %>%summarise(n_edits =n_distinct(editing_session),n_rejects =n_distinct(editing_session[n_rejects >0& was_reverted ==0])) %>%mutate(dismissal_rate =paste0(round(n_rejects/n_edits *100, 1), "%")) %>%#filter(n_edits > 50) %>% # limit to wikis with over 50 edits.ungroup() %>%mutate(n_edits =ifelse(n_edits <50, "<50", n_edits),n_rejects =ifelse(n_rejects <50, "<50", n_rejects)) %>%#sanitizing per data publication guidelinesselect(-2) %>%gt() %>%tab_header(title ="Tone Check decline rate by partner Wikipedia" ) %>%opt_stylize(5) %>%cols_label(wiki ="Wikipedia",#n_edits = "Number of edits shown Tone check",n_rejects ="Number of edits that declined Tone Check",dismissal_rate ="Proportion of edits where Tone Check was declined" ) %>%tab_source_note( gt::md('Limited to published edits where at least one Tone Check was shown') )display_html(as_raw_html(tone_check_dismissal_bywiki ))
Tone Check decline rate by partner Wikipedia
Wikipedia
Number of edits that declined Tone Check
Proportion of edits where Tone Check was declined
French Wikipedia
335
27.3%
Japanese Wikipedia
61
27.9%
Portuguese Wikipedia
75
26.5%
Limited to published edits where at least one Tone Check was shown
Code
Decline rates are very similar across all three Partner Wikipedias.
Distinct users that publish a reverted edit
Hypothesis: Newcomers and Junior Contributors will be more aware of the need to write in a neutral tone when contributing new text because the visual editor will prompt them to do so in cases where they have written text that contains non-neutral language.
Methodology: The proportion of newcomers and Junior Contributors shown or eligible to be shown Tone Check that publish at least one new content edit that was reverted.
This metric is similar to the revert rate analysis except that it looks at proportion of distinct editors versus distinct edits. There were no significant differences in the results reported in Primary Metric 1: Revert rate section as the majority of newcomers and Junior Contributors posted just one new content edit during the reviewed time period. See details below.
# plot visualization of overall users reverteddodge <-position_dodge(width=0.9) p <- tone_check_reverts_byuser_overall |>ggplot(aes(x= test_group, y = n_users_revert/n_users, fill = test_group)) +geom_col(position ='dodge') +scale_y_continuous(labels = scales::percent) +geom_text(aes(label =paste(revert_rate, "\n", n_users_revert,"users reverted"), fontface=2), vjust=1.2, size =10, color ="white") +scale_fill_manual(values=c("#999999", "dodgerblue4"), name ="Experiment Group") +labs (y ="Percent of distinct users reverted ",x ="Experiment Group",title ="Proportion of users with at least one reverted edit",caption ="Limited to published new content edits shown or eligible to be shown Tone Check") +theme(panel.grid.minor =element_blank(),panel.background =element_blank(),plot.title =element_text(hjust =0.5),text =element_text(size=24),axis.text.x =element_text(size =24),axis.title.x =element_text(margin =margin(t =20, unit ="pt")),legend.position="none",axis.line =element_line(colour ="black")) p
By If multiple checks were shown
Code
tone_check_revert_byuser_bymultiple <- tone_check_publish_data |>filter(is_new_content ==1& is_test_eligible =='eligible'& test_group =='test (tone check shown)'& multiple_checks_shown !="no tone checks") |>group_by( multiple_checks_shown) %>%summarise(n_users =n_distinct(user_id),n_revert_users =n_distinct(user_id[was_reverted ==1])) |>#limit to new content edits without a refernecemutate(revert_rate =paste0(round(n_revert_users/n_users *100, 1), "%")) |>select(-c(2,3)) |># removing granular data columns for publicationgt() |>tab_header(title ="Users with at least one reverted edit by if multiple checks were shown" ) |>opt_stylize(5) |>cols_label(multiple_checks_shown ="Multiple Check",#n_edits = "Number of published new content edits",#n_reverts = "Number of edits reverted ",revert_rate ="Proportion of new content edits that were reverted" ) |>tab_source_note( gt::md('Limited to published new content edits shown or eligible to shown Tone Check') )display_html(as_raw_html(tone_check_revert_byuser_bymultiple))
Users with at least one reverted edit by if multiple checks were shown
Multiple Check
Proportion of new content edits that were reverted
one tone check
26%
multiple tone checks
26.6%
Limited to published new content edits shown or eligible to shown Tone Check
By Platform
Code
tone_check_revert_byuser_byplatform <- tone_check_publish_data |>filter(is_new_content ==1& is_test_eligible =='eligible' ) |>group_by(platform, test_group) |>summarise(n_users =n_distinct(user_id),n_revert_users =n_distinct(user_id[was_reverted ==1])) %>%#limit to new content edits without a refernecemutate(revert_rate =paste0(round(n_revert_users/n_users *100, 1), "%")) %>%select(-c(3,4)) %>%# removing granular data columns for publicationgt() |>tab_header(title ="Users with at least one reverted edit by platform" ) |>opt_stylize(5) %>%cols_label(test_group ="Experiment Group",platform ="Platform",#n_users = "Number of Users",#n_revert_users = "Number of users reverted ",revert_rate ="Proportion of distinct users that were reverted" ) |>tab_source_note( gt::md('Limited to users who published new content edits shown or eligible to shown Tone Check') )display_html(as_raw_html(tone_check_revert_byuser_byplatform))
Users with at least one reverted edit by platform
Experiment Group
Proportion of distinct users that were reverted
mobile web
control (eligible but not shown tone check)
33%
test (tone check shown)
37.4%
desktop
control (eligible but not shown tone check)
24.1%
test (tone check shown)
22.7%
Limited to users who published new content edits shown or eligible to shown Tone Check
By User Experience
Code
tone_check_revert_byuser_byuserexp <- tone_check_publish_data |>filter(is_new_content ==1& is_test_eligible =='eligible' ) |>group_by(experience_level_group, test_group) |>summarise(n_users =n_distinct(user_id),n_revert_users =n_distinct(user_id[was_reverted ==1])) %>%#limit to new content edits without a refernecemutate(revert_rate =paste0(round(n_revert_users/n_users *100, 1), "%")) |>select(-c(3,4)) %>%# removing granular data columns for publicationgt() |>tab_header(title ="Users with at least one reverted edit by user experience" ) |>opt_stylize(5) |>cols_label(test_group ="Test group",experience_level_group ="User experience",#n_users = "Number of users",#n_revert_users = "Number of users reverted ",revert_rate ="Proportion of distinct users that were reverted" ) |>tab_source_note( gt::md('Limited to users who published new content edits shown or eligible to shown Tone Check') )display_html(as_raw_html(tone_check_revert_byuser_byuserexp))
Users with at least one reverted edit by user experience
Test group
Proportion of distinct users that were reverted
Unregistered
control (eligible but not shown tone check)
34.9%
test (tone check shown)
37.2%
Newcomer
control (eligible but not shown tone check)
21.5%
test (tone check shown)
26.4%
Junior Contributor
control (eligible but not shown tone check)
25.5%
test (tone check shown)
22.5%
Limited to users who published new content edits shown or eligible to shown Tone Check
Constructive Retention Rate (Tone Check not shown again)
We also reviewed the proportion of newcomers and Junior Contributors that publish an edit Tone Check was activated within and return to make a new content edit where Tone Check was not shown 7 to 14 days after.
retention_rate_norepeat_overall_table <- retention_rate_norepeat_overall %>%select(-c(2,3)) |># removing granular data columns for publicationgt() %>%tab_header(title ="Constructive second week retention rate (tone check not shown again)" ) %>%cols_label(test_group ="Experiment group",#return_editors = "Number of editors that returned second week",#editors = "Number of first week editors",retention_rate ="Retention rate" ) %>%opt_stylize(5) %>%tab_footnote(footnote ="Limited to users shown or eligible to be shown at least one Tone Check",locations =cells_column_labels(columns ='retention_rate' ) ) display_html(as_raw_html(retention_rate_norepeat_overall_table))
Constructive second week retention rate (tone check not shown again)
Experiment group
Retention rate1
control (eligible but not shown tone check)
1.6%
test (tone check shown)
1.7%
1 Limited to users shown or eligible to be shown at least one Tone Check
Less than 2% of editors in both the control and test group returned to make another new content edit where Tone Check was not shown or not eligible to be shown in their second week. While we see a slight increase in retention rate for editors shown Tone Check, we are unable to confirm statistical significance due to the small effect and sample size.
retention_rate_norepeat_byplatform_table <- retention_rate_norepeat_byplatform %>%select(-c(3,4)) %>%# removing granular data columns for publicationgt() %>%tab_header(title ="Constructive second week retention rate (tone check not shown again) by platform" ) %>%cols_label(test_group ="Experiment group",platform ="Platform",#return_editors = "Number of editors that returned second week",#editors = "Number of first week editors",retention_rate ="Retention rate" ) %>%opt_stylize(5) %>%tab_footnote(footnote ="Limited to users shown or eligible to be shown at least one Tone Check",locations =cells_column_labels(columns ='retention_rate' ) ) display_html(as_raw_html(retention_rate_norepeat_byplatform_table))
Constructive second week retention rate (tone check not shown again) by platform
Experiment group
Retention rate1
mobile web
control (eligible but not shown tone check)
0.5%
test (tone check shown)
1.7%
desktop
control (eligible but not shown tone check)
2.1%
test (tone check shown)
1.7%
1 Limited to users shown or eligible to be shown at least one Tone Check
retention_rate_norepeat_byuserexp_table <- retention_rate_norepeat_byuserexp %>%select(-c(3,4)) %>%# removing granular data columns for publicationgt() %>%tab_header(title ="Constructive second week retention rate (tone check not shown again) by user experience" ) %>%cols_label(test_group ="Experiment group",experience_level_group ="Experience level group",#return_editors = "Number of editors that returned second week",#editors = "Number of first week editors",retention_rate ="Retention rate" ) %>%opt_stylize(5) %>%tab_footnote(footnote ="Limited to users shown or eligible to be shown at least one Tone Check",locations =cells_column_labels(columns ='retention_rate' ) ) display_html(as_raw_html(retention_rate_norepeat_byuserexp_table))
Constructive second week retention rate (tone check not shown again) by user experience
Experiment group
Retention rate1
Unregistered
control (eligible but not shown tone check)
0%
test (tone check shown)
0.5%
Newcomer
control (eligible but not shown tone check)
1.1%
test (tone check shown)
2.3%
Junior Contributor
control (eligible but not shown tone check)
2.7%
test (tone check shown)
2%
1 Limited to users shown or eligible to be shown at least one Tone Check