Reference Check A/B Test Report

Author
Affiliation

Irene Florez

Published

December 19, 2025

Modified

January 6, 2026

1 Summary in Brief

NoneSummary

TL;DR: When Reference Check was shown, edits were much more likely to add a new reference, edits were more often constructive, and reverts declined, with a slight reduction in edit completion; these patterns are especially evident on mobile web.

Brief Summary: When Reference Check is shown, edits are significantly more likely to add a reference, especially on mobile web. Edits shown Reference Check are directionally more constructive and less likely to be reverted within 48 hours, with the strongest and most consistent improvements on mobile web. While Reference Check slightly reduces edit completion, the decrease is modest.

  • References added: Edits were more likely to add at least one reference when Reference Check was shown, with very large gains on mobile web and clear gains on desktop.
  • Constructive edits: Edits were more likely to be constructive (not reverted within 48 hours), with the strongest improvement on mobile web.
  • Reverts: Revert rates declined overall, with the largest reduction on mobile web.
  • Edit completion: Completion rates decreased slightly across platforms.

More references, fewer reverts, and improved constructiveness on mobile demonstrate the benefits of Reference Check on English Wikipedia and outweigh the observed reduction in completion.

2 Key Results

2.0.1 References Added or Acknowledged (KPI #1)

High-level: When Reference Check was shown, edits were far more likely to add a reference or acknowledge/explain why they did not.

Why KPI #1b: Direct test–control comparisons for KPI #1 are hard to interpret because the “Decline” option exists only in test. We therefore focus on KPI #1b, which removes this imbalance.

KPI #1b — Reference Added (Shown / Eligible):

  • Desktop: ~2.2× more likely (30.7% → 68.2%)
  • Mobile web: ~17.5× more likely (2.8% → 48.9%)
  • Interpretation: The increase in reference inclusion is large and statistically significant across models and simpler comparisons.

KPI #1b — Reference Added (Availability / ITT):

  • Overall: 56.3% → 68.3% (+12.1 pp, +21.5%)
  • Desktop: 60.5% → 70.6% (+10.2 pp, +16.8%)
  • Mobile web: ~2.2× more likely (22.0% → 47.8%)
  • Interpretation: Even under a conservative ITT view, Reference Check increases the likelihood that constructive new-content edits include a reference, with the strongest lift on mobile web.

2.0.2 Constructive Edits (Not Reverted Within 48 Hours) (KPI #2)

  • Desktop: 75.5% → 77.9% (+3.2% relative; the overall adjusted regression does not find a statistically significant across-platform effect)
  • Mobile web: 56.4% → 66.7% (+18.2% relative; within-platform contrast statistically significant)

Interpretation: Results consistently favor the test group, with the clearest improvements on mobile web. While cross-platform differences are not definitive, the pattern suggests larger gains on mobile. When adjusting for reference inclusion, the mobile effect attenuates, indicating part of the benefit operates through added references.

2.0.3 Revert Rate Within 48 Hours (Lower Is Better) (Guardrail #1)

  • Overall: −14.5% relative (28.2% → 24.1%)
  • Desktop: −9.8% relative (24.5% → 22.1%)
  • Mobile web: −23.6% relative (43.6% → 33.3%)

Interpretation: Edits shown Reference Check were less likely to be reverted across analyses, with the strongest and most reliable reduction on mobile web. Although cross-platform differences are not statistically definitive, within-mobile contrasts and relax analyses show clear reductions. Edits that added a new reference were much less likely to be reverted, supporting the quality mechanism.

2.0.4 Edit Completion (SaveIntent → SaveSuccess) (Guardrail #2)

  • Overall: −4.8% relative (88.3% → 84.1%)
  • Desktop: −6.8% relative (94.0% → 87.6%)
  • Mobile web: −6.3% relative (74.1% → 69.4%)

Interpretation: Reference Check introduces measurable friction that reduces completion rates, but this trade-off coincides with higher-quality outcomes: increased reference inclusion, fewer reverts, and improved constructiveness on mobile web.

3 Overview

The Wikimedia Foundation’s Editing team is working on a set of improvements for the visual editor to help new volunteers understand and follow some of the policies necessary to make constructive changes to Wikipedia projects.

This work is guided by the Wikimedia Foundation Annual Plan, specifically by the Wiki Experiences 1.1 objective key result: Increase the rate at which editors with ≤100 cumulative edits publish constructive edits on mobile web by 4%, as measured by controlled experiments (by the end of Q2).

In this A/B test the Editing team is evaluating the impact of Reference Check. Reference Check invites users who have added more than 50 new characters to an article namespace to include a reference to the edit they’re making if they have not already done so themselves at the time they indicate their intent to save. More information about features of this tool and project updates is available on the project page.

English Wikipedia Reference Check KPI Hypothesis: The amount of constructive edits newcomers will publish will increase because a greater percentage of edits that add new content will include a reference or an explicit acknowledgement as to why these edits lack references. KPI Metrics(s) for evaluation: 1) Proportion of published edits that add new content and include a reference or an explicit acknowledgement of why a citation was not added; 2) Proportion of published edits that add new content T333714 and are constructive (read: NOT reverted within 48 hours).


From the Edit Check Reference-Check AB Test report when Reference Check was shown, edits were 2.2× more likely to include a new reference and be constructive (i.e. not reverted within 48 hours) than otherwise. The English Wikipedia Reference Check A/B test will be looking to how numbers compare to this 2024 finding.


Code
# Load packages
shhh <- function(expr) suppressPackageStartupMessages(suppressWarnings(suppressMessages(expr)))
shhh({
  library(lubridate)
  library(ggplot2)
  library(dplyr)
  library(gt)
  library(IRdisplay)
  library(tidyr)
  library(relax)
  library(tibble)
  library(lme4)
  library(broom)
  library(broom.mixed)
  library(broom.helpers)
  # NOTE: We intentionally do NOT attach brms here.
  # In some environments, brms/rstan can fail to load due to binary toolchain issues.
  # The notebook runs brms models on a best-effort basis via safe_brm() and will skip if unavailable.
  set.seed(5)
})

# Preferences
options(dplyr.summarise.inform = FALSE)
# Reduce default plot size (tablet-friendly)
options(repr.plot.width = 9, repr.plot.height = 5.5)

# Colorblind-friendly palette
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

# Configuration (edit for reruns)
experiment_name <- "reference_check_ab_test_2025"
experiment_bucket_test <- "2025-09-editcheck-addReference-test"
experiment_bucket_control <- "2025-09-editcheck-addReference-control"
experiment_buckets <- c(experiment_bucket_test, experiment_bucket_control)
wiki_list <- c("enwiki")

# Bucket label normalization (collapse long labels to control/test)
# NOTE: bucket_map is set for 2025 RC; update for other experiments or set via env.
bucket_map <- c(
  "2025-09-editcheck-addReference-control" = "control",
  "2025-09-editcheck-addReference-test" = "test"
)
normalize_buckets <- function(df) {
  if (is.null(df) || !"test_group" %in% names(df)) return(df)
  df %>% mutate(test_group = {
    tg <- trimws(as.character(test_group))
    tg <- dplyr::case_when(
      tg == experiment_bucket_control ~ "control",
      tg == experiment_bucket_test ~ "test",
      TRUE ~ tg
    )
    tg <- recode(tg, !!!bucket_map, .default = tg)
    tg <- ifelse(grepl("addreference-control", tg, ignore.case = TRUE), "control",
                 ifelse(grepl("addreference-test", tg, ignore.case = TRUE), "test", tg))
    tg
  })
}

normalize_platforms <- function(df) {
  if (is.null(df) || !"platform" %in% names(df)) return(df)
  df %>% mutate(platform = {
    pf <- trimws(as.character(platform))
    pf <- dplyr::case_when(
      tolower(pf) %in% c("phone", "mobile") ~ "mobile-web",
      TRUE ~ pf
    )
    pf
  })
}

# Apply normalization after load too (in case downstream merges introduce new labels)
renorm_buckets <- function(df) normalize_platforms(normalize_buckets(df))

# Construct analysis groups per updated methodology
# - For KPI1/KPI2/Guardrail1: test = rows where RC was shown at least once; control = eligible-but-not-shown
make_rc_ab_group_published <- function(df) {
  if (is.null(df)) return(df)
  df <- df %>% apply_aliases() %>% renorm_buckets()
  need <- c("test_group", "was_reference_check_shown", "was_reference_check_eligible")
  if (!all(need %in% names(df))) return(df)

  df %>%
    mutate(
      ab_group = dplyr::case_when(
        test_group == "test" & was_reference_check_shown == 1 ~ "test",
        test_group == "control" & was_reference_check_eligible == 1 & (is.na(was_reference_check_shown) | was_reference_check_shown != 1) ~ "control",
        TRUE ~ NA_character_
      ),
      test_group = ab_group
    ) %>%
    filter(!is.na(test_group)) %>%
    add_experience_group()
}

# Guardrail2 constraint: test group is shown-only; control group is not restricted by eligibility
make_rc_ab_group_completion <- function(df) {
  if (is.null(df)) return(df)
  df <- df %>% apply_aliases() %>% renorm_buckets()
  if (!all(c("test_group", "was_reference_check_shown") %in% names(df))) return(df)

  df %>%
    filter(!(test_group == "test" & (is.na(was_reference_check_shown) | was_reference_check_shown != 1)))
}

# Experience group for Guardrail #2 reporting
add_experience_group <- function(df) {
  if (is.null(df)) return(df)
  if (!("user_edit_count" %in% names(df)) || !("user_status" %in% names(df))) return(df)
  df %>% mutate(
    experience_level_group = dplyr::case_when(
      user_edit_count == 0 & user_status == "registered" ~ "Newcomer",
      user_edit_count == 0 & user_status == "unregistered" ~ "Unregistered",
      user_edit_count > 0 & user_edit_count <= 100 ~ "Junior Contributor",
      user_edit_count > 100 ~ "Non-Junior Contributor",
      TRUE ~ NA_character_
    ),
    experience_level_group = factor(
      experience_level_group,
      levels = c("Unregistered", "Newcomer", "Junior Contributor", "Non-Junior Contributor")
    )
  )
}

# Column aliasing (plug-and-play for drifted names)
col_aliases <- list(
  was_reference_check_shown         = c("reference_check_shown", "rc_shown"),
  was_reference_check_eligible      = c("reference_check_eligible", "rc_eligible"),
  saved_edit                        = c("save_success", "saved"),
  was_reverted                      = c("reverted_48h", "mw_reverted"),
  was_reference_included            = c("reference_added", "has_reference_added", "has_reference", "reference_added_or_acknowledged"),
  has_reference_or_acknowledgement  = c("has_reference_or_ack"),
  added_reference_or_acknowledgement = c("added_reference_or_ack")
)
apply_aliases <- function(df, aliases = col_aliases) {
  if (is.null(df)) return(df)
  for (nm in names(aliases)) {
    if (!(nm %in% names(df))) {
      cand <- aliases[[nm]]
      hit <- cand[cand %in% names(df)][1]
      if (!is.na(hit)) names(df)[names(df) == hit] <- nm
    }
  }
  df
}

# Timestamp candidates for coverage checks
ts_candidates <- c("event_dt", "rev_timestamp", "mw_timestamp", "first_edit_time", "return_time", "dt", "timestamp")

# Default sanitize_counts (if not provided elsewhere)
if (!exists("sanitize_counts")) {
  sanitize_counts <- function(df, cols) {
    keep <- intersect(cols, names(df))
    if (length(keep) == 0) return(df)
    df %>% mutate(across(all_of(keep), ~ if (!is.numeric(.)) {
      as.character(.)
    } else {
      dplyr::case_when(
        is.na(.) ~ NA_character_,
        . < 50 ~ "<50",
        TRUE ~ scales::comma(., accuracy = 1)
      )
    }))
  }
}

# Data directory (hard-coded relative to this notebook)
output_dir <- file.path("data", experiment_name)
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)
#message("Data directory: ", normalizePath(output_dir, mustWork = FALSE))

# Expected input files from the collection notebook
files <- list(
  reference_check_save_data = file.path(output_dir, "reference_check_save_data.tsv"),
  constructive_retention_data = file.path(output_dir, "constructive_retention_data.tsv"),
  edit_completion_rate_data = file.path(output_dir, "edit_completion_rate_data.tsv"),
  reference_check_rejects_data = file.path(output_dir, "reference_check_rejects_data.tsv")
)
Code
# Helper: sanitize counts < 50 per publication guidelines
sanitize_counts <- function(df, cols) {
  keep <- intersect(cols, names(df))
  if (length(keep) == 0) return(df)
  df %>% mutate(across(all_of(keep), ~ ifelse(!is.na(.) & . < 50, "<50", as.character(.))))
}
Code
# Plot helpers (paste-check style)
pc_theme <- function() {
  ggplot2::theme_minimal(base_size = 11) +
    ggplot2::theme(
      legend.position = "bottom",
      panel.grid.minor = ggplot2::element_blank(),
      # Padding so titles/labels/annotations do not touch plot edges
      plot.margin = ggplot2::margin(10, 14, 10, 10),
      plot.title = ggplot2::element_text(margin = ggplot2::margin(b = 8)),
      plot.subtitle = ggplot2::element_text(margin = ggplot2::margin(b = 8)),
      axis.title.x = ggplot2::element_text(margin = ggplot2::margin(t = 8)),
      axis.title.y = ggplot2::element_text(margin = ggplot2::margin(r = 8))
    )
}
Code
# Table helpers: rates and relative change vs control by platform
make_rate_table <- function(df, value_col, group_cols = c("test_group", "platform")) {
  df %>%
    group_by(across(all_of(group_cols))) %>%
    summarise(rate = mean(.data[[value_col]], na.rm = TRUE), n = n(), .groups = "drop")
}

# Build a self-explanatory change table: control/test rates + absolute diff (pp) + relative change + Ns
make_rel_change_dim <- function(rate_tbl, dim_col = "platform", rate_col = "rate") {
  if (is.null(rate_tbl) || nrow(rate_tbl) == 0) {
    return(tibble(
      !!rlang::sym(dim_col) := character(),
      control_rate = numeric(),
      test_rate = numeric(),
      abs_diff_pp = numeric(),
      rel_change = numeric(),
      n_control = numeric(),
      n_test = numeric()
    ))
  }

  if (!(dim_col %in% names(rate_tbl))) {
    stop("make_rel_change_dim: dim_col not found: ", dim_col)
  }
  if (!("test_group" %in% names(rate_tbl))) {
    stop("make_rel_change_dim: missing required column: test_group")
  }
  if (!(rate_col %in% names(rate_tbl))) {
    stop("make_rel_change_dim: rate_col not found: ", rate_col)
  }

  rate_tbl <- renorm_buckets(rate_tbl)

  # Work with a stable internal key so joins are never broken by tidy-eval
  rt <- rate_tbl %>%
    rename(dim = all_of(dim_col))

  # Rates wide
  wide_rate <- rt %>%
    select(dim, test_group, value = all_of(rate_col)) %>%
    tidyr::pivot_wider(names_from = test_group, values_from = value)

  if (!"control" %in% names(wide_rate)) wide_rate$control <- NA_real_
  if (!"test" %in% names(wide_rate)) wide_rate$test <- NA_real_

  out <- wide_rate %>%
    transmute(
      dim,
      control_rate = control,
      test_rate = test,
      abs_diff_pp = (test - control) * 100,
      rel_change = dplyr::if_else(is.na(control) | control == 0, NA_real_, (test - control) / control)
    )

  # Optional: pivot counts if present
  if ("n" %in% names(rt)) {
    wide_n <- rt %>%
      select(dim, test_group, n) %>%
      tidyr::pivot_wider(names_from = test_group, values_from = n)
    if (!"control" %in% names(wide_n)) wide_n$control <- NA_real_
    if (!"test" %in% names(wide_n)) wide_n$test <- NA_real_
    out <- out %>%
      left_join(wide_n %>% transmute(dim, n_control = control, n_test = test), by = "dim")
  }

  # Optional: pivot denominators if present (used in dismissal rate tables)
  if ("distinct_sessions" %in% names(rt)) {
    wide_denom <- rt %>%
      select(dim, test_group, distinct_sessions) %>%
      tidyr::pivot_wider(names_from = test_group, values_from = distinct_sessions)
    if (!"control" %in% names(wide_denom)) wide_denom$control <- NA_real_
    if (!"test" %in% names(wide_denom)) wide_denom$test <- NA_real_
    out <- out %>%
      left_join(
        wide_denom %>% transmute(dim, distinct_sessions_control = control, distinct_sessions_test = test),
        by = "dim"
      )
  }

  out %>%
    rename(!!rlang::sym(dim_col) := dim)
}

make_rel_change <- function(rate_tbl, rate_col = "rate") {
  make_rel_change_dim(rate_tbl, dim_col = "platform", rate_col = rate_col)
}

4 Analysis

4.0.1 Methodology reference

KPI 1: References (included or acknowledged) We evaluate whether a reference was included or the editor explicitly acknowledged missing references (via one of the four valid decline reasons). We compare outcomes across experiment groups and slices and report uncertainty via regression + Bayesian lift.

KPI 2: Constructive edits We define an edit as constructive if it is not reverted within 48 hours (1 = not reverted within 48h). We estimate differences across groups and slices via regression + Bayesian lift.

Guardrail 1: Content quality We examine 48-hour revert rates, including a breakdown stratified by whether a reference was included. We use relax for lift and may include prop.test as a lightweight audit check.

Guardrail 2: Edit completion We define completion as the transition from save intent to a successful save (saveIntent → saveSuccess). We analyze completion rates via regression + Bayesian lift, with the test group limited to sessions where Reference Check was shown.

Modeling details Primary approach: logistic regression (glm) + Bayesian (relax; brms optional) for uncertainty. Mixed-effects models are not used in this enwiki-only report.

  • Buckets present and counts
  • Wiki coverage
  • Date span (min/max timestamps)
Code
# Reference Check exposure among published new-content edits (test bucket; by editing_session_id)
# Note: `editing_session_id` corresponds to a single edit in this dataset; it links together that edit’s steps/events.
if (require_cols(reference_check_save_data,
                 c("test_group", "is_new_content", "editing_session", "was_reference_check_shown"),
                 "Reference Check exposure (editing_session_id)")) {

  df_test_nc <- reference_check_save_data %>%
    renorm_buckets() %>%
    filter(test_group == "test", is_new_content == 1)

  exposure_sessions <- df_test_nc %>%
    group_by(editing_session) %>%
    summarise(
      shown_any = as.integer(any(was_reference_check_shown == 1, na.rm = TRUE)),
      .groups = "drop"
    ) %>%
    mutate(exposure_bucket = dplyr::if_else(shown_any == 1L, "shown", "not_shown")) %>%
    count(exposure_bucket) %>%
    mutate(pct = n / sum(n))

  render_pct_table(
    exposure_sessions,
    "Reference Check exposure among published new-content edits (test bucket; by editing_session_id)",
    c(exposure_bucket = "Bucket",
      n = "Count (edits; distinct editing_session_id)",
      pct = "Percent of published new-content edits"),
    note_text = paste(
      "Denominator = distinct editing_session_id (edits) among published new-content edits in the test bucket",
      "(`test_group == 'test'` and `is_new_content == 1`).",
      "An edit is counted as exposed/shown if any event for that edit has `was_reference_check_shown == 1`",
      "(measured via VEFU; action=check-shown-presave)."
    )
  )
}
Reference Check exposure among published new-content edits (test bucket; by editing_session_id)
Bucket Count (edits; distinct editing_session_id) Percent of published new-content edits
not_shown 2558 61.9%
shown 1574 38.1%

Table note: Denominator = distinct editing_session_id (edits) among published new-content edits in the test bucket (test_group == 'test' and is_new_content == 1). An edit is counted as exposed/shown if any event for that edit has was_reference_check_shown == 1 (measured via VEFU; action=check-shown-presave).

“Among published new-content sessions in the test bucket, what fraction had the check shown at least once?”

4.2 Key metrics

4.2.1 KPI #1 Reference Added or acknowledged why a citation was not added

Metric: Proportion of published edits that add new content and either include a reference or explicitly acknowledge why a citation was not added.

Methodology: We analyze published edits that add new content.

Test group: The test group includes editing sessions where Reference Check was shown at least once during the editing session. This corresponds to event.feature = "editCheck-addReference" and event.action = "check-shown-presave". Only published edits are included. An edit is counted if it either adds at least one net new reference or includes an explicit acknowledgement for missing references by selecting one of the four valid decline reasons.

Control group: The control group includes published edits identified as eligible but not shown Reference Check. An edit is counted if it includes at least one net new reference.

We compare proportions between experiment groups (control vs test) overall and by platform / user status / checks-shown buckets. For adjusted comparisons we use multivariable logistic regression (glm); for lift/uncertainty we also report Bayesian lift via relax when available.

Note: Similar to KPI1 in Multi Check

Results: When Reference Check was shown, edits were far more likely to either add a reference or clearly acknowledge why they didn’t. Given the current KPI1 definition, direct test–control comparisons are difficult to interpret because the “Decline” option is available only in test. To ensure a fair comparison across test and control groups, we focus on KPI1b, which compares control and test without the decline option.

Code
# KPI1 bar (Reference Added or acknowledged why a citation was not added), by test_group
# Updated per KPI #1 methodology:
# - population: new-content edits within analysis groups (shown test vs eligible-not-shown control)
# - outcome: test counts (reference included OR valid decline acknowledgement); control counts (reference included)
if (!is.null(reference_check_save_data)) {
  kpi1_df <- reference_check_save_data %>%
    make_rc_ab_group_published() %>%
    filter(is_new_content == 1)

  ref_flag <- pick_first(reference_flag_candidates, kpi1_df)

  if (!is.null(ref_flag) && all(c("test_group") %in% names(kpi1_df))) {
    valid_reasons <- c(
      "edit-check-feedback-reason-common-knowledge",
      "edit-check-feedback-reason-irrelevant",
      "edit-check-feedback-reason-uncertain",
      "edit-check-feedback-reason-other"
    )

    # Add acknowledgement (selected decline reason) at the editing_session level, when available
    if (!is.null(reference_check_rejects_data) &&
        all(c("editing_session", "reject_reason") %in% names(reference_check_rejects_data)) &&
        ("editing_session" %in% names(kpi1_df))) {
      ack_sessions <- reference_check_rejects_data %>%
        renorm_buckets() %>%
        filter(reject_reason %in% valid_reasons) %>%
        distinct(editing_session) %>%
        mutate(ack_reason_selected = 1L)
      kpi1_df <- kpi1_df %>% left_join(ack_sessions, by = "editing_session")
    }

    kpi1_bar_df <- kpi1_df %>%
      mutate(
        ref_included = dplyr::if_else(is.na(.data[[ref_flag]]), 0L, as.integer(.data[[ref_flag]] == 1)),
        ack_reason_selected = dplyr::if_else(is.na(ack_reason_selected), 0L, as.integer(ack_reason_selected == 1)),
        has_ref_ack = dplyr::if_else(
          test_group == "test",
          as.integer((ref_included == 1) | (ack_reason_selected == 1)),
          ref_included
        )
      ) %>%
      group_by(test_group) %>%
      summarise(rate = mean(has_ref_ack, na.rm = TRUE), n = n(), .groups = "drop")

    kpi1_bar <- kpi1_bar_df %>%
      mutate(label = scales::percent(rate, accuracy = 0.1)) %>%
      ggplot(aes(x = test_group, y = rate, fill = test_group)) +
      geom_col() +
      geom_text(aes(label = label), vjust = -0.2, size = 3) +
      scale_y_continuous(labels = scales::percent_format(), expand = expansion(mult = c(0, 0.12))) +
      scale_fill_manual(values = c("control" = "#999999", "test" = "dodgerblue4")) +
      labs(title = "KPI1: Reference Added or acknowledged why a citation was not added",
           x = "Test group",
           y = "Percent of new-content edits") +
      pc_theme() +
      guides(fill = "none")

    print(kpi1_bar)
  } else {
    message("KPI1 plot: reference flag or required columns missing")
  }
} else {
  message("KPI1 plot: data not loaded")
}

Chart note (definition of Rate / denominator)

  • KPI 1 (Reference Added or acknowledged why a citation was not added):
    • Rate = mean(has_ref_ack) where has_ref_ack is a 0/1 flag. In the test group, outcome=1 if the edit either includes a reference (was_reference_included == 1) or the user selected one of the four valid decline reasons. In the control group, outcome=1 if the edit includes a reference.
    • Denominator = new-content edit rows in reference_check_save_data (is_new_content == 1) within the analysis groups (test = RC shown at least once; control = eligible-but-not-shown).

4.2.2 KPI #1b New reference included

Metric: Proportion of published edits that add new content, are constructive (not reverted within 48 hours), and include at least one net new reference.

Methodology: We analyze published edits that add new content and exclude edits reverted within 48 hours.

Test group: The test group includes editing sessions where Reference Check was shown at least once during the editing session.

Control group: The control group includes published edits identified as eligible but not shown Reference Check.

Population definition (KPI #1b): A published edit is included if is_new_content == 1 and was_reverted != 1 (i.e., not reverted within 48 hours).

Outcome definition (KPI #1b): An edit is counted (outcome=1) if it includes at least one net new reference (in this notebook: was_reference_included == 1 when available).

Important: population / comparability note

This notebook’s primary KPI #1b reporting uses shown vs eligible-not-shown analysis groups:

  • Test = new-content edits where Reference Check was shown at least once.
  • Control = new-content edits tagged eligible but not shown.

This is an exposure-style (per-protocol) estimate: it answers “what is the effect when Reference Check is actually shown?”.

For comparability with prior reports (e.g., 2024), we also report an availability / intent-to-treat (ITT) version for KPI #1b:

  • Test vs Control assignment buckets, among all published new-content edits (not restricted to shown/eligible). The denominator is all new-content edits, and the outcome is 1 only if the edit both includes a new reference and is not reverted within 48 hours.

Why KPI #1b? KPI #1 counts edits that either include a new reference or explicitly acknowledge why a citation was not added. KPI #1b removes the acknowledgement component to isolate the effect on references actually added.

Results: When Reference Check was shown, editors were much more likely to add a new reference. This effect is large and statistically significant in the pooled adjusted model (adjusted for platform). How big is the change:

  • Desktop: Editors were ~2.2× more likely to add a new reference (30.7% → 68.2%).
  • Mobile-web: Editors were ~17.5× more likely to add a new reference (2.8% → 48.9%).

The increase in references added is substantial on both platforms. Across both adjusted models and simpler comparisons, the evidence is clear: Reference Check materially increases the likelihood that editors add a new reference(s).
Note: KPI #1 and KPI #1b above compare edits where Reference Check was shown with edits that were eligible but not shown (exposure-style, per-protocol). For KPI #1b, this is further limited to constructive new-content edits that were not reverted within 48 hours.

In one sentence: Among constructive new‑content edits (not reverted within 48 hours), edits where Reference Check was shown were ~2.2× more likely on desktop (30.7% → 68.2%) and ~17.5× more likely on mobile web (2.8% → 48.9%) to include at least one net new reference compared to eligible edits where Reference Check was not shown.

KPI #1b (Shown/Eligible; reference added on constructive new-content edits):
Overall: ↑ +38.7 pp (26.5% → 65.2%), +146.0% relative (roughly 2.46×).
Desktop: ↑ +37.5 pp (30.7% → 68.2%), +122.1% relative (roughly 2.2×).
Mobile web: ↑ +46.1 pp (2.8% → 48.9%), +1,646.4% relative (~17.5×).
Evidence: glm (Table 1b) OR = 5.56 (95% CI 4.65–6.66), p < 0.001.
Relax (relative lift): Bayesian = +1.23 (95% CrI 1.00–1.46, P(test better)=1.00);
Frequentist = +1.46 (95% CI 1.21–1.71), p < 0.001.

Code
# KPI1b bar (new reference included), by test_group
if (!is.null(reference_check_save_data)) {
  kpi1b_df <- reference_check_save_data %>%
    make_rc_ab_group_published() %>%
    filter(is_new_content == 1)

  # KPI #1b population: constructive new-content edits only (exclude reverted within 48h)
  if ("was_reverted" %in% names(kpi1b_df)) {
    kpi1b_df <- kpi1b_df %>% filter(is.na(was_reverted) | was_reverted != 1)
  }

  # KPI1b uses reference-included only (no acknowledgement)
  ref_col <- if ("was_reference_included" %in% names(kpi1b_df)) {
    "was_reference_included"
  } else {
    pick_first(c("was_reference_included", "reference_added", "has_reference_added", "has_reference"), kpi1b_df)
  }

  if (!is.null(ref_col) && all(c("test_group") %in% names(kpi1b_df))) {
    kpi1b_bar <- kpi1b_df %>%
      mutate(ref_included = ifelse(is.na(.data[[ref_col]]), 0L, as.integer(.data[[ref_col]] == 1))) %>%
      group_by(test_group) %>%
      summarise(rate = mean(ref_included, na.rm = TRUE), n = n(), .groups = "drop") %>%
      mutate(label = scales::percent(rate, accuracy = 0.1)) %>%
      ggplot(aes(x = test_group, y = rate, fill = test_group)) +
      geom_col() +
      geom_text(aes(label = label), vjust = -0.2, size = 3) +
      scale_y_continuous(labels = scales::percent_format(), expand = expansion(mult = c(0, 0.12))) +
      scale_fill_manual(values = c("control" = "#999999", "test" = "dodgerblue4")) +
      labs(title = "KPI1b: New reference included",
           x = "Test group",
           y = "Percent of constructive new-content edits") +
      pc_theme() +
      guides(fill = "none")

    print(kpi1b_bar)
  } else {
    message("KPI1b plot: required columns missing (reference flag or test_group)")
  }
} else {
  message("KPI1b plot: data not loaded")
}

Chart note (definition of Rate / denominator)

  • KPI 1b (new reference included):
    • Rate = mean(ref_included) where ref_included is a 0/1 flag (1 = at least one net new reference included).
    • Denominator = constructive new-content edit rows in reference_check_save_data where is_new_content == 1 and was_reverted != 1 (not reverted within 48 hours), within the analysis groups (test = RC shown at least once; control = eligible-but-not-shown).
Code
# KPI #1b tables (platform control vs test and deltas)
# Updated per methodology:
# - test = new-content edits where RC was shown at least once
# - control = new-content edits eligible-but-not-shown
# - outcome: reference included only (no acknowledgement)
if (!is.null(reference_check_save_data)) {
  kpi1b_df <- reference_check_save_data %>%
    make_rc_ab_group_published() %>%
    filter(is_new_content == 1)

  # KPI #1b population: constructive new-content edits only (exclude reverted within 48h)
  if ("was_reverted" %in% names(kpi1b_df)) {
    kpi1b_df <- kpi1b_df %>% filter(is.na(was_reverted) | was_reverted != 1)
  }

  ref_col <- if ("was_reference_included" %in% names(kpi1b_df)) {
    "was_reference_included"
  } else {
    pick_first(c("was_reference_included", "reference_added", "has_reference_added", "has_reference"), kpi1b_df)
  }

  needed_cols <- c("test_group", "platform")
  missing <- setdiff(c(needed_cols, if (is.null(ref_col)) character() else ref_col), names(kpi1b_df))

  if (is.null(ref_col)) {
    message("KPI #1b tables: reference-included flag not found in data")
  } else if (length(missing) > 0) {
    message("KPI #1b tables: missing columns: ", paste(missing, collapse = ", "))
  } else {
    kpi1b_df <- kpi1b_df %>%
      mutate(ref_included = ifelse(is.na(.data[[ref_col]]), 0L, as.integer(.data[[ref_col]] == 1)))

    # Overall (control vs test) + change vs control
    kpi1b_overall_rates <- make_rate_table(kpi1b_df, "ref_included", group_cols = c("test_group")) %>%
      mutate(scope = "Overall")
    kpi1b_overall_rel <- make_rel_change_dim(kpi1b_overall_rates, dim_col = "scope")

    render_rate_rel(
      kpi1b_overall_rates,
      kpi1b_overall_rel,
      "KPI #1b: new reference included (overall)",
      "KPI #1b: change vs control (overall)",
      c(test_group = "Test group", scope = "Scope", rate = "Rate", n = "Count (edits)"),
      note_rate = "Rate = mean(0/1 outcome) where outcome=1 means at least one net new reference included (per edit). Denominator = constructive new-content edits (rows) in `reference_check_save_data` where `is_new_content == 1` and `was_reverted != 1` (not reverted within 48 hours), within each analysis group (shown test vs eligible-not-shown control)."
    )

    # By platform (control vs test) + change vs control
    kpi1b_rates <- kpi1b_df %>%
      make_rate_table("ref_included")

    kpi1b_rel <- make_rel_change(kpi1b_rates)

    render_rate_rel(
      kpi1b_rates, kpi1b_rel,
      "KPI #1b: new reference included (by platform)",
      "KPI #1b: change vs control (by platform)",
      c(test_group = "Test group", platform = "Platform", rate = "Rate", n = "Count (edits)"),
      note_rate = "Rate = mean(0/1 outcome) where outcome=1 means at least one net new reference included (per edit). Denominator = constructive new-content edits (rows) in `reference_check_save_data` where `is_new_content == 1` and `was_reverted != 1` (not reverted within 48 hours), within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform)."
    )

    # User experience breakdown (Unregistered / Newcomer / Junior Contributor)
    if ("experience_level_group" %in% names(kpi1b_df)) {
      kpi1b_exp_df <- kpi1b_df %>%
        filter(!is.na(experience_level_group), experience_level_group %in% c("Unregistered", "Newcomer", "Junior Contributor"))

      kpi1b_exp_rates <- make_rate_table(kpi1b_exp_df, "ref_included", group_cols = c("test_group", "experience_level_group"))
      kpi1b_exp_rel <- make_rel_change_dim(kpi1b_exp_rates, dim_col = "experience_level_group")

      render_rate_rel(
        kpi1b_exp_rates, kpi1b_exp_rel,
        "KPI #1b: new reference included (by user experience)",
        "KPI #1b: change vs control (by user experience)",
        c(test_group = "Test group", experience_level_group = "User experience", rate = "Rate", n = "Count (edits)"),
        note_rate = "Rate = mean(0/1 outcome) where outcome=1 means at least one net new reference included (per edit). Denominator = constructive new-content edits (rows) in `reference_check_save_data` where `is_new_content == 1` and `was_reverted != 1` (not reverted within 48 hours), within each analysis group (shown test vs eligible-not-shown control) for each (test group × user experience)."
      )
    } else {
      message("KPI #1b user experience tables: experience_level_group not available in reference_check_save_data")
    }
  }
} else {
  message("KPI #1b tables: data not loaded")
}
KPI #1b: new reference included (overall)
Test group Rate Count (edits) Scope
control 26.5% 1196 Overall
test 65.2% 1195 Overall

Table note: Rate = mean(0/1 outcome) where outcome=1 means at least one net new reference included (per edit). Denominator = constructive new-content edits (rows) in reference_check_save_data where is_new_content == 1 and was_reverted != 1 (not reverted within 48 hours), within each analysis group (shown test vs eligible-not-shown control).

KPI #1b: change vs control (overall)
scope Control rate Test rate Absolute difference (pp) Relative change vs control N (control) N (test)
Overall 26.5% 65.2% 38.7 145.9% 1196 1195

Table note: Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.

KPI #1b: new reference included (by platform)
Test group Platform Rate Count (edits)
control desktop 30.7% 1015
control mobile-web 2.8% 181
test desktop 68.2% 1009
test mobile-web 48.9% 186

Table note: Rate = mean(0/1 outcome) where outcome=1 means at least one net new reference included (per edit). Denominator = constructive new-content edits (rows) in reference_check_save_data where is_new_content == 1 and was_reverted != 1 (not reverted within 48 hours), within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform).

KPI #1b: change vs control (by platform)
Platform Control rate Test rate Absolute difference (pp) Relative change vs control N (control) N (test)
desktop 30.7% 68.2% 37.4 121.8% 1015 1009
mobile-web 2.8% 48.9% 46.2 1 671.1% 181 186

Table note: Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.

KPI #1b: new reference included (by user experience)
Test group User experience Rate Count (edits)
control Unregistered 13.8% 160
control Newcomer 31.6% 136
control Junior Contributor 28.0% 900
test Unregistered 63.9% 183
test Newcomer 69.2% 143
test Junior Contributor 64.8% 869

Table note: Rate = mean(0/1 outcome) where outcome=1 means at least one net new reference included (per edit). Denominator = constructive new-content edits (rows) in reference_check_save_data where is_new_content == 1 and was_reverted != 1 (not reverted within 48 hours), within each analysis group (shown test vs eligible-not-shown control) for each (test group × user experience).

KPI #1b: change vs control (by user experience)
experience_level_group Control rate Test rate Absolute difference (pp) Relative change vs control N (control) N (test)
Unregistered 13.8% 63.9% 50.2 365.0% 160 183
Newcomer 31.6% 69.2% 37.6 119.0% 136 143
Junior Contributor 28.0% 64.8% 36.8 131.4% 900 869

Table note: Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.

Code
# KPI #1b by checks shown (bucketed) + platform/user_status slice
# Updated per methodology (shown test vs eligible-not-shown control)
if (!is.null(reference_check_save_data)) {
  kpi1b_df <- reference_check_save_data %>%
    make_rc_ab_group_published() %>%
    filter(is_new_content == 1)

  # KPI #1b population: constructive new-content edits only (exclude reverted within 48h)
  if ("was_reverted" %in% names(kpi1b_df)) {
    kpi1b_df <- kpi1b_df %>% filter(is.na(was_reverted) | was_reverted != 1)
  }

  ref_col <- if ("was_reference_included" %in% names(kpi1b_df)) {
    "was_reference_included"
  } else {
    pick_first(c("was_reference_included", "reference_added", "has_reference_added", "has_reference"), kpi1b_df)
  }

  if (!is.null(ref_col) && all(c("test_group", "n_checks_shown") %in% names(kpi1b_df))) {
    kpi1b_df <- kpi1b_df %>%
      mutate(ref_included = ifelse(is.na(.data[[ref_col]]), 0L, as.integer(.data[[ref_col]] == 1)))

    df_kpi1b_checks <- kpi1b_df %>%
      # Checks-shown buckets come from RC shown events; we report this slice for the test group only.
      filter(test_group == "test") %>%
      mutate(checks_bucket = case_when(
        is.na(n_checks_shown) ~ "unknown",
        n_checks_shown == 0 ~ "0",
        n_checks_shown == 1 ~ "1",
        n_checks_shown == 2 ~ "2",
        n_checks_shown >= 3 ~ "3+"
      )) %>%
      group_by(checks_bucket) %>%
      summarise(rate = mean(ref_included, na.rm = TRUE), n = n(), .groups = "drop")

    render_slice(
      df_kpi1b_checks,
      "KPI #1b by checks shown (test group only)",
      c(checks_bucket = "Checks shown", rate = "Rate", n = "Count (edits)"),
      note_text = "Rate = mean(0/1 outcome) in the test group only where outcome=1 means reference included (no acknowledgement). Denominator = constructive new-content edits (rows) in `reference_check_save_data` where `is_new_content == 1` and `was_reverted != 1` (not reverted within 48 hours), within the shown test group for each checks-shown bucket. Control is excluded because it has no comparable checks-shown event stream."
    )

    # KPI #1b by platform and user experience (Unregistered / Newcomer / Junior Contributor)
    if (all(c("platform", "experience_level_group") %in% names(kpi1b_df))) {
      kpi1b_exp_slices <- kpi1b_df %>%
        filter(!is.na(experience_level_group), experience_level_group %in% c("Unregistered", "Newcomer", "Junior Contributor")) %>%
        group_by(test_group, platform, experience_level_group) %>%
        summarise(rate = mean(ref_included, na.rm = TRUE), n = n(), .groups = "drop")

      render_slice(
        kpi1b_exp_slices,
        "KPI #1b: new reference included by platform and user experience",
        c(test_group = "Test group", platform = "Platform", experience_level_group = "User experience", rate = "Rate", n = "Count (edits)"),
        note_text = "Rate = mean(0/1 outcome) where outcome=1 means reference included (no acknowledgement). Denominator = constructive new-content edits (rows) in `reference_check_save_data` where `is_new_content == 1` and `was_reverted != 1` (not reverted within 48 hours), within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform × user experience)."
      )
    } else {
      message("KPI #1b user experience slice: required columns missing in reference_check_save_data")
    }
  } else {
    message("KPI #1b by checks: required columns missing in reference_check_save_data")
  }
} else {
  message("KPI #1b by checks: data not loaded")
}
KPI #1b by checks shown (test group only)
Checks shown Rate Count (edits)
1 62.1% 832
2 66.9% 166
3+ 76.6% 197

Table note: Rate = mean(0/1 outcome) in the test group only where outcome=1 means reference included (no acknowledgement). Denominator = constructive new-content edits (rows) in reference_check_save_data where is_new_content == 1 and was_reverted != 1 (not reverted within 48 hours), within the shown test group for each checks-shown bucket. Control is excluded because it has no comparable checks-shown event stream.

KPI #1b: new reference included by platform and user experience
Test group Platform User experience Rate Count (edits)
control desktop Unregistered 18.6% 118
control desktop Newcomer 36.1% 119
control desktop Junior Contributor 31.7% 778
control mobile-web Unregistered 0.0% <50
control mobile-web Newcomer 0.0% <50
control mobile-web Junior Contributor 4.1% 122
test desktop Unregistered 67.2% 137
test desktop Newcomer 72.4% 127
test desktop Junior Contributor 67.7% 745
test mobile-web Unregistered 54.3% <50
test mobile-web Newcomer 43.8% <50
test mobile-web Junior Contributor 47.6% 124

Table note: Rate = mean(0/1 outcome) where outcome=1 means reference included (no acknowledgement). Denominator = constructive new-content edits (rows) in reference_check_save_data where is_new_content == 1 and was_reverted != 1 (not reverted within 48 hours), within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform × user experience).

4.2.2.1 Confirming the impact of Reference Check

Code
# KPI #1b model (reference included only)
# Updated per methodology (shown test vs eligible-not-shown control)
if (!is.null(reference_check_save_data)) {
  df_kpi1b <- reference_check_save_data %>%
    make_rc_ab_group_published() %>%
    filter(is_new_content == 1)

  # KPI #1b population: constructive new-content edits only (exclude reverted within 48h)
  if ("was_reverted" %in% names(df_kpi1b)) {
    df_kpi1b <- df_kpi1b %>% filter(is.na(was_reverted) | was_reverted != 1)
  }

  ref_col <- if ("was_reference_included" %in% names(df_kpi1b)) {
    "was_reference_included"
  } else {
    pick_first(c("was_reference_included", "reference_added", "has_reference_added", "has_reference"), df_kpi1b)
  }

  if (!is.null(ref_col) && all(c("test_group", "platform") %in% names(df_kpi1b))) {
    df_kpi1b <- df_kpi1b %>%
      mutate(ref_included = ifelse(is.na(.data[[ref_col]]), 0L, as.integer(.data[[ref_col]] == 1)))

    tryCatch({
      m_kpi1b <- glm(ref_included ~ test_group + platform, data = df_kpi1b, family = binomial())
      render_binom_model(
        m_kpi1b,
        "Table 1b. Adjusted odds ratios (ORs) from multivariable logistic regression for KPI #1b outcome among constructive new-content edits.",
        note_text = "Outcome=1 means at least one net new reference included on a constructive new-content edit (not reverted within 48h). Population is restricted to shown test vs eligible-not-shown control. Adjusted for platform. OR>1 indicates higher odds of the outcome."
      )
    }, error = function(e) {
      message("KPI #1b model error: ", e$message)
    })

    # KPI #1b Bayesian lift (relax) — reference included only
    kpi1b_relax_df <- df_kpi1b %>%
      transmute(
        outcome = ref_included,
        variation = dplyr::case_when(
          test_group == "control" ~ "control",
          test_group == "test" ~ "treatment",
          TRUE ~ as.character(test_group)
        )
      )
    render_relax(kpi1b_relax_df, "KPI #1b", metric_type = "proportion", better = "higher")
  } else {
    message("KPI #1b model: required columns missing")
  }
} else {
  message("KPI #1b model: data not loaded")
}
Table 1b. Adjusted odds ratios (ORs) from multivariable logistic regression for KPI #1b outcome among constructive new-content edits.
Term OR CI low CI high SE p-value
Intercept 0.414 0.362 0.471 0.067 <0.001
test_grouptest 5.555 4.645 6.659 0.092 <0.001
platformmobile-web 0.301 0.229 0.392 0.137 <0.001

Table note: Outcome=1 means at least one net new reference included on a constructive new-content edit (not reverted within 48h). Population is restricted to shown test vs eligible-not-shown control. Adjusted for platform. OR>1 indicates higher odds of the outcome.

KPI #1b
Relative lift ((Treatment − Control) / Control)
Bayesian Analysis
Frequentist Analysis
Point Estimate Chance to Win P(Treatment better) 95% CrI Lower 95% CrI Upper Point Estimate p-value 95% CI Lower 95% CI Upper
1.231 1.000 1.000 0.998 1.464 1.459 0.000 1.206 1.713

Interpretation: Based on relax, the posterior probability that treatment is better than control is 100.0% (computed as Chance to Win).

4.2.2.2 KPI #1b (availability / ITT)

This section reports KPI #1b using assignment buckets (control vs test) among all published new-content edits (not restricted to shown/eligible). This matches the 2024-style definition where the denominator is all new-content edits and the outcome is 1 only if the edit both (a) includes a new reference and (b) is not reverted within 48 hours.

Results: We also report KPI #1b by test vs. control assignment, regardless of whether Reference Check was shown (intent-to-treat). This includes all published new-content edits in the target population and is directly comparable to 2024 Reference Check study data. Under this more conservative view, Reference Check still shows a benefit: edits in the test group were more likely to be constructive new-content edits that included a net new reference. How big is the change (ITT): * Overall: 56.3% → 68.3% (+12.1 pp, +21.5% relative) * Desktop: 60.5% → 70.6% (+10.2 pp, +16.8% relative) * Mobile web: editors were ~2.2× more likely to add a reference (22.0% → 47.8%)

Even under the conservative ITT lens (not restricted to shown/eligible), edits in the test group were more likely to be constructive new-content edits that included a net new reference, especially on mobile-web; this lift is statistically significant in the adjusted ITT model.
Note: In the 2024 Reference Check report: “Users [i.e., based on an edit-level (edit session) comparison—not unique-user aggregation] were 2.2 times more likely to publish a new content edit that included a reference and was constructive (not reverted within 48 hours) when reference check was shown to eligible edits.” “On mobile, new content edits by contributors were 4.2 times more likely to include a reference and not be reverted when reference check was shown to eligible edits.”

KPI #1b (Availability / ITT; 2024-comparable): ↑ +12.1 pp (56.3% → 68.3%), +21.5% relative.
Evidence: glm (Table 1c) OR = 1.69 (95% CI 1.54–1.86), p < 0.001.
Relax (ITT, relative lift): +0.215 (Bayesian 95% CrI 0.172–0.255; Frequentist 95% CI 0.173–0.256; p < 0.001).

Code
# KPI #1b (availability / ITT; 2024-comparable): constructive reference-including edits
# - Uses assignment buckets (control vs test) among all new-content edits
# - Denominator: all published new-content edits in the target population (<=100 edits or unregistered)
# - Outcome: 1 only if (reference included) AND (not reverted within 48h)
if (!is.null(reference_check_save_data) &&
    all(c("test_group", "is_new_content", "was_reverted") %in% names(reference_check_save_data))) {

  df_itt <- reference_check_save_data %>%
    renorm_buckets() %>%
    filter(is_new_content == 1)

  # Match target population used elsewhere: unregistered OR <= 100 edits
  if (all(c("user_status", "user_edit_count") %in% names(df_itt))) {
    df_itt <- df_itt %>%
      filter(user_status == "unregistered" | (!is.na(user_edit_count) & user_edit_count <= 100))
  }

  # Outcome flag for reference included
  ref_col <- if ("was_reference_included" %in% names(df_itt)) {
    "was_reference_included"
  } else {
    pick_first(c("was_reference_included", "reference_added", "has_reference_added", "has_reference"), df_itt)
  }

  if (is.null(ref_col)) {
    message("KPI #1b (ITT): reference-included flag not found in data")
  } else {
    df_itt <- df_itt %>%
      mutate(
        ref_included = ifelse(is.na(.data[[ref_col]]), 0L, as.integer(.data[[ref_col]] == 1)),
        unreverted_48h = ifelse(is.na(was_reverted), NA_integer_, as.integer(was_reverted != 1)),
        outcome = ifelse(is.na(unreverted_48h), NA_integer_, as.integer((ref_included == 1) & (unreverted_48h == 1))),
        test_group = factor(test_group, levels = c("control", "test"))
      )

    # If edits are not 1:1 with editing_session, dedupe to per-edit first
    if ("editing_session" %in% names(df_itt)) {
      df_itt <- df_itt %>%
        group_by(editing_session) %>%
        summarise(
          test_group = dplyr::first(test_group),
          platform = dplyr::first(if ("platform" %in% names(df_itt)) platform else NA),
          outcome = {
            if (all(is.na(outcome))) {
              NA_real_
            } else {
              base::max(outcome, na.rm = TRUE)
            }
          },
          .groups = "drop"
        )
    }

    # Persist the final ITT per-edit frame for the chart below
    df_kpi1b_itt <- df_itt

    # 1) Overall ITT rate + change vs control
    itt_overall_rates <- make_rate_table(df_itt, "outcome", group_cols = c("test_group")) %>%
      mutate(scope = "Overall")
    itt_overall_rel <- make_rel_change_dim(itt_overall_rates, dim_col = "scope")

    render_rate_rel(
      itt_overall_rates,
      itt_overall_rel,
      "Constructive new content edits that include a new reference and are not reverted",
      "Constructive new content edits that include a new reference and are not reverted: change vs control",
      c(test_group = "Experiment group", scope = "Scope", rate = "Rate", n = "Count (edits)"),
      note_rate = paste(
        "Rate = mean(0/1 outcome) where outcome=1 means (new reference included) AND (not reverted within 48h) (per edit).",
        "Denominator = all published new-content edits within each assignment bucket (control vs test)",
        "in the target population (unregistered OR <=100 edits)."
      )
    )

    # 2) By platform ITT rate + change vs control
    if ("platform" %in% names(df_itt)) {
      df_itt <- df_itt %>% mutate(platform = factor(platform))
      itt_rates <- make_rate_table(df_itt, "outcome", group_cols = c("test_group", "platform"))
      itt_rel <- make_rel_change(itt_rates)

      render_rate_rel(
        itt_rates,
        itt_rel,
        "Constructive new content edits that include a new reference and are not reverted (by platform)",
        "Constructive new content edits that include a new reference and are not reverted: change vs control (by platform)",
        c(test_group = "Experiment group", platform = "Platform", rate = "Rate", n = "Count (edits)"),
        note_rate = paste(
          "Rate = mean(0/1 outcome) where outcome=1 means (new reference included) AND (not reverted within 48h) (per edit).",
          "Denominator = all published new-content edits within each (assignment bucket × platform)",
          "in the target population (unregistered OR <=100 edits)."
        )
      )
    }

  }
} else {
  message("KPI #1b (ITT): data not loaded or required columns missing")
}
Constructive new content edits that include a new reference and are not reverted
Experiment group Rate Count (edits) Scope
control 56.3% 4074 Overall
test 68.3% 4132 Overall

Table note: Rate = mean(0/1 outcome) where outcome=1 means (new reference included) AND (not reverted within 48h) (per edit). Denominator = all published new-content edits within each assignment bucket (control vs test) in the target population (unregistered OR <=100 edits).

Constructive new content edits that include a new reference and are not reverted: change vs control
scope Control rate Test rate Absolute difference (pp) Relative change vs control N (control) N (test)
Overall 56.3% 68.3% 12.1 21.5% 4074 4132

Table note: Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.

Constructive new content edits that include a new reference and are not reverted (by platform)
Experiment group Platform Rate Count (edits)
control desktop 60.5% 3629
control mobile-web 22.0% 445
test desktop 70.6% 3718
test mobile-web 47.8% 414

Table note: Rate = mean(0/1 outcome) where outcome=1 means (new reference included) AND (not reverted within 48h) (per edit). Denominator = all published new-content edits within each (assignment bucket × platform) in the target population (unregistered OR <=100 edits).

Constructive new content edits that include a new reference and are not reverted: change vs control (by platform)
Platform Control rate Test rate Absolute difference (pp) Relative change vs control N (control) N (test)
desktop 60.5% 70.6% 10.2 16.8% 3629 3718
mobile-web 22.0% 47.8% 25.8 117.2% 445 414

Table note: Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.

Code
# KPI #1b (availability / ITT) bar chart — 2024-comparable
# Constructive new content edits that include a new reference and are not reverted
if (exists("df_kpi1b_itt") && !is.null(df_kpi1b_itt) && all(c("test_group", "outcome") %in% names(df_kpi1b_itt))) {
  kpi1b_itt_bar_df <- df_kpi1b_itt %>%
    filter(!is.na(outcome)) %>%
    group_by(test_group) %>%
    summarise(rate = mean(outcome, na.rm = TRUE), n = dplyr::n(), .groups = "drop") %>%
    mutate(label = scales::percent(rate, accuracy = 0.1))

  kpi1b_itt_bar_df %>%
    ggplot(aes(x = test_group, y = rate, fill = test_group)) +
    geom_col(width = 0.9) +
    geom_text(aes(label = label), vjust = 1.2, color = "white", size = 6) +
    scale_y_continuous(labels = scales::percent) +
    scale_fill_manual(values = c("control" = "#999999", "test" = "steelblue2")) +
    labs(
      title = "Constructive new content edits that include a new reference",
      x = "Experiment Group",
      y = "Percent of new content edits"
    ) +
    pc_theme() +
    guides(fill = "none")
} else {
  message("KPI #1b (ITT) bar: df_kpi1b_itt not available; run the ITT section above")
}

Chart note (definition of Rate / denominator)

  • Rate = mean(0/1 outcome) where outcome=1 means the edit both included a new reference and was not reverted within 48 hours.
  • Denominator = all published new-content edits in the target population (unregistered users or users with 100 or fewer edits), within each assignment bucket (control vs test).
  • Interpretation: this is the 2024-comparable “availability / ITT” view; it is not restricted to shown/eligible edits.

4.2.2.3 Confirming the impact of Reference Check

Code
# KPI #1b (availability / ITT) confirmation
# Uses the ITT per-edit frame created above (df_kpi1b_itt)
if (exists("df_kpi1b_itt") && !is.null(df_kpi1b_itt) && all(c("test_group", "outcome") %in% names(df_kpi1b_itt))) {
  df_itt_stats <- df_kpi1b_itt

  # 1) Adjusted logistic regression (glm)
  if ("platform" %in% names(df_itt_stats)) {
    tryCatch({
      m_itt <- glm(outcome ~ test_group + platform, data = df_itt_stats, family = binomial())
      render_binom_model(
        m_itt,
        "Table 1c. Adjusted odds ratios (ORs) from multivariable logistic regression for KPI #1b outcome among new-content edits (availability / ITT).",
        note_text = "Outcome=1 means the edit included a new reference AND was not reverted within 48 hours (per edit). Denominator includes all new-content edits in the target population (unregistered OR <=100 edits) and is not restricted to shown/eligible. Adjusted for platform. OR>1 indicates higher odds of the outcome."
      )
    }, error = function(e) {
      message("KPI #1b (ITT) model error: ", e$message)
    })
  } else {
    message("KPI #1b (ITT) model: platform not available")
  }

  # 2) Bayesian/Frequentist lift (relax)
  itt_relax_df <- df_itt_stats %>%
    transmute(
      outcome = outcome,
      variation = dplyr::case_when(
        test_group == "control" ~ "control",
        test_group == "test" ~ "treatment",
        TRUE ~ as.character(test_group)
      )
    )
  render_relax(itt_relax_df, "KPI #1b (availability / ITT)", metric_type = "proportion", better = "higher")
} else {
  message("KPI #1b (ITT) confirmation: df_kpi1b_itt not available; run the ITT section above")
}
Table 1c. Adjusted odds ratios (ORs) from multivariable logistic regression for KPI #1b outcome among new-content edits (availability / ITT).
Term OR CI low CI high SE p-value
Intercept 1.477 1.385 1.576 0.033 <0.001
test_grouptest 1.693 1.544 1.856 0.047 <0.001
platformmobile-web 0.273 0.235 0.317 0.077 <0.001

Table note: Outcome=1 means the edit included a new reference AND was not reverted within 48 hours (per edit). Denominator includes all new-content edits in the target population (unregistered OR <=100 edits) and is not restricted to shown/eligible. Adjusted for platform. OR>1 indicates higher odds of the outcome.

KPI #1b (availability / ITT)
Relative lift ((Treatment − Control) / Control)
Bayesian Analysis
Frequentist Analysis
Point Estimate Chance to Win P(Treatment better) 95% CrI Lower 95% CrI Upper Point Estimate p-value 95% CI Lower 95% CI Upper
0.214 1.000 1.000 0.172 0.255 0.215 0.000 0.173 0.256

Interpretation: Based on relax, the posterior probability that treatment is better than control is 100.0% (computed as Chance to Win).

4.2.3 KPI #2 Constructive edits

Metric: Proportion of published edits that add new content (T333714) and are constructive, defined as not reverted within 48 hours.

Methodology: This metric is computed on edits where is_new_content == 1. Constructive is defined as the edit not being reverted within 48 hours of publication.

Test group: The test group includes new-content edits where Reference Check was shown at least once during the editing session.

Control group: The control group includes new-content edits identified as eligible but not shown Reference Check.

Results: Edits shown Reference Check were more likely to be constructive, especially on mobile-web. On desktop, constructive edits increased modestly from 75.5% to 77.9% (+3.2% relative lift), but this difference is not statistically significant in the regression model. Constructive outcomes trend higher when Reference Check is shown across both platforms, with a much larger improvement on mobile-web. While the most conservative adjusted model (adjusted regression terms (the overall test_grouptest effect and the platform interaction)) cannot fully rule out chance at this sample size, simpler and relax-based comparisons point in the same direction: the test group performs better. Both overall results and mobile-web-only analyses show meaningful improvements, with strong gains on mobile web. Although we cannot definitively conclude that mobile improvements are larger than desktop, the results consistently suggest this pattern. When we account for whether an edit added a new reference, the mobile-web advantage becomes smaller and no longer statistically clear, indicating that part of the benefit may come from increased reference inclusion. Overall, the evidence suggests Reference Check improves constructive editing outcomes, especially on mobile web.

Constructive Edits (Not Reverted Within 48 Hours), KPI #2 Mobile-Web Only:
On mobile-web, Reference Check meaningfully increases the likelihood that an edit is constructive. Editors were 18.2% more likely to make a constructive edit (56.4% → 66.7%). Results are directionally consistent; the mobile-web within-platform adjusted contrast is significant, indicating Reference Check improves constructive edit outcomes, particularly for mobile-web editors. Because mobile-web is a subgroup slice, we treat these results as strong within-platform evidence, though it aligns with the broader pattern in the unadjusted comparisons. In the conditional model, the mobile-web contrast is not statistically significant, consistent with the interpretation that some of the effect may operate through increased reference inclusion.
In short, consistent with a path where: Reference Check → references added → edits are constructive

KPI #2 (Constructive = not reverted within 48h, new-content edits):
Overall: ↑ +4.1 pp (71.8% → 75.9%), +5.7% relative.
Desktop: ↑ +2.4 pp (75.5% → 77.9%), +3.2% relative.
Mobile web: ↑ +10.3 pp (56.4% → 66.7%), +18.2% relative (within-platform contrast statistically significant).
Evidence: glm (Table 2) overall across-platform treatment term not significant (p=0.146);
mobile-web contrast significant (Table 2A: OR = 1.55, p = 0.010).
Conditional model: mobile-web contrast not significant when adjusting for reference inclusion (Table 2E: OR = 1.16, p = 0.390).
Relax (relative lift): overall +0.057 (p = 0.010); mobile-web-only +0.182 (p = 0.018).

Code
# KPI2 bar (constructive = not reverted within 48h), by test_group
# Updated per methodology (shown test vs eligible-not-shown control)
if (!is.null(reference_check_save_data) &&
    all(c("test_group", "is_new_content", "was_reverted",
          "was_reference_check_shown", "was_reference_check_eligible") %in% names(reference_check_save_data))) {
  kpi2_bar <- reference_check_save_data %>%
    make_rc_ab_group_published() %>%
    filter(is_new_content == 1) %>%
    mutate(constructive = ifelse(was_reverted == 1, 0, 1)) %>%
    group_by(test_group) %>%
    summarise(rate = mean(constructive, na.rm = TRUE), n = n(), .groups = "drop") %>%
    mutate(label = scales::percent(rate, accuracy = 0.1)) %>%
    ggplot(aes(x = test_group, y = rate, fill = test_group)) +
    geom_col() +
    geom_text(aes(label = label), vjust = -0.2, size = 3) +
    scale_y_continuous(labels = scales::percent_format(), expand = expansion(mult = c(0, 0.12))) +
    scale_fill_manual(values = c("control" = "#999999", "test" = "dodgerblue4")) +
    labs(title = "KPI2: Constructive (not reverted within 48h)",
         x = "Test group",
         y = "Percent of new-content edits") +
    pc_theme() +
    guides(fill = "none")
  print(kpi2_bar)
} else {
  message("KPI2 plot: required columns missing in reference_check_save_data")
}

Chart note (definition of Rate / denominator) - KPI 2 (constructive): - Rate = mean(constructive) where constructive is 1 if was_reverted == 0 (not reverted within 48h) and 0 otherwise. - Denominator = new-content edit rows in reference_check_save_data (is_new_content == 1) within the analysis groups (test = RC shown at least once; control = eligible-but-not-shown).

Code
# 2) Constructive edits (revert within 48h) among new content
# Align with KPI #2 population (shown test vs eligible-not-shown control)
if (!is.null(reference_check_save_data) &&
    all(c("test_group", "is_new_content", "was_reverted",
          "was_reference_check_shown", "was_reference_check_eligible") %in% names(reference_check_save_data))) {
  constructive_summary <- reference_check_save_data %>%
    make_rc_ab_group_published() %>%
    filter(is_new_content == 1) %>%
    mutate(reverted_flag = ifelse(was_reverted == 1, "reverted", "not_reverted")) %>%
    count(test_group, reverted_flag) %>%
    group_by(test_group) %>%
    mutate(pct = n / sum(n)) %>%
    arrange(test_group, desc(n))
  constructive_summary <- renorm_buckets(constructive_summary)
  render_pct_table(
    constructive_summary,
    "Constructive (48h unreverted) among new content",
    c(test_group = "Test group", reverted_flag = "Revert status", n = "Count (edits)", pct = "Percent of new-content edits"),
    note_text = "Percent of new-content edits = share of new-content edits within each analysis group (shown test vs eligible-not-shown control). <br>`Revert status` is derived from `was_reverted` (within 48h)."
  )
} else {
  message("Constructive summary: required columns missing in reference_check_save_data")
}
Constructive (48h unreverted) among new content
Revert status Count (edits) Percent of new-content edits
control
not_reverted 1196 71.8%
reverted 469 28.2%
test
not_reverted 1195 75.9%
reverted 379 24.1%

Table note: Percent of new-content edits = share of new-content edits within each analysis group (shown test vs eligible-not-shown control).
Revert status is derived from was_reverted (within 48h).

Code
# KPI #2 tables (constructive) by platform and deltas
# Updated per methodology (shown test vs eligible-not-shown control)
if (!is.null(reference_check_save_data) &&
    all(c("test_group", "platform", "is_new_content", "was_reverted",
          "was_reference_check_shown", "was_reference_check_eligible") %in% names(reference_check_save_data))) {
  df_kpi2 <- reference_check_save_data %>%
    make_rc_ab_group_published() %>%
    filter(is_new_content == 1) %>%
    mutate(constructive = ifelse(was_reverted == 1, 0, 1))

  # Overall (control vs test) + change vs control
  kpi2_overall_rates <- make_rate_table(df_kpi2, "constructive", group_cols = c("test_group")) %>%
    mutate(scope = "Overall")
  kpi2_overall_rel <- make_rel_change_dim(kpi2_overall_rates, dim_col = "scope")

  render_rate_rel(
    kpi2_overall_rates,
    kpi2_overall_rel,
    "KPI #2: constructive (48h unreverted) overall",
    "KPI #2: change vs control (overall)",
    c(test_group = "Test group", scope = "Scope", rate = "Rate", n = "Count (edits)"),
    note_rate = "Rate = mean(0/1 outcome) where outcome=1 means not reverted within 48h (per edit). Denominator = new-content edits (rows) in `reference_check_save_data` within each analysis group (shown test vs eligible-not-shown control)."
  )

  # By platform (control vs test) + change vs control
  kpi2_rates <- df_kpi2 %>% make_rate_table("constructive")
  kpi2_rel <- make_rel_change(kpi2_rates)

  render_rate_rel(
    kpi2_rates, kpi2_rel,
    "KPI #2: constructive (48h unreverted) by platform",
    "KPI #2: change vs control (by platform)",
    c(test_group = "Test group", platform = "Platform", rate = "Rate", n = "Count (edits)"),
    note_rate = "Rate = mean(0/1 outcome) where outcome=1 means not reverted within 48h (per edit). Denominator = new-content edits (rows) in `reference_check_save_data` within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform).",
    group_col = "platform",
    within_group_order_col = "test_group",
    within_group_order = c("control", "test")
  )

  # User experience breakdown (Unregistered / Newcomer / Junior Contributor)
  if ("experience_level_group" %in% names(df_kpi2)) {
    kpi2_exp_df <- df_kpi2 %>%
      filter(!is.na(experience_level_group), experience_level_group %in% c("Unregistered", "Newcomer", "Junior Contributor"))

    kpi2_exp_rates <- make_rate_table(kpi2_exp_df, "constructive", group_cols = c("test_group", "experience_level_group"))
    kpi2_exp_rel <- make_rel_change_dim(kpi2_exp_rates, dim_col = "experience_level_group")

    render_rate_rel(
      kpi2_exp_rates, kpi2_exp_rel,
      "KPI #2: constructive (48h unreverted) by user experience",
      "KPI #2: change vs control (by user experience)",
      c(test_group = "Test group", experience_level_group = "User experience", rate = "Rate", n = "Count (edits)"),
      note_rate = "Rate = mean(0/1 outcome) where outcome=1 means not reverted within 48h (per edit). Denominator = new-content edits (rows) in `reference_check_save_data` within each analysis group (shown test vs eligible-not-shown control) for each (test group × user experience)."
    )
  } else {
    message("KPI #2 user experience tables: experience_level_group not available in reference_check_save_data")
  }
} else {
  message("KPI #2 tables: required columns missing in reference_check_save_data")
}
KPI #2: constructive (48h unreverted) overall
Test group Rate Count (edits) Scope
control 71.8% 1665 Overall
test 75.9% 1574 Overall

Table note: Rate = mean(0/1 outcome) where outcome=1 means not reverted within 48h (per edit). Denominator = new-content edits (rows) in reference_check_save_data within each analysis group (shown test vs eligible-not-shown control).

KPI #2: change vs control (overall)
scope Control rate Test rate Absolute difference (pp) Relative change vs control N (control) N (test)
Overall 71.8% 75.9% 4.1 5.7% 1665 1574

Table note: Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.

KPI #2: constructive (48h unreverted) by platform
Test group Rate Count (edits)
desktop
control 75.5% 1344
test 77.9% 1295
mobile-web
control 56.4% 321
test 66.7% 279

Table note: Rate = mean(0/1 outcome) where outcome=1 means not reverted within 48h (per edit). Denominator = new-content edits (rows) in reference_check_save_data within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform).

KPI #2: change vs control (by platform)
Platform Control rate Test rate Absolute difference (pp) Relative change vs control N (control) N (test)
desktop 75.5% 77.9% 2.4 3.2% 1344 1295
mobile-web 56.4% 66.7% 10.3 18.2% 321 279

Table note: Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.

KPI #2: constructive (48h unreverted) by user experience
Test group User experience Rate Count (edits)
control Unregistered 62.7% 255
control Newcomer 59.1% 230
control Junior Contributor 76.3% 1180
test Unregistered 70.4% 260
test Newcomer 64.7% 221
test Junior Contributor 79.5% 1093

Table note: Rate = mean(0/1 outcome) where outcome=1 means not reverted within 48h (per edit). Denominator = new-content edits (rows) in reference_check_save_data within each analysis group (shown test vs eligible-not-shown control) for each (test group × user experience).

KPI #2: change vs control (by user experience)
experience_level_group Control rate Test rate Absolute difference (pp) Relative change vs control N (control) N (test)
Unregistered 62.7% 70.4% 7.6 12.2% 255 260
Newcomer 59.1% 64.7% 5.6 9.4% 230 221
Junior Contributor 76.3% 79.5% 3.2 4.2% 1180 1093

Table note: Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.

Code
# KPI #2 by platform and user experience
# Updated per methodology (shown test vs eligible-not-shown control)
if (!is.null(reference_check_save_data) &&
    all(c("test_group", "is_new_content", "was_reverted", "platform", "experience_level_group",
          "was_reference_check_shown", "was_reference_check_eligible") %in% names(reference_check_save_data))) {
  df_kpi2 <- reference_check_save_data %>%
    make_rc_ab_group_published() %>%
    filter(is_new_content == 1) %>%
    mutate(constructive = ifelse(was_reverted == 1, 0, 1))

  kpi2_slices <- df_kpi2 %>%
    filter(!is.na(experience_level_group), experience_level_group %in% c("Unregistered", "Newcomer", "Junior Contributor")) %>%
    group_by(test_group, platform, experience_level_group) %>%
    summarise(rate = mean(constructive, na.rm = TRUE), n = n(), .groups = "drop")

  render_slice(
    kpi2_slices,
    "KPI #2: constructive by platform and user experience",
    c(test_group = "Test group", platform = "Platform", experience_level_group = "User experience", rate = "Rate", n = "Count (edits)"),
    note_text = "Rate = mean(0/1 outcome) where outcome=1 means not reverted within 48h (per edit). Denominator = new-content edits (rows) in `reference_check_save_data` within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform × user experience)."
  )
} else {
  message("KPI #2 slices: required columns missing in reference_check_save_data")
}
KPI #2 slices: required columns missing in reference_check_save_data
Code
# KPI #2 by checks shown (bucketed)
# Updated per methodology (shown test vs eligible-not-shown control)
if (!is.null(reference_check_save_data) &&
    all(c("test_group", "is_new_content", "was_reverted", "n_checks_shown",
          "was_reference_check_shown", "was_reference_check_eligible") %in% names(reference_check_save_data))) {
  df_kpi2_checks <- reference_check_save_data %>%
    make_rc_ab_group_published() %>%
    filter(is_new_content == 1, test_group == "test") %>%
    mutate(
      constructive = ifelse(was_reverted == 1, 0, 1),
      checks_bucket = case_when(
        is.na(n_checks_shown) ~ "unknown",
        n_checks_shown == 0 ~ "0",
        n_checks_shown == 1 ~ "1",
        n_checks_shown == 2 ~ "2",
        n_checks_shown >= 3 ~ "3+"
      )
    ) %>%
    group_by(checks_bucket) %>%
    summarise(rate = mean(constructive, na.rm = TRUE), n = n(), .groups = "drop") %>%
    mutate(checks_bucket = factor(checks_bucket, levels = c("unknown", "0", "1", "2", "3+"))) %>%
    arrange(checks_bucket)

  render_slice(
    df_kpi2_checks,
    "KPI #2 by checks shown (test group only)",
    c(checks_bucket = "Checks shown", rate = "Rate", n = "Count (edits)"),
    note_text = "Rate = mean(0/1 outcome) in the test group only where outcome=1 means not reverted within 48h (per edit). Denominator = new-content edits (rows) in `reference_check_save_data` within the shown test group for each checks-shown bucket. Control is excluded because it has no comparable checks-shown event stream."
  )
} else {
  message("KPI #2 by checks: required columns missing in reference_check_save_data")
}
KPI #2 by checks shown (test group only)
Checks shown Rate Count (edits)
1 76.6% 1086
2 79.8% 208
3+ 70.4% 280

Table note: Rate = mean(0/1 outcome) in the test group only where outcome=1 means not reverted within 48h (per edit). Denominator = new-content edits (rows) in reference_check_save_data within the shown test group for each checks-shown bucket. Control is excluded because it has no comparable checks-shown event stream.

Code
# Per-wiki sanity: constructive
# Only render when multiple wikis are present.
if (!is.null(reference_check_save_data) &&
    all(c("wiki", "test_group", "is_new_content", "was_reverted") %in% names(reference_check_save_data))) {
  if (dplyr::n_distinct(reference_check_save_data$wiki) <= 1) {
    message("Per-wiki constructive: skipped (single wiki)")
  } else {
    per_wiki_constructive <- reference_check_save_data %>%
      filter(is_new_content == 1) %>%
      mutate(constructive = ifelse(was_reverted == 1, 0, 1)) %>%
      group_by(wiki, test_group) %>%
      summarise(rate = mean(constructive, na.rm = TRUE), n = n(), .groups = "drop")
    render_slice(
      per_wiki_constructive,
      "Per-wiki constructive",
      c(wiki = "Wiki", test_group = "Test group", rate = "Rate", n = "Count (edits)"),
      note_text = "Rate = mean(0/1 outcome) where outcome=1 means not reverted within 48h (per edit). Denominator = new-content edits (rows) in `reference_check_save_data` for each (wiki × test group)."
    )
  }
} else {
  message("Per-wiki constructive: required columns missing in reference_check_save_data")
}
Per-wiki constructive: skipped (single wiki)

4.2.3.1 Confirming the impact of Reference Check

Model A estimates the primary KPI #2 estimand: the total effect of treatment on the constructive rate.

Model B is an optional conditional analysis, reported only when editcheck-newreference is directly observed, and estimates the treatment effect holding reference inclusion constant.

Code
# KPI #2 model (constructive = not reverted within 48h)
# Updated per methodology (shown test vs eligible-not-shown control)
if (!is.null(reference_check_save_data) &&
    all(c("test_group", "is_new_content", "was_reverted", "platform",
          "was_reference_check_shown", "was_reference_check_eligible") %in% names(reference_check_save_data))) {
  df_kpi2 <- reference_check_save_data %>%
    make_rc_ab_group_published() %>%
    filter(is_new_content == 1) %>%
    mutate(
      constructive = ifelse(was_reverted == 1, 0, 1),
      test_group = factor(test_group, levels = c("control", "test")),
      platform = factor(platform)
    )

  tryCatch({
    IRdisplay::display_markdown("**Model A (total effect)**")

    # Model A (total effect): primary KPI #2 estimand
    if (dplyr::n_distinct(df_kpi2$platform) > 1) {
      m_kpi2_a <- glm(constructive ~ test_group * platform, data = df_kpi2, family = binomial())
    } else {
      m_kpi2_a <- glm(constructive ~ test_group + platform, data = df_kpi2, family = binomial())
    }

    render_binom_model(
      m_kpi2_a,
      "Table 2. Adjusted odds ratios (ORs) from multivariable logistic regression for constructive outcome (not reverted within 48h) among new-content edits.",
      note_text = paste(
        "Model A estimates the primary KPI #2 estimand: the total effect of treatment on the constructive rate.",
        "Outcome=1 means not reverted within 48h on a new-content edit.",
        "Population is restricted to shown test vs eligible-not-shown control.",
        "Includes a test_group×platform interaction when platform has multiple levels.",
        "OR>1 indicates higher odds of the outcome.",
        sep = " "
      )
    )

    # Table 2A: mobile-web treatment vs control (Model A)
    if ("mobile-web" %in% levels(df_kpi2$platform)) {
      nd_ctrl <- data.frame(
        test_group = factor("control", levels = levels(df_kpi2$test_group)),
        platform = factor("mobile-web", levels = levels(df_kpi2$platform))
      )
      nd_test <- nd_ctrl %>% mutate(test_group = factor("test", levels = levels(df_kpi2$test_group)))

      mw_contrast <- tidy_glm_contrast_or(
        model = m_kpi2_a,
        newdata_control = nd_ctrl,
        newdata_test = nd_test,
        label = "Mobile-web: treatment vs control"
      )

      render_or_contrast_table(
        mw_contrast,
        "Table 2A. Mobile-web treatment vs control (constructive outcome) among new-content edits.",
        note_text = "This contrast is computed from Table 2 (Model A) as the log-odds difference between (test, mobile-web) and (control, mobile-web), converted to an OR with a Wald 95% CI and two-sided p-value."
      )
    } else {
      message("KPI #2 Table 2A skipped: platform level 'mobile-web' not present in df_kpi2")
    }

    IRdisplay::display_markdown("**Model B (conditional; only when editcheck-newreference is observed)**")

    # Model B (conditional): only when editcheck-newreference is directly observed
    if ("was_reference_included" %in% names(df_kpi2)) {
      df_kpi2 <- df_kpi2 %>%
        mutate(was_reference_included = ifelse(is.na(was_reference_included), 0L, as.integer(was_reference_included == 1)))

      if (dplyr::n_distinct(df_kpi2$platform) > 1) {
        m_kpi2_b <- glm(constructive ~ test_group * platform + was_reference_included, data = df_kpi2, family = binomial())
      } else {
        m_kpi2_b <- glm(constructive ~ test_group + platform + was_reference_included, data = df_kpi2, family = binomial())
      }

      render_binom_model(
        m_kpi2_b,
        "Table 2D. Adjusted odds ratios (ORs) from multivariable logistic regression for constructive outcome (not reverted within 48h) among new-content edits (conditional on reference inclusion).",
        note_text = "Model B is an optional conditional analysis that estimates the treatment effect holding reference inclusion constant, and is reported only when editcheck-newreference is directly observed."
      )

      # Table 2E: mobile-web treatment vs control (Model B)
      if ("mobile-web" %in% levels(df_kpi2$platform)) {
        ref_mean_mw <- mean(df_kpi2$was_reference_included[df_kpi2$platform == "mobile-web"], na.rm = TRUE)

        nd_ctrl_b <- data.frame(
          test_group = factor("control", levels = levels(df_kpi2$test_group)),
          platform = factor("mobile-web", levels = levels(df_kpi2$platform)),
          was_reference_included = ref_mean_mw
        )
        nd_test_b <- nd_ctrl_b %>% mutate(test_group = factor("test", levels = levels(df_kpi2$test_group)))

        mw_contrast_b <- tidy_glm_contrast_or(
          model = m_kpi2_b,
          newdata_control = nd_ctrl_b,
          newdata_test = nd_test_b,
          label = "Mobile-web: treatment vs control (conditional)"
        )

        render_or_contrast_table(
          mw_contrast_b,
          "Table 2E. Mobile-web treatment vs control (constructive outcome) among new-content edits (conditional on reference inclusion).",
          note_text = "This contrast is computed from Table 2D (Model B), holding was_reference_included at its mobile-web mean."
        )
      }
    } else {
      message("KPI #2 Model B skipped: was_reference_included (editcheck-newreference) not available in df_kpi2")
    }
  }, error = function(e) {
    message("KPI #2 model error: ", e$message)
  })
} else {
  message("KPI #2 model: required columns missing or data not loaded")
}

Model A (total effect)

Table 2. Adjusted odds ratios (ORs) from multivariable logistic regression for constructive outcome (not reverted within 48h) among new-content edits.
Term OR CI low CI high SE p-value
Intercept 3.085 2.728 3.498 0.063 <0.001
test_grouptest 1.144 0.955 1.371 0.092 0.146
platformmobile-web 0.419 0.325 0.540 0.129 <0.001
test_grouptest:platformmobile-web 1.353 0.927 1.978 0.193 0.118

Table note: Model A estimates the primary KPI #2 estimand: the total effect of treatment on the constructive rate. Outcome=1 means not reverted within 48h on a new-content edit. Population is restricted to shown test vs eligible-not-shown control. Includes a test_group×platform interaction when platform has multiple levels. OR>1 indicates higher odds of the outcome.

Table 2A. Mobile-web treatment vs control (constructive outcome) among new-content edits.
Contrast OR CI low CI high SE p-value
Mobile-web: treatment vs control 1.547 1.109 2.157 0.170 0.010

Table note: This contrast is computed from Table 2 (Model A) as the log-odds difference between (test, mobile-web) and (control, mobile-web), converted to an OR with a Wald 95% CI and two-sided p-value.

Model B (conditional; only when editcheck-newreference is observed)

Table 2D. Adjusted odds ratios (ORs) from multivariable logistic regression for constructive outcome (not reverted within 48h) among new-content edits (conditional on reference inclusion).
Term OR CI low CI high SE p-value
Intercept 2.613 2.296 2.981 0.067 <0.001
test_grouptest 0.873 0.718 1.060 0.099 0.170
platformmobile-web 0.482 0.374 0.624 0.131 <0.001
was_reference_included 2.069 1.716 2.501 0.096 <0.001
test_grouptest:platformmobile-web 1.332 0.910 1.954 0.195 0.142

Table note: Model B is an optional conditional analysis that estimates the treatment effect holding reference inclusion constant, and is reported only when editcheck-newreference is directly observed.

Table 2E. Mobile-web treatment vs control (constructive outcome) among new-content edits (conditional on reference inclusion).
Contrast OR CI low CI high SE p-value
Mobile-web: treatment vs control (conditional) 1.162 0.825 1.637 0.175 0.390

Table note: This contrast is computed from Table 2D (Model B), holding was_reference_included at its mobile-web mean.

Code
# KPI #2 Bayesian confirmation (brms) + Bayesian lift (relax) — constructive (not reverted within 48h)
# Updated per methodology (shown test vs eligible-not-shown control)
if (!is.null(reference_check_save_data) &&
    all(c("test_group", "is_new_content", "was_reverted",
          "was_reference_check_shown", "was_reference_check_eligible") %in% names(reference_check_save_data))) {

  df_kpi2_brm <- reference_check_save_data %>%
    make_rc_ab_group_published() %>%
    filter(is_new_content == 1) %>%
    mutate(
      constructive = ifelse(was_reverted == 1, 0, 1),
      test_group = factor(test_group, levels = c("control", "test"))
    )

  # Hierarchical Bayesian regression when user_id is present
  if (all(c("user_id", "platform", "experience_level_group") %in% names(df_kpi2_brm))) {
    df_brm <- df_kpi2_brm %>%
      mutate(
        platform = factor(platform),
        experience_level_group = droplevels(experience_level_group)
      ) %>%
      filter(!is.na(user_id), !is.na(constructive), !is.na(test_group), !is.na(platform))

    if (dplyr::n_distinct(df_brm$user_id) > 1 && dplyr::n_distinct(df_brm$test_group) == 2) {
      if (!requireNamespace("brms", quietly = TRUE)) {
        message("KPI #2 brms: skipped (brms not available / cannot be loaded in this environment)")
      } else if (!exists("safe_brm", mode = "function")) {
        message("KPI #2 brms: skipped (safe_brm not defined; run the setup/helper cells first)")
      } else {
        priors <- c(
          brms::set_prior(prior = "std_normal()", class = "b"),
          brms::set_prior("cauchy(0, 5)", class = "sd")
        )

        fit_brm <- safe_brm(
          constructive ~ test_group + platform + experience_level_group + (1 | user_id),
          data = df_brm,
          prior = priors,
          seed = 5,
          chains = 4,
          cores = 4,
          refresh = 0
        )

        if (!is.null(fit_brm)) {
          # Posterior-derived lift (probability space) + OR summary (multi-check style)
          nd_ctrl <- df_brm %>% mutate(test_group = factor("control", levels = c("control", "test")))
          nd_test <- df_brm %>% mutate(test_group = factor("test", levels = c("control", "test")))

          render_brms_confirm_table(
            fit = fit_brm,
            title = "Table 2B. Hierarchical Bayesian confirmation for constructive outcome among new-content edits.",
            coef_name = "b_test_grouptest",
            newdata_control = nd_ctrl,
            newdata_test = nd_test,
            note_text = "Posterior-derived average lift is computed as the per-draw mean of Pr(outcome|test) − Pr(outcome|control) over the observed covariate distribution (platform + experience), using population-level predictions (re_formula = NA)."
          )
        }
      }
    } else {
      message("KPI #2 brms: skipped (insufficient variation in user_id or test_group)")
    }
  } else {
    message("KPI #2 brms: skipped (missing user_id/platform/experience_level_group)")
  }

  # relax lift table (audit + uncertainty)
  IRdisplay::display_markdown("**Bayesian Analysis / Frequentist Analysis (overall)**")
  kpi2_df <- df_kpi2_brm %>%
    transmute(
      outcome = constructive,
      variation = dplyr::case_when(
        test_group == "control" ~ "control",
        test_group == "test" ~ "treatment",
        TRUE ~ as.character(test_group)
      )
    )
  render_relax(kpi2_df, "KPI #2", metric_type = "proportion", better = "higher")

  IRdisplay::display_markdown("**Bayesian Analysis / Frequentist Analysis (mobile-web only)**")
  # Mobile-web only: explicit control vs treatment inference
  if ("platform" %in% names(df_kpi2_brm)) {
    df_kpi2_mw <- df_kpi2_brm %>%
      filter(platform == "mobile-web")

    if (nrow(df_kpi2_mw) == 0) {
      message("KPI #2 (mobile-web only): skipped (no rows after filtering platform == 'mobile-web')")
    } else if (dplyr::n_distinct(df_kpi2_mw$test_group) < 2) {
      message("KPI #2 (mobile-web only): skipped (need both control and test groups)")
    } else {
      # relax (mobile-web)
      kpi2_mw_df <- df_kpi2_mw %>%
        transmute(
          outcome = constructive,
          variation = dplyr::case_when(
            test_group == "control" ~ "control",
            test_group == "test" ~ "treatment",
            TRUE ~ as.character(test_group)
          )
        )
      render_relax(kpi2_mw_df, "KPI #2 (mobile-web only)", metric_type = "proportion", better = "higher")

      # brms confirmation (mobile-web) when user_id is present
      if (all(c("user_id", "experience_level_group") %in% names(df_kpi2_mw))) {
        df_brm_mw <- df_kpi2_mw %>%
          mutate(experience_level_group = droplevels(experience_level_group)) %>%
          filter(!is.na(user_id), !is.na(constructive), !is.na(test_group))

        if (dplyr::n_distinct(df_brm_mw$user_id) > 1 && dplyr::n_distinct(df_brm_mw$test_group) == 2) {
          if (!requireNamespace("brms", quietly = TRUE)) {
            message("KPI #2 brms (mobile-web): skipped (brms not available / cannot be loaded in this environment)")
          } else if (!exists("safe_brm", mode = "function")) {
            message("KPI #2 brms (mobile-web): skipped (safe_brm not defined; run the setup/helper cells first)")
          } else {
            priors <- c(
              brms::set_prior(prior = "std_normal()", class = "b"),
              brms::set_prior("cauchy(0, 5)", class = "sd")
            )

            fit_brm_mw <- safe_brm(
              constructive ~ test_group + experience_level_group + (1 | user_id),
              data = df_brm_mw,
              prior = priors,
              seed = 5,
              chains = 4,
              cores = 4,
              refresh = 0
            )

            if (!is.null(fit_brm_mw)) {
              nd_ctrl <- df_brm_mw %>% mutate(test_group = factor("control", levels = c("control", "test")))
              nd_test <- df_brm_mw %>% mutate(test_group = factor("test", levels = c("control", "test")))

              render_brms_confirm_table(
                fit = fit_brm_mw,
                title = "Table 2C. Hierarchical Bayesian confirmation for constructive outcome among mobile-web new-content edits.",
                coef_name = "b_test_grouptest",
                newdata_control = nd_ctrl,
                newdata_test = nd_test,
                note_text = "Mobile-web-only brms model adjusts for experience group (platform is constant and omitted). Posterior-derived average lift is computed as the per-draw mean of Pr(outcome|test) − Pr(outcome|control) over the observed covariate distribution (experience), using population-level predictions (re_formula = NA)."
              )
            }
          }
        } else {
          message("KPI #2 brms (mobile-web): skipped (insufficient variation in user_id or test_group)")
        }
      } else {
        message("KPI #2 brms (mobile-web): skipped (missing user_id/experience_level_group)")
      }
    }
  }
} else {
  message("KPI #2 relax: required columns missing or data not loaded")
}
Start sampling
Running MCMC with 4 parallel chains...

Chain 3 finished in 23.6 seconds.
Chain 1 finished in 24.0 seconds.
Chain 4 finished in 24.0 seconds.
Chain 2 finished in 24.3 seconds.

All 4 chains finished successfully.
Mean chain execution time: 24.0 seconds.
Total execution time: 24.4 seconds.
Loading required package: rstan

Error: package or namespace load failed for ‘rstan’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/R/library/rstan/libs/rstan.so':
  /srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/R/library/rstan/libs/rstan.so: undefined symbol: _ZN3tbb8internal26task_scheduler_observer_v37observeEb

brms fit skipped (backend=cmdstanr): unable to find required package ‘rstan’
Continuing with glm + relax outputs.
To run brms reliably, prefer cmdstanr with CmdStan installed and a stable R toolchain.

Bayesian Analysis / Frequentist Analysis (overall)

KPI #2
Relative lift ((Treatment − Control) / Control)
Bayesian Analysis
Frequentist Analysis
Point Estimate Chance to Win P(Treatment better) 95% CrI Lower 95% CrI Upper Point Estimate p-value 95% CI Lower 95% CI Upper
0.057 0.995 0.995 0.013 0.100 0.057 0.010 0.014 0.100

Interpretation: Based on relax, the posterior probability that treatment is better than control is 99.5% (computed as Chance to Win).

Bayesian Analysis / Frequentist Analysis (mobile-web only)

KPI #2 (mobile-web only)
Relative lift ((Treatment − Control) / Control)
Bayesian Analysis
Frequentist Analysis
Point Estimate Chance to Win P(Treatment better) 95% CrI Lower 95% CrI Upper Point Estimate p-value 95% CI Lower 95% CI Upper
0.171 0.989 0.989 0.026 0.317 0.182 0.018 0.032 0.333

Interpretation: Based on relax, the posterior probability that treatment is better than control is 98.9% (computed as Chance to Win).

Start sampling
Running MCMC with 4 parallel chains...

Chain 1 finished in 3.0 seconds.
Chain 4 finished in 3.1 seconds.
Chain 2 finished in 3.3 seconds.
Chain 3 finished in 3.8 seconds.

All 4 chains finished successfully.
Mean chain execution time: 3.3 seconds.
Total execution time: 3.9 seconds.
Loading required package: rstan

Error: package or namespace load failed for ‘rstan’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/R/library/rstan/libs/rstan.so':
  /srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/R/library/rstan/libs/rstan.so: undefined symbol: _ZN3tbb8internal26task_scheduler_observer_v37observeEb

brms fit skipped (backend=cmdstanr): unable to find required package ‘rstan’
Continuing with glm + relax outputs.
To run brms reliably, prefer cmdstanr with CmdStan installed and a stable R toolchain.

4.2.4 Guardrail #1 Content quality (reverts)

Metric: Proportion of published new-content edits that are reverted within 48 hours.

Methodology: We review revert rates for all published new-content edits and compare the test and control groups.

Test group: The test group includes published new-content edits where Reference Check was shown at least once during the editing session.

Control group: The control group includes published new-content edits identified as eligible but not shown Reference Check.

Additional analysis: We include a breakdown of revert rates for published edits with a reference added and published edits without a reference added.

Note: Similar to KPI2 in Multi Check

Results: Edits shown Reference Check were less likely to be reverted within 48hours (-14.5% relative: 28.2% control, 24.1% test). How big is the change: * Desktop: revert rates declined 9.8% relative (24.5% → 22.1%). * Mobile-web: revert rates declined 23.6% relative (43.6% → 33.3%).

Overall, edits were less likely to be reverted when editors were shown Reference Check compared with the control group. This reduction is supported by the overall and mobile-web-only relax results. The decrease was most pronounced on mobile web where within-platform comparisons show a clear reduction in reverts. While the results suggest a larger reduction on mobile web than desktop, we cannot establish this difference because the platform interaction term is not statistically significant. Importantly, edits that include a new reference were much less likely to be reverted, reinforcing the proposed quality mechanism behind the feature. In the overall adjusted regression, the across-platform treatment term and the platform-difference term are not statistically significant, so we treat “mobile improves more than desktop” as suggestive. However, the within-mobile-web adjusted contrast is statistically significant (Table 3A), and the overall and mobile-web-only relax analyses show a statistically significant reduction in reverts.
Note: In the 2024 Reference Check report, New content edit revert rate decreased by 8.6% if reference check was available. While some nonconstructive new content edits with a reference were introduced by this feature (5 percentage point (pp) increase), there was a higher proportion of constructive new content edits with a reference added (23.4 pp increase).

Guardrail #1 (Revert within 48h; lower is better):
Overall: ↓ −4.1 pp (28.2% → 24.1%), −14.5% relative.
Desktop: ↓ −2.4 pp (24.5% → 22.1%), −9.8% relative.
Mobile-web: ↓ −10.3 pp (43.6% → 33.3%), −23.6% relative.
Evidence: glm (Table 3) overall across-platform treatment term not significant (p=0.146);
mobile-web contrast significant (Table 3A p=0.010).
Relax (relative lift): overall −0.141 (p=0.004). [Optional: mobile-web-only −0.236 (p=0.004).]

Code
# Revert rate bar (guardrail quality) by test group
# Updated per methodology (shown test vs eligible-not-shown control)
if (!is.null(reference_check_save_data) &&
    all(c("test_group", "was_reverted", "is_new_content",
          "was_reference_check_shown", "was_reference_check_eligible") %in% names(reference_check_save_data))) {
  revert_plot <- reference_check_save_data %>%
    make_rc_ab_group_published() %>%
    filter(is_new_content == 1) %>%
    mutate(reverted = ifelse(was_reverted == 1, 1, 0)) %>%
    group_by(test_group) %>%
    summarise(rate = mean(reverted, na.rm = TRUE), n = n(), .groups = "drop") %>%
    mutate(label = scales::percent(rate, accuracy = 0.1)) %>%
    ggplot(aes(x = test_group, y = rate, fill = test_group)) +
    geom_col() +
    geom_text(aes(label = label), vjust = -0.2, size = 3) +
    scale_y_continuous(labels = scales::percent_format(), expand = expansion(mult = c(0, 0.12))) +
    scale_fill_manual(values = c("control" = "#999999", "test" = "dodgerblue4")) +
    labs(title = "Revert rate by test group",
         x = "Test group",
         y = "Percent reverted (48h)") +
    pc_theme() +
    guides(fill = "none")
  print(revert_plot)
} else {
  message("Revert plot: required columns missing in reference_check_save_data")
}

Chart note (definition of Rate / denominator) - Revert rate by test group: Rate = mean(reverted) where reverted is 1 if was_reverted == 1 and 0 otherwise. Denominator = rows (edits) in reference_check_save_data for each test group.

Code
# Guardrail #1 tables (revert rate) by platform and deltas
# Updated per methodology (shown test vs eligible-not-shown control)
if (!is.null(reference_check_save_data) &&
    all(c("test_group", "platform", "is_new_content", "was_reverted",
          "was_reference_check_shown", "was_reference_check_eligible") %in% names(reference_check_save_data))) {
  gr1_df <- reference_check_save_data %>%
    make_rc_ab_group_published() %>%
    filter(is_new_content == 1) %>%
    mutate(reverted = ifelse(was_reverted == 1, 1, 0))

  # Overall (control vs test) + change vs control
  gr1_overall_rates <- make_rate_table(gr1_df, "reverted", group_cols = c("test_group")) %>%
    mutate(scope = "Overall")
  gr1_overall_rel <- make_rel_change_dim(gr1_overall_rates, dim_col = "scope")

  render_rate_rel(
    gr1_overall_rates,
    gr1_overall_rel,
    "Guardrail #1: revert rate (48h) overall",
    "Guardrail #1: change vs control (overall)",
    c(test_group = "Test group", scope = "Scope", rate = "Revert rate", n = "Count (edits)"),
    note_rate = "Revert rate = mean(0/1 outcome) where outcome=1 means reverted within 48h (per edit). Denominator = new-content edits (rows) in `reference_check_save_data` within each analysis group (shown test vs eligible-not-shown control)."
  )

  # By platform (control vs test) + change vs control
  guardrail1_rates <- gr1_df %>% make_rate_table("reverted")
  guardrail1_rel <- make_rel_change(guardrail1_rates)

  render_rate_rel(
    guardrail1_rates, guardrail1_rel,
    "Guardrail #1: revert rate (48h) by platform",
    "Guardrail #1: change vs control (by platform)",
    c(test_group = "Test group", platform = "Platform", rate = "Revert rate", n = "Count (edits)"),
    note_rate = "Revert rate = mean(0/1 outcome) where outcome=1 means reverted within 48h (per edit). Denominator = new-content edits (rows) in `reference_check_save_data` within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform)."
  )

  # Guardrail #1 by checks shown (bucketed)
  if ("n_checks_shown" %in% names(gr1_df)) {
    gr1_checks <- gr1_df %>%
      # Checks-shown buckets come from RC shown events; we report this slice for the test group only.
      filter(test_group == "test") %>%
      mutate(checks_bucket = case_when(
        is.na(n_checks_shown) ~ "unknown",
        n_checks_shown == 0 ~ "0",
        n_checks_shown == 1 ~ "1",
        n_checks_shown == 2 ~ "2",
        n_checks_shown >= 3 ~ "3+"
      )) %>%
      group_by(checks_bucket) %>%
      summarise(rate = mean(reverted, na.rm = TRUE), n = n(), .groups = "drop") %>%
      mutate(checks_bucket = factor(checks_bucket, levels = c("unknown", "0", "1", "2", "3+"))) %>%
      arrange(checks_bucket)

    render_slice(
      gr1_checks,
      "Guardrail #1 by checks shown (test group only)",
      c(checks_bucket = "Checks shown", rate = "Revert rate", n = "Count (edits)"),
      note_text = "Revert rate = mean(0/1 outcome) in the test group only where outcome=1 means reverted within 48h (per edit). Denominator = new-content edits (rows) in `reference_check_save_data` within the shown test group for each checks-shown bucket. Control is excluded because it has no comparable checks-shown event stream."
    )
  } else {
    message("Guardrail #1 by checks: required columns missing in reference_check_save_data")
  }

  # User experience breakdown (Unregistered / Newcomer / Junior Contributor)
  if ("experience_level_group" %in% names(gr1_df)) {
    gr1_exp_df <- gr1_df %>%
      filter(!is.na(experience_level_group), experience_level_group %in% c("Unregistered", "Newcomer", "Junior Contributor"))

    gr1_exp_rates <- make_rate_table(gr1_exp_df, "reverted", group_cols = c("test_group", "experience_level_group"))
    gr1_exp_rel <- make_rel_change_dim(gr1_exp_rates, dim_col = "experience_level_group")

    render_rate_rel(
      gr1_exp_rates, gr1_exp_rel,
      "Guardrail #1: revert rate (48h) by user experience",
      "Guardrail #1: change vs control (by user experience)",
      c(test_group = "Test group", experience_level_group = "User experience", rate = "Revert rate", n = "Count (edits)"),
      note_rate = "Revert rate = mean(0/1 outcome) where outcome=1 means reverted within 48h (per edit). Denominator = new-content edits (rows) in `reference_check_save_data` within each analysis group (shown test vs eligible-not-shown control) for each (test group × user experience)."
    )

    # Guardrail #1 by platform and user experience
    gr1_exp_slices <- gr1_exp_df %>%
      group_by(test_group, platform, experience_level_group) %>%
      summarise(rate = mean(reverted, na.rm = TRUE), n = n(), .groups = "drop")

    render_slice(
      gr1_exp_slices,
      "Guardrail #1: revert rate (48h) by platform and user experience",
      c(test_group = "Test group", platform = "Platform", experience_level_group = "User experience", rate = "Revert rate", n = "Count (edits)"),
      note_text = "Revert rate = mean(0/1 outcome) where outcome=1 means reverted within 48h (per edit). Denominator = new-content edits (rows) in `reference_check_save_data` within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform × user experience)."
    )
  } else {
    message("Guardrail #1 user experience tables: experience_level_group not available in reference_check_save_data")
  }
} else {
  message("Guardrail #1 tables: required columns missing in reference_check_save_data")
}
Guardrail #1: revert rate (48h) overall
Test group Revert rate Count (edits) Scope
control 28.2% 1665 Overall
test 24.1% 1574 Overall

Table note: Revert rate = mean(0/1 outcome) where outcome=1 means reverted within 48h (per edit). Denominator = new-content edits (rows) in reference_check_save_data within each analysis group (shown test vs eligible-not-shown control).

Guardrail #1: change vs control (overall)
scope Control rate Test rate Absolute difference (pp) Relative change vs control N (control) N (test)
Overall 28.2% 24.1% -4.1 -14.5% 1665 1574

Table note: Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.

Guardrail #1: revert rate (48h) by platform
Test group Platform Revert rate Count (edits)
control desktop 24.5% 1344
control mobile-web 43.6% 321
test desktop 22.1% 1295
test mobile-web 33.3% 279

Table note: Revert rate = mean(0/1 outcome) where outcome=1 means reverted within 48h (per edit). Denominator = new-content edits (rows) in reference_check_save_data within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform).

Guardrail #1: change vs control (by platform)
Platform Control rate Test rate Absolute difference (pp) Relative change vs control N (control) N (test)
desktop 24.5% 22.1% -2.4 -9.8% 1344 1295
mobile-web 43.6% 33.3% -10.3 -23.6% 321 279

Table note: Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.

Guardrail #1 by checks shown (test group only)
Checks shown Revert rate Count (edits)
1 23.4% 1086
2 20.2% 208
3+ 29.6% 280

Table note: Revert rate = mean(0/1 outcome) in the test group only where outcome=1 means reverted within 48h (per edit). Denominator = new-content edits (rows) in reference_check_save_data within the shown test group for each checks-shown bucket. Control is excluded because it has no comparable checks-shown event stream.

Guardrail #1: revert rate (48h) by user experience
Test group User experience Revert rate Count (edits)
control Unregistered 37.3% 255
control Newcomer 40.9% 230
control Junior Contributor 23.7% 1180
test Unregistered 29.6% 260
test Newcomer 35.3% 221
test Junior Contributor 20.5% 1093

Table note: Revert rate = mean(0/1 outcome) where outcome=1 means reverted within 48h (per edit). Denominator = new-content edits (rows) in reference_check_save_data within each analysis group (shown test vs eligible-not-shown control) for each (test group × user experience).

Guardrail #1: change vs control (by user experience)
experience_level_group Control rate Test rate Absolute difference (pp) Relative change vs control N (control) N (test)
Unregistered 37.3% 29.6% -7.6 -20.5% 255 260
Newcomer 40.9% 35.3% -5.6 -13.6% 230 221
Junior Contributor 23.7% 20.5% -3.2 -13.6% 1180 1093

Table note: Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.

Guardrail #1: revert rate (48h) by platform and user experience
Test group Platform User experience Revert rate Count (edits)
control desktop Unregistered 34.4% 180
control desktop Newcomer 36.4% 187
control desktop Junior Contributor 20.4% 977
control mobile-web Unregistered 44.0% 75
control mobile-web Newcomer 60.5% <50
control mobile-web Junior Contributor 39.9% 203
test desktop Unregistered 29.0% 193
test desktop Newcomer 33.5% 191
test desktop Junior Contributor 18.2% 911
test mobile-web Unregistered 31.3% 67
test mobile-web Newcomer 46.7% <50
test mobile-web Junior Contributor 31.9% 182

Table note: Revert rate = mean(0/1 outcome) where outcome=1 means reverted within 48h (per edit). Denominator = new-content edits (rows) in reference_check_save_data within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform × user experience).

Code
# Guardrail #1 model (revert rate)
# Updated per methodology (shown test vs eligible-not-shown control)
if (!is.null(reference_check_save_data) &&
    all(c("test_group", "is_new_content", "was_reverted", "platform",
          "was_reference_check_shown", "was_reference_check_eligible") %in% names(reference_check_save_data))) {

  df_g1 <- reference_check_save_data %>%
    make_rc_ab_group_published() %>%
    filter(is_new_content == 1) %>%
    mutate(
      reverted = ifelse(was_reverted == 1, 1, 0),
      test_group = factor(test_group, levels = c("control", "test")),
      platform = factor(platform)
    )

  if (all(c("test_group", "platform", "reverted") %in% names(df_g1))) {
    tryCatch({
      # Model A (total effect): primary revert-rate estimand
      if (dplyr::n_distinct(df_g1$platform) > 1) {
        m_g1_a <- glm(reverted ~ test_group * platform, data = df_g1, family = binomial())
      } else {
        m_g1_a <- glm(reverted ~ test_group + platform, data = df_g1, family = binomial())
      }

      render_binom_model(
        m_g1_a,
        "Table 3. Adjusted odds ratios (ORs) from multivariable logistic regression for reverted within 48h among new-content edits.",
        note_text = paste(
          "Model A estimates the total effect of treatment on revert rate.",
          "Outcome=1 means reverted within 48h on a new-content edit.",
          "Population is restricted to shown test vs eligible-not-shown control.",
          "Includes a test_group×platform interaction when platform has multiple levels.",
          "OR>1 indicates higher odds of revert.",
          sep = " "
        )
      )

      # Table 3A: mobile-web treatment vs control (Model A)
      if ("mobile-web" %in% levels(df_g1$platform)) {
        nd_ctrl <- data.frame(
          test_group = factor("control", levels = levels(df_g1$test_group)),
          platform = factor("mobile-web", levels = levels(df_g1$platform))
        )
        nd_test <- nd_ctrl %>% mutate(test_group = factor("test", levels = levels(df_g1$test_group)))

        mw_contrast <- tidy_glm_contrast_or(
          model = m_g1_a,
          newdata_control = nd_ctrl,
          newdata_test = nd_test,
          label = "Mobile-web: treatment vs control"
        )

        render_or_contrast_table(
          mw_contrast,
          "Table 3A. Mobile-web treatment vs control (revert rate within 48h) among new-content edits.",
          note_text = "This contrast is computed from Table 3 (Model A) as the log-odds difference between (test, mobile-web) and (control, mobile-web), converted to an OR with a Wald 95% CI and two-sided p-value."
        )
      } else {
        message("Guardrail #1 Table 3A skipped: platform level 'mobile-web' not present in df_g1")
      }

      # Model B (conditional): only when editcheck-newreference is directly observed
      if ("was_reference_included" %in% names(df_g1)) {
        df_g1 <- df_g1 %>%
          mutate(was_reference_included = ifelse(is.na(was_reference_included), 0L, as.integer(was_reference_included == 1)))

        if (dplyr::n_distinct(df_g1$platform) > 1) {
          m_g1_b <- glm(reverted ~ test_group * platform + was_reference_included, data = df_g1, family = binomial())
        } else {
          m_g1_b <- glm(reverted ~ test_group + platform + was_reference_included, data = df_g1, family = binomial())
        }

        render_binom_model(
          m_g1_b,
          "Table 3B. Adjusted odds ratios (ORs) from multivariable logistic regression for reverted within 48h among new-content edits (conditional on reference inclusion).",
          note_text = "Model B is an optional conditional analysis that estimates the treatment effect holding reference inclusion constant, and is reported only when editcheck-newreference is directly observed."
        )

        # Table 3C: mobile-web treatment vs control (Model B)
        if ("mobile-web" %in% levels(df_g1$platform)) {
          ref_mean_mw <- mean(df_g1$was_reference_included[df_g1$platform == "mobile-web"], na.rm = TRUE)

          nd_ctrl_b <- data.frame(
            test_group = factor("control", levels = levels(df_g1$test_group)),
            platform = factor("mobile-web", levels = levels(df_g1$platform)),
            was_reference_included = ref_mean_mw
          )
          nd_test_b <- nd_ctrl_b %>% mutate(test_group = factor("test", levels = levels(df_g1$test_group)))

          mw_contrast_b <- tidy_glm_contrast_or(
            model = m_g1_b,
            newdata_control = nd_ctrl_b,
            newdata_test = nd_test_b,
            label = "Mobile-web: treatment vs control (conditional)"
          )

          render_or_contrast_table(
            mw_contrast_b,
            "Table 3C. Mobile-web treatment vs control (revert rate within 48h) among new-content edits (conditional on reference inclusion).",
            note_text = "This contrast is computed from Table 3B (Model B), holding was_reference_included at its mobile-web mean."
          )
        }
      } else {
        message("Guardrail #1 Model B skipped: was_reference_included (editcheck-newreference) not available in df_g1")
      }
    }, error = function(e) {
      message("Guardrail #1 model error: ", e$message)
    })
  } else {
    message("Guardrail #1 model: required columns missing after aliasing")
  }
} else {
  message("Guardrail #1 model: required columns missing or data not loaded")
}
Table 3. Adjusted odds ratios (ORs) from multivariable logistic regression for reverted within 48h among new-content edits.
Term OR CI low CI high SE p-value
Intercept 0.324 0.286 0.367 0.063 <0.001
test_grouptest 0.874 0.730 1.048 0.092 0.146
platformmobile-web 2.386 1.851 3.073 0.129 <0.001
test_grouptest:platformmobile-web 0.739 0.506 1.078 0.193 0.118

Table note: Model A estimates the total effect of treatment on revert rate. Outcome=1 means reverted within 48h on a new-content edit. Population is restricted to shown test vs eligible-not-shown control. Includes a test_group×platform interaction when platform has multiple levels. OR>1 indicates higher odds of revert.

Table 3A. Mobile-web treatment vs control (revert rate within 48h) among new-content edits.
Contrast OR CI low CI high SE p-value
Mobile-web: treatment vs control 0.646 0.464 0.902 0.170 0.010

Table note: This contrast is computed from Table 3 (Model A) as the log-odds difference between (test, mobile-web) and (control, mobile-web), converted to an OR with a Wald 95% CI and two-sided p-value.

Table 3B. Adjusted odds ratios (ORs) from multivariable logistic regression for reverted within 48h among new-content edits (conditional on reference inclusion).
Term OR CI low CI high SE p-value
Intercept 0.383 0.335 0.435 0.067 <0.001
test_grouptest 1.146 0.943 1.393 0.099 0.170
platformmobile-web 2.073 1.604 2.676 0.131 <0.001
was_reference_included 0.483 0.400 0.583 0.096 <0.001
test_grouptest:platformmobile-web 0.751 0.512 1.099 0.195 0.142

Table note: Model B is an optional conditional analysis that estimates the treatment effect holding reference inclusion constant, and is reported only when editcheck-newreference is directly observed.

Table 3C. Mobile-web treatment vs control (revert rate within 48h) among new-content edits (conditional on reference inclusion).
Contrast OR CI low CI high SE p-value
Mobile-web: treatment vs control (conditional) 0.861 0.611 1.212 0.175 0.390

Table note: This contrast is computed from Table 3B (Model B), holding was_reference_included at its mobile-web mean.

Code
# Guardrail #1 Bayesian lift (relax) — revert rate among new content
# Updated per methodology (shown test vs eligible-not-shown control)
if (!is.null(reference_check_save_data) &&
    all(c("test_group", "is_new_content", "was_reverted",
          "was_reference_check_shown", "was_reference_check_eligible") %in% names(reference_check_save_data))) {

  df_g1_relax <- reference_check_save_data %>%
    make_rc_ab_group_published() %>%
    filter(is_new_content == 1) %>%
    mutate(
      reverted = ifelse(was_reverted == 1, 1, 0),
      test_group = factor(test_group, levels = c("control", "test"))
    )

  # Overall: control vs treatment
  g1_df <- df_g1_relax %>%
    transmute(
      outcome = reverted,
      variation = dplyr::case_when(
        test_group == "control" ~ "control",
        test_group == "test" ~ "treatment",
        TRUE ~ as.character(test_group)
      )
    )
  IRdisplay::display_markdown("**Bayesian Analysis / Frequentist Analysis (overall)**")
  render_relax(g1_df, "Guardrail #1", metric_type = "proportion", better = "lower")

  IRdisplay::display_markdown("**Bayesian Analysis / Frequentist Analysis (mobile-web only)**")
  if ("platform" %in% names(df_g1_relax)) {
    df_g1_mw <- df_g1_relax %>% filter(platform == "mobile-web")

    if (nrow(df_g1_mw) == 0) {
      message("Guardrail #1 (mobile-web only): skipped (no rows after filtering platform == 'mobile-web')")
    } else if (dplyr::n_distinct(df_g1_mw$test_group) < 2) {
      message("Guardrail #1 (mobile-web only): skipped (need both control and test groups)")
    } else {
      g1_mw_df <- df_g1_mw %>%
        transmute(
          outcome = reverted,
          variation = dplyr::case_when(
            test_group == "control" ~ "control",
            test_group == "test" ~ "treatment",
            TRUE ~ as.character(test_group)
          )
        )
      render_relax(g1_mw_df, "Guardrail #1 (mobile-web only)", metric_type = "proportion", better = "lower")
    }
  }
} else {
  message("Guardrail #1 relax: required columns missing or data not loaded")
}

Bayesian Analysis / Frequentist Analysis (overall)

Guardrail #1
Relative lift ((Treatment − Control) / Control)
Bayesian Analysis
Frequentist Analysis
Point Estimate Chance to Win P(Treatment better) 95% CrI Lower 95% CrI Upper Point Estimate p-value 95% CI Lower 95% CI Upper
−0.141 0.002 0.998 −0.239 −0.043 −0.145 0.004 −0.245 −0.046

Interpretation: Based on relax, the posterior probability that treatment is better than control is 99.8% (computed as 1 - Chance to Win).

Bayesian Analysis / Frequentist Analysis (mobile-web only)

Guardrail #1 (mobile-web only)
Relative lift ((Treatment − Control) / Control)
Bayesian Analysis
Frequentist Analysis
Point Estimate Chance to Win P(Treatment better) 95% CrI Lower 95% CrI Upper Point Estimate p-value 95% CI Lower 95% CI Upper
−0.220 0.002 0.998 −0.373 −0.067 −0.236 0.004 −0.395 −0.077

Interpretation: Based on relax, the posterior probability that treatment is better than control is 99.8% (computed as 1 - Chance to Win).

Code
# Quick proportion tests for Guardrail #1 (revert) and Guardrail #2 (completion)
# Note: these are lightweight audit checks; primary inference is via regression + relax.

# Guardrail #1: revert rate among new-content edits (shown test vs eligible-not-shown control)
if (!is.null(reference_check_save_data) &&
    all(c("test_group", "is_new_content", "was_reverted",
          "was_reference_check_shown", "was_reference_check_eligible") %in% names(reference_check_save_data))) {
  df_g1 <- reference_check_save_data %>%
    make_rc_ab_group_published() %>%
    filter(is_new_content == 1)

  prop_df <- df_g1 %>%
    group_by(test_group) %>%
    summarise(success = sum(was_reverted == 1, na.rm = TRUE), total = n(), .groups = "drop")

  ctrl <- prop_df %>% filter(test_group == "control")
  tst <- prop_df %>% filter(test_group == "test")

  if (nrow(ctrl) == 1 && nrow(tst) == 1) {
    render_prop_test(ctrl$success, ctrl$total, tst$success, tst$total, "Prop test (Guardrail #1, revert rate)")
  }
}
Prop test (Guardrail #1, revert rate)
Group Success Total Rate
control 469 1665 28.2%
test 379 1574 24.1%
Prop test (Guardrail #1, revert rate) (prop.test)
Metric Value
p_value 0.00916
statistic 6.79

4.2.5 Guardrail #2 Edit completion

Metric: Proportion of edits that reach the point where Reference Check was shown or would have been shown and are successfully published, defined as event.action = "saveSuccess".

Eligibility: Eligible editing sessions are those where a user clicks publish, defined as event.action = "saveIntent", and successfully publishes the edit, defined as event.action = "saveSuccess".

Test group constraint: In the test group, analysis is limited to edits where Reference Check was shown at least once.

Methodology: We review the proportion of edits by newcomers, junior contributors with fewer than 100 edits, and unregistered users that reach saveIntent and successfully publish. Analysis is limited to edits that are not reverted within 48 hours.

Test group: The test group includes edits where Reference Check was shown.

Control group: The control group includes all edits that reach saveIntent. The control group cannot be limited to eligible-but-not-shown edits because eligibility is only tagged on published edits.

Note: Similar to Guardrail #1 in Reference Check 2024

Results: We did not observe any drastic decreases in edit completion rate.Reference Check slightly reduces the likelihood that an edit is completed (-4.8 % relative: 88.3% control and 84.1% test) and this effect is statistically significant. How big is the change by platform: * Desktop: Completion decreased 6.8% relative (94.0% → 87.6%). * Mobile-web: Completion decreased 6.3% relative (74.1% → 69.4%).

This guardrail shows a real and statistically meaningful decrease in completion. Overall, Reference Check introduces measurable friction that leads to lower completion rates, but this trade-off coincides with higher-quality outcomes: more references added, fewer reverts, and improved constructive edits on mobile-web.
Note: In the 2024 Reference Check report, there was a 10% decrease in edit completion rate for edits where reference check was shown compared to the control group. There was a higher observed decrease in edit completion rate on mobile compared to desktop. On mobile, edit completion rate decreased by -24.3% (-13.5pp) while on desktop it decreased by only -3.1% (-2.3pp). Note: The completion rates reported in this 2024 report includes saved edits that were reverted.

Guardrail #2 (Completion = saveIntent → saveSuccess):
Overall: ↓ −4.2 pp (88.3% → 84.1%), −4.8% relative.
Desktop: ↓ −6.4 pp (94.0% → 87.6%), −6.8% relative.
Mobile-web: ↓ −4.7 pp (74.1% → 69.4%), −6.3% relative.
Evidence: glm (Table 4) OR = 0.58 (95% CI 0.49–0.67), p < 0.001.
Relax (relative lift): −0.048 (p < 0.001).

Code
# Completion (saveSuccess) bar by test group
# Updated per methodology (shown-only in test; focus population; unreverted when available)
if (!is.null(edit_completion_rate_data) &&
    all(c("test_group", "saved_edit", "was_reference_check_shown") %in% names(edit_completion_rate_data))) {
  ec_df <- edit_completion_rate_data %>%
    make_rc_ab_group_completion() %>%
    add_experience_group()

  if ("experience_level_group" %in% names(ec_df)) {
    ec_df <- ec_df %>% filter(experience_level_group %in% c("Newcomer", "Junior Contributor", "Unregistered"))
  }
  if ("was_reverted" %in% names(ec_df)) {
    ec_df <- ec_df %>% filter(is.na(was_reverted) | was_reverted != 1)
  }

  completion_plot <- ec_df %>%
    group_by(test_group) %>%
    summarise(rate = mean(saved_edit, na.rm = TRUE), n = n(), .groups = "drop") %>%
    mutate(label = scales::percent(rate, accuracy = 0.1)) %>%
    ggplot(aes(x = test_group, y = rate, fill = test_group)) +
    geom_col() +
    geom_text(aes(label = label), vjust = -0.2, size = 3) +
    scale_y_continuous(labels = scales::percent_format(), expand = expansion(mult = c(0, 0.12))) +
    scale_fill_manual(values = c("control" = "#999999", "test" = "dodgerblue4")) +
    labs(title = "Completion (saveSuccess) by test group",
         x = "Test group",
         y = "Percent saveSuccess") +
    pc_theme() +
    guides(fill = "none")
  print(completion_plot)
} else {
  message("Completion plot: required columns missing in edit_completion_rate_data")
}

Chart note (definition of Rate / denominator)

  • Completion (saveSuccess) by experiment group:
    • Rate = mean(saved_edit) where saved_edit is a 0/1 outcome (1 = saveSuccess).
    • Denominator = rows (events) in edit_completion_rate_data within each experiment group after Guardrail #2 filters (shown-only in test; focus population).

Note: these completion results exclude edits reverted within 48 hours when the was_reverted flag is available in edit_completion_rate_data.

Code
# 3) Edit completion rate (saveIntent -> saveSuccess)
# Updated to match Guardrail #2 methodology:
# - test rows are shown-only; control includes all saveIntent rows
# - focus population: Newcomer / Junior Contributor / Unregistered
# - exclude edits reverted within 48 hours when available
if (!is.null(edit_completion_rate_data) &&
    all(c("test_group", "saved_edit", "was_reference_check_shown") %in% names(edit_completion_rate_data))) {

  ec_df <- edit_completion_rate_data %>%
    make_rc_ab_group_completion() %>%
    add_experience_group()

  if ("experience_level_group" %in% names(ec_df)) {
    ec_df <- ec_df %>% filter(experience_level_group %in% c("Newcomer", "Junior Contributor", "Unregistered"))
  }
  if ("was_reverted" %in% names(ec_df)) {
    ec_df <- ec_df %>% filter(is.na(was_reverted) | was_reverted != 1)
  }

  completion_summary <- ec_df %>%
    mutate(
      was_reference_check_shown = ifelse(was_reference_check_shown == 1, "shown", "not_shown"),
      saved_edit = ifelse(saved_edit == 1, "saved", "not_saved")
    ) %>%
    count(test_group, was_reference_check_shown, saved_edit) %>%
    group_by(test_group, was_reference_check_shown) %>%
    mutate(pct = n / sum(n)) %>%
    arrange(test_group, was_reference_check_shown, desc(n))

  completion_summary <- renorm_buckets(completion_summary)

  render_pct_table(
    completion_summary,
    "Edit completion (saveIntent → saveSuccess)",
    c(test_group = "Test group", was_reference_check_shown = "Reference Check shown", saved_edit = "Outcome", n = "Count (events)", pct = "Percent of events"),
    note_text = "Percent of events = share of rows (events) in `edit_completion_rate_data` within each (test group × Reference Check shown) after the Guardrail #2 filters (shown-only in test; focus population). This table is a breakdown of events into `saved` vs `not_saved` (saveSuccess vs not saveSuccess). Note: these completion results exclude edits reverted within 48 hours when the `was_reverted` flag is available."
  )
} else {
  message("Completion summary: required columns missing in edit_completion_rate_data")
}
Edit completion (saveIntent → saveSuccess)
Outcome Count (events) Percent of events
control - not_shown
saved 58283 88.3%
not_saved 7686 11.7%
test - shown
saved 1223 84.1%
not_saved 231 15.9%

Table note: Percent of events = share of rows (events) in edit_completion_rate_data within each (test group × Reference Check shown) after the Guardrail #2 filters (shown-only in test; focus population). This table is a breakdown of events into saved vs not_saved (saveSuccess vs not saveSuccess). Note: these completion results exclude edits reverted within 48 hours when the was_reverted flag is available.

Code
# Completion tables by platform and deltas (Guardrail #2)
# Updated per methodology:
# - test group is limited to rows where RC was shown at least once
# - control group includes all rows reaching saveIntent (as represented in edit_completion_rate_data)
# - population focus: Newcomer / Junior (<=100 edits) / Unregistered; limit to unreverted within 48h when available
if (!is.null(edit_completion_rate_data) &&
    all(c("test_group", "platform", "saved_edit", "was_reference_check_shown") %in% names(edit_completion_rate_data))) {
  ec_df <- edit_completion_rate_data %>%
    make_rc_ab_group_completion() %>%
    add_experience_group()

  if ("experience_level_group" %in% names(ec_df)) {
    ec_df <- ec_df %>% filter(experience_level_group %in% c("Newcomer", "Junior Contributor", "Unregistered"))
  }
  if ("was_reverted" %in% names(ec_df)) {
    ec_df <- ec_df %>% filter(is.na(was_reverted) | was_reverted != 1)
  }

  # Overall (control vs test) + change vs control
  completion_overall_rates <- make_rate_table(ec_df, "saved_edit", group_cols = c("test_group")) %>%
    mutate(scope = "Overall")
  completion_overall_rel <- make_rel_change_dim(completion_overall_rates, dim_col = "scope")

  render_rate_rel(
    completion_overall_rates,
    completion_overall_rel,
    "Completion (saveSuccess) overall",
    "Completion: change vs control (overall)",
    c(test_group = "Test group", scope = "Scope", rate = "Rate", n = "Count (events)"),
    note_rate = "Rate = mean(0/1 outcome) where outcome=1 means saveSuccess (per event). Test group is shown-only; control includes all saveIntent rows. Population is restricted to Newcomer/Junior/Unregistered. Denominator = rows in `edit_completion_rate_data` within each experiment group after these filters. Note: these completion results exclude edits reverted within 48 hours when the `was_reverted` flag is available."
  )

  # By platform (control vs test) + change vs control
  completion_rates <- ec_df %>% make_rate_table("saved_edit")
  completion_rel <- make_rel_change(completion_rates)

  render_rate_rel(
    completion_rates, completion_rel,
    "Completion (saveSuccess) by platform",
    "Completion: change vs control (by platform)",
    c(test_group = "Test group", platform = "Platform", rate = "Rate", n = "Count (events)"),
    note_rate = "Rate = mean(0/1 outcome) where outcome=1 means saveSuccess (per event). Test group is shown-only; control includes all saveIntent rows. Population is restricted to Newcomer/Junior/Unregistered. Denominator = rows in `edit_completion_rate_data` for each (test group × platform) after these filters. Note: these completion results exclude edits reverted within 48 hours when the `was_reverted` flag is available."
  )
} else {
  message("Completion tables: required columns missing in edit_completion_rate_data")
}
Completion (saveSuccess) overall
Test group Rate Count (events) Scope
control 88.3% 65969 Overall
test 84.1% 1454 Overall

Table note: Rate = mean(0/1 outcome) where outcome=1 means saveSuccess (per event). Test group is shown-only; control includes all saveIntent rows. Population is restricted to Newcomer/Junior/Unregistered. Denominator = rows in edit_completion_rate_data within each experiment group after these filters. Note: these completion results exclude edits reverted within 48 hours when the was_reverted flag is available.

Completion: change vs control (overall)
scope Control rate Test rate Absolute difference (pp) Relative change vs control N (control) N (test)
Overall 88.3% 84.1% -4.2 -4.8% 65969 1454

Table note: Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.

Completion (saveSuccess) by platform
Test group Platform Rate Count (events)
control desktop 94.0% 47267
control mobile-web 74.1% 18702
test desktop 87.6% 1176
test mobile-web 69.4% 278

Table note: Rate = mean(0/1 outcome) where outcome=1 means saveSuccess (per event). Test group is shown-only; control includes all saveIntent rows. Population is restricted to Newcomer/Junior/Unregistered. Denominator = rows in edit_completion_rate_data for each (test group × platform) after these filters. Note: these completion results exclude edits reverted within 48 hours when the was_reverted flag is available.

Completion: change vs control (by platform)
Platform Control rate Test rate Absolute difference (pp) Relative change vs control N (control) N (test)
desktop 94.0% 87.6% -6.4 -6.8% 47267 1176
mobile-web 74.1% 69.4% -4.7 -6.3% 18702 278

Table note: Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.

Code
# Guardrail #2 model (completion = saveSuccess vs saveIntent)
# Updated per methodology (shown-only in test; control includes all saveIntent rows; focus population)
if (!is.null(edit_completion_rate_data)) {
  df_g2 <- edit_completion_rate_data %>%
    make_rc_ab_group_completion() %>%
    add_experience_group() %>%
    mutate(completed = saved_edit)

  if ("experience_level_group" %in% names(df_g2)) {
    df_g2 <- df_g2 %>% filter(experience_level_group %in% c("Newcomer", "Junior Contributor", "Unregistered"))
  }
  if ("was_reverted" %in% names(df_g2)) {
    df_g2 <- df_g2 %>% filter(is.na(was_reverted) | was_reverted != 1)
  }

  if (all(c("test_group", "platform", "completed") %in% names(df_g2)) &&
      ("experience_level_group" %in% names(df_g2))) {
    tryCatch({
      # Include was_reference_check_shown only if it varies after filtering (avoid perfect collinearity)
      f_g2 <- completed ~ test_group + platform + experience_level_group
      if ("was_reference_check_shown" %in% names(df_g2) && length(unique(df_g2$was_reference_check_shown)) > 1) {
        f_g2 <- completed ~ test_group + platform + experience_level_group + was_reference_check_shown
      }

      m_g2 <- glm(f_g2, data = df_g2, family = binomial())
      render_binom_model(
        m_g2,
        "Table 4. Adjusted odds ratios (ORs) from multivariable logistic regression for saveSuccess among saveIntent events.",
        note_text = "Outcome=1 means saveSuccess (per event). Test rows are shown-only; control includes all saveIntent rows. Population is restricted to Newcomer/Junior/Unregistered and to unreverted rows when available. Adjusted for platform and experience group (and for Reference Check shown when it varies). OR>1 indicates higher odds of the outcome."
      )
    }, error = function(e) {
      message("Guardrail #2 model error: ", e$message)
    })
  } else {
    message("Guardrail #2 model: required columns missing or data not loaded")
  }
} else {
  message("Guardrail #2 model: data not loaded")
}
Table 4. Adjusted odds ratios (ORs) from multivariable logistic regression for saveSuccess among saveIntent events.
Term OR CI low CI high SE p-value
Intercept 6.242 5.903 6.603 0.029 <0.001
test_grouptest 0.575 0.494 0.671 0.078 <0.001
platformmobile-web 0.212 0.201 0.222 0.026 <0.001
experience_level_groupNewcomer 1.243 1.146 1.349 0.042 <0.001
experience_level_groupJunior Contributor 3.495 3.301 3.700 0.029 <0.001
was_reference_check_shown NA NA NA NA NA

Table note: Outcome=1 means saveSuccess (per event). Test rows are shown-only; control includes all saveIntent rows. Population is restricted to Newcomer/Junior/Unregistered and to unreverted rows when available. Adjusted for platform and experience group (and for Reference Check shown when it varies). OR>1 indicates higher odds of the outcome.

Code
# Completion by platform and user_status
# Updated per methodology (shown-only in test; focus population; unreverted when available)
if (!is.null(edit_completion_rate_data)) {
  ec_df <- edit_completion_rate_data %>%
    make_rc_ab_group_completion() %>%
    add_experience_group()

  if ("experience_level_group" %in% names(ec_df)) {
    ec_df <- ec_df %>% filter(experience_level_group %in% c("Newcomer", "Junior Contributor", "Unregistered"))
  }
  if ("was_reverted" %in% names(ec_df)) {
    ec_df <- ec_df %>% filter(is.na(was_reverted) | was_reverted != 1)
  }

  if (all(c("test_group", "saved_edit", "platform", "user_status") %in% names(ec_df))) {
    completion_slices <- ec_df %>%
      group_by(test_group, platform, user_status) %>%
      summarise(rate = mean(saved_edit, na.rm = TRUE), n = n(), .groups = "drop")
    render_slice(
      completion_slices,
      "Completion by platform and user status",
      c(test_group = "Test group", platform = "Platform", user_status = "User status", rate = "Rate", n = "Count (events)"),
      note_text = "Rate = mean(0/1 outcome) where outcome=1 means saveSuccess (per event). Test rows are shown-only; control includes all saveIntent rows. Population is restricted to Newcomer/Junior/Unregistered and to unreverted rows when available. Denominator = rows in `edit_completion_rate_data` for each (test group × platform × user status) after these filters."
    )

    # Completion by experience group (explicitly matches methodology)
    if ("experience_level_group" %in% names(ec_df)) {
      completion_exp <- ec_df %>%
        group_by(test_group, experience_level_group) %>%
        summarise(rate = mean(saved_edit, na.rm = TRUE), n = n(), .groups = "drop")
      render_slice(
        completion_exp,
        "Completion by experience group",
        c(test_group = "Test group", experience_level_group = "Experience group", rate = "Rate", n = "Count (events)"),
        note_text = "Rate = mean(0/1 outcome) where outcome=1 means saveSuccess (per event). Denominator = rows in `edit_completion_rate_data` for each (test group × experience group) after the same filters used above."
      )
    }
  } else {
    message("Completion slices: required columns missing in edit_completion_rate_data")
  }
} else {
  message("Completion slices: required columns missing in edit_completion_rate_data")
}

# Completion by number of checks shown (bucketed)
if (!is.null(edit_completion_rate_data)) {
  ec_df <- edit_completion_rate_data %>%
    make_rc_ab_group_completion() %>%
    add_experience_group()

  if ("experience_level_group" %in% names(ec_df)) {
    ec_df <- ec_df %>% filter(experience_level_group %in% c("Newcomer", "Junior Contributor", "Unregistered"))
  }
  if ("was_reverted" %in% names(ec_df)) {
    ec_df <- ec_df %>% filter(is.na(was_reverted) | was_reverted != 1)
  }

  if (all(c("test_group", "saved_edit", "n_checks_shown") %in% names(ec_df))) {
    ec_df <- ec_df %>%
      mutate(checks_bucket = case_when(
        is.na(n_checks_shown) ~ "unknown",
        n_checks_shown == 0 ~ "0",
        n_checks_shown == 1 ~ "1",
        n_checks_shown == 2 ~ "2",
        n_checks_shown >= 3 ~ "3+"
      ))
    completion_by_checks <- ec_df %>%
      # Checks-shown buckets come from RC shown events; we report this slice for the test group only.
      filter(test_group == "test") %>%
      group_by(checks_bucket) %>%
      summarise(rate = mean(saved_edit, na.rm = TRUE), n = n(), .groups = "drop")

    render_slice(
      completion_by_checks,
      "Completion by checks shown (test group only)",
      c(checks_bucket = "Checks shown", rate = "Rate", n = "Count (events)"),
      note_text = "Rate = mean(0/1 outcome) in the test group only where outcome=1 means saveSuccess (per event). Denominator = rows in `edit_completion_rate_data` within the shown test group for each checks-shown bucket after the same filters used above. Control is excluded because it has no comparable checks-shown event stream."
    )
  } else {
    message("Completion by checks: required columns missing in edit_completion_rate_data")
  }
} else {
  message("Completion by checks: required columns missing in edit_completion_rate_data")
}
Completion by platform and user status
Test group Platform User status Rate Count (events)
control desktop registered 96.1% 41697
control desktop unregistered 78.0% 5570
control mobile-web registered 76.3% 14291
control mobile-web unregistered 66.8% 4411
test desktop registered 89.1% 1001
test desktop unregistered 78.9% 175
test mobile-web registered 68.9% 212
test mobile-web unregistered 71.2% 66

Table note: Rate = mean(0/1 outcome) where outcome=1 means saveSuccess (per event). Test rows are shown-only; control includes all saveIntent rows. Population is restricted to Newcomer/Junior/Unregistered and to unreverted rows when available. Denominator = rows in edit_completion_rate_data for each (test group × platform × user status) after these filters.

Completion by experience group
Test group Experience group Rate Count (events)
control Unregistered 73.0% 9981
control Newcomer 79.4% 5615
control Junior Contributor 92.4% 50373
test Unregistered 76.8% 241
test Newcomer 83.4% 175
test Junior Contributor 85.9% 1038

Table note: Rate = mean(0/1 outcome) where outcome=1 means saveSuccess (per event). Denominator = rows in edit_completion_rate_data for each (test group × experience group) after the same filters used above.

Completion by checks shown (test group only)
Checks shown Rate Count (events)
1 86.3% 985
2 84.7% 202
3+ 75.7% 267

Table note: Rate = mean(0/1 outcome) in the test group only where outcome=1 means saveSuccess (per event). Denominator = rows in edit_completion_rate_data within the shown test group for each checks-shown bucket after the same filters used above. Control is excluded because it has no comparable checks-shown event stream.

Code
# Per-wiki sanity: completion
# Updated per methodology (shown-only in test; focus population; unreverted when available)
# Only render when multiple wikis are present.
if (!is.null(edit_completion_rate_data) &&
    all(c("wiki", "test_group", "saved_edit", "was_reference_check_shown") %in% names(edit_completion_rate_data))) {
  if (dplyr::n_distinct(edit_completion_rate_data$wiki) <= 1) {
    message("Per-wiki completion: skipped (single wiki)")
  } else {
    ec_df <- edit_completion_rate_data %>%
      make_rc_ab_group_completion() %>%
      add_experience_group()

    if ("experience_level_group" %in% names(ec_df)) {
      ec_df <- ec_df %>% filter(experience_level_group %in% c("Newcomer", "Junior Contributor", "Unregistered"))
    }
    if ("was_reverted" %in% names(ec_df)) {
      ec_df <- ec_df %>% filter(is.na(was_reverted) | was_reverted != 1)
    }

    per_wiki_completion <- ec_df %>%
      group_by(wiki, test_group) %>%
      summarise(rate = mean(saved_edit, na.rm = TRUE), n = n(), .groups = "drop")

    render_slice(
      per_wiki_completion,
      "Per-wiki completion",
      c(wiki = "Wiki", test_group = "Test group", rate = "Rate", n = "Count (events)"),
      note_text = "Rate = mean(0/1 outcome) where outcome=1 means saveSuccess (per event). Test rows are shown-only; control includes all saveIntent rows. Denominator = rows in `edit_completion_rate_data` for each (wiki × test group) after these filters."
    )
  }
} else {
  message("Per-wiki completion: required columns missing in edit_completion_rate_data")
}
Per-wiki completion: skipped (single wiki)
Code
# Dismissal significance: prop.test by platform and by user_status
# (Normalize to control/test for readability.)
if (!is.null(reference_check_rejects_data) &&
    all(c("test_group", "platform", "user_status", "reject_reason", "editing_session") %in% names(reference_check_rejects_data))) {
  dismiss_df <- reference_check_rejects_data %>%
    renorm_buckets() %>%
    filter(reject_reason %in% c(
      "edit-check-feedback-reason-common-knowledge",
      "edit-check-feedback-reason-irrelevant",
      "edit-check-feedback-reason-uncertain",
      "edit-check-feedback-reason-other"
    ))

  base_df <- reference_check_rejects_data %>% renorm_buckets()

  # By platform
  plat_sessions <- base_df %>%
    group_by(test_group, platform) %>%
    summarise(total_sessions = n_distinct(editing_session), .groups = "drop")
  plat_dismiss <- dismiss_df %>%
    group_by(test_group, platform) %>%
    summarise(dismiss_sessions = n_distinct(editing_session), .groups = "drop")
  plat_join <- plat_sessions %>%
    left_join(plat_dismiss, by = c("test_group", "platform")) %>%
    mutate(dismiss_sessions = coalesce(dismiss_sessions, 0))

  if (all(c("control", "test") %in% plat_join$test_group)) {
    ctrl <- plat_join %>% filter(test_group == "control")
    tst <- plat_join %>% filter(test_group == "test")
    for (p in intersect(ctrl$platform, tst$platform)) {
      c_row <- ctrl %>% filter(platform == p)
      t_row <- tst %>% filter(platform == p)
      if (nrow(c_row) == 1 && nrow(t_row) == 1) {
        cat("\nProp test (Dismissal rate) platform =", p, "\n")
        print(prop.test(
          c(t_row$dismiss_sessions, c_row$dismiss_sessions),
          c(t_row$total_sessions,   c_row$total_sessions)
        ))
      }
    }
  }

  # By user_status
  us_sessions <- base_df %>%
    group_by(test_group, user_status) %>%
    summarise(total_sessions = n_distinct(editing_session), .groups = "drop")
  us_dismiss <- dismiss_df %>%
    group_by(test_group, user_status) %>%
    summarise(dismiss_sessions = n_distinct(editing_session), .groups = "drop")
  us_join <- us_sessions %>%
    left_join(us_dismiss, by = c("test_group", "user_status")) %>%
    mutate(dismiss_sessions = coalesce(dismiss_sessions, 0))

  if (all(c("control", "test") %in% us_join$test_group)) {
    ctrl2 <- us_join %>% filter(test_group == "control")
    tst2 <- us_join %>% filter(test_group == "test")
    for (u in intersect(ctrl2$user_status, tst2$user_status)) {
      c_row <- ctrl2 %>% filter(user_status == u)
      t_row <- tst2 %>% filter(user_status == u)
      if (nrow(c_row) == 1 && nrow(t_row) == 1) {
        cat("\nProp test (Dismissal rate) user_status =", u, "\n")
        print(prop.test(
          c(t_row$dismiss_sessions, c_row$dismiss_sessions),
          c(t_row$total_sessions,   c_row$total_sessions)
        ))
      }
    }
  }
} else {
  message("Dismissal prop tests: required columns missing in reference_check_rejects_data")
}
Code
# Guardrail #2 Bayesian lift (relax) — completion (saveSuccess vs saveIntent)
# Updated per methodology (shown-only in test; focus population; unreverted when available)
if (!is.null(edit_completion_rate_data) &&
    all(c("test_group", "saved_edit", "was_reference_check_shown") %in% names(edit_completion_rate_data))) {
  ec_df <- edit_completion_rate_data %>%
    make_rc_ab_group_completion() %>%
    add_experience_group()

  if ("experience_level_group" %in% names(ec_df)) {
    ec_df <- ec_df %>% filter(experience_level_group %in% c("Newcomer", "Junior Contributor", "Unregistered"))
  }
  if ("was_reverted" %in% names(ec_df)) {
    ec_df <- ec_df %>% filter(is.na(was_reverted) | was_reverted != 1)
  }

  guardrail2_df <- ec_df %>%
    transmute(
      outcome = saved_edit,
      variation = dplyr::case_when(
        test_group == "control" ~ "control",
        test_group == "test" ~ "treatment",
        TRUE ~ as.character(test_group)
      )
    )

  render_relax(guardrail2_df, "Guardrail #2", metric_type = "proportion", better = "higher")
} else {
  message("Guardrail #2 relax: required columns missing or data not loaded")
}
Guardrail #2
Relative lift ((Treatment − Control) / Control)
Bayesian Analysis
Frequentist Analysis
Point Estimate Chance to Win P(Treatment better) 95% CrI Lower 95% CrI Upper Point Estimate p-value 95% CI Lower 95% CI Upper
−0.048 0.000 0.000 −0.069 −0.026 −0.048 0.000 −0.069 −0.027

Interpretation: Based on relax, the posterior probability that treatment is better than control is 0.0% (computed as Chance to Win).

Code
# Guardrail #2: completion rate (test shown-only; control all saveIntent rows)
if (!is.null(edit_completion_rate_data) &&
    all(c("test_group", "saved_edit", "was_reference_check_shown") %in% names(edit_completion_rate_data))) {
  ec_df <- edit_completion_rate_data %>%
    make_rc_ab_group_completion() %>%
    add_experience_group()

  if ("experience_level_group" %in% names(ec_df)) {
    ec_df <- ec_df %>% filter(experience_level_group %in% c("Newcomer", "Junior Contributor", "Unregistered"))
  }
  if ("was_reverted" %in% names(ec_df)) {
    ec_df <- ec_df %>% filter(is.na(was_reverted) | was_reverted != 1)
  }

  prop_df2 <- ec_df %>%
    group_by(test_group) %>%
    summarise(success = sum(saved_edit == 1, na.rm = TRUE), total = n(), .groups = "drop")

  ctrl <- prop_df2 %>% filter(test_group == "control")
  tst <- prop_df2 %>% filter(test_group == "test")

  if (nrow(ctrl) == 1 && nrow(tst) == 1) {
    render_prop_test(ctrl$success, ctrl$total, tst$success, tst$total, "Prop test (Guardrail #2, completion)")
  }
}
Prop test (Guardrail #2, completion)
Group Success Total Rate
control 58283 65969 88.3%
test 1223 1454 84.1%
Prop test (Guardrail #2, completion) (prop.test)
Metric Value
p_value 8.56e-07
statistic 24.2

5 Reference

  • Focus: Reference Check A/B test KPIs (reference added or acknowledged why a citation was not added, constructive edits) in addition to guardrails (revert rate, completion, dismissals, retention).
  • Dimensions: group (control, test), platform (mobile web, desktop), user status (registered, unregistered), and # checks-shown buckets. (This report is enwiki-only.)
  • Statistical meaningfulness: Primary inference uses multivariable logistic regression (glm; binomial). Statistical meaningfulness was defined a priori as a two-sided p<0.05. As a robustness check, we fit a relax Bayesian model; effects with posterior probability >95% of a non-null association were considered corroborated.
  • In this AB test, users in the test group will be shown Reference Check if attempting an edit that meets the requirements for the check to be shown in VisualEditor. The control group is provided the default editing experience where no Reference Check is shown.
  • We collected AB test events logged between 8 November 2025 and 8 December 2025 on English Wikipedia.
  • We relied on events logged in EditAttemptStep, VisualEditorFeatureUse, and change tags recorded in the revision tags table.
    • Published edits eligible for Reference Check are identified by the editcheck-references revison tag.
    • For filtering to new content edits we use editcheck-newcontent.
    • To identify edits where Reference Check was shown we use VisualEditorFeatureUse events: event.feature = editCheck-addReference event.action = check-shown-presave
    • action-reject: editor dismissed Reference Check
    • edit-check-feedback-reason-*: Reason for dismissal
  • For calculating Edit Completion Rate we make an assumption and posit that all edits reaching saveIntent are eligible.
  • For calculating Revert Rate, published edits eligible for Reference Check are identified by the editcheck-references revision tag
    See the instrumentation spec for more details.
  • Data was limited to mobile web and desktop edits completed on a main page namespace using VisualEditor on English Wikipedia. We also limited to edits completed by unregistered users and users with 100 or fewer edits as those are the users that would be shown Reference Check under the default config settings.
  • For each metric, we reviewed the following dimensions: by experiment group (test and control), by platform (mobile web or desktop), by user experience and status. We also reviewed some indicators such as edit completion rate by the number of checks shown within a single editing session.
    • Note: For the by user experience analysis, we split newer editors into three experience level groups: (1) unregistered, (2) newcomer (registered user making their first edit on Wikipedia), and (3) junior contributor, a registered contributor with >0 and ≤100 edits (i.e., 1–100).
  • Data collection: collect_enwiki_refcheck_ab_test_data.ipynb
  • Styling: paste-check-aligned plot/table defaults; method-note callout CSS included.
  • Models: logistic regression (glm) + Bayesian lift (relax) for inference/uncertainty.
  • New-content edit: An edit where is_new_content == 1 in the dataset (i.e., edits tagged/flagged as new-content in instrumentation).

As indicated in https://www.mediawiki.org/wiki/Edit_check/Tags, “Tag applied to all edits in which new content is added. Where”new content” in this context is defined by the conditions that were defined in T324730 and are now codified in editcheck/modules/init.js.:”

  • Constructive: A new-content edit that was not reverted within 48 hours. In code: constructive = 1 when was_reverted != 1.
  • Returned (retention): A per-user 0/1 flag returned where 1 = the user made at least one subsequent saveSuccess 7–14 days after their first eligible edit, and 0 otherwise.
  • Dismissal rate:
    • Numerator: count of dismissal events (filtered to the 4 valid reasons).
    • Denominator: distinct editing_session count in the same slice (e.g., by test group × platform).
  • Revision tags (from measurement plan):

    • editcheck-references (Reference Check eligible)
    • editcheck-references-shown (Reference Check shown; we treat this as a secondary/audit signal)
    • editcheck-newcontent (new content edit)
    • editcheck-newreference (net new reference added)
    • mw-reverted (reverted)
  • Dataset column mapping used in this notebook:

    • is_new_content == 1 ↔︎ editcheck-newcontent
    • was_reference_check_eligible == 1 ↔︎ editcheck-references
    • was_reference_check_shown == 1 ↔︎ VisualEditorFeatureUse (VEFU) event.feature = editCheck-addReference, event.action = check-shown-presave
    • was_reference_included == 1 ↔︎ editcheck-newreference
    • was_reverted == 1 ↔︎ mw-reverted within 48 hours (per the collection definition)
  • Key events used for engagement / dismissal breakdowns (feature editCheck-addReference unless noted):

    • check-shown-presave (RC shown), action-accept, action-reject.
    • edit-check-feedback-shown (survey), edit-check-feedback-reason-* with valid RC reasons: other, uncertain, common-knowledge, irrelevant.
    • editCheckDialog window-open-from-check-[moment] (sidebar open; used as an auxiliary signal when needed).
  • Note: relevant-paste, ignored-paste-*, and check-learn-more are Paste Check events; we keep them in the reference list for cross-notebook consistency, but we do not use them for Reference Check metrics.

  • Control nuance from multi-check 2024: control had RC available; we use treatment-only comparisons there.

  • Core identifiers: wiki, test_group, user_id, user_status (registered vs unregistered), user_edit_count, experience_level_group (Unregistered / Newcomer / Junior Contributor), editing_session (EditAttemptStep editing_session_id), platform.
  • Reference/dismissal-reason-acknowledgement (spelled out: acknowledgement for why a citation was not added) flags (pick-first): has_reference_or_acknowledgement, added_reference_or_acknowledgement, has_reference, reference_added, has_reference_added, was_reference_included.
  • Retention flags (pick-first): retained_7_14d, retained_14d, retained, returned (retention flag: qualifying return edit in 7–14d window after first RC shown/eligible; if absent, use retention_flag_candidates).
  • Revert flag: was_reverted (48h window from collection queries).
  • New content flag: is_new_content.
  • Shown/eligible flags: was_reference_check_shown, n_checks_shown, was_reference_check_eligible, reference_check_shown.
  • Completion / outcome: saved_edit, event_action (expect saveIntent, saveSuccess).
  • Dismissals: was_reference_check_rejected, n_rejects, reject_reason.
  • Retention aggregates: return_editors (returned users in window), editors (total users in grouping; retention denominator).
  • For this A/B test, we use the typical second-day retention: editor returns 7–14 days after first being shown Reference Check.
  • The code looks for a retention flag in retention_flag_candidates (e.g., retained_7_14d, retained_14d, etc.).

Previous Reference Check Reports:
* Multi_check_ab_test_report
* Multi Check Indicators Ticket
* Multi Check Leading Indicators Report
* Reference Check AB Test
Previous Check Reports:
* Paste_check_leading_indicators Gitlab
* Paste_check_leading_indicators
* Paste Check Ticket

Methodology references: multi-check (2024), paste check (2025), edit-check references (2023). Note: in multi-check, the control also had Reference Check enabled; we referenced treatment-group-only comparisons from that report and avoided direct control deltas from that work for this study.