This is the notebook containing code and results of the analysis of the section editing experiment (T218851). This analysis is tracked in ticket T211239 and performed on SWAP.
Key findings:
Mobile VE users with section editing were 1.03 times more likely to finish making an edit after starting
The probability of success (defined as successfully finishing a mobile VE session, resulting in an edit to the article) increased from 0.3567 in the control group to 0.3664 in the treatment
This 0.0097 increase in success probability is statistically significant at \(\alpha=0.05\) level (p-value = 0.034)
This roughly translates to 1 fewer VE edits being abanded per every 100 VE edits started if we enable section editing for everyone, everywhere (compared to keeping the status quo)
The average number of sessions per user was 1.948 in the treatment group and 1.940 in the control group
This difference is not statistically significant (p-value = 0.8)
Code
# Packages:library(glue)library(zeallot)library(magrittr)import::from(dplyr, group_by, keep_where =filter, ungroup, summarize, mutate, rename, select, arrange, n, left_join, distinct,vars, everything, starts_with)import::from(tidyr, spread, gather)library(repr)library(ggplot2)library(patchwork)# Helper functions:import::from(polloi, compress)inf2na <- function(x) { y <- x y[is.infinite(x)] <- NAreturn(y)}nan2na <- function(x) { y <- x y[is.nan(x)] <- NAreturn(y)}na2zero <- function(x) { y <- x y[is.na(x)] <-0return(y)}suppress_messages_warnings <- function(x) { suppressMessages(suppressWarnings(x))}to_html <- function(df, ...) { df %>% knitr::kable(format="html", ...) %>%as.character() %>% IRdisplay::display_html()}
Data
The data comes from client-side EventLogging which uses the EditAttemptStep schema and we fetch the daily session summaries using the following query:
Code
query <-"USE event;SELECT wiki, event.user_id, IF(event.user_id %2=0, 'control', 'treatment') AS bucket, event.session_token AS mw_session_token, event.page_id, MIN(dt) AS session_dt_start, MAX(dt) AS session_dt_end, MAX(event.user_editcount) AS user_edit_count, MAX(event.init_type) AS init_type, MAX(event.init_mechanism) AS init_mechanism, SUM(IF(event.action ='init', 1, 0)) >0 AS ve_session_initialized, SUM(IF(event.action ='ready', 1, 0)) >0 AS ve_session_readied, SUM(IF(event.action ='loaded', 1, 0)) >0 AS ve_session_loaded, SUM(IF(event.action ='saveSuccess', 1, 0)) >0 AS ve_session_succeeded, SUM(IF(event.action ='abort', 1, 0)) >0 AS ve_session_abortedFROM EditAttemptStepWHERE year = ${year} AND month = ${month} AND day = ${day} AND event.page_ns =0-- main articles only AND event.user_id >0-- only logged-in users AND event.page_id >0-- page creation VE sessions AND wiki RLIKE 'wiki$' AND NOT wiki IN('${exclude_wikis}') AND event.session_token IS NOT NULL AND event.editor_interface ='visualeditor' AND event.platform IN('phone', 'tablet')GROUP BY wiki, event.user_id, IF(event.user_id %2=0, 'control', 'treatment'), event.session_token, event.platform, event.page_id"
Code
query_data <- function() {# see https://phabricator.wikimedia.org/T211239 start_date <-as.Date("2019-03-29") # not 2019-03-18 end_date <-as.Date("2019-06-10") test_dates <- seq(start_date, end_date, by ="day") exclude_wikis <- c('wikidatawiki', 'commonswiki', 'mediawikiwiki', 'metawiki','sourceswiki', 'specieswiki', 'outreachwiki', 'testwiki', 'incubatorwiki', paste0(c('he', 'bn', 'zh_yue'), 'wiki') # 1st wave had 100% rollout (not A/B tested) ) %>% paste0(collapse ="', '") results <- purrr::map_dfr(test_dates, function(date) {# message("Fetching mobile VE session data from ", date) c(year, month, day) %<-% wmf::extract_ymd(date) query <- glue(query, .open="${", .close ="}") result <- suppress_messages_warnings(wmf::query_hive(query)) result %<>% mutate( date = date, bucket = ifelse(user_id %%2==0, 'control', 'treatment'), unique_user_id = paste(wiki, user_id) ) %>% dplyr::mutate_at(vars(starts_with("session_dt_")), lubridate::ymd_hms) %>% dplyr::mutate_at(vars(starts_with("init_")), ~ ifelse(.x =="NULL", NA, .x)) %>% dplyr::mutate_at(vars(starts_with("ve_session")), ~ .x =="true") %>% select(date, wiki, bucket, user_id, mw_session_token, page_id, everything()) %>% arrange(date, wiki, bucket, user_id, session_dt_start)if (date <"2019-04-02") { wikis <- paste0(c('hi', 'ar', 'fa', 'id', 'mr', 'ms', 'ml', 'th', 'az', 'sq'), 'wiki') result <- result[result$wiki %in% wikis, ] # only keep where test was live as of 28 March 2019# 2 April 2019 is all remaining wikis }return(result) })return(results)}if (file.exists("daily_data.rds")) { daily_data <- readr::read_rds("daily_data.rds")} else { daily_data <- query_data() readr::write_rds(daily_data, "daily_data.rds", compress ="gz")}
Here, we analyze the probability of a successful VE session once a session has been initiated (and VE has loaded), where success is “an edit is published”. For this, we use the BCDA package to compare success probabilities \(\pi_1\) and \(\pi_2\) between the \(n_1\) sessions in treatment bucket (“group 1” in the analysis below) and \(n_2\) sessions in control bucket (“group 2” below). In this simple check, we consider each session independent of others, even though some sessions come from the same user.
We model the successes \(y_1\) and \(y_2\) with the Binomial distributions having parameters \((\pi_1, n_1)\) and \(\pi_2, n_2)\), respectively. We assign Beta priors on \(\pi_1\) and \(\pi_2\), which forms our traditional beta-binomial model and makes it very easy to sample from the posterior. Using those those samples, we can calculate credible intervals for the quantities \(\Delta_\pi = \pi_1 - \pi_2\) (the difference between the success probabilities) and the relative risk\(\theta = \frac{\pi_1}{\pi_2}\).
options(digits =3)BCDA::present_bbfit(bb, raw = TRUE) %>% to_html()
Group 1
Group 2
Pr(Success) in Group 1
Pr(Success) in Group 2
Difference
Relative Risk
Odds Ratio
70351
65638
48.138% (47.763%, 48.503%)
47.219% (46.839%, 47.600%)
0.919% (0.391%, 1.439%)
1.019 (1.008, 1.031)
1.038 (1.016, 1.059)
According to this approach (model), initialized VE sessions made by users in the treatment group had an increased probability of success compared to VE sessions initialized by users in the control group. The increase (of 0.92%) is from an average of 47.2% in the control group to 48.14% in the treatment group. That is, sessions with mobile section editing are 1.019 times more likely to result in a contribution than sessions without it.
These frequentist results are consistent with the Bayesian results above – an increase of 0.92% in probability and an odds ratio of 1.038. However, there is one major unresolved issue.
Multilevel model of success probability
The issue with the above approaches is that the sessions are assumed to be independent, which is not the actual case. Multiple sessions can belong to the same user who is just more (or less) likely than others to make edits. Furthermore, each language of Wikipedia can have its own base probability of an initiated VE session ending successfully. Therefore, a more correct model of success probability takes into consideration the within-user correlations and the between-user/within-wiki correlations. Let \(y = 0\) if the initialized (and readied/loaded) VE session did not result in a contribution and \(y = 1\) if it did. Then the outcome \(y_i\) of the \(i\)-th session by \(j\)-th user from \(k\)-th wiki can be modeled as follows:
By taking into consideration that sessions from the same user would be similar and that users from the same wiki would behave similarly, we increase the ability of the model to extract the effect of the treatment. This gets us an approximate increase of 0.97% in success probability – from 35.67% in the control group to 36.64% in the treatment.
Using the Gelman & Hill (2006) “divide by 4” rule and the delta method to get an upper bound on change in probability, we get an approximate 1.05% increase with a 95% Confidence Interval of (0.08-2.02)%
Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.
Per-wiki improvements
With \(\hat{\mu}\) as the overall intercept estimate, \(\hat{\gamma}_k\) as the \(k\)-th wiki’s estimated intercept, and \(\hat{\beta}\) as the estimated effect of treatment on the linear scale, we can estimate the average probability of completion of the control and the treatment groups on \(k\)-th wiki as:
If we define exposure as the time (in days) that the user has had the ability to edit sections in VE on mobile since their first mobile VE session, then we can consider a model where the success probability of each session is also affected by this exposure time:
However, both of these models are worse (as determined by AIC) than our original one with the per-user, per-wiki varying intercepts and a constant effect of treatment, which means the additional complexity is unnecessary.
Average sessions per user
Since (enrolled) users at the start of the test have more opportunities to see/use section editing, they have potential to have much higher number of sessions than users who entered the test at a later time – especially near the end of the test. Therefore, we have decided to compare “average number of sessions per user within first week of entrance into the test” between the two groups. As a result, users who entered the dataset in the last week of the test were not included in this analysis.
t.test(n ~ bucket, data = per_user_session_counts, paired = FALSE)
Welch Two Sample t-test
data: n by bucket
t = -0.261, df = 38800, p-value = 0.79
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.074627 0.057070
sample estimates:
mean in group control mean in group treatment
1.9396 1.9483
Users with section editing had slightly more sessions (within 7 days of their first VE session) than users without, but that increase was not statistically significant (p-value = 0.8).