Variables impacting user account block rate on English Wikipedia

Author

Megan Neisler, Staff Data Scientist, Wikimedia Foundation

Modified

2025-01-27

Overview

As part of WE 4.2, the Trust and Safety team is exploring ways to reliably associate an individual with their actions (sockpuppetting mitigation), and combine existing signals (e.g. IP addresses, account history, request attributes) to allow for more precise targeting of actions on bad actors.

This analysis aims to explore a sample of users on English Wikipedia and identify patterns that affect the chances of a registered user account being blocked. This analysis will determine how to weight a set of publicly available data points in calculating an overall reputation score for an account. This account reputation score could then be presented to functionaries/administrators/moderators on the wikis to assist in anti-abuse work.

See task description and the code repo for more details.

Methodology

First, we collected data on a set of publically available data points associated with registered user accounts including edit and block history. The Overview of User Attributes Reviewed table shown below provides an overview of the user attributes considered and their data sources.

We used this data to explore patterns across a sample of blocked user accounts on English Wikipedia and compare them to patterns across a sample of non-blocked user accounts. For this analysis, we specifically gathered all users who created an account on English Wikipedia between May 2024 and July 2024. Data was limited to accounts less than 90 days old at the time of the analysis assuming that accounts with persistent bad faith activity older than that will have already been blocked. Bots and any auto-created accounts were also excluded. This resulted in a sample of 253,633 users of which about 2,406 users (0.1% of all users) were issued a block during the reviewed timeframe.

Two classification methods were then used to understand the relative importance of each of the user attributes on an account being blocked. Random forests were used to assess how important certain user attributes are in classification. This was followed by logistical regression modeling to assess the magnitude and direction of the feature’s impact on the probability of an account being blocked.

Show the code


var_info <- read.delim('user_attribute_table.tsv', sep='\t')

var_tbl <- (
    var_info %>%
    gt(groupname_col = 'Type', rowname_col='Variable') %>%
    opt_stylize() %>%
    tab_header('Overview of User Attributes Reviewed') %>%
    tab_options(
        table.font.size = px(14)
    ) %>%
    tab_source_note(
        gt::md('data sources documentation: [mediawiki_history](https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/MediaWiki_history), \
                [globaluser table](https://www.mediawiki.org/wiki/Extension:CentralAuth/globaluser_table), \
                [loging](https://www.mediawiki.org/wiki/Manual:Logging_table), \
                [mediawiki user blocks change](https://schema.wikimedia.org/repositories/primary/jsonschema/mediawiki/user/blocks-change/current.yaml)')
    )
)

display_html(as_raw_html(var_tbl))
Overview of User Attributes Reviewed
Description Source
Logical
Has confirmed email address Has confirmed email address centralauth.globaluser
Has completed edit Has completed at least one edit on any namespace prior to being blocked mediawiki_history
Has received thanks Has received at least one thanks by another user logging
Has reverted edit within 12 hours Has completed at least one edit that was reverted within 12 hours after it was published mediawiki_history
Has article created Has created at least once article since creating an account mediawiki_history
Has completed content edits Has completed at least one content edit since creating an account (identified by page_namespace_is_content_historical = TRUE) mediawiki_history
Has completed non content edit Has completed at least one non content edit on a non content namespace since creating an account (identified by page_namespace_is_content_historical = FALSE) mediawiki_history
Has sleeper account activity Defined a sleeper account as not making an edit at least 30 days after creating an account and making at least over 5 edits within 5 minutes. mediawiki_history
Has user page User page exists mediawiki_history
Has historical (expired, since removed) blocks Has received historical blocks since creating an account that have since expired. mediawiki_user_history
Has blocks on other wikis Has account blocks that have been issued on other wikis mediawiki_user_blocks_change
Has check user reports Has check user reports issued cu_log
Numerical
Number of articles created Number of articles created by the user since creating an account (identified by revision_parent_id = 0 AND page_namespace_historical = 0) mediawiki_history
Number of reverted edits Number of edits reverted (identifited by revision_is_identity_reverted = true) by the user since created an account. mediawiki_history
Number of expired blocks Number of historical blocks that user has received since creating an account that have since expired. mediawiki_user_blocks_change
Number of thanks received Number of thanks received by another user logging
Number of all edits Total number of edits completed on any page namespace since creating an account mediawiki_history
Number of content edits Total number of edits completed on a content namespace since creating an account mediawiki_history
Number of non-content edits Total number of edits completed on a non-content namespace since creating an account mediawiki_history
Number of unreverted edits Total number of unreverted edits mediawiki_history
Maximum edit size The largest absolute edit size in bytes completed by a user since creating a count (Identified by revision_text_bytes_diff) mediawiki_history
Ratio of reverted edits to all edits The ratio of all edits that were reverted to all edits by the user since creating an account mediawiki_history
Number of articles created Total number of articles created by the user since creating an account mediawiki_history
Number of edits within 1 hours Total number of edits completed ony any page namespace within 1 hour of creating an account mediawiki_history
Number of edits within 24 hours Total number of edits completed ony any page namespace within 24 hours of creating an account mediawiki_history
Number of reverted edits within 12 hours Total number of edits reverted within 12 hours mediawiki_history
Number of reverted edits with 10 minutes Total number of edits reverted within 10 minutes mediawiki_history
Number of check user reports Number of check user reports issued cu_log
Categorical
User rights level The current user rights of the user (identified by event_user_groups) mediawiki_history
User edit bucket Edit count bucket of the user on English Wikipedia mediawiki_history
data sources documentation: mediawiki_history, globaluser table, loging, mediawiki user blocks change

Setup

Show the code
shhh <- function(expr) suppressPackageStartupMessages(suppressWarnings(suppressMessages(expr)))
shhh({
    library(tidyverse);
    library(gt);
    library(IRdisplay);
    library(htmltools);
    library(scales);
    library(glue);
    library(gridExtra);
    library(gtsummary);
    library(data.table);
    library(ggthemes);
    library(magrittr)
# modeling support
    library(caret);
    library(tictoc);
    library(ranger);
    library(RhpcBLASctl);
    library(broom);
    library(broom.helpers);
    library(parameters);
    library(arm)
})

select <- dplyr::select

# random seed for reproducible results
set.seed(2024)

# set options
options(dplyr.summarise.inform = FALSE)
options(repr.plot.width = 15, repr.plot.height = 10)
options(warn=-1)

blas_set_num_threads(1)
options(
    digits = 3, 
    scipen = 50, 
    repr.plot.width = 15, 
    repr.plot.height = 10
)

# random seed for reproducible results
set.seed(2024)
Show the code
user_attribute_data <-
  read.csv(
    file = 'data/all_users_data.tsv',
    header = TRUE,
    sep = "\t",
    stringsAsFactors = FALSE
  ) 
Show the code
 # display great tables horizontally
display_tbl_hrz <- function(tables, space = 10) {
    
    tables_html <- lapply(tables, function(tbl) {
        div(style = sprintf("margin-right: %spx;", space),
            HTML(as.character(as.tags(tbl))))
    })
    
    main_div <- div(style = "display: flex; justify-content: space-around; flex-wrap: wrap;", tables_html)
    return(main_div)
}

# calculate percentage difference
pct_diff <- function(old, new) {
    return(((new - old) / old) * 100)
}


# format big numbers
format_big_number <- function(num) {
  if (num < 1000) {
    return(as.character(num))
  } else if (num < 1000000) {
    return(paste(format(num / 1000, nsmall = 1), "K", sep = ""))
  } else if (num < 1000000000) {
    return(paste(format(num / 1000000, nsmall = 1), "M", sep = ""))
  } else {
    return(paste(format(num / 1000000000, nsmall = 1), "B", sep = ""))
  }
}

Data Processing

Please refer to the data collection and the data processing notebooks for more details on the steps to collect and process the data reviewed in this report.

Overview of Blocks in Sample

Blocks by Duration

Show the code
blocked_users_duration_tbl <- user_attribute_data %>%
    filter(is_blocked == 1) %>%
    group_by(block_duration) %>%
    summarise(n_users = n_distinct(user_name)) %>%
    mutate(pct_users = paste0(round(n_users/sum(n_users) * 100, 0), "%"))  %>% 
    select(-2)  %>% 
    gt()  %>%
opt_stylize(5) %>%
    tab_header(
    title = "Proportion of blocked user accounts by block duration"
      )  %>%
  cols_label(
    block_duration= "Block duration",
    pct_users = "Proportion of users",
  ) 


display_html(as_raw_html(blocked_users_duration_tbl))
Proportion of blocked user accounts by block duration
Block duration Proportion of users
permanent 97%
temporary 3%

The majority of blocked registered accounts (97%) in the sample have been issued a permanent block.

Average time duration from user registration to block

Show the code
time_to_block <- user_attribute_data  %>%
    filter(is_blocked == 1) %>%
    group_by(user_name) %>%
    reframe(registration_time = user_registration_timestamp,
              block_time = block_timestamp,
            duration = as.numeric(difftime(block_time, registration_time, units = "days")))  
Show the code
time_to_block_distribution <- round(quantile(time_to_block$duration, probs = c(0.25, 0.5, 0.75), na.rm = TRUE), 0)
Show the code
options(repr.plot.width = 20, repr.plot.height = 15)

options(scipen = 999)
p <- time_to_block %>%
    ggplot(aes(x=duration)) + 
    geom_histogram(color = 'black', fill = "#999999") +
    geom_vline(aes(xintercept = time_to_block_distribution[1], color = '25th Percentile'), linetype = 'dashed', linewidth = 1.5) +
     geom_vline(aes(xintercept = time_to_block_distribution[2], color = '50th Percentile'), linetype = 'dashed', linewidth = 1.5) +
    geom_vline(aes(xintercept = time_to_block_distribution [3], color = '75th Percentile'), linetype = 'dashed', linewidth = 1.5) +
      geom_label(aes(x = time_to_block_distribution[1], y =300, label = paste0(time_to_block_distribution[1], " days"))) +
      geom_label(aes(x = time_to_block_distribution[2], y =300, label = paste0(time_to_block_distribution[2] , " days"))) +
      geom_label(aes(x = time_to_block_distribution[3], y = 300 , label = paste0(time_to_block_distribution[3], " days"))) +
      scale_x_log10() +
    scale_y_continuous(labels = label_comma())+
    labs (title = "Time from user registration to first block",
          y = "Number of users",
         x= "Time to first block (days, log scale)")   + 
    scale_color_manual(name='Percentiles',
                     breaks=c('25th Percentile', '50th Percentile', '75th Percentile'),
                     values=c("25th Percentile" = "darkblue", "50th Percentile" = "darkmagenta", "75th Percentile" = "darkred")) +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=25),
        legend.position= "bottom",
        axis.line = element_line(colour = "black")) 

p
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Figure 1: The median block time for users accounts that registered during the reviewed timeframe is 34 days after registration.

The median block time is 29 days after account registration on English Wikipedia. Note: This dataset is limited to newer accounts that were created between May 2024 and July 2024 and as a result does not include accounts that took longer than 90 days to block.

By Block Reason

The block reason comes from the comment field of the mediawiki_user_block_change table. Note: As there is currently no structured data on block reason, we parsed the user provided text in this field to find key words and sort the data into a set of block categories. More bespoke block reasons that did not fit into the identified categories were grouped into the category “Other”.

There is currently a task open to provide more structured data to associate with a particular block.

Show the code

blocked_users_block_reason_tbl <- user_attribute_data %>%
    filter(is_blocked == 1) %>%
    group_by(block_reason_simple) %>%
    summarise(n_users = n_distinct(user_id)) %>%
    mutate(pct_users = paste0(round(n_users/sum(n_users) * 100, 1), "%")) %>%
    arrange(desc(n_users)) %>%
    select(-2) %>%
     gt()  %>%
      tab_header(
    title = "Proportion of blocked users by block reason"
      )  %>%
opt_stylize(1) %>%
  cols_label(
    block_reason_simple = "Block reason",
    pct_users = "Proportion of users",
  ) %>%
  tab_footnote(
    footnote = "Based on categorization of user provided text for the block reason. \n The Other category includes text that did clearly identify a reason or were too bespoke to categorize",
    locations = cells_column_labels(
      columns = "block_reason_simple"
    )
  ) 


display_html(as_raw_html(blocked_users_block_reason_tbl))
Proportion of blocked users by block reason
Block reason1 Proportion of users
sockpuppet 30.2%
spam_advertising 27.7%
checkuserblock 14.9%
disruptive 6.9%
Not_here_to_build_enyclopedia 6.9%
Other 6%
username_similar_to_organization 2.2%
long_term_abuse 1.8%
vandalism 1.7%
username_policy_violation 1.6%
making_legal_threat 0.1%
trolling 0%
user_name_bot 0%
1 Based on categorization of user provided text for the block reason. The Other category includes text that did clearly identify a reason or were too bespoke to categorize

The most frequent reason identified for a user account block during the reviewed timeframe was sockpuppetry (30.2% of all blocked accounts) followed by spam or advertising activity (27.7% of all blocked accounts).

These trends vary based on the duration of the block as shown in the table below:

Show the code


block_reason_tmp_tbl <- user_attribute_data %>%
    filter(is_blocked == 1 &
          block_duration == 'temporary') %>%
    group_by(block_reason_simple) %>%
    summarise(n_users = n_distinct(user_id)) %>%
    mutate(pct_users = paste0(round(n_users/sum(n_users) * 100, 2), "%")) %>%
    arrange(desc(n_users)) %>%
    select(-2) %>%
     gt()  %>%
      tab_header(
    title = "Proportion of temporary blocked users by block reason"
      )  %>%
opt_stylize(1) %>%
  cols_label(
    block_reason_simple = "Block reason",
    pct_users = "Proportion of users",
  ) 

block_reason_perm_tbl <- user_attribute_data %>%
    filter(is_blocked == 1 &
          block_duration == 'permanent') %>%
    group_by(block_reason_simple) %>%
    summarise(n_users = n_distinct(user_id)) %>%
    mutate(pct_users = paste0(round(n_users/sum(n_users) * 100, 2), "%")) %>%
    arrange(desc(n_users)) %>%
    select(-2) %>%
     gt()  %>%

      tab_header(
    title = "Proportion of permanent blocked users by block reason"
      )  %>%
opt_stylize(1) %>%
  cols_label(
    block_reason_simple = "Block reason",
    pct_users = "Proportion of users",
  ) 

display_tbl_hrz(list(as_raw_html(block_reason_tmp_tbl), as_raw_html(block_reason_perm_tbl)), space = 0)
Proportion of temporary blocked users by block reason
Block reason Proportion of users
Other 59.46%
disruptive 27.03%
spam_advertising 8.11%
sockpuppet 4.05%
vandalism 1.35%
Proportion of permanent blocked users by block reason
Block reason Proportion of users
sockpuppet 30.87%
spam_advertising 28.29%
checkuserblock 15.27%
Not_here_to_build_enyclopedia 7.03%
disruptive 6.36%
Other 4.53%
username_similar_to_organization 2.2%
long_term_abuse 1.87%
vandalism 1.75%
username_policy_violation 1.66%
making_legal_threat 0.08%
trolling 0.04%
user_name_bot 0.04%

Temporary blocked accounts are more likely to have bespoke user-provided block reasons, which are currently grouped under the “Other” category. Temporary blocked accounts are also more likely to be blocked for disruptive activity while permanent blocks are more likely to be blocked for sockpupptery and adverstising activity.

Distribution of published edits by blocked users

A number of the reviewed user attributes relate to the user’s editing activity. Let’s review the typical number of edits completed by users that are blocked within the 90 day timeframe.

Show the code
blocked_users_edits <- user_attribute_data %>%
    filter(is_blocked == 1,
          num_all_edits > 0)  %>%
    group_by(user_id) %>% # a few cases of a single user listed twice due to multiple blocks
    summarise(num_edits_completed = sum(num_all_edits))
Show the code
blocked_edits_distribution <- quantile(blocked_users_edits$num_edits_completed,probs = c(0.25, 0.5, 0.75))
Show the code
options(scipen = 999)
blocked_edits_histogram <- blocked_users_edits  %>%
    ggplot(aes(x=num_edits_completed)) + 
    geom_histogram(color = 'black', fill = "#999999") +
    geom_vline(aes(xintercept = blocked_edits_distribution[1], color = '25th Percentile'), linetype = 'dashed', linewidth = 1.5) +
     geom_vline(aes(xintercept = blocked_edits_distribution[2], color = '50th Percentile'), linetype = 'dashed', linewidth = 1.5) +
    geom_vline(aes(xintercept = blocked_edits_distribution[3], color = '75th Percentile'), linetype = 'dashed', linewidth = 1.5) +
      geom_label(aes(x = blocked_edits_distribution[1], y =250, label = paste0(blocked_edits_distribution[1], " edits"))) +
      geom_label(aes(x = blocked_edits_distribution[2], y =250, label = paste0(blocked_edits_distribution[2] , " edits"))) +
      geom_label(aes(x = blocked_edits_distribution[3], y = 250 , label = paste0(blocked_edits_distribution[3], " edits"))) +
    scale_y_continuous(labels = label_comma())+
    scale_x_log10() +
    labs (title = "Number of edits completed by blocked user accounts",
          y = "Number of blocked users",
         x= "Number of edits (log scale)")   + 
    scale_color_manual(name='Percentiles',
                     breaks=c('25th Percentile', '50th Percentile', '75th Percentile'),
                     values=c("25th Percentile" = "darkblue", "50th Percentile" = "darkmagenta", "75th Percentile" = "darkred")) +
    theme(
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=20),
        legend.position= "bottom",
        axis.line = element_line(colour = "black")) 

blocked_edits_histogram
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The median number of edits (across any namespace) completed by blocked accounts within 90 days of creating an account is 5 edits. We observe even more edits completed by temporary blocked accounts. The median number of edits completed by temporary accounts is 32 edits compared to a median of 3 edits completed by permanent blocked accounts.

Data exploration: Patterns across blocked accounts

We first completed an exploratory analysis of the data to identify common patterns of blocked accounts across various user attributes. This included editing activity and other available publically available data about the user such as user rights level and if they have a confirmed email address.

Please refer to the blocked_user_data_exploration.ipynb notebook for more details on patterns across blocked accounts for the reviewed attributes including trends by block duration and block reason. Key insights from this analysis are summarized below.

Summary of key data points associated with blocked accounts
  • Confirmed email address: The majority (60%) of blocked user accounts do not have a confirmed email address. This is consistent for both temporary and blocked accounts but varies based on block reason.
  • Received Thanks: 99% of all permanently blocked user accounts and 87% of all temporary blocked users accounts have never received Thanks for any users.
    • If a blocked user did receive a thanks, they typically received just 1 thanks. 75% of all blocked accounts received fewer than 4 thanks.
  • Number of Articles Created: Only 3% of blocked registered user accounts created at least one article prior to being blocked. Temporary blocked users are slightly more likely to have created at least one article than permanently blocked users.
    • For the accounts that did create an article, the majority (75%) created less than 5 articles. The median number of articles created by blocked accounts is 2 articles.
    • Accounts blocked for sockpuppetry were the most likely to publish at least one article (8%) and accounts blocked for promotion and advertising activity were the least likely (0.8%) to publish at least one article during the reviewed timeframe.
  • Number of reverted edits:
    • Patterns differ significantly for temporary and permanent blocked accounts. Almost all (96%) of temporary blocked users have made at least one reverted edit compared to 51% of permanently blocked users.
    • For temporary blocked accounts, the median number of reverted edits is 16 edits. Most all of these reverts occur within 12 hours.
  • Block reason also impacts the observed revert rate.
    • Accounts blocked for disruptive behavior were the most likely to complete an edit that was reverted while accounts blocked for promotion/advertising and user names associated with an organization were less likely to complete a reverted edit. This is likely because many of the accounts blocked for promotion and advertising are due to a suspicious user name versus editing activity.
  • Number of Edits
    • Temporary blocked accounts are much more likely to complete a higher number of edits. The median number of edits completed by temporary blocked accounts is 32 edits compared to a median of 3 edits completed by permanent blocked accounts.
  • Most edits are typically completed by accounts blocked for disruption or sockpuppetry. The least number of edits completed by accounts blocked for user name violations.
  • Edit Size
    • The average edit size by blocked accounts is 304 bytes while the typical maximum edit size is 699 bytes. This appears to be much higher than the edit size by non-blocked users but will confirm in the next part of the analysis when comparing trends to accounts that are not blocked.
    • While there is not much difference in the average edit size between temporary and permanent blocked users, temporary account users are much more likely to make larger edits.
  • Recently created account and high amount of activity:
    • Temporary blocked accounts complete a median of 5 edits 24 hours after registering.
    • Also accounts blocked for blocked for sockpuppetry, disruptive activity, and violating WP:NOTHERE are more likely to complete a higher number of edits right after registering.
  • Check user reports:
    • 15% of all blocked accounts have been checked with the CheckUser tool.
    • Patterns vary by block reason. Only 3% of accounts blocked for promotion and advertising have at least 1 check user report vs 50% of accounts blocked for sockpuppetry.
  • If the user’s “User:” namespace exists
    • The majority of accounts have not created a User Page on English Wikipedia. Trends are mostly consistent by block duration and reason.
  • The user rights for the user: 99% of all blocked accounts on English Wikipedia do not have any current or previously held user permissions.

Data exploration: Review of block rate across user attributes

We then compared these patterns to user accounts that are not currently blocked to explore the impact of each user attribute on the likelihood that account was blocked during the reviewed timeframe.

For this data exploration, we reviewed all accounts created between May and June 2024 on English Wikipedia and the account’s block status at the time this dataset was gathered.

Editing Activity

A number of the reviewed user attributes are associated with the user account’s editing activity. There are two types of attributes we can consider: proportion of users that completed at least one of the editing activities and also the average number of those types of edits completed.

A number of these attributes are only relevant for users that have completed at least one edit.

User accounts that completed at least one edit

Show the code
# How many blocked users completed at least one edit
prop_users_w_edits <- user_attribute_data %>%
group_by(is_blocked, has_completed_edit) %>%
 summarise(n_users = n_distinct(user_name)) %>%
    mutate(pct_users = n_users/sum(n_users))

Blocked accounts are more likely to complete at least one edit. Only 14% of blocked accounts did not complete at least one edit compared to 35% of non-blocked accounts that did not complete an edit during the reviewed timeframe.

Show the code
# compare proportion across blocked and non blocked accounts

calc_proportion <- function (col, df = user_attribute_data, is_col_bool = TRUE) {
    result <- (
        df %>%
        group_by(.data[[col]]) %>%
        summarise(
            Total = n_distinct(user_name),
            blocked = n_distinct(user_name[is_blocked == 1]),
            not_blocked = n_distinct(user_name[is_blocked == 0])
        ) %>%
        mutate(
            pct_blocked = blocked / Total,
            pct_not_blocked = not_blocked / Total,
            Variable = col,
            Total = Total
        ) %>%
        select(-c(blocked, not_blocked))
    )

    if (is_col_bool) {
        colnames(result)[1] <- 'TF'
    
        result <- (
            result %>% 
            mutate(
                across('TF', str_replace, 'TRUE', 'Yes'),
                across('TF', str_replace, 'FALSE', 'No')
            )
        )
    }

    return(data.frame(result))
}
Show the code
completed_edit_col <- 'has_completed_edit'

completed_edit_proportion_list <- list()
for (col in completed_edit_col) {
    completed_edit_proportion_list[[col]] <- calc_proportion(col, user_attribute_data)
}

completed_edit_proportion <- do.call(rbind, completed_edit_proportion_list)

## for editors that completed at least one edit

has_edit_proportion_tbl <- (
    completed_edit_proportion%>%
    mutate(Total_fmt = sapply(Total, format_big_number)) %>%
    gt(
        groupname_col = 'Variable', 
        rowname_col = 'TF'
    ) %>%
    fmt_percent(
        c('pct_blocked', 'pct_not_blocked'), 
        decimals=1
    ) %>%
    data_color(
        c('pct_blocked', 'pct_not_blocked'),
        palette = 'GnBu'
    ) %>%
    tab_spanner(
        'Is User Account Blocked', 
        c('pct_blocked', 'pct_not_blocked')
    ) %>%
    cols_label(
        pct_blocked = 'Yes',
        pct_not_blocked = 'No'
    ) %>%
    cols_hide('Total_fmt') %>%
    tab_header('Blocked proportion of user accounts by editing status') %>%
    opt_stylize(4)
)

display_html(as_raw_html(has_edit_proportion_tbl))
Blocked proportion of user accounts by editing status
Total
Is User Account Blocked
Yes No
has_completed_edit
No 162770 0.2% 99.8%
Yes 90863 2.3% 97.7%

If we look at the block rate, 2.3% of all accounts that completed at least one edit during the reviewed timeframe were blocked compared to 0.2% of accounts that did not complete an edit.

Show the code
# creat dataset limited to editor onlys

user_attribute_data_editors <- user_attribute_data %>%
filter(num_all_edits > 0)
Show the code
editing_logical_cols <- c('has_received_thanks', 
                  'has_reverted_edit_12hours', 'has_article_created', 'has_completed_content',
                  'has_completed_non_content', 'has_sleeper_account_activity')


logical_cols_proportion_list <- list()
for (col in editing_logical_cols) {
    logical_cols_proportion_list[[col]] <- calc_proportion(col, user_attribute_data_editors)
}

logical_cols_proportion <- do.call(rbind, logical_cols_proportion_list)

## for editors that completed at least one edit

editors_proportion_tbl <- (
    logical_cols_proportion %>%
    mutate(Total_fmt = sapply(Total, format_big_number)) %>%
    gt(
        groupname_col = 'Variable', 
        rowname_col = 'TF'
    ) %>%
    fmt_percent(
        c('pct_blocked', 'pct_not_blocked'), 
        decimals=1
    ) %>%
    data_color(
        c('pct_blocked', 'pct_not_blocked'),
        palette = 'GnBu'
    ) %>%
    tab_spanner(
        'Is User Account Blocked', 
        c('pct_blocked', 'pct_not_blocked')
    ) %>%
    cols_label(
        pct_blocked = 'Yes',
        pct_not_blocked = 'No'
    ) %>%
    cols_hide('Total_fmt') %>%
    tab_header('Editors only: Blocked proportion of user accounts by editing activity') %>%
    opt_stylize(4)
)

display_html(as_raw_html(editors_proportion_tbl))
Editors only: Blocked proportion of user accounts by editing activity
Total
Is User Account Blocked
Yes No
has_received_thanks
No 89385 2.2% 97.8%
Yes 1478 5.8% 94.2%
has_reverted_edit_12hours
No 62077 1.5% 98.5%
Yes 28786 3.9% 96.1%
has_article_created
No 89597 2.1% 97.9%
Yes 1266 17.3% 82.7%
has_completed_content
No 28180 2.3% 97.7%
Yes 62683 2.3% 97.7%
has_completed_non_content
No 48187 1.3% 98.7%
Yes 42676 3.4% 96.6%
has_sleeper_account_activity
No 89301 2.2% 97.8%
Yes 1562 7.2% 92.8%
Summary
  • Overall, the block rate is higher for user accounts that completed at least one of the identified editing activities compared to user accounts that did not.
  • has_completed_non_content The block rate is slightly higher for user accounts that completed at least one edit on a non-content page compared to user accounts that completed at least one edit on a content page.
  • has_reverted_edit_12_hours We also reviewed if the user had a reverted edit within 12 hours after their revision timestamp. This 12-hour time to revert criteria was proposed as it was also identified as one of the criteria to identify potential vandalism edits in T349083. The block rate is higher for user who completed at least one reverted edit. 3.9% of all user accounts that had an edit reverted within 12 hours after the revision timestamp were blocked.
  • has_article_created We observed the highest block rate for accounts that created at least one new article in the reviewed timeframe. 17.3% of all accounts that created an article on English Wikipedia were blocked.
  • has_received_thanks We also observed a higher block rate for accounts that that received thanks. 5.8% of all accounts that received thanks were issued a block; however, this is likely because blocked accounts are also more likely to make more edits compared to non-blocked user accounts.
  • is_sleeper_account_activity. We also observed that user accounts are more likely to be blocked if they completed a quick rate of editing after a while since account creation. For this analysis, sleeper account activity was defined as making an edit at least 30 days after creating an account and making at least over 5 edits within 5 minutes.

Average Editing User Attributes by Blocked Status

Show the code
editing_num_cols <- c(
    'num_all_edits', 'num_content_edits', 'num_non_content_edits', 'max_edit_size',
     'num_articles_created', 'num_edits_1hrs','num_edits_24hrs', 'num_reverted_edits_10minutes',
    'num_reverted_edits_12hours', 'num_reverted_edits_all', 'ratio_revert_edits', 'num_unreverted_edits'
)
Show the code
edits_by_block_status <- data.frame(t(
    user_attribute_data  %>%
    filter(num_all_edits > 0) %>%  #only applicable to editors
    select(is_blocked, all_of(editing_num_cols)) %>%
    group_by(is_blocked) %>%
    summarize(across(all_of(editing_num_cols), mean))
))


colnames(edits_by_block_status) <- c('No', 'Yes')
edits_by_block_status$Variable <- rownames(edits_by_block_status)
rownames(edits_by_block_status) <- NULL

edits_by_block_status_tbl <- (
    edits_by_block_status[-1, ] %>%
    gt() %>%
    cols_move_to_start('Yes') %>%
    cols_move_to_start('Variable') %>%
    tab_spanner('Is User Account Blocked', c('Yes', 'No')) %>%
    tab_header(
        'Editing activity by user account blocked status',
        'Limited only to accounts with at least one edit'
    ) %>%
    fmt_number(c('Yes', 'No'), decimals=1) %>%
    opt_stylize(style = 4)
)


display_html(as_raw_html(edits_by_block_status_tbl))
Editing activity by user account blocked status
Limited only to accounts with at least one edit
Variable
Is User Account Blocked
Yes No
num_all_edits 46.7 7.7
num_content_edits 33.1 5.1
num_non_content_edits 13.6 2.5
max_edit_size 5,282.4 1,793.9
num_articles_created 1.1 0.1
num_edits_1hrs 2.2 1.8
num_edits_24hrs 4.5 2.8
num_reverted_edits_10minutes 2.0 0.5
num_reverted_edits_12hours 5.4 1.0
num_reverted_edits_all 10.0 1.3
ratio_revert_edits 0.4 0.3
num_unreverted_edits 36.7 6.4
Summary
  • Overall, there is a higher rate of editing activity across all reviewed editing attributes for blocked accounts.
  • num_edits & num_content_edits & num_non_content_edits: On average, blocked accounts complete roughly about 6 times as many content edits and 5 times as many non-content edits. Note: The reviewed sample was limited to newly created accounts; as a result, this reflects the number of edits completed within a relatively quick time period (90 days).
  • max_edit_size & avg_edit_size On average, blocked users make larger-sized edits. The average maximum edit size by blocked accounts was 5,282 compared to 1,793 for accounts that were not blocked.
  • num_reverted_edit_10minutes & num_reverted_edits_12hours & num_reverted_edits_all: For each user account, we reviewed the total number of reverted edits as well as reverted edits completed within 10 minutes and 12 hours to determine any impact from the rate of reverts. Blocked user accounts are more likely to have reverted edits across all three of these attributes. We see the largest relative difference in the number of total reverted edits. There was an average of 10 reverted edits by blocked accounts compared to 1 for non-blocked accounts.
  • num_articles_created: While we found that 17% of all user accounts that created an article were blocked, these blocked accounts typically create just one article before they are blocked.
  • ratio_revert_edits: On average, blocked accounts have a slightly higher ratio of reverted edits to all edits compared to non-blocked accounts.

User Edit Count

Show the code
# setting threshold per data publication guidelines.
threshold <- 150
Show the code
## order and factor user edit count
edit_buckets <- c('No edits', '1-10', '11-99', '100-999', '1000-4999', '5000+')

user_attribute_data <- (
    user_attribute_data %>%
    mutate(
        user_edit_bucket = factor(user_edit_bucket, levels = edit_buckets)
    )
)
Show the code
 plot_del_pct_bar <- function(variable, var_label) {
    
  plot <- (
      ggplot(
          calc_proportion(variable, is_col_bool = FALSE), 
          aes_string(x=variable, y="pct_blocked", fill="pct_blocked")
      ) +
      theme_classic() +
      geom_bar(stat = 'identity', position = 'dodge') + 
      geom_text(
           aes(label = ifelse(Total < threshold, (paste(scales::percent(pct_blocked), glue("<{threshold}"))),
                             (paste(scales::percent(pct_blocked), glue('of {Total}')))), fontface = 2),
          vjust = -0.5, 
          size = 5, 
          color = 'black'
      ) +
      scale_y_continuous(labels = percent) +
      labs(
          x = var_label,
          y = 'Percentage of Users Blocked',
          title = glue('Proportion of blocked users by {var_label}')
      ) +
      theme(
          text = element_text(size=16),
          plot.title = element_text(hjust = 0.5),
          legend.position = 'none'
      ) +
      scale_fill_gradient(low = 'PowderBlue', high='SteelBlue')
  )

  return(plot)
}

      
Show the code
user_edit_bucket_del_plot <- plot_del_pct_bar('user_edit_bucket', 'user edit bucket')
Show the code
user_edit_bucket_del_plot

Summary

User Edit Count

  • Users with lower edit counts have a lower rate of being blocked.
  • User with higher edits counts (over 1000) have the highest rate of being blocked; however, there are also fewer of these accounts.
  • It’s also important to note that the dataset was limited to newly created accounts. As a result, these trends also reflect the rate of editing. There are fewer users are able to make a large number of edits within 90 days of creating an account.

Blocked Attributes

We also reviewed past moderation or adminstrative activities against the user including historical, expired blocks issued as well as blocks issued on other wikis and check user reports. This analysis include all user accounts including accounts that have not completed an edit.

Show the code
blocked_logical_cols <- c('has_historical_blocks', 'has_other_wiki_blocks', 'has_cu_reports')

logical_cols_proportion_list <- list()
for (col in blocked_logical_cols) {
    logical_cols_proportion_list[[col]] <- calc_proportion(col, user_attribute_data)
}

logical_cols_proportion <- do.call(rbind, logical_cols_proportion_list)

logical_cols_proportion_tbl <- (
    logical_cols_proportion %>%
    mutate(Total_fmt = sapply(Total, format_big_number)) %>%
    gt(
        groupname_col = 'Variable', 
        rowname_col = 'TF'
    ) %>%
    fmt_percent(
        c('pct_blocked', 'pct_not_blocked'), 
        decimals=1
    ) %>%
    data_color(
        c('pct_blocked', 'pct_not_blocked'),
        palette = 'GnBu'
    ) %>%
    tab_spanner(
        'Is User Account Blocked', 
        c('pct_blocked', 'pct_not_blocked')
    ) %>%
    cols_label(
        pct_blocked = 'Yes',
        pct_not_blocked = 'No'
    ) %>%
    cols_hide('Total_fmt') %>%
    tab_header('Proportion of user Accounts that have completed Editing activity') %>%
    opt_stylize(4)
)

display_html(as_raw_html(logical_cols_proportion_tbl))
Proportion of user Accounts that have completed Editing activity
Total
Is User Account Blocked
Yes No
has_historical_blocks
No 253449 0.9% 99.1%
Yes 184 13.0% 87.0%
has_other_wiki_blocks
No 253487 0.9% 99.1%
Yes 146 22.6% 77.4%
has_cu_reports
No 250989 0.7% 99.3%
Yes 2644 23.1% 76.9%
Show the code
blocked_num_cols <- c(
    'num_exp_blocks'
)

exp_blocks_by_block_status <- data.frame(t(
    user_attribute_data  %>%
    #filter(num_exp_blocks > 0) %>%
    select(is_blocked, all_of(blocked_num_cols)) %>%
    group_by(is_blocked) %>%
    summarize(across(all_of(blocked_num_cols), mean))
))


colnames(exp_blocks_by_block_status) <- c('No', 'Yes')
exp_blocks_by_block_status$Variable <- rownames(exp_blocks_by_block_status)
rownames(exp_blocks_by_block_status) <- NULL

exp_blocks_by_block_status_tbl <- (
    exp_blocks_by_block_status[-1, ] %>%
    gt() %>%
    cols_move_to_start('Yes') %>%
    cols_move_to_start('Variable') %>%
    tab_spanner('Is User Account Blocked', c('Yes', 'No')) %>%
    tab_header(
        'Number of expired blocks by user account blocked status'
    ) %>%
    fmt_number(c('Yes', 'No'), decimals=1) %>%
    opt_stylize(style = 5)
)


display_html(as_raw_html(exp_blocks_by_block_status_tbl))
Number of expired blocks by user account blocked status
Variable
Is User Account Blocked
Yes No
num_exp_blocks 0.0 0.0
Summary
  • We see higher block rates associated with other blocks or check user activity.
  • 13% of all user accounts that have had a historical, since expired block now have a current block issued to them.
  • 22.6% of all user accounts that have other wiki blocks or had check user reports issued are blocked.
  • While we see higher block rates for these attributes compared to editing activity, there is also a smaller number of users that met this criteria. In the sample dataset reviewed, there were less than 200 users that were identified as having a historical or other wiki block. This represents less than 1% of all user accounts.
  • The average number of expired blocks is also 0 as the majority of users have not previously had a block issued to them. If we limit to accounts that have had a block, most all accounts have only been issued one expired block.

Other User Account Attributes

We also identified some miscellaneous account attributes such as the existence of a user page and if they have a confirmed email address.

Show the code
other_logical_cols <- c('has_user_page', 'has_confirmed_email')

logical_cols_proportion_list <- list()
for (col in other_logical_cols) {
    logical_cols_proportion_list[[col]] <- calc_proportion(col, user_attribute_data)
}

logical_cols_proportion <- do.call(rbind, logical_cols_proportion_list)

logical_cols_proportion_tbl <- (
    logical_cols_proportion %>%
    mutate(Total_fmt = sapply(Total, format_big_number)) %>%
    gt(
        groupname_col = 'Variable', 
        rowname_col = 'TF'
    ) %>%
    fmt_percent(
        c('pct_blocked', 'pct_not_blocked'), 
        decimals=1
    ) %>%
    data_color(
        c('pct_blocked', 'pct_not_blocked'),
        palette = 'GnBu'
    ) %>%
    tab_spanner(
        'Is User Account Blocked', 
        c('pct_blocked', 'pct_not_blocked')
    ) %>%
    cols_label(
        pct_blocked = 'Yes',
        pct_not_blocked = 'No'
    ) %>%
    cols_hide('Total_fmt') %>%
    tab_header('Proportion of blocked accounts by other user attribute activity') %>%
    opt_stylize(4)
)

display_html(as_raw_html(logical_cols_proportion_tbl))
Proportion of blocked accounts by other user attribute activity
Total
Is User Account Blocked
Yes No
has_user_page
No 244386 0.7% 99.3%
Yes 9247 6.7% 93.3%
has_confirmed_email
No 166405 0.9% 99.1%
Yes 87228 1.0% 99.0%
Summary
  • has_user_page: User accounts with a user page created have a higher block rate than user accounts that do not. Note: This checks if the user page was created but not the extent of content on the page. It is possible that this trend is due to to higher frequency of comments left on talk pages of users who are blocked or made an edit that was reverted.
  • has_confirmed_email: There is minimal difference in the blocked rate of accounts with a confirmed email and those without. About 1% of user accounts with and without a confirmed email address are blocked.

User Rights Level

Show the code
## order and factor user rights level

user_rights <- c('none', 'confirmed', 'extended','extendedconfirmed, patroller' )

user_attribute_data <- (
    user_attribute_data %>%
    mutate(
        user_rights_level = factor(user_rights_level, levels = user_rights, ordered = TRUE),
    )
)
Show the code
 plot_del_pct_bar <- function(variable, var_label) {
    
  plot <- (
      ggplot(
          calc_proportion(variable, is_col_bool = FALSE), 
          aes_string(x=variable, y="pct_blocked", fill="pct_blocked")
      ) +
      theme_classic() +
      geom_bar(stat = 'identity', position = 'dodge') + 
      geom_text(
           aes(label = ifelse(Total < threshold, (paste(scales::percent(pct_blocked), glue("<{threshold}"))),
                             (paste(scales::percent(pct_blocked), glue('of {Total}')))), fontface = 2),
          vjust = -0.5, 
          size = 5, 
          color = 'black'
      ) +
      scale_y_continuous(labels = percent) +
      labs(
          x = var_label,
          y = 'Percentage of Users Blocked',
          title = glue('by {var_label}')
      ) +
      theme(
          text = element_text(size=16),
          plot.title = element_text(hjust = 0.5),
          legend.position = 'none'
      ) +
      scale_fill_gradient(low = 'PowderBlue', high='SteelBlue')
  )

  return(plot)
}
Show the code
user_rights_del_plot <- plot_del_pct_bar('user_rights_level', 'User Rights Level')
Summary

User Rights Level:

  • The majority of all reviewed user accounts do not have any assigned user rights, which is expected as this analysis is currently focused on accounts recently created.
  • 136 accounts were identified as having extended rights and 15% of these accounts were blocked. On English Wikipedia, a registered editor receives the extenededconfirmed automatically on edit after the account existed for at least 30 days and has made at least 500 edits. Given the quick editing rate observed for blocked accounts, it makes sense that these editors were provided right automatically prior to being blocked.

Determining variable importance with random forest modeling

We used the gathered sample of 253,634 users to train a random forest (an ensemble classification algorithm) on users that were issued either a temporary or a permanent block. The random forest models allows us to understand variable importance in the dataset. This type of model considers interactions between variables, allowing it to identify attributes that, while not individually significant, become important in combination with others.

To understand the variable importance, we will be using the permutation importance approach, which gives the Mean Decrease in Accuracy (MDA). In this approach, values of a given variable get randomly permutted to see how that affects classification accuracy.

Data preparation for modeling

Since the reviewed attributes have vastly different scales, we will standardize the data before using logistic regression. This will help the optimization algorithm converge faster and more efficiently. Standardized coefficeints are also easier to interpret as they represent the change in the outcome variable per unit standard deviation change in the predictor.

Show the code
all_numerical_cols <- c('num_all_edits', 'num_content_edits', 'num_non_content_edits', 'max_edit_size', 
     'num_articles_created', 'num_edits_1hrs','num_edits_24hrs', 'num_reverted_edits_10minutes',
    'num_reverted_edits_12hours', 'num_reverted_edits_all', 'ratio_revert_edits', 'num_unreverted_edits', 'num_exp_blocks',
    'num_cu_requests', 'num_thanks'
)
Show the code
user_attribute_data_norm <- user_attribute_data %>% mutate(across(all_of(all_numerical_cols), log1p))
Show the code
user_attribute_data_norm <- user_attribute_data %>% 
    mutate_at(all_numerical_cols, ~(scale(.) %>% as.vector))

Several of the attributes explores are highly correlated or redundant, which might lead to difficulty in the model accutrately interpreting the individual effects of each variable. We can calculate the correlation between all the numberical user attributes in the data set.

Show the code
cor_matrix <- cor(select(user_attribute_data_norm, all_numerical_cols)) 

Based on a review of the correlations across attributes, the following attributes have a high correlation (above 0.7): * Num_all_edits: num_content_edits * Num all edits: num_unreverted_edits * Num_edits_1_hrs: Num_edits_24_hrs * Num reverted_edits_10minutes: Num_reverted_edits_12hours * num_reverted_edits_all: Num_reverted_edits_12hours

Let’s focus the model on the following attributes, which were shown in exploratory analysis to have a large difference between blocked and non_blocked accounts . * Num_reverted_edits_12hrs * Num_edits_24hrs * Num_content_edits

Variable importance across all accounts

Show the code
# create training dataset (80% of the dataset) and test dataset (20% of the dataset)

# Remove identifier, non-attribute variables and variables identifed as duplicative
model_data_rf <- select(user_attribute_data_norm, -c(user_name, user_id, 
                                                   block_timestamp, block_expiration_timestamp, user_registration_timestamp, 
                                                   block_duration, block_reason_simple, avg_edit_size, num_unreverted_edits, num_all_edits, num_edits_1hrs, num_reverted_edits_10minutes,
                                                   num_reverted_edits_all, user_rights_level, user_edit_bucket))

trn_index_rf <- createDataPartition(model_data_rf$is_blocked, p = 0.8, list = FALSE)
trn_data_rf <- model_data_rf[trn_index_rf, ]
test_data_rf <- model_data_rf[-trn_index_rf, ]
Show the code
tic()

rf_model_ranger <- ranger(
    
    formula = is_blocked ~ ., 
    data = trn_data_rf, 
    num.trees = 501,
    
    # suggested default for classification
    mtry = floor(sqrt(ncol(trn_data_rf))),
    verbose = TRUE,

    # mean decreasing accuracy
    importance = 'permutation',
    
    # ranger package by default uses all available cores at disposal
    # ideal to specify threads, especially if on a shared server
    num.threads = 6
)


toc()

saveRDS(rf_model_ranger, file='data/rf_model.rds')
Computing permutation importance.. Progress: 87%. Estimated remaining time: 4 seconds.
62.174 sec elapsed
Show the code
var_importance <- data.frame(
    variable = names(rf_model_ranger$variable.importance),
    imp = rf_model_ranger$variable.importance
)

var_importance <- var_importance %>% mutate(
    # imp_pct = imp / sum(imp) * 100,
    imp_norm = imp / max(imp)
)

row.names(var_importance) <- NULL
Show the code
options(repr.plot.width = 12, repr.plot.height = 10) 
var_imp_plot <- (
    ggplot(var_importance, aes(x = reorder(variable, imp_norm), y=imp_norm, fill=imp_norm)) + 
    geom_bar(stat = 'identity') +
    # geom_text(
    #     aes(label = sprintf("%.1f%%", imp_pct), fontface = 1),
    #     hjust = -0.1
    # ) + 
    coord_flip() + 
    labs(
        title = 'Importance of user attributes in predicting a user being blocked',
        x = 'Variable',
        y = 'Mean Decreasing Accuracy (Relative to Max)'
    ) + 
    theme_classic() + 
    theme(
        plot.margin = unit(c(1, 0, 1, 1), "cm"),
        text = element_text(size=14),
        plot.title = element_text(hjust = 0.5, size=18),
        legend.position = 'none'
    ) + 
    scale_fill_gradient(low = 'lightblue', high = 'darkblue')
)

var_imp_plot

Summary:
  • Each bar indicates the average decrease in accuracy (relative to the maximum).
  • Many of the editing-related attributes were identified as the most important in predicting a user being blocked.
    • The ratio of all reverted edits to all total edits was also identified as an important user attribute. Note: This ratio relects all edits and is not limited to a current page namespace.
    • A high decrease in accuracy was observed when values of the user’s maximum edit size was changed.
    • The number of non-content edits is slightly more important to predicting if a user was blocked compared to the number of content edits.
    • The total number of reverted edits within 12 hours was also identified as important.
    • The model also indicates that if a user talk page exists for the user is important in determining if a user was blocked. Note: This could be related to comments being left on a user’s talk page.
  • Non-editing attributes were identified as less important in this model. If the user has a check user report was identified as the most important non-editing attribute but it has a relatively low mean decrease in accuracy (MDA) compared to editing attributes.

Note: This doesn’t necessarily mean that these variables don’t have any importance in deciding the blocked outcome, but when considering the overall set of variables available, they are less important compared to others. Random forests favor features with more distinct values. In this dataset, the non-editing user attributes had limited distinct values.

We will investigate further using regression modeling to estimate the impact of each of these variables on the outcome and identify which features both models agree on.

Logistic Regression Modeling

We then use logistic regression modeling to assess the magnitude and direction of the user attribute’s impact on whether a user was blocked or not.

Note: If this analysis is extended to other Wikipedias, it would be good to use a hiearchical regresesion model to handle random effects that may be caused by differing block behavior on various Wikipedias.

For this method, we split the dataset into the Training set, on which the Logistic Regression model will be trained and the Test set, on which the trained model will be applied to classify the result.

Show the code
outcome <- 'is_blocked'

# exclude identifier columns and redundant variables
blocked_glm_exclude_cols <- c('user_name','user_id', 'block_timestamp', 'user_registration_timestamp', 'has_completed_edit',
                              'block_expiration_timestamp', 'block_duration', 'block_reason_simple', 
                             'avg_edit_size', 'num_unreverted_edits', 'num_all_edits', 'num_edits_1hrs', 'num_reverted_edits_10minutes',
                                                   'num_reverted_edits_all', 'user_rights_level', 'user_edit_bucket')

blocked_glm_data <- select(user_attribute_data_norm, -all_of(blocked_glm_exclude_cols))
blocked_glm_num_cols <- names(select(blocked_glm_data, where(is.numeric)))


#create training and test data sets
blocked_glm_trn_idx <- createDataPartition(blocked_glm_data[[outcome]], p = 0.9, list = FALSE)
blocked_glm_trn_data <- blocked_glm_data[blocked_glm_trn_idx, ]
blocked_glm_tst_data <- blocked_glm_data[-blocked_glm_trn_idx, ]

blocked_predictors <- setdiff(names(blocked_glm_data), outcome)
blocked_glm_formula <- as.formula(glue('{outcome} ~ {paste(blocked_predictors, collapse = \' + \')}'))

tic()
blas_set_num_threads(16)
blocked_glm <- glm(
    formula = blocked_glm_formula,
    family = binomial(link = 'logit'),
    #weights = weight,
    data = blocked_glm_trn_data
)
blas_set_num_threads(1)
print('model has been built.')
toc()
[1] "model has been built."
1.037 sec elapsed

Model Summary

Show the code

blocked_glm_numerical_features_summary_tbl <- tbl_regression(blocked_glm, include = all_of(blocked_glm_num_cols))

blocked_glm_logical_cols <- setdiff(names(select(blocked_glm_data, where(is.logical))), outcome)

blocked_glm_logical_features_summary_tbl <- (
    tbl_regression(
        blocked_glm, 
        include = all_of(blocked_glm_logical_cols), 
        show_single_row = all_of(blocked_glm_logical_cols),
        tidy_fun = broom.helpers::tidy_parameters
    )
)
Profiled confidence intervals may take longer time to compute.
  Use `ci_method="wald"` for faster computation of CIs.
Show the code
 #| column: page

display_tbl_hrz(
    list(
        as_raw_html(
            as_gt(
                blocked_glm_numerical_features_summary_tbl
            ) %>% 
            opt_stylize(5) %>%
            tab_header('Numerical features of user accounts')
            
        ), 
        as_raw_html(
            as_gt(
                blocked_glm_logical_features_summary_tbl
            ) %>% 
            opt_stylize(5) %>%
            tab_header('Logical features of user accounts')
        )
    )
)
Numerical features of user accounts
Characteristic log(OR)1 95% CI1 p-value
num_thanks -0.57 -1.0, -0.14 0.011
num_exp_blocks 0.26 -3.1, 3.5 0.9
num_cu_requests -0.40 -0.79, -0.02 0.040
num_reverted_edits_12hours 0.24 0.13, 0.34 <0.001
num_edits_24hrs -0.29 -0.35, -0.22 <0.001
num_articles_created -0.09 -0.29, 0.11 0.4
num_content_edits 0.19 0.11, 0.26 <0.001
num_non_content_edits 0.17 0.10, 0.25 <0.001
max_edit_size 0.10 0.07, 0.13 <0.001
ratio_revert_edits 1.3 0.99, 1.6 <0.001
1 OR = Odds Ratio, CI = Confidence Interval
Logical features of user accounts
Characteristic log(OR)1 95% CI1 p-value
has_received_thanks 0.12 -0.39, 0.64 0.6
has_reverted_edit_12hours -0.12 -0.31, 0.07 0.2
has_article_created 0.61 0.28, 0.93 <0.001
has_completed_content -0.26 -0.44, -0.08 0.005
has_completed_non_content 0.46 0.29, 0.63 <0.001
has_sleeper_account_activity 0.91 0.68, 1.1 <0.001
has_confirmed_email -0.05 -0.16, 0.05 0.3
has_user_page 0.35 0.22, 0.47 <0.001
has_historical_blocks 0.25 -2.2, 2.8 0.8
has_other_wiki_blocks 2.1 1.5, 2.6 <0.001
has_cu_reports 2.4 2.0, 2.7 <0.001
1 OR = Odds Ratio, CI = Confidence Interval

Logistic regression coefficient review

Show the code
 #| column: page

confidence_intervals <- confint.default(blocked_glm) %>%
  as.data.frame %>% { .$term <- gsub("`", "", rownames(.)); . } %>%
  set_rownames(NULL) %>%
  dplyr::mutate(estimate = coef(blocked_glm),
                `increases chances of` = ifelse(estimate > 0, "blocked", "not blocked"))


p <- ggplot(confidence_intervals, aes(y = estimate, x = term, color = `increases chances of`)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_pointrange(aes(ymin = `2.5 %`, ymax = `97.5 %`)) +
  scale_y_continuous(limits = c(-8, 8)) +
  scale_color_brewer(type = "qual", palette = "Set1", direction = -1) +
  coord_flip() +
  ggthemes::theme_tufte(base_family = "Gill Sans", base_size = 16) +
  labs(x = "Variable", y = "Estimate",
       title = "Coefficient plot of logistic regression model of \n factors associated with blocked accounts ",
      caption = "Point estimates and 95% confidence intervals for coefficients used in the logistic regression model.
Positive coefficients increase the odds of the user being blocked" ) +
  theme(legend.position = "bottom", panel.grid = element_line(color = "gray80"),
        panel.grid.major.y = element_line(color = "black", size = 0.1))

p

There are a number of user attributes that were assessed to have no significant effect on the outcome variable such as number of thanks received. The ratio of reverted edits, being identifed as a sleeper account, as well as as administrative activity such as blocks on wikis and check user reports were identified as factors that signficantly increased chance of a user account blocked.

We will further explore the impact of variables determined to have a significant impact on whether an account is blocked in the next section.

Model interpretation and summary

Show the code
blocked_glm_coeffs <- data.frame(coeff = blocked_glm$coefficients)
blocked_glm_intercept <- blocked_glm_coeffs['(Intercept)', ]

Logical Variables

Show the code
prob_impact <- function(
    var_name,
    var_info, 
    intercept, 
    fixefs, 
    ranefs, 
    ranef_group='11-20', 
    incl_ranef=TRUE, 
    return_pct_diff=TRUE
) {
      
    if (incl_ranef) {
        baseline <- intercept + ranefs[ranef_group, 'Intercept']
    } else {
        baseline <- intercept
    }

    coeff <- fixefs[var_name, ]

    if (var_info$var_type == 'num') {
        
        init_val <- var_info$base
        increase <- var_info$increase

        if (var_info$is_ln) {
            
            init_log_odds <- baseline + coeff * log1p(init_val)
            new_log_odds <- baseline + coeff * log1p(init_val + increase)
            
        } else {
            
            init_log_odds <- baseline + coeff * init_val
            new_log_odds <- baseline + coeff * (init_val + increase)
        }
        
    } else if (var_info$var_type == 'bool') {
        init_log_odds <- baseline
        new_log_odds <- baseline + coeff
    }

    init_val_prob <- invlogit(init_log_odds)
    increased_prob <- invlogit(new_log_odds)
    
    if (return_pct_diff) {
        return(pct_diff(init_val_prob, increased_prob))
    } else {
        return(new_prob)
    }
}
Show the code
# Define row of signficant cols
blocked_glm_logical_cols_sig <- c('has_article_created','has_completed_non_content', 'has_completed_content',  'has_sleeper_account_activity', 
                                  'has_user_page', 'has_other_wiki_blocks', 'has_cu_reports')
Show the code
blocked_glm_logical_features_summary <- data.frame(
    feature = blocked_glm_logical_cols_sig, 
    coeff = blocked_glm_coeffs[blocked_glm_logical_cols_sig, ]
        
)

row.names(blocked_glm_logical_features_summary) <- blocked_glm_logical_cols_sig

for (col in blocked_glm_logical_cols_sig) {
    blocked_glm_logical_features_summary[col, 'prob_diff'] <- prob_impact(
        col, list(var_type = 'bool'), intercept = blocked_glm_intercept, fixefs = blocked_glm_coeffs, incl_ranef = FALSE
    ) / 100
}
Show the code
blocked_glm_logical_features_summary_tbl <- (
    blocked_glm_logical_features_summary %>%
    gt(rowname_col = 'feature') %>%
    fmt_percent('prob_diff', decimals=0) %>%    
    cols_label(
        coeff = 'Coefficient',
        prob_diff = '% Change'
    ) %>%
    tab_header('Change in Probability of User Account Block', 'for logical variables') %>%
    tab_source_note('% Change is when the given feature is TRUE') %>%
    opt_stylize()
)

display_html(as_raw_html(blocked_glm_logical_features_summary_tbl))
Change in Probability of User Account Block
for logical variables
Coefficient % Change
has_article_created 0.593 80%
has_completed_non_content 0.409 50%
has_completed_content -0.317 −27%
has_sleeper_account_activity 0.890 141%
has_user_page 0.331 39%
has_other_wiki_blocks 1.987 602%
has_cu_reports 2.231 784%
% Change is when the given feature is TRUE
Summary
  • The impact on probability of being blocked is highest when there are check user reports have been issued or the user account has been blocked on other wikis.
  • Sleeper accounts (defined as a quick rate of editing from new accounts, especially after a while since account creation), was also identified as one of the attribute with the highest impact on the probability of being blocked.
  • The probability of a user being blocked also increased if they have completed a non-content edit or created an article recently after creating an account, while the probability of being blocked decreased if they completed at least one content edit.

Numerical Variables

Show the code
# limit to varaibles where signficant difference was found

blocked_glm_num_cols_sig <- c('num_reverted_edits_12hours','num_edits_24hrs', 'num_content_edits', 'num_non_content_edits', 'max_edit_size', 'ratio_revert_edits')
Show the code
blocked_glm_numerical_features_summary <- data.frame(
    feature = blocked_glm_num_cols_sig,
    base_l1 = c(3, 2, 10, 10, 1000, 0.2),
    base_l2 = c(5, 10, 100, 100, 2000, 0.2),
    increase_l1 = c(10, 3, 20, 20, 2000, 0.3),
    increase_l2 = c(15, 50, 200, 200, 3000, 0.4)
)

row.names(blocked_glm_numerical_features_summary) <- blocked_glm_num_cols_sig

for (col in blocked_glm_num_cols_sig) {
    row_index <- which(row.names(blocked_glm_numerical_features_summary) == col)
    
    col_info <- list(
        var_type = 'num',
        is_ln = TRUE,
        base = blocked_glm_numerical_features_summary[row_index, 'base_l1'], 
        increase = blocked_glm_numerical_features_summary[row_index, 'increase_l1']        
    )
    
    blocked_glm_numerical_features_summary[row_index, 'prob_diff_l1'] <- prob_impact(col, col_info, intercept = blocked_glm_intercept, fixefs = blocked_glm_coeffs, incl_ranef = FALSE) / 100

    col_info <- list(
        var_type = 'num',
        is_ln = TRUE,
        base = blocked_glm_numerical_features_summary[row_index, 'base_l2'], 
        increase = blocked_glm_numerical_features_summary[row_index, 'increase_l2']        
    )
    
   blocked_glm_numerical_features_summary[row_index, 'prob_diff_l2'] <- prob_impact(col, col_info, intercept = blocked_glm_intercept, fixefs = blocked_glm_coeffs, incl_ranef = FALSE) / 100
}
Show the code
blocked_glm_numerical_features_summary_tbl <- (
    blocked_glm_numerical_features_summary %>%
    gt(rowname_col = 'feature') %>%
    tab_spanner(
        label = 'Scenario 1',
        columns = ends_with('l1')
    ) %>%
    tab_spanner(
        label = 'Scenario 2',
        columns = ends_with('l2')
    ) %>%
    cols_label(
        starts_with("base") ~ "Initial",
        starts_with("increase") ~ "Increase",
        starts_with("prob") ~ "% Change"
    ) %>%
    fmt_percent(
        starts_with('prob')
    ) %>%
    data_color(
        starts_with('prob'),
        palette  = 'RdYlBu',
        reverse = TRUE
    ) %>%
    opt_stylize() %>%
    tab_header('Change in Probability of User Account Block', 'for numerical variables')
)

display_html(as_raw_html(blocked_glm_numerical_features_summary_tbl))
Change in Probability of User Account Block
for numerical variables
Scenario 1
Scenario 2
Initial Increase % Change Initial Increase % Change
num_reverted_edits_12hours 3.0 10.0 32.94% 5.0 15.0 32.90%
num_edits_24hrs 2.0 3.0 −18.46% 10.0 50.0 −39.66%
num_content_edits 10.0 20.0 21.95% 100.0 200.0 23.11%
num_non_content_edits 10.0 20.0 19.30% 100.0 200.0 20.33%
max_edit_size 1000.0 2000.0 10.82% 2000.0 3000.0 8.94%
ratio_revert_edits 0.2 0.3 32.63% 0.2 0.4 43.90%

Summary:

To help understand changes in probability due to the impact of each user attribute, we calculated the change in probability of being blocked as the difference between the probability for a given initial value and the probability for the new value. We reviewed two scenarios. A lower inital value is reviewed in the first scenario and compared to a larger initial value in the second scenario.

Increases in all of the numerical attributes led to an increase in the probability of a user account block except for all the nubmer of edits completed by the user within 24 hours. Increases in ratio of a user’s reverted edits to all edits completed leads to the biggest increase in the probability of a user being blocked.

Conclusions

The two models reveal a similar set of high-impact features, namely:

  • Ratio of reverted edit to all edits completed
  • Maximum edit size
  • Number of edits reverted within 12 hours
  • Has completed a non content edit
  • Has created a new article
  • Has a user page

The logistic regression model also identifed these as high-impact features:

  • Has other wiki blocks
  • Has check user reports
  • Has sleeper account activity

The models confirmed these attributes had the highest impact on the likelihood of an account being blocked and could be considered in the calculation of an account reputation score.

Follow-up analysis suggestions:

  • Extend analysis to other Wikipedias where the type of user attributes as well as the magnitude and direction of their impact on a user’s block status may vary due to different moderation and administrative practices.
  • Consider additional user attributes. The analysis revealed that quick rate of editing especially if a large portion of those edits were reverted increase likelihood of a user being blocked. We can explore a few more defintions of quick editing rates to see if we can find a more precise indicator.

Acknowledgements

The approach in this analysis references the following reports and analyses:

  • Variables Affecting Deletion Rate of Articles, Krishna Chaitanya Velaga, 2024 repo
  • Anticipating Zero Results From Query Features, Mikhail Popov’s 2016 report

Reuse