Megan Neisler, Staff Data Scientist, Wikimedia Foundation
Modified
2025-01-27
Overview
As part of WE 4.2, the Trust and Safety team is exploring ways to reliably associate an individual with their actions (sockpuppetting mitigation), and combine existing signals (e.g. IP addresses, account history, request attributes) to allow for more precise targeting of actions on bad actors.
This analysis aims to explore a sample of users on English Wikipedia and identify patterns that affect the chances of a registered user account being blocked. This analysis will determine how to weight a set of publicly available data points in calculating an overall reputation score for an account. This account reputation score could then be presented to functionaries/administrators/moderators on the wikis to assist in anti-abuse work.
First, we collected data on a set of publically available data points associated with registered user accounts including edit and block history. The Overview of User Attributes Reviewed table shown below provides an overview of the user attributes considered and their data sources.
We used this data to explore patterns across a sample of blocked user accounts on English Wikipedia and compare them to patterns across a sample of non-blocked user accounts. For this analysis, we specifically gathered all users who created an account on English Wikipedia between May 2024 and July 2024. Data was limited to accounts less than 90 days old at the time of the analysis assuming that accounts with persistent bad faith activity older than that will have already been blocked. Bots and any auto-created accounts were also excluded. This resulted in a sample of 253,633 users of which about 2,406 users (0.1% of all users) were issued a block during the reviewed timeframe.
Two classification methods were then used to understand the relative importance of each of the user attributes on an account being blocked. Random forests were used to assess how important certain user attributes are in classification. This was followed by logistical regression modeling to assess the magnitude and direction of the feature’s impact on the probability of an account being blocked.
Has completed at least one edit on any namespace prior to being blocked
mediawiki_history
Has received thanks
Has received at least one thanks by another user
logging
Has reverted edit within 12 hours
Has completed at least one edit that was reverted within 12 hours after it was published
mediawiki_history
Has article created
Has created at least once article since creating an account
mediawiki_history
Has completed content edits
Has completed at least one content edit since creating an account (identified by page_namespace_is_content_historical = TRUE)
mediawiki_history
Has completed non content edit
Has completed at least one non content edit on a non content namespace since creating an account (identified by page_namespace_is_content_historical = FALSE)
mediawiki_history
Has sleeper account activity
Defined a sleeper account as not making an edit at least 30 days after creating an account and making at least over 5 edits within 5 minutes.
mediawiki_history
Has user page
User page exists
mediawiki_history
Has historical (expired, since removed) blocks
Has received historical blocks since creating an account that have since expired.
mediawiki_user_history
Has blocks on other wikis
Has account blocks that have been issued on other wikis
mediawiki_user_blocks_change
Has check user reports
Has check user reports issued
cu_log
Numerical
Number of articles created
Number of articles created by the user since creating an account (identified by revision_parent_id = 0 AND page_namespace_historical = 0)
mediawiki_history
Number of reverted edits
Number of edits reverted (identifited by revision_is_identity_reverted = true) by the user since created an account.
mediawiki_history
Number of expired blocks
Number of historical blocks that user has received since creating an account that have since expired.
mediawiki_user_blocks_change
Number of thanks received
Number of thanks received by another user
logging
Number of all edits
Total number of edits completed on any page namespace since creating an account
mediawiki_history
Number of content edits
Total number of edits completed on a content namespace since creating an account
mediawiki_history
Number of non-content edits
Total number of edits completed on a non-content namespace since creating an account
mediawiki_history
Number of unreverted edits
Total number of unreverted edits
mediawiki_history
Maximum edit size
The largest absolute edit size in bytes completed by a user since creating a count (Identified by revision_text_bytes_diff)
mediawiki_history
Ratio of reverted edits to all edits
The ratio of all edits that were reverted to all edits by the user since creating an account
mediawiki_history
Number of articles created
Total number of articles created by the user since creating an account
mediawiki_history
Number of edits within 1 hours
Total number of edits completed ony any page namespace within 1 hour of creating an account
mediawiki_history
Number of edits within 24 hours
Total number of edits completed ony any page namespace within 24 hours of creating an account
mediawiki_history
Number of reverted edits within 12 hours
Total number of edits reverted within 12 hours
mediawiki_history
Number of reverted edits with 10 minutes
Total number of edits reverted within 10 minutes
mediawiki_history
Number of check user reports
Number of check user reports issued
cu_log
Categorical
User rights level
The current user rights of the user (identified by event_user_groups)
mediawiki_history
User edit bucket
Edit count bucket of the user on English Wikipedia
Please refer to the data collection and the data processing notebooks for more details on the steps to collect and process the data reviewed in this report.
Overview of Blocks in Sample
Blocks by Duration
Show the code
blocked_users_duration_tbl <- user_attribute_data %>%filter(is_blocked ==1) %>%group_by(block_duration) %>%summarise(n_users =n_distinct(user_name)) %>%mutate(pct_users =paste0(round(n_users/sum(n_users) *100, 0), "%")) %>%select(-2) %>%gt() %>%opt_stylize(5) %>%tab_header(title ="Proportion of blocked user accounts by block duration" ) %>%cols_label(block_duration="Block duration",pct_users ="Proportion of users", ) display_html(as_raw_html(blocked_users_duration_tbl))
Proportion of blocked user accounts by block duration
Block duration
Proportion of users
permanent
97%
temporary
3%
The majority of blocked registered accounts (97%) in the sample have been issued a permanent block.
Average time duration from user registration to block
options(repr.plot.width =20, repr.plot.height =15)options(scipen =999)p <- time_to_block %>%ggplot(aes(x=duration)) +geom_histogram(color ='black', fill ="#999999") +geom_vline(aes(xintercept = time_to_block_distribution[1], color ='25th Percentile'), linetype ='dashed', linewidth =1.5) +geom_vline(aes(xintercept = time_to_block_distribution[2], color ='50th Percentile'), linetype ='dashed', linewidth =1.5) +geom_vline(aes(xintercept = time_to_block_distribution [3], color ='75th Percentile'), linetype ='dashed', linewidth =1.5) +geom_label(aes(x = time_to_block_distribution[1], y =300, label =paste0(time_to_block_distribution[1], " days"))) +geom_label(aes(x = time_to_block_distribution[2], y =300, label =paste0(time_to_block_distribution[2] , " days"))) +geom_label(aes(x = time_to_block_distribution[3], y =300 , label =paste0(time_to_block_distribution[3], " days"))) +scale_x_log10() +scale_y_continuous(labels =label_comma())+labs (title ="Time from user registration to first block",y ="Number of users",x="Time to first block (days, log scale)") +scale_color_manual(name='Percentiles',breaks=c('25th Percentile', '50th Percentile', '75th Percentile'),values=c("25th Percentile"="darkblue", "50th Percentile"="darkmagenta", "75th Percentile"="darkred")) +theme(panel.grid.minor =element_blank(),panel.background =element_blank(),plot.title =element_text(hjust =0.5),text =element_text(size=25),legend.position="bottom",axis.line =element_line(colour ="black")) p
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 1: The median block time for users accounts that registered during the reviewed timeframe is 34 days after registration.
The median block time is 29 days after account registration on English Wikipedia. Note: This dataset is limited to newer accounts that were created between May 2024 and July 2024 and as a result does not include accounts that took longer than 90 days to block.
By Block Reason
The block reason comes from the comment field of the mediawiki_user_block_change table. Note: As there is currently no structured data on block reason, we parsed the user provided text in this field to find key words and sort the data into a set of block categories. More bespoke block reasons that did not fit into the identified categories were grouped into the category “Other”.
There is currently a task open to provide more structured data to associate with a particular block.
Show the code
blocked_users_block_reason_tbl <- user_attribute_data %>%filter(is_blocked ==1) %>%group_by(block_reason_simple) %>%summarise(n_users =n_distinct(user_id)) %>%mutate(pct_users =paste0(round(n_users/sum(n_users) *100, 1), "%")) %>%arrange(desc(n_users)) %>%select(-2) %>%gt() %>%tab_header(title ="Proportion of blocked users by block reason" ) %>%opt_stylize(1) %>%cols_label(block_reason_simple ="Block reason",pct_users ="Proportion of users", ) %>%tab_footnote(footnote ="Based on categorization of user provided text for the block reason. \n The Other category includes text that did clearly identify a reason or were too bespoke to categorize",locations =cells_column_labels(columns ="block_reason_simple" ) ) display_html(as_raw_html(blocked_users_block_reason_tbl))
Proportion of blocked users by block reason
Block reason1
Proportion of users
sockpuppet
30.2%
spam_advertising
27.7%
checkuserblock
14.9%
disruptive
6.9%
Not_here_to_build_enyclopedia
6.9%
Other
6%
username_similar_to_organization
2.2%
long_term_abuse
1.8%
vandalism
1.7%
username_policy_violation
1.6%
making_legal_threat
0.1%
trolling
0%
user_name_bot
0%
1 Based on categorization of user provided text for the block reason. The Other category includes text that did clearly identify a reason or were too bespoke to categorize
The most frequent reason identified for a user account block during the reviewed timeframe was sockpuppetry (30.2% of all blocked accounts) followed by spam or advertising activity (27.7% of all blocked accounts).
These trends vary based on the duration of the block as shown in the table below:
Proportion of temporary blocked users by block reason
Block reason
Proportion of users
Other
59.46%
disruptive
27.03%
spam_advertising
8.11%
sockpuppet
4.05%
vandalism
1.35%
Proportion of permanent blocked users by block reason
Block reason
Proportion of users
sockpuppet
30.87%
spam_advertising
28.29%
checkuserblock
15.27%
Not_here_to_build_enyclopedia
7.03%
disruptive
6.36%
Other
4.53%
username_similar_to_organization
2.2%
long_term_abuse
1.87%
vandalism
1.75%
username_policy_violation
1.66%
making_legal_threat
0.08%
trolling
0.04%
user_name_bot
0.04%
Temporary blocked accounts are more likely to have bespoke user-provided block reasons, which are currently grouped under the “Other” category. Temporary blocked accounts are also more likely to be blocked for disruptive activity while permanent blocks are more likely to be blocked for sockpupptery and adverstising activity.
Distribution of published edits by blocked users
A number of the reviewed user attributes relate to the user’s editing activity. Let’s review the typical number of edits completed by users that are blocked within the 90 day timeframe.
Show the code
blocked_users_edits <- user_attribute_data %>%filter(is_blocked ==1, num_all_edits >0) %>%group_by(user_id) %>%# a few cases of a single user listed twice due to multiple blockssummarise(num_edits_completed =sum(num_all_edits))
options(scipen =999)blocked_edits_histogram <- blocked_users_edits %>%ggplot(aes(x=num_edits_completed)) +geom_histogram(color ='black', fill ="#999999") +geom_vline(aes(xintercept = blocked_edits_distribution[1], color ='25th Percentile'), linetype ='dashed', linewidth =1.5) +geom_vline(aes(xintercept = blocked_edits_distribution[2], color ='50th Percentile'), linetype ='dashed', linewidth =1.5) +geom_vline(aes(xintercept = blocked_edits_distribution[3], color ='75th Percentile'), linetype ='dashed', linewidth =1.5) +geom_label(aes(x = blocked_edits_distribution[1], y =250, label =paste0(blocked_edits_distribution[1], " edits"))) +geom_label(aes(x = blocked_edits_distribution[2], y =250, label =paste0(blocked_edits_distribution[2] , " edits"))) +geom_label(aes(x = blocked_edits_distribution[3], y =250 , label =paste0(blocked_edits_distribution[3], " edits"))) +scale_y_continuous(labels =label_comma())+scale_x_log10() +labs (title ="Number of edits completed by blocked user accounts",y ="Number of blocked users",x="Number of edits (log scale)") +scale_color_manual(name='Percentiles',breaks=c('25th Percentile', '50th Percentile', '75th Percentile'),values=c("25th Percentile"="darkblue", "50th Percentile"="darkmagenta", "75th Percentile"="darkred")) +theme(panel.grid.minor =element_blank(),panel.background =element_blank(),plot.title =element_text(hjust =0.5),text =element_text(size=20),legend.position="bottom",axis.line =element_line(colour ="black")) blocked_edits_histogram
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The median number of edits (across any namespace) completed by blocked accounts within 90 days of creating an account is 5 edits. We observe even more edits completed by temporary blocked accounts. The median number of edits completed by temporary accounts is 32 edits compared to a median of 3 edits completed by permanent blocked accounts.
Data exploration: Patterns across blocked accounts
We first completed an exploratory analysis of the data to identify common patterns of blocked accounts across various user attributes. This included editing activity and other available publically available data about the user such as user rights level and if they have a confirmed email address.
Please refer to the blocked_user_data_exploration.ipynb notebook for more details on patterns across blocked accounts for the reviewed attributes including trends by block duration and block reason. Key insights from this analysis are summarized below.
Summary of key data points associated with blocked accounts
Confirmed email address: The majority (60%) of blocked user accounts do not have a confirmed email address. This is consistent for both temporary and blocked accounts but varies based on block reason.
Received Thanks: 99% of all permanently blocked user accounts and 87% of all temporary blocked users accounts have never received Thanks for any users.
If a blocked user did receive a thanks, they typically received just 1 thanks. 75% of all blocked accounts received fewer than 4 thanks.
Number of Articles Created: Only 3% of blocked registered user accounts created at least one article prior to being blocked. Temporary blocked users are slightly more likely to have created at least one article than permanently blocked users.
For the accounts that did create an article, the majority (75%) created less than 5 articles. The median number of articles created by blocked accounts is 2 articles.
Accounts blocked for sockpuppetry were the most likely to publish at least one article (8%) and accounts blocked for promotion and advertising activity were the least likely (0.8%) to publish at least one article during the reviewed timeframe.
Number of reverted edits:
Patterns differ significantly for temporary and permanent blocked accounts. Almost all (96%) of temporary blocked users have made at least one reverted edit compared to 51% of permanently blocked users.
For temporary blocked accounts, the median number of reverted edits is 16 edits. Most all of these reverts occur within 12 hours.
Block reason also impacts the observed revert rate.
Accounts blocked for disruptive behavior were the most likely to complete an edit that was reverted while accounts blocked for promotion/advertising and user names associated with an organization were less likely to complete a reverted edit. This is likely because many of the accounts blocked for promotion and advertising are due to a suspicious user name versus editing activity.
Number of Edits
Temporary blocked accounts are much more likely to complete a higher number of edits. The median number of edits completed by temporary blocked accounts is 32 edits compared to a median of 3 edits completed by permanent blocked accounts.
Most edits are typically completed by accounts blocked for disruption or sockpuppetry. The least number of edits completed by accounts blocked for user name violations.
Edit Size
The average edit size by blocked accounts is 304 bytes while the typical maximum edit size is 699 bytes. This appears to be much higher than the edit size by non-blocked users but will confirm in the next part of the analysis when comparing trends to accounts that are not blocked.
While there is not much difference in the average edit size between temporary and permanent blocked users, temporary account users are much more likely to make larger edits.
Recently created account and high amount of activity:
Temporary blocked accounts complete a median of 5 edits 24 hours after registering.
Also accounts blocked for blocked for sockpuppetry, disruptive activity, and violating WP:NOTHERE are more likely to complete a higher number of edits right after registering.
Check user reports:
15% of all blocked accounts have been checked with the CheckUser tool.
Patterns vary by block reason. Only 3% of accounts blocked for promotion and advertising have at least 1 check user report vs 50% of accounts blocked for sockpuppetry.
If the user’s “User:” namespace exists
The majority of accounts have not created a User Page on English Wikipedia. Trends are mostly consistent by block duration and reason.
The user rights for the user: 99% of all blocked accounts on English Wikipedia do not have any current or previously held user permissions.
Data exploration: Review of block rate across user attributes
We then compared these patterns to user accounts that are not currently blocked to explore the impact of each user attribute on the likelihood that account was blocked during the reviewed timeframe.
For this data exploration, we reviewed all accounts created between May and June 2024 on English Wikipedia and the account’s block status at the time this dataset was gathered.
Editing Activity
A number of the reviewed user attributes are associated with the user account’s editing activity. There are two types of attributes we can consider: proportion of users that completed at least one of the editing activities and also the average number of those types of edits completed.
A number of these attributes are only relevant for users that have completed at least one edit.
User accounts that completed at least one edit
Show the code
# How many blocked users completed at least one editprop_users_w_edits <- user_attribute_data %>%group_by(is_blocked, has_completed_edit) %>%summarise(n_users =n_distinct(user_name)) %>%mutate(pct_users = n_users/sum(n_users))
Blocked accounts are more likely to complete at least one edit. Only 14% of blocked accounts did not complete at least one edit compared to 35% of non-blocked accounts that did not complete an edit during the reviewed timeframe.
Show the code
# compare proportion across blocked and non blocked accountscalc_proportion <-function (col, df = user_attribute_data, is_col_bool =TRUE) { result <- ( df %>%group_by(.data[[col]]) %>%summarise(Total =n_distinct(user_name),blocked =n_distinct(user_name[is_blocked ==1]),not_blocked =n_distinct(user_name[is_blocked ==0]) ) %>%mutate(pct_blocked = blocked / Total,pct_not_blocked = not_blocked / Total,Variable = col,Total = Total ) %>%select(-c(blocked, not_blocked)) )if (is_col_bool) {colnames(result)[1] <-'TF' result <- ( result %>%mutate(across('TF', str_replace, 'TRUE', 'Yes'),across('TF', str_replace, 'FALSE', 'No') ) ) }return(data.frame(result))}
Show the code
completed_edit_col <-'has_completed_edit'completed_edit_proportion_list <-list()for (col in completed_edit_col) { completed_edit_proportion_list[[col]] <-calc_proportion(col, user_attribute_data)}completed_edit_proportion <-do.call(rbind, completed_edit_proportion_list)## for editors that completed at least one edithas_edit_proportion_tbl <- ( completed_edit_proportion%>%mutate(Total_fmt =sapply(Total, format_big_number)) %>%gt(groupname_col ='Variable', rowname_col ='TF' ) %>%fmt_percent(c('pct_blocked', 'pct_not_blocked'), decimals=1 ) %>%data_color(c('pct_blocked', 'pct_not_blocked'),palette ='GnBu' ) %>%tab_spanner('Is User Account Blocked', c('pct_blocked', 'pct_not_blocked') ) %>%cols_label(pct_blocked ='Yes',pct_not_blocked ='No' ) %>%cols_hide('Total_fmt') %>%tab_header('Blocked proportion of user accounts by editing status') %>%opt_stylize(4))display_html(as_raw_html(has_edit_proportion_tbl))
Blocked proportion of user accounts by editing status
Total
Is User Account Blocked
Yes
No
has_completed_edit
No
162770
0.2%
99.8%
Yes
90863
2.3%
97.7%
If we look at the block rate, 2.3% of all accounts that completed at least one edit during the reviewed timeframe were blocked compared to 0.2% of accounts that did not complete an edit.
Show the code
# creat dataset limited to editor onlysuser_attribute_data_editors <- user_attribute_data %>%filter(num_all_edits >0)
Show the code
editing_logical_cols <-c('has_received_thanks', 'has_reverted_edit_12hours', 'has_article_created', 'has_completed_content','has_completed_non_content', 'has_sleeper_account_activity')logical_cols_proportion_list <-list()for (col in editing_logical_cols) { logical_cols_proportion_list[[col]] <-calc_proportion(col, user_attribute_data_editors)}logical_cols_proportion <-do.call(rbind, logical_cols_proportion_list)## for editors that completed at least one editeditors_proportion_tbl <- ( logical_cols_proportion %>%mutate(Total_fmt =sapply(Total, format_big_number)) %>%gt(groupname_col ='Variable', rowname_col ='TF' ) %>%fmt_percent(c('pct_blocked', 'pct_not_blocked'), decimals=1 ) %>%data_color(c('pct_blocked', 'pct_not_blocked'),palette ='GnBu' ) %>%tab_spanner('Is User Account Blocked', c('pct_blocked', 'pct_not_blocked') ) %>%cols_label(pct_blocked ='Yes',pct_not_blocked ='No' ) %>%cols_hide('Total_fmt') %>%tab_header('Editors only: Blocked proportion of user accounts by editing activity') %>%opt_stylize(4))display_html(as_raw_html(editors_proportion_tbl))
Editors only: Blocked proportion of user accounts by editing activity
Total
Is User Account Blocked
Yes
No
has_received_thanks
No
89385
2.2%
97.8%
Yes
1478
5.8%
94.2%
has_reverted_edit_12hours
No
62077
1.5%
98.5%
Yes
28786
3.9%
96.1%
has_article_created
No
89597
2.1%
97.9%
Yes
1266
17.3%
82.7%
has_completed_content
No
28180
2.3%
97.7%
Yes
62683
2.3%
97.7%
has_completed_non_content
No
48187
1.3%
98.7%
Yes
42676
3.4%
96.6%
has_sleeper_account_activity
No
89301
2.2%
97.8%
Yes
1562
7.2%
92.8%
Summary
Overall, the block rate is higher for user accounts that completed at least one of the identified editing activities compared to user accounts that did not.
has_completed_non_content The block rate is slightly higher for user accounts that completed at least one edit on a non-content page compared to user accounts that completed at least one edit on a content page.
has_reverted_edit_12_hours We also reviewed if the user had a reverted edit within 12 hours after their revision timestamp. This 12-hour time to revert criteria was proposed as it was also identified as one of the criteria to identify potential vandalism edits in T349083. The block rate is higher for user who completed at least one reverted edit. 3.9% of all user accounts that had an edit reverted within 12 hours after the revision timestamp were blocked.
has_article_created We observed the highest block rate for accounts that created at least one new article in the reviewed timeframe. 17.3% of all accounts that created an article on English Wikipedia were blocked.
has_received_thanks We also observed a higher block rate for accounts that that received thanks. 5.8% of all accounts that received thanks were issued a block; however, this is likely because blocked accounts are also more likely to make more edits compared to non-blocked user accounts.
is_sleeper_account_activity. We also observed that user accounts are more likely to be blocked if they completed a quick rate of editing after a while since account creation. For this analysis, sleeper account activity was defined as making an edit at least 30 days after creating an account and making at least over 5 edits within 5 minutes.
edits_by_block_status <-data.frame(t( user_attribute_data %>%filter(num_all_edits >0) %>%#only applicable to editorsselect(is_blocked, all_of(editing_num_cols)) %>%group_by(is_blocked) %>%summarize(across(all_of(editing_num_cols), mean))))colnames(edits_by_block_status) <-c('No', 'Yes')edits_by_block_status$Variable <-rownames(edits_by_block_status)rownames(edits_by_block_status) <-NULLedits_by_block_status_tbl <- ( edits_by_block_status[-1, ] %>%gt() %>%cols_move_to_start('Yes') %>%cols_move_to_start('Variable') %>%tab_spanner('Is User Account Blocked', c('Yes', 'No')) %>%tab_header('Editing activity by user account blocked status','Limited only to accounts with at least one edit' ) %>%fmt_number(c('Yes', 'No'), decimals=1) %>%opt_stylize(style =4))display_html(as_raw_html(edits_by_block_status_tbl))
Editing activity by user account blocked status
Limited only to accounts with at least one edit
Variable
Is User Account Blocked
Yes
No
num_all_edits
46.7
7.7
num_content_edits
33.1
5.1
num_non_content_edits
13.6
2.5
max_edit_size
5,282.4
1,793.9
num_articles_created
1.1
0.1
num_edits_1hrs
2.2
1.8
num_edits_24hrs
4.5
2.8
num_reverted_edits_10minutes
2.0
0.5
num_reverted_edits_12hours
5.4
1.0
num_reverted_edits_all
10.0
1.3
ratio_revert_edits
0.4
0.3
num_unreverted_edits
36.7
6.4
Summary
Overall, there is a higher rate of editing activity across all reviewed editing attributes for blocked accounts.
num_edits & num_content_edits & num_non_content_edits: On average, blocked accounts complete roughly about 6 times as many content edits and 5 times as many non-content edits. Note: The reviewed sample was limited to newly created accounts; as a result, this reflects the number of edits completed within a relatively quick time period (90 days).
max_edit_size & avg_edit_size On average, blocked users make larger-sized edits. The average maximum edit size by blocked accounts was 5,282 compared to 1,793 for accounts that were not blocked.
num_reverted_edit_10minutes & num_reverted_edits_12hours & num_reverted_edits_all: For each user account, we reviewed the total number of reverted edits as well as reverted edits completed within 10 minutes and 12 hours to determine any impact from the rate of reverts. Blocked user accounts are more likely to have reverted edits across all three of these attributes. We see the largest relative difference in the number of total reverted edits. There was an average of 10 reverted edits by blocked accounts compared to 1 for non-blocked accounts.
num_articles_created: While we found that 17% of all user accounts that created an article were blocked, these blocked accounts typically create just one article before they are blocked.
ratio_revert_edits: On average, blocked accounts have a slightly higher ratio of reverted edits to all edits compared to non-blocked accounts.
User Edit Count
Show the code
# setting threshold per data publication guidelines.threshold <-150
Show the code
## order and factor user edit countedit_buckets <-c('No edits', '1-10', '11-99', '100-999', '1000-4999', '5000+')user_attribute_data <- ( user_attribute_data %>%mutate(user_edit_bucket =factor(user_edit_bucket, levels = edit_buckets) ))
Users with lower edit counts have a lower rate of being blocked.
User with higher edits counts (over 1000) have the highest rate of being blocked; however, there are also fewer of these accounts.
It’s also important to note that the dataset was limited to newly created accounts. As a result, these trends also reflect the rate of editing. There are fewer users are able to make a large number of edits within 90 days of creating an account.
Blocked Attributes
We also reviewed past moderation or adminstrative activities against the user including historical, expired blocks issued as well as blocks issued on other wikis and check user reports. This analysis include all user accounts including accounts that have not completed an edit.
Show the code
blocked_logical_cols <-c('has_historical_blocks', 'has_other_wiki_blocks', 'has_cu_reports')logical_cols_proportion_list <-list()for (col in blocked_logical_cols) { logical_cols_proportion_list[[col]] <-calc_proportion(col, user_attribute_data)}logical_cols_proportion <-do.call(rbind, logical_cols_proportion_list)logical_cols_proportion_tbl <- ( logical_cols_proportion %>%mutate(Total_fmt =sapply(Total, format_big_number)) %>%gt(groupname_col ='Variable', rowname_col ='TF' ) %>%fmt_percent(c('pct_blocked', 'pct_not_blocked'), decimals=1 ) %>%data_color(c('pct_blocked', 'pct_not_blocked'),palette ='GnBu' ) %>%tab_spanner('Is User Account Blocked', c('pct_blocked', 'pct_not_blocked') ) %>%cols_label(pct_blocked ='Yes',pct_not_blocked ='No' ) %>%cols_hide('Total_fmt') %>%tab_header('Proportion of user Accounts that have completed Editing activity') %>%opt_stylize(4))display_html(as_raw_html(logical_cols_proportion_tbl))
Proportion of user Accounts that have completed Editing activity
Number of expired blocks by user account blocked status
Variable
Is User Account Blocked
Yes
No
num_exp_blocks
0.0
0.0
Summary
We see higher block rates associated with other blocks or check user activity.
13% of all user accounts that have had a historical, since expired block now have a current block issued to them.
22.6% of all user accounts that have other wiki blocks or had check user reports issued are blocked.
While we see higher block rates for these attributes compared to editing activity, there is also a smaller number of users that met this criteria. In the sample dataset reviewed, there were less than 200 users that were identified as having a historical or other wiki block. This represents less than 1% of all user accounts.
The average number of expired blocks is also 0 as the majority of users have not previously had a block issued to them. If we limit to accounts that have had a block, most all accounts have only been issued one expired block.
Other User Account Attributes
We also identified some miscellaneous account attributes such as the existence of a user page and if they have a confirmed email address.
Show the code
other_logical_cols <-c('has_user_page', 'has_confirmed_email')logical_cols_proportion_list <-list()for (col in other_logical_cols) { logical_cols_proportion_list[[col]] <-calc_proportion(col, user_attribute_data)}logical_cols_proportion <-do.call(rbind, logical_cols_proportion_list)logical_cols_proportion_tbl <- ( logical_cols_proportion %>%mutate(Total_fmt =sapply(Total, format_big_number)) %>%gt(groupname_col ='Variable', rowname_col ='TF' ) %>%fmt_percent(c('pct_blocked', 'pct_not_blocked'), decimals=1 ) %>%data_color(c('pct_blocked', 'pct_not_blocked'),palette ='GnBu' ) %>%tab_spanner('Is User Account Blocked', c('pct_blocked', 'pct_not_blocked') ) %>%cols_label(pct_blocked ='Yes',pct_not_blocked ='No' ) %>%cols_hide('Total_fmt') %>%tab_header('Proportion of blocked accounts by other user attribute activity') %>%opt_stylize(4))display_html(as_raw_html(logical_cols_proportion_tbl))
Proportion of blocked accounts by other user attribute activity
Total
Is User Account Blocked
Yes
No
has_user_page
No
244386
0.7%
99.3%
Yes
9247
6.7%
93.3%
has_confirmed_email
No
166405
0.9%
99.1%
Yes
87228
1.0%
99.0%
Summary
has_user_page: User accounts with a user page created have a higher block rate than user accounts that do not. Note: This checks if the user page was created but not the extent of content on the page. It is possible that this trend is due to to higher frequency of comments left on talk pages of users who are blocked or made an edit that was reverted.
has_confirmed_email: There is minimal difference in the blocked rate of accounts with a confirmed email and those without. About 1% of user accounts with and without a confirmed email address are blocked.
User Rights Level
Show the code
## order and factor user rights leveluser_rights <-c('none', 'confirmed', 'extended','extendedconfirmed, patroller' )user_attribute_data <- ( user_attribute_data %>%mutate(user_rights_level =factor(user_rights_level, levels = user_rights, ordered =TRUE), ))
user_rights_del_plot <-plot_del_pct_bar('user_rights_level', 'User Rights Level')
Summary
User Rights Level:
The majority of all reviewed user accounts do not have any assigned user rights, which is expected as this analysis is currently focused on accounts recently created.
136 accounts were identified as having extended rights and 15% of these accounts were blocked. On English Wikipedia, a registered editor receives the extenededconfirmed automatically on edit after the account existed for at least 30 days and has made at least 500 edits. Given the quick editing rate observed for blocked accounts, it makes sense that these editors were provided right automatically prior to being blocked.
Determining variable importance with random forest modeling
We used the gathered sample of 253,634 users to train a random forest (an ensemble classification algorithm) on users that were issued either a temporary or a permanent block. The random forest models allows us to understand variable importance in the dataset. This type of model considers interactions between variables, allowing it to identify attributes that, while not individually significant, become important in combination with others.
To understand the variable importance, we will be using the permutation importance approach, which gives the Mean Decrease in Accuracy (MDA). In this approach, values of a given variable get randomly permutted to see how that affects classification accuracy.
Data preparation for modeling
Since the reviewed attributes have vastly different scales, we will standardize the data before using logistic regression. This will help the optimization algorithm converge faster and more efficiently. Standardized coefficeints are also easier to interpret as they represent the change in the outcome variable per unit standard deviation change in the predictor.
Several of the attributes explores are highly correlated or redundant, which might lead to difficulty in the model accutrately interpreting the individual effects of each variable. We can calculate the correlation between all the numberical user attributes in the data set.
Based on a review of the correlations across attributes, the following attributes have a high correlation (above 0.7): * Num_all_edits: num_content_edits * Num all edits: num_unreverted_edits * Num_edits_1_hrs: Num_edits_24_hrs * Num reverted_edits_10minutes: Num_reverted_edits_12hours * num_reverted_edits_all: Num_reverted_edits_12hours
Let’s focus the model on the following attributes, which were shown in exploratory analysis to have a large difference between blocked and non_blocked accounts . * Num_reverted_edits_12hrs * Num_edits_24hrs * Num_content_edits
Variable importance across all accounts
Show the code
# create training dataset (80% of the dataset) and test dataset (20% of the dataset)# Remove identifier, non-attribute variables and variables identifed as duplicativemodel_data_rf <-select(user_attribute_data_norm, -c(user_name, user_id, block_timestamp, block_expiration_timestamp, user_registration_timestamp, block_duration, block_reason_simple, avg_edit_size, num_unreverted_edits, num_all_edits, num_edits_1hrs, num_reverted_edits_10minutes, num_reverted_edits_all, user_rights_level, user_edit_bucket))trn_index_rf <-createDataPartition(model_data_rf$is_blocked, p =0.8, list =FALSE)trn_data_rf <- model_data_rf[trn_index_rf, ]test_data_rf <- model_data_rf[-trn_index_rf, ]
Show the code
tic()rf_model_ranger <-ranger(formula = is_blocked ~ ., data = trn_data_rf, num.trees =501,# suggested default for classificationmtry =floor(sqrt(ncol(trn_data_rf))),verbose =TRUE,# mean decreasing accuracyimportance ='permutation',# ranger package by default uses all available cores at disposal# ideal to specify threads, especially if on a shared servernum.threads =6)toc()saveRDS(rf_model_ranger, file='data/rf_model.rds')
options(repr.plot.width =12, repr.plot.height =10) var_imp_plot <- (ggplot(var_importance, aes(x =reorder(variable, imp_norm), y=imp_norm, fill=imp_norm)) +geom_bar(stat ='identity') +# geom_text(# aes(label = sprintf("%.1f%%", imp_pct), fontface = 1),# hjust = -0.1# ) + coord_flip() +labs(title ='Importance of user attributes in predicting a user being blocked',x ='Variable',y ='Mean Decreasing Accuracy (Relative to Max)' ) +theme_classic() +theme(plot.margin =unit(c(1, 0, 1, 1), "cm"),text =element_text(size=14),plot.title =element_text(hjust =0.5, size=18),legend.position ='none' ) +scale_fill_gradient(low ='lightblue', high ='darkblue'))var_imp_plot
Summary:
Each bar indicates the average decrease in accuracy (relative to the maximum).
Many of the editing-related attributes were identified as the most important in predicting a user being blocked.
The ratio of all reverted edits to all total edits was also identified as an important user attribute. Note: This ratio relects all edits and is not limited to a current page namespace.
A high decrease in accuracy was observed when values of the user’s maximum edit size was changed.
The number of non-content edits is slightly more important to predicting if a user was blocked compared to the number of content edits.
The total number of reverted edits within 12 hours was also identified as important.
The model also indicates that if a user talk page exists for the user is important in determining if a user was blocked. Note: This could be related to comments being left on a user’s talk page.
Non-editing attributes were identified as less important in this model. If the user has a check user report was identified as the most important non-editing attribute but it has a relatively low mean decrease in accuracy (MDA) compared to editing attributes.
Note: This doesn’t necessarily mean that these variables don’t have any importance in deciding the blocked outcome, but when considering the overall set of variables available, they are less important compared to others. Random forests favor features with more distinct values. In this dataset, the non-editing user attributes had limited distinct values.
We will investigate further using regression modeling to estimate the impact of each of these variables on the outcome and identify which features both models agree on.
Logistic Regression Modeling
We then use logistic regression modeling to assess the magnitude and direction of the user attribute’s impact on whether a user was blocked or not.
Note: If this analysis is extended to other Wikipedias, it would be good to use a hiearchical regresesion model to handle random effects that may be caused by differing block behavior on various Wikipedias.
For this method, we split the dataset into the Training set, on which the Logistic Regression model will be trained and the Test set, on which the trained model will be applied to classify the result.
Show the code
outcome <-'is_blocked'# exclude identifier columns and redundant variablesblocked_glm_exclude_cols <-c('user_name','user_id', 'block_timestamp', 'user_registration_timestamp', 'has_completed_edit','block_expiration_timestamp', 'block_duration', 'block_reason_simple', 'avg_edit_size', 'num_unreverted_edits', 'num_all_edits', 'num_edits_1hrs', 'num_reverted_edits_10minutes','num_reverted_edits_all', 'user_rights_level', 'user_edit_bucket')blocked_glm_data <-select(user_attribute_data_norm, -all_of(blocked_glm_exclude_cols))blocked_glm_num_cols <-names(select(blocked_glm_data, where(is.numeric)))#create training and test data setsblocked_glm_trn_idx <-createDataPartition(blocked_glm_data[[outcome]], p =0.9, list =FALSE)blocked_glm_trn_data <- blocked_glm_data[blocked_glm_trn_idx, ]blocked_glm_tst_data <- blocked_glm_data[-blocked_glm_trn_idx, ]blocked_predictors <-setdiff(names(blocked_glm_data), outcome)blocked_glm_formula <-as.formula(glue('{outcome} ~ {paste(blocked_predictors, collapse = \' + \')}'))tic()blas_set_num_threads(16)blocked_glm <-glm(formula = blocked_glm_formula,family =binomial(link ='logit'),#weights = weight,data = blocked_glm_trn_data)blas_set_num_threads(1)print('model has been built.')toc()
[1] "model has been built."
1.037 sec elapsed
Model Summary
Show the code
blocked_glm_numerical_features_summary_tbl <-tbl_regression(blocked_glm, include =all_of(blocked_glm_num_cols))blocked_glm_logical_cols <-setdiff(names(select(blocked_glm_data, where(is.logical))), outcome)blocked_glm_logical_features_summary_tbl <- (tbl_regression( blocked_glm, include =all_of(blocked_glm_logical_cols), show_single_row =all_of(blocked_glm_logical_cols),tidy_fun = broom.helpers::tidy_parameters ))
Profiled confidence intervals may take longer time to compute.
Use `ci_method="wald"` for faster computation of CIs.
Show the code
#| column: pagedisplay_tbl_hrz(list(as_raw_html(as_gt( blocked_glm_numerical_features_summary_tbl ) %>%opt_stylize(5) %>%tab_header('Numerical features of user accounts') ), as_raw_html(as_gt( blocked_glm_logical_features_summary_tbl ) %>%opt_stylize(5) %>%tab_header('Logical features of user accounts') ) ))
Numerical features of user accounts
Characteristic
log(OR)1
95% CI1
p-value
num_thanks
-0.57
-1.0, -0.14
0.011
num_exp_blocks
0.26
-3.1, 3.5
0.9
num_cu_requests
-0.40
-0.79, -0.02
0.040
num_reverted_edits_12hours
0.24
0.13, 0.34
<0.001
num_edits_24hrs
-0.29
-0.35, -0.22
<0.001
num_articles_created
-0.09
-0.29, 0.11
0.4
num_content_edits
0.19
0.11, 0.26
<0.001
num_non_content_edits
0.17
0.10, 0.25
<0.001
max_edit_size
0.10
0.07, 0.13
<0.001
ratio_revert_edits
1.3
0.99, 1.6
<0.001
1OR = Odds Ratio, CI = Confidence Interval
Logical features of user accounts
Characteristic
log(OR)1
95% CI1
p-value
has_received_thanks
0.12
-0.39, 0.64
0.6
has_reverted_edit_12hours
-0.12
-0.31, 0.07
0.2
has_article_created
0.61
0.28, 0.93
<0.001
has_completed_content
-0.26
-0.44, -0.08
0.005
has_completed_non_content
0.46
0.29, 0.63
<0.001
has_sleeper_account_activity
0.91
0.68, 1.1
<0.001
has_confirmed_email
-0.05
-0.16, 0.05
0.3
has_user_page
0.35
0.22, 0.47
<0.001
has_historical_blocks
0.25
-2.2, 2.8
0.8
has_other_wiki_blocks
2.1
1.5, 2.6
<0.001
has_cu_reports
2.4
2.0, 2.7
<0.001
1OR = Odds Ratio, CI = Confidence Interval
Logistic regression coefficient review
Show the code
#| column: pageconfidence_intervals <-confint.default(blocked_glm) %>% as.data.frame %>% { .$term <-gsub("`", "", rownames(.)); . } %>%set_rownames(NULL) %>% dplyr::mutate(estimate =coef(blocked_glm),`increases chances of`=ifelse(estimate >0, "blocked", "not blocked"))p <-ggplot(confidence_intervals, aes(y = estimate, x = term, color =`increases chances of`)) +geom_hline(yintercept =0, linetype ="dashed") +geom_pointrange(aes(ymin =`2.5 %`, ymax =`97.5 %`)) +scale_y_continuous(limits =c(-8, 8)) +scale_color_brewer(type ="qual", palette ="Set1", direction =-1) +coord_flip() + ggthemes::theme_tufte(base_family ="Gill Sans", base_size =16) +labs(x ="Variable", y ="Estimate",title ="Coefficient plot of logistic regression model of \n factors associated with blocked accounts ",caption ="Point estimates and 95% confidence intervals for coefficients used in the logistic regression model.Positive coefficients increase the odds of the user being blocked" ) +theme(legend.position ="bottom", panel.grid =element_line(color ="gray80"),panel.grid.major.y =element_line(color ="black", size =0.1))p
There are a number of user attributes that were assessed to have no significant effect on the outcome variable such as number of thanks received. The ratio of reverted edits, being identifed as a sleeper account, as well as as administrative activity such as blocks on wikis and check user reports were identified as factors that signficantly increased chance of a user account blocked.
We will further explore the impact of variables determined to have a significant impact on whether an account is blocked in the next section.
blocked_glm_logical_features_summary_tbl <- ( blocked_glm_logical_features_summary %>%gt(rowname_col ='feature') %>%fmt_percent('prob_diff', decimals=0) %>%cols_label(coeff ='Coefficient',prob_diff ='% Change' ) %>%tab_header('Change in Probability of User Account Block', 'for logical variables') %>%tab_source_note('% Change is when the given feature is TRUE') %>%opt_stylize())display_html(as_raw_html(blocked_glm_logical_features_summary_tbl))
Change in Probability of User Account Block
for logical variables
Coefficient
% Change
has_article_created
0.593
80%
has_completed_non_content
0.409
50%
has_completed_content
-0.317
−27%
has_sleeper_account_activity
0.890
141%
has_user_page
0.331
39%
has_other_wiki_blocks
1.987
602%
has_cu_reports
2.231
784%
% Change is when the given feature is TRUE
Summary
The impact on probability of being blocked is highest when there are check user reports have been issued or the user account has been blocked on other wikis.
Sleeper accounts (defined as a quick rate of editing from new accounts, especially after a while since account creation), was also identified as one of the attribute with the highest impact on the probability of being blocked.
The probability of a user being blocked also increased if they have completed a non-content edit or created an article recently after creating an account, while the probability of being blocked decreased if they completed at least one content edit.
Numerical Variables
Show the code
# limit to varaibles where signficant difference was foundblocked_glm_num_cols_sig <-c('num_reverted_edits_12hours','num_edits_24hrs', 'num_content_edits', 'num_non_content_edits', 'max_edit_size', 'ratio_revert_edits')
blocked_glm_numerical_features_summary_tbl <- ( blocked_glm_numerical_features_summary %>%gt(rowname_col ='feature') %>%tab_spanner(label ='Scenario 1',columns =ends_with('l1') ) %>%tab_spanner(label ='Scenario 2',columns =ends_with('l2') ) %>%cols_label(starts_with("base") ~"Initial",starts_with("increase") ~"Increase",starts_with("prob") ~"% Change" ) %>%fmt_percent(starts_with('prob') ) %>%data_color(starts_with('prob'),palette ='RdYlBu',reverse =TRUE ) %>%opt_stylize() %>%tab_header('Change in Probability of User Account Block', 'for numerical variables'))display_html(as_raw_html(blocked_glm_numerical_features_summary_tbl))
Change in Probability of User Account Block
for numerical variables
Scenario 1
Scenario 2
Initial
Increase
% Change
Initial
Increase
% Change
num_reverted_edits_12hours
3.0
10.0
32.94%
5.0
15.0
32.90%
num_edits_24hrs
2.0
3.0
−18.46%
10.0
50.0
−39.66%
num_content_edits
10.0
20.0
21.95%
100.0
200.0
23.11%
num_non_content_edits
10.0
20.0
19.30%
100.0
200.0
20.33%
max_edit_size
1000.0
2000.0
10.82%
2000.0
3000.0
8.94%
ratio_revert_edits
0.2
0.3
32.63%
0.2
0.4
43.90%
Summary:
To help understand changes in probability due to the impact of each user attribute, we calculated the change in probability of being blocked as the difference between the probability for a given initial value and the probability for the new value. We reviewed two scenarios. A lower inital value is reviewed in the first scenario and compared to a larger initial value in the second scenario.
Increases in all of the numerical attributes led to an increase in the probability of a user account block except for all the nubmer of edits completed by the user within 24 hours. Increases in ratio of a user’s reverted edits to all edits completed leads to the biggest increase in the probability of a user being blocked.
Conclusions
The two models reveal a similar set of high-impact features, namely:
Ratio of reverted edit to all edits completed
Maximum edit size
Number of edits reverted within 12 hours
Has completed a non content edit
Has created a new article
Has a user page
The logistic regression model also identifed these as high-impact features:
Has other wiki blocks
Has check user reports
Has sleeper account activity
The models confirmed these attributes had the highest impact on the likelihood of an account being blocked and could be considered in the calculation of an account reputation score.
Follow-up analysis suggestions:
Extend analysis to other Wikipedias where the type of user attributes as well as the magnitude and direction of their impact on a user’s block status may vary due to different moderation and administrative practices.
Consider additional user attributes. The analysis revealed that quick rate of editing especially if a large portion of those edits were reverted increase likelihood of a user being blocked. We can explore a few more defintions of quick editing rates to see if we can find a more precise indicator.
Acknowledgements
The approach in this analysis references the following reports and analyses: