Purpose:

Special:Investgate has been deployed on all wikis. This dashboard is to monitor the adoption, user experience, feature usage with this new feature. T257539

Adoption: Comparing usage for the new tool versus the old tool

  • How often Special:Investigate is accessed?
  • How often Special:Checkuser is accessed?

User experience

  • How many users does an investigation start with?
  • How many records are fetched per investigation?
  • How often did we display incomplete results to the user?
  • How long did a user spend waiting for the results?

Feature usage

  • Over the course of an investigation how often did a user access the individual tabs?
  • How many times did a user pin a highlight?
  • How often did the block feature get used and how many users were blocked?

User reliance on external IP tools

  • Which tools under the IP address were clicked and how many times?

Other metrics we'd like to track after we have developed a 'Done' button

  • How often did a user quit the investigation (page close or back button) while the results were being generated?
  • How many users are under investigation by the time the investigation ends?

Schema: eventlogging data

Deploy schedule:

Deployed on itwiki and svwiki on Sept 24th.
Deploying on all projects on Oct 8.

In [1]:
%%capture
%set_env http_proxy=http://webproxy.eqiad.wmnet:8080
%set_env https_proxy=https://webproxy.eqiad.wmnet:8080
%set_env no_proxy=localhost,127.0.0.1
In [2]:
import wmfdata
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-f542_wk_ because the default path (/nonexistent/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
In [3]:
wmfdata.utils.insert_code_toggle()
In [4]:
from wmfdata import hive, mariadb
import pandas as pd
In [5]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as ticker
import matplotlib.dates as mdates
In [6]:
from datetime import datetime, timedelta, date
In [7]:
now = pd.Timestamp.utcnow()
today=now.date()
last_Saturday=(today - timedelta(days=today.weekday()+2)).strftime("%Y%m%d")
month_1st_day= datetime.today().replace(day=1).strftime("%Y-%m-%d")
In [8]:
start_wmf_raw_date = '20210101'
# the first day of the week is Sunday
end_wmf_raw_date=last_Saturday

lastyear_start_wmf_raw_date = '20200101'
lastyear_end_wmf_raw_date='20201231'

Comparing usage for the new tool versus the old tool

In [9]:
wikis =hive.run(
'''
SELECT database_code, database_group
FROM  canonical_data.wikis 
WHERE
    database_group in (
        "commons", "incubator", "foundation", "mediawiki", "meta", "sources", 
        "species","wikibooks", "wikidata", "wikinews", "wikipedia", "wikiquote",
        "wikisource", "wikiversity", "wikivoyage", "wiktionary"
    )
'''
).sort_values("database_code").set_index("database_code")
In [10]:
local_CU_list=['metawiki','commonswiki', 'specieswiki', 'wikidatawiki','arwiki', 'bnwiki','cawiki','cswiki','dawiki' ,
               'dewiki', 'enwiki','eswiki','fawiki','fiwiki','frwiki', 'hewiki','hrwiki','huwiki', 'idwiki','itwiki',
               'jawiki', 'kowiki', 'mlwiki', 'nlwiki', 'plwiki', 'ptwiki', 'ruwiki', 'simplewiki', 'slwiki',
               'srwiki', 'svwiki', 'trwiki', 'ukwiki', 'viwiki','enwikibooks','enwiktionary']
In [11]:
checkuser_requests_query='''
SELECT 
 DATE_FORMAT(from_unixtime(UNIX_timestamp(cul_timestamp)), '%U') AS week_n,
 '{wiki}' AS wiki_db,'{admin}' AS admin_type, 
 CASE WHEN cul_type='investigate' THEN 'investigate' ELSE 'checkuser' END AS tool,
 COUNT(cul_id) AS requests
FROM cu_log 
 WHERE  substr(cul_timestamp,1,8)  BETWEEN '{START_YYYYMMDD}' AND '{END_YYYYMMDD}'
 GROUP BY  DATE_FORMAT(from_unixtime(UNIX_timestamp(cul_timestamp)), '%U'), 
 CASE WHEN cul_type='investigate' THEN 'investigate' ELSE 'checkuser' END 
 ORDER BY wiki_db, week_n
 LIMIT 10000
'''
In [12]:
columns = ['week_n','wiki_db', 'admin_type','tool','requests']
df_admin_request= pd.DataFrame( columns=columns)
df_admin_request_lastyear= pd.DataFrame( columns=columns)
In [13]:
for w in wikis.index.values:
    if w in local_CU_list:
        try: 
            tmp=mariadb.run(checkuser_requests_query.format(wiki=w,admin='local_checkuser', START_YYYYMMDD=start_wmf_raw_date,END_YYYYMMDD=end_wmf_raw_date),w)
            df_admin_request=df_admin_request.append(tmp, ignore_index = True)
        except ValueError:
            pass
    else:
        try:    
            tmp=mariadb.run(checkuser_requests_query.format(wiki=w,admin='global_checkuser', START_YYYYMMDD=start_wmf_raw_date,END_YYYYMMDD=end_wmf_raw_date),w)
            df_admin_request=df_admin_request.append(tmp, ignore_index = True)
        except ValueError:
            pass
In [14]:
for w in wikis.index.values:
    if w in local_CU_list:
        try: 
            tmp=mariadb.run(checkuser_requests_query.format(wiki=w,admin='local_checkuser', START_YYYYMMDD=lastyear_start_wmf_raw_date,END_YYYYMMDD=lastyear_end_wmf_raw_date),w)
            df_admin_request_lastyear=df_admin_request_lastyear.append(tmp, ignore_index = True)
        except ValueError:
            pass
    else:
        try:    
            tmp=mariadb.run(checkuser_requests_query.format(wiki=w,admin='global_checkuser', START_YYYYMMDD=lastyear_start_wmf_raw_date,END_YYYYMMDD=lastyear_end_wmf_raw_date),w)
            df_admin_request_lastyear=df_admin_request_lastyear.append(tmp, ignore_index = True)
        except ValueError:
            pass
In [15]:
fig, ax = plt.subplots(nrows=2, ncols=1,figsize=(16,12))

fig.tight_layout(pad=1.08, w_pad=0.5, h_pad=6)
ax[0].set_title('Check Requests on Wikis With Local Checkusers',fontweight="bold",fontsize = 16,y=1.08)
ax[0].set_xlabel('Week',  fontsize = 16)
ax[0].set_ylabel('Requests',  fontsize = 16)

ax[0].plot(df_admin_request[(df_admin_request['tool']=='checkuser')&(df_admin_request['admin_type']=='local_checkuser' )].groupby(['week_n'])['requests'].agg('sum'), 'o-',color='green',label='2021 Checkuser')
ax[0].plot(df_admin_request[(df_admin_request['tool']=='investigate')&(df_admin_request['admin_type']=='local_checkuser' )].groupby(['week_n'])['requests'].agg('sum'), 'o-',color='orange', label='2021 Special:Investigate')
ax[0].plot(df_admin_request_lastyear[(df_admin_request_lastyear['tool']=='checkuser')&(df_admin_request_lastyear['admin_type']=='local_checkuser' )].groupby(['week_n'])['requests'].agg('sum'), 'o--',color='green', label='2020 Checkuser')
ax[0].plot(df_admin_request_lastyear[(df_admin_request_lastyear['tool']=='investigate')&(df_admin_request_lastyear['admin_type']=='local_checkuser' )].groupby(['week_n'])['requests'].agg('sum'), 'o--',color='orange', label='2020 Special:Investigate')

ax[0].set_xlim(1,52)
ax[0].set_ylim(0,4000)
ax[0].legend(loc='upper right')


ax2 = ax[0].twiny()  # instantiate a second axes that shares the same x-axis
ax2.set_xlim(0,12)
ax2.set_xticklabels(['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
ax2.set_xticks([0.5,1.5,2.5,3.5,4.5,5.5,6.5,7.5,8.5,9.5,10.5,11.5])
ax2.tick_params(length=0)
ax2.set_xlabel('Month',fontsize=16)

miloc = plt.MultipleLocator(1)

ax2.xaxis.set_minor_locator(miloc)
ax2.grid(axis='x', which='minor')

ax[1].set_title('Check Requests on Wikis Without Local Checkusers',fontweight="bold",fontsize = 16,y=1.08)
ax[1].set_xlabel('Week',fontsize = 16)
ax[1].set_ylabel('Requests',fontsize = 16)

ax[1].set_xlim(1,52)
ax[1].set_ylim(0,1000)

ax3=ax[1].twiny()
ax3.set_xlim(0,12)
ax3.set_xticklabels(['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
ax3.set_xticks([0.5,1.5,2.5,3.5,4.5,5.5,6.5,7.5,8.5,9.5,10.5,11.5])
ax3.tick_params(length=0)
ax3.xaxis.set_minor_locator(miloc)
ax3.grid(axis='x', which='minor')

ax[1].plot(df_admin_request[(df_admin_request['tool']=='checkuser')&(df_admin_request['admin_type']=='global_checkuser' )].groupby(['week_n'])['requests'].agg('sum'), 'o-',color='green', label='2021 Checkuser')
ax[1].plot(df_admin_request[(df_admin_request['tool']=='investigate')&(df_admin_request['admin_type']=='global_checkuser' )].groupby(['week_n'])['requests'].agg('sum'), 'o-',color='orange', label='2021 Special:Investigate')
ax[1].plot(df_admin_request_lastyear[(df_admin_request_lastyear['tool']=='checkuser')&(df_admin_request_lastyear['admin_type']=='global_checkuser' )].groupby(['week_n'])['requests'].agg('sum'), 'o--',color='green', label='2020 Checkuser')
ax[1].plot(df_admin_request_lastyear[(df_admin_request_lastyear['tool']=='investigate')&(df_admin_request_lastyear['admin_type']=='global_checkuser' )].groupby(['week_n'])['requests'].agg('sum'), 'o--',color='orange',label='2020 Special:Investigate')

ax[1].legend(loc='upper right')
#plt.savefig("adoption.png")
plt.show()
/home/jiawang/venv/lib/python3.7/site-packages/ipykernel_launcher.py:20: UserWarning: FixedFormatter should only be used together with FixedLocator
/home/jiawang/venv/lib/python3.7/site-packages/ipykernel_launcher.py:39: UserWarning: FixedFormatter should only be used together with FixedLocator

How often Special:Investigate is accessed?

Top 10

In [16]:
df_admin_request[df_admin_request['tool']=='investigate'].sort_values(by=['week_n', 'requests'], ascending=False).head(10)
Out[16]:
week_n wiki_db admin_type tool requests
566 40 eswiki local_checkuser investigate 76
389 40 enwiki local_checkuser investigate 63
1281 40 ptwiki local_checkuser investigate 16
1595 40 ukwiki local_checkuser investigate 8
1100 40 metawiki local_checkuser investigate 7
1404 40 simplewiki local_checkuser investigate 6
862 40 idwiki local_checkuser investigate 3
1593 39 ukwiki local_checkuser investigate 75
564 39 eswiki local_checkuser investigate 38
387 39 enwiki local_checkuser investigate 17

Bottom 10

In [17]:
df_admin_request[df_admin_request['tool']=='investigate'].sort_values(by=['week_n', 'requests'], ascending=[True, False ]).tail(10)
Out[17]:
week_n wiki_db admin_type tool requests
861 39 idwiki local_checkuser investigate 4
1279 39 ptwiki local_checkuser investigate 4
1402 39 simplewiki local_checkuser investigate 2
566 40 eswiki local_checkuser investigate 76
389 40 enwiki local_checkuser investigate 63
1281 40 ptwiki local_checkuser investigate 16
1595 40 ukwiki local_checkuser investigate 8
1100 40 metawiki local_checkuser investigate 7
1404 40 simplewiki local_checkuser investigate 6
862 40 idwiki local_checkuser investigate 3
In [18]:
week_list = list(range(0, 53)) 
df_week_n= pd.DataFrame(week_list, columns=['week_n'])
In [19]:
df_tmp=df_admin_request[(df_admin_request['tool']=='investigate')].groupby(['week_n'])['requests'].agg('sum').to_frame().reset_index()
In [20]:
df_tmp['week_n']=df_tmp['week_n'].astype(int)
In [21]:
df_to_draw=pd.merge(df_week_n, df_tmp,how='left', left_on = 'week_n',right_on='week_n')
In [22]:
df_tmp_pre=df_admin_request_lastyear[(df_admin_request_lastyear['tool']=='investigate')].groupby(['week_n'])['requests'].agg('sum').to_frame().reset_index()
In [23]:
df_tmp_pre['week_n']=df_tmp_pre['week_n'].astype(int)
In [24]:
df_to_draw_pre=pd.merge(df_week_n, df_tmp_pre,how='left', left_on = 'week_n',right_on='week_n')
In [25]:
fig, ax = plt.subplots(nrows=1, ncols=1,figsize=(16,6))

fig.tight_layout(pad=1.08, w_pad=0.5, h_pad=6)
ax.set_title('Special Investigate Requests on All Wikis',fontweight="bold",fontsize = 16,y=1.08)
ax.set_xlabel('Week',  fontsize = 16)
ax.set_ylabel('Requests',  fontsize = 16)

ax.plot(df_to_draw['week_n'],df_to_draw['requests'], 'o-',label='2021')
ax.plot(df_to_draw_pre['week_n'],df_to_draw_pre['requests'], 'o--',label='2020')
ax.set_xlim(1,52)
ax.set_ylim(0,1000)

ax.grid(axis='x', which='minor')

ax2 = ax.twiny()  # instantiate a second axes that shares the same x-axis
ax2.set_xlim(0,12)
ax2.set_xticklabels(['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
ax2.set_xticks([0.5,1.5,2.5,3.5,4.5,5.5,6.5,7.5,8.5,9.5,10.5,11.5])
ax2.tick_params(length=0)
ax2.set_xlabel('Month',fontsize=16)
ax2.xaxis.set_minor_locator(miloc)
ax2.grid(axis='x', which='minor')
ax.legend(loc='upper right')
#plt.savefig("adoption2.png")
plt.show()
/home/jiawang/venv/lib/python3.7/site-packages/ipykernel_launcher.py:17: UserWarning: FixedFormatter should only be used together with FixedLocator

How many users does an investigation start with?

In [26]:
query_targetsCount = '''
SELECT event.targetsCount
FROM event.specialinvestigate
WHERE year =2021  and wiki!='testwiki' AND event.action='submit'
'''
In [27]:
df_targetsCount=hive.run(query_targetsCount)
In [28]:
plt.figure(figsize=(18, 6))
plt.hist(df_targetsCount['targetscount'], bins=50, histtype='bar')
plt.title('Distribution of # of Users Investigated per Sumbit',fontweight="bold",fontsize = 16,y=1.08)
plt.xlabel('# of Users',fontsize = 16)
plt.ylabel('Investigation Count',fontsize = 16)
#plt.savefig("users_investigated.png")
plt.show()

How many records records are fetched per investigation?

In [29]:
query_result_count = '''
SELECT event.resultscount AS result_count
FROM event.specialinvestigate
WHERE year =2021 
AND event.resultscount IS not NULL
'''
In [30]:
df_result_count=hive.run(query_result_count)
In [31]:
plt.figure(figsize=(18, 6))
plt.hist(df_result_count['result_count'], bins=50, histtype='bar')
plt.title('Distribution of # of Result Returned',fontweight="bold",fontsize = 16,y=1.08)
plt.xlabel('# of results',fontsize = 16)
plt.ylabel('Investigation Count',fontsize = 16)
#plt.savefig("results.png")
plt.show()
In [32]:
query_result_count_hist = '''
SELECT event.resultscount AS results, COUNT(1) AS investigate
FROM event.specialinvestigate
WHERE year =2021 
AND event.resultscount IS not NULL
GROUP BY event.resultscount
'''
In [33]:
df_result_count_hist=hive.run(query_result_count_hist)
Top 20 number of records in return
In [34]:
df_result_count_hist.sort_values(by=['investigate'], ascending=False).head(20)
Out[34]:
results investigate
50 51 482
74 101 114
1 1 95
2 2 79
4 4 79
3 3 72
5 5 61
6 6 58
7 7 53
8 8 50
15 15 49
0 0 47
9 9 42
105 201 34
12 12 33
14 14 33
49 50 30
122 501 28
13 13 28
10 10 26

How often did we display incomplete results to the user?

In [35]:
query_result_incomplete_count = '''
SELECT COUNT(1) AS resultsIncomplete_count
FROM event.specialinvestigate
WHERE year =2021 and event.resultsIncomplete 
'''
In [36]:
df_result_incomplete_count=hive.run(query_result_incomplete_count)
In [37]:
df_result_incomplete_count.head()
Out[37]:
resultsincomplete_count
0 4

How long did a user spend waiting for the results?

In [38]:
query_queryTime = '''
SELECT event.queryTime
FROM event.specialinvestigate
WHERE year =2021  AND event.queryTime IS not null
'''
In [39]:
df_queryTime=hive.run(query_queryTime)
In [40]:
plt.figure(figsize=(18, 6))
plt.hist(df_queryTime['querytime'], bins=50, histtype='bar')
plt.title('Distribution of Query Time',fontweight="bold",fontsize = 16,y=1.08)
plt.xlabel('Query Time',fontsize = 16)
plt.ylabel('Investigation Count',fontsize = 16)
#plt.savefig("querytime.png")
plt.show()
Top 10 waiting time
In [41]:
df_queryTime.sort_values (by='querytime', ascending=False).head(10)
Out[41]:
querytime
4095 w count = 3
519 t.hadoop.InternalParquetRecordReader: block re...
3003 r: block read in memory in 8 ms. row count = 8
1344 mptContextImpl
4708 lize counter due to context is not a instance ...
2590 lParquetRecordReader: at row 0. reading next b...
3263 he.parquet.hadoop.InternalParquetRecordReader:...
3782 etRecordReader: at row 0. reading next block
1657 ecords.
2796 duce.task.TaskAttemptContextImpl

How often did they access the individual tabs?

tab name value in schema
Default landing tab (IPs/User Agents) NaN
IPs/User Agents compare
Account Information preliminary-check
Timeline timeline
In [42]:
query_tab_clicks='''
SELECT event.tab, count(1) AS tab_clicks
FROM event.specialinvestigate
WHERE year =2021 AND event.action IN ('query','submit')
GROUP BY event.tab
'''
In [43]:
df_tab_clicks=hive.run(query_tab_clicks)
In [44]:
df_tab_clicks
Out[44]:
tab tab_clicks
0 NaN 839
1 compare 1347
2 preliminary-check 130
3 timeline 463

How many times did a user pin a highlight?

In [45]:
query_pins='''

SELECT COUNT(1) AS pins
FROM event.specialinvestigate
WHERE year =2021 
AND event.action = 'pin'
'''
In [46]:
df_pins=hive.run(query_pins)
In [47]:
df_pins
Out[47]:
pins
0 125

How often did the block feature get used?

In [48]:
query_blocks='''

SELECT COUNT(1) AS blocks
FROM event.specialinvestigate
WHERE year =2021 
AND event.action = 'block'


'''
In [49]:
df_blocks=hive.run(query_blocks)
In [50]:
df_blocks
Out[50]:
blocks
0 30

How many users were blocked?

In [51]:
query_blocked_users='''

SELECT  SUM(event.targetsCount) AS blocked_users
FROM event.specialinvestigate
WHERE year =2021 
AND event.action = 'block'

'''
In [52]:
df_blocked_users=hive.run(query_blocked_users)
In [53]:
df_blocked_users
Out[53]:
blocked_users
0 75

Which tools under the IP address were clicked and how many times?

In [54]:
query_tool_usage='''
SELECT  event.tool, COUNT(1) tool_usage
FROM event.specialinvestigate
WHERE year =2021 AND event.action='tool'
GROUP BY event.tool
ORDER BY tool_usage DESC
LIMIT 100000
'''
In [55]:
df_tool_usage=hive.run(query_tool_usage)
In [56]:
df_tool_usage
Out[56]:
tool tool_usage
0 Special:Contributions 173
1 whois.toolforge.org 71
2 whatismyipaddress.com 35
3 Special:InvestigateLog 17
4 ipcheck.toolforge.org 15
In [57]:
print("This dashboard was generated on {} at {}. It is maintained by Jennifer Wang, Product Analytics. If you have questions or feedback, please email jwang@wikimedia.org or product-analytics@wikimedia.org.".format(today.strftime("%Y-%m-%d"), now.strftime("%H:%M %Z")))
This dashboard was generated on 2021-10-15 at 19:49 UTC. It is maintained by Jennifer Wang, Product Analytics. If you have questions or feedback, please email jwang@wikimedia.org or product-analytics@wikimedia.org.
In [ ]:
 
In [ ]: