English Wikipedia Page Views by Topics¶

https://phabricator.wikimedia.org/T221891

In this analysis, we use the ORES draft topic model to get the topics of articles viewed on English Wikipedia in March 2019.

The outcome topics are the mid-level categories of WikiProject directory (see the hierarchy).

from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this notebook is by default hidden for easier reading.
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code"></form>
''')

%load_ext sql_magic

import findspark, os
os.environ['SPARK_HOME'] = '/usr/lib/spark2';
findspark.init()
import pyspark
import pyspark.sql
conf = pyspark.SparkConf().setMaster("yarn")  # Use master yarn here if you are going to query large datasets.
conf.set('spark.executor.memory', '8g')
conf.set('spark.yarn.executor.memoryOverhead', '1024')
conf.set('spark.executor.cores', '4')
conf.set('spark.dynamicAllocation.maxExecutors', '32')
conf.set('spark.driver.memory', '4g')
conf.set('spark.driver.maxResultSize', '10g')
conf.set('spark.logConf', True)
sc = pyspark.SparkContext(conf=conf)
spark_hive = pyspark.sql.HiveContext(sc)

%config SQL.conn_name = 'spark_hive'

import requests
import pandas as pd
import json
import matplotlib.pyplot as plt

%%read_sql enwiki_pageviews -d
select page_id, sum(view_count) as pageviews
from wmf.pageview_hourly
where year=2019 and month=3
and namespace_id = 0
and project = 'en.wikipedia'
and agent_type = 'user'
group by page_id

enwiki_pageviews['proportion']= enwiki_pageviews['pageviews']/enwiki_pageviews['pageviews'].sum()
enwiki_pageviews = enwiki_pageviews.sort_values(by='pageviews', ascending=False)

print('Total page views: ' + str(enwiki_pageviews.pageviews.sum()))

Total page views: 7495953809

print('Number of unqiue pages: ' + str(enwiki_pageviews.shape[0]))

Number of unqiue pages: 7162726

print('Top 1M pages account for ' + str(round(enwiki_pageviews.proportion[:1000000].sum() * 100,2)) + '% of total page views.')

Top 1M pages account for 92.01% of total page views.

# Page views distribution
distr_plot = enwiki_pageviews.iloc[1:1000].hist(column='pageviews', bins=100, grid=False, figsize=(15,8))
plt.title('Page Views Distribution of the Top 1000 Pages')
plt.xlabel('Page Views')
plt.ylabel('Number of Pages')

%%read_sql enwiki_pv_rev -d
with v as (
    select page_id, sum(view_count) as pageviews
    from wmf.pageview_hourly
    where year=2019 and month=3
    and namespace_id = 0
    and project = 'en.wikipedia'
    and agent_type = 'user'
    group by page_id
    order by pageviews desc
    limit 1000000
), p as (
    select page_id, page_title, page_latest
    from wmf_raw.mediawiki_page
    where wiki_db = 'enwiki'
    and snapshot = '2019-03'
    and page_id is not null
    and page_namespace = 0
    and not page_is_redirect
)

select v.page_id, p.page_title, p.page_latest as rev_id, v.pageviews
from v left join p on v.page_id=p.page_id

# Save rev_id into a json file
enwiki_pv_rev[['rev_id']].dropna().astype('int64').to_json(path_or_buf='input_rev_id.json', orient='records', lines=True)

%%bash
ores score_revisions https://ores.wikimedia.org "cxie@wikimedia.org analyzing article topics" enwiki drafttopic --parallel-requests=4 --input=input_rev_id.json > output_drafttopic_enwiki.json

# Get topic from ORES draft topic output
def get_pred_topic_best(input_json):
    try:
        topics = input_json['score']['drafttopic']['score']['probability']
        best = sorted(topics, key=topics.get, reverse=True)[0]
    except (IndexError, KeyError) as error:
        best = 'Unknown'
    return best

topic_df = pd.DataFrame([])
with open('output_drafttopic_enwiki.json') as json_file: 
    for line in json_file:
        try:
            ores_results = json.loads(line)
            topic_df = topic_df.append(pd.DataFrame([[ores_results['rev_id'], get_pred_topic_best(ores_results)]]))
        except ValueError:
            print(line)

topic_df.columns = ['rev_id', 'topic']

topic_df.rev_id=topic_df.rev_id.astype(int)
enwiki_pv_rev_topic = enwiki_pv_rev.merge(topic_df, how = 'left', on='rev_id')
enwiki_pv_rev_topic['topic'] = enwiki_pv_rev_topic['topic'].fillna(value='Unknown')
enwiki_pv_rev_topic['proportion']= enwiki_pv_rev_topic['pageviews']/enwiki_pv_rev_topic['pageviews'].sum()

Topics by page views¶

The table below shows the page views by topics of the top 1M pages viewed in March 2019 on English Wikipedia. Their corresponding proportions among the total page views of the top 1M pages are also calculated. Main page is excluded in this table, so the sum of the proprotion is not 100%.

enwiki_pv_rev_topic_summary = (enwiki_pv_rev_topic[enwiki_pv_rev_topic.page_title != 'Main_Page']
          .groupby('topic', as_index = False)['pageviews', 'proportion']
          .sum()
          .sort_values(by='pageviews', ascending=False))
enwiki_pv_rev_topic_summary

The table below aggregates the table above by WikiProject Directory (broad topics).

enwiki_pv_rev_topic_summary['broad topic'] = enwiki_pv_rev_topic_summary.topic.str.split(pat=".", n=1, expand=True)[0]
enwiki_pv_rev_topic_summary.groupby('broad topic', as_index = False)['pageviews', 'proportion'].sum().sort_values(by='pageviews', ascending=False)

	page_title	pageviews	proportion	topic
0	Main_Page	479884777	0.069575	Culture.Language and literature
1	Captain_Marvel_(film)	6862452	0.000995	Culture.Entertainment
2	Luke_Perry	6323974	0.000917	Culture.Internet culture
3	Us_(2019_film)	4496740	0.000652	Culture.Entertainment
4	Freddie_Mercury	4219683	0.000612	Culture.Language and literature
5	Deaths_in_2019	3465759	0.000502	Culture.Language and literature
6	Murder_of_Dee_Dee_Blanchard	2958431	0.000429	Culture.Language and literature
7	Brie_Larson	2932960	0.000425	Culture.Entertainment
8	Mötley_Crüe	2898841	0.000420	Culture.Performing arts
9	Boeing_737_MAX	2678576	0.000388	History_And_Society.Transportation
10	XHamster	2583785	0.000375	Culture.Internet culture
11	Disappearance_of_Madeleine_McCann	2549882	0.000370	Geography.Europe
12	Louis_Tomlinson	2498069	0.000362	Culture.Performing arts
13	Kayden_Boche	2493771	0.000362	Culture.Language and literature
14	Michael_Jackson	2445447	0.000355	Culture.Language and literature
15	Momo_Challenge_hoax	2320873	0.000336	Culture.Internet culture
16	Lori_Loughlin	2261315	0.000328	Culture.Broadcasting
17	List_of_Marvel_Cinematic_Universe_films	2183426	0.000317	Culture.Entertainment
18	Nikki_Sixx	2182354	0.000316	Culture.Performing arts
19	Elizabeth_Holmes	2147236	0.000311	History_And_Society.Business and economics
20	Avengers:_Endgame	2096072	0.000304	Culture.Entertainment
21	Vince_Neil	2048563	0.000297	Culture.Performing arts
22	The_Umbrella_Academy_(TV_series)	2010815	0.000292	Culture.Broadcasting
23	Bible	1977906	0.000287	Culture.Philosophy and religion
24	Tommy_Lee	1962732	0.000285	Culture.Performing arts
25	Keith_Flint	1953548	0.000283	Culture.Language and literature
26	Bonnie_and_Clyde	1945857	0.000282	Culture.Language and literature
27	Mick_Mars	1939377	0.000281	Culture.Performing arts
28	Wade_Robson	1897451	0.000275	Culture.Performing arts
29	Christchurch_mosque_shootings	1866022	0.000271	Geography.Countries
30	Beto_O'Rourke	1850413	0.000268	Geography.Countries
31	WrestleMania_35	1748988	0.000254	Culture.Entertainment
32	Billie_Eilish	1737580	0.000252	Culture.Performing arts
33	2019_Indian_general_election	1721628	0.000250	Geography.Countries
34	Saint_Patrick's_Day	1660495	0.000241	Geography.Europe
35	Lady_Gaga	1641279	0.000238	Culture.Performing arts
36	Queen_(band)	1482866	0.000215	Culture.Performing arts
37	Alexandria_Ocasio-Cortez	1475223	0.000214	History_And_Society.Politics and government
38	Leaving_Neverland	1467295	0.000213	Culture.Internet culture
39	United_States	1444284	0.000209	Geography.Countries
40	Nick_Jonas	1414818	0.000205	Culture.Internet culture
41	Triple_Frontier_(film)	1406955	0.000204	Culture.Entertainment
42	Google	1403287	0.000203	STEM.Technology
43	Alex_Trebek	1397506	0.000203	Culture.Language and literature
44	Kate_Beckinsale	1397442	0.000203	Culture.Entertainment
45	Game_of_Thrones	1354419	0.000196	Culture.Entertainment
46	Jacinda_Ardern	1340463	0.000194	Geography.Oceania
47	A_Star_Is_Born_(2018_film)	1329926	0.000193	Culture.Entertainment
48	Shazam!_(film)	1325363	0.000192	Culture.Entertainment
49	Felicity_Huffman	1263590	0.000183	Culture.Entertainment

	topic	pageviews	proportion
19	Geography.Countries	876013906	0.127007
7	Culture.Entertainment	791003561	0.114682
10	Culture.Language and literature	789910351	0.114523
12	Culture.Performing arts	509424862	0.073858
15	Culture.Sports	467226462	0.067740
5	Culture.Broadcasting	440542717	0.063871
20	Geography.Europe	257480010	0.037330
36	STEM.Medicine	216034124	0.031321
41	STEM.Technology	177204035	0.025692
29	History_And_Society.Transportation	171271898	0.024831
24	History_And_Society.Business and economics	169232555	0.024536
27	History_And_Society.Military and warfare	155619000	0.022562
30	STEM.Biology	146682206	0.021266
13	Culture.Philosophy and religion	141343937	0.020492
28	History_And_Society.Politics and government	141265006	0.020481
26	History_And_Society.History and society	136577117	0.019801
16	Culture.Visual arts	133202298	0.019312
9	Culture.Internet culture	129473614	0.018771
8	Culture.Food and drink	104995798	0.015223
31	STEM.Chemistry	67170928	0.009739
38	STEM.Physics	57990065	0.008408
23	Geography.Oceania	49815409	0.007222
35	STEM.Mathematics	46069158	0.006679
25	History_And_Society.Education	36830555	0.005340
40	STEM.Space	33884499	0.004913
32	STEM.Engineering	29384135	0.004260
39	STEM.Science	26784289	0.003883
14	Culture.Plastic arts	17006891	0.002466
33	STEM.Geosciences	14654692	0.002125
37	STEM.Meteorology	12751036	0.001849
6	Culture.Crafts and hobbies	11105379	0.001610
22	Geography.Maps	9339909	0.001354
17	Geography.Bodies of water	8718838	0.001264
11	Culture.Media	7327281	0.001062
21	Geography.Landforms	6971947	0.001011
42	STEM.Time	6109674	0.000886
3	Assistance.Maintenance	4603435	0.000667
43	Unknown	4200563	0.000609
4	Culture.Arts	4177723	0.000606
1	Assistance.Contents systems	3176108	0.000460
34	STEM.Information science	3122904	0.000453
2	Assistance.Files	898307	0.000130
0	Assistance.Article improvement and grading	519105	0.000075
18	Geography.Cities	373871	0.000054

	broad topic	pageviews	proportion
1	Culture	3546740874	0.514216
2	Geography	1208713890	0.175243
4	STEM	837841745	0.121473
3	History_And_Society	810796131	0.117551
0	Assistance	9196955	0.001333
5	Unknown	4200563	0.000609

English Wikipedia Page Views by Topics¶

Top 50 articles read in March 2019 on English Wikipedia¶

Topics by page views¶