English Wikipedia Page Views by Topics

https://phabricator.wikimedia.org/T221891

In this analysis, we use the ORES draft topic model to get the topics of articles viewed on English Wikipedia in March 2019.

The outcome topics are the mid-level categories of WikiProject directory (see the hierarchy).

In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this notebook is by default hidden for easier reading.
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code"></form>
''')
Out[1]:
The raw code for this notebook is by default hidden for easier reading.
In [2]:
%load_ext sql_magic

import findspark, os
os.environ['SPARK_HOME'] = '/usr/lib/spark2';
findspark.init()
import pyspark
import pyspark.sql
conf = pyspark.SparkConf().setMaster("yarn")  # Use master yarn here if you are going to query large datasets.
conf.set('spark.executor.memory', '8g')
conf.set('spark.yarn.executor.memoryOverhead', '1024')
conf.set('spark.executor.cores', '4')
conf.set('spark.dynamicAllocation.maxExecutors', '32')
conf.set('spark.driver.memory', '4g')
conf.set('spark.driver.maxResultSize', '10g')
conf.set('spark.logConf', True)
sc = pyspark.SparkContext(conf=conf)
spark_hive = pyspark.sql.HiveContext(sc)

%config SQL.conn_name = 'spark_hive'
In [3]:
import requests
import pandas as pd
import json
import matplotlib.pyplot as plt
In [ ]:
%%read_sql enwiki_pageviews -d
select page_id, sum(view_count) as pageviews
from wmf.pageview_hourly
where year=2019 and month=3
and namespace_id = 0
and project = 'en.wikipedia'
and agent_type = 'user'
group by page_id
In [6]:
enwiki_pageviews['proportion']= enwiki_pageviews['pageviews']/enwiki_pageviews['pageviews'].sum()
enwiki_pageviews = enwiki_pageviews.sort_values(by='pageviews', ascending=False)
In [7]:
print('Total page views: ' + str(enwiki_pageviews.pageviews.sum()))
Total page views: 7495953809
In [8]:
print('Number of unqiue pages: ' + str(enwiki_pageviews.shape[0]))
Number of unqiue pages: 7162726
In [9]:
print('Top 1M pages account for ' + str(round(enwiki_pageviews.proportion[:1000000].sum() * 100,2)) + '% of total page views.')
Top 1M pages account for 92.01% of total page views.
In [ ]:
# Page views distribution
distr_plot = enwiki_pageviews.iloc[1:1000].hist(column='pageviews', bins=100, grid=False, figsize=(15,8))
plt.title('Page Views Distribution of the Top 1000 Pages')
plt.xlabel('Page Views')
plt.ylabel('Number of Pages')
In [ ]:
%%read_sql enwiki_pv_rev -d
with v as (
    select page_id, sum(view_count) as pageviews
    from wmf.pageview_hourly
    where year=2019 and month=3
    and namespace_id = 0
    and project = 'en.wikipedia'
    and agent_type = 'user'
    group by page_id
    order by pageviews desc
    limit 1000000
), p as (
    select page_id, page_title, page_latest
    from wmf_raw.mediawiki_page
    where wiki_db = 'enwiki'
    and snapshot = '2019-03'
    and page_id is not null
    and page_namespace = 0
    and not page_is_redirect
)

select v.page_id, p.page_title, p.page_latest as rev_id, v.pageviews
from v left join p on v.page_id=p.page_id
In [19]:
# Save rev_id into a json file
enwiki_pv_rev[['rev_id']].dropna().astype('int64').to_json(path_or_buf='input_rev_id.json', orient='records', lines=True)
In [ ]:
%%bash
ores score_revisions https://ores.wikimedia.org "cxie@wikimedia.org analyzing article topics" enwiki drafttopic --parallel-requests=4 --input=input_rev_id.json > output_drafttopic_enwiki.json
In [22]:
# Get topic from ORES draft topic output
def get_pred_topic_best(input_json):
    try:
        topics = input_json['score']['drafttopic']['score']['probability']
        best = sorted(topics, key=topics.get, reverse=True)[0]
    except (IndexError, KeyError) as error:
        best = 'Unknown'
    return best

topic_df = pd.DataFrame([])
with open('output_drafttopic_enwiki.json') as json_file: 
    for line in json_file:
        try:
            ores_results = json.loads(line)
            topic_df = topic_df.append(pd.DataFrame([[ores_results['rev_id'], get_pred_topic_best(ores_results)]]))
        except ValueError:
            print(line)

topic_df.columns = ['rev_id', 'topic']
In [25]:
topic_df.rev_id=topic_df.rev_id.astype(int)
enwiki_pv_rev_topic = enwiki_pv_rev.merge(topic_df, how = 'left', on='rev_id')
enwiki_pv_rev_topic['topic'] = enwiki_pv_rev_topic['topic'].fillna(value='Unknown')
enwiki_pv_rev_topic['proportion']= enwiki_pv_rev_topic['pageviews']/enwiki_pv_rev_topic['pageviews'].sum()

Top 50 articles read in March 2019 on English Wikipedia

In [29]:
enwiki_pv_rev_topic[['page_title','pageviews','proportion','topic']].sort_values(by='pageviews', ascending=False).reset_index(drop=True).head(50)
Out[29]:
page_title pageviews proportion topic
0 Main_Page 479884777 0.069575 Culture.Language and literature
1 Captain_Marvel_(film) 6862452 0.000995 Culture.Entertainment
2 Luke_Perry 6323974 0.000917 Culture.Internet culture
3 Us_(2019_film) 4496740 0.000652 Culture.Entertainment
4 Freddie_Mercury 4219683 0.000612 Culture.Language and literature
5 Deaths_in_2019 3465759 0.000502 Culture.Language and literature
6 Murder_of_Dee_Dee_Blanchard 2958431 0.000429 Culture.Language and literature
7 Brie_Larson 2932960 0.000425 Culture.Entertainment
8 Mötley_Crüe 2898841 0.000420 Culture.Performing arts
9 Boeing_737_MAX 2678576 0.000388 History_And_Society.Transportation
10 XHamster 2583785 0.000375 Culture.Internet culture
11 Disappearance_of_Madeleine_McCann 2549882 0.000370 Geography.Europe
12 Louis_Tomlinson 2498069 0.000362 Culture.Performing arts
13 Kayden_Boche 2493771 0.000362 Culture.Language and literature
14 Michael_Jackson 2445447 0.000355 Culture.Language and literature
15 Momo_Challenge_hoax 2320873 0.000336 Culture.Internet culture
16 Lori_Loughlin 2261315 0.000328 Culture.Broadcasting
17 List_of_Marvel_Cinematic_Universe_films 2183426 0.000317 Culture.Entertainment
18 Nikki_Sixx 2182354 0.000316 Culture.Performing arts
19 Elizabeth_Holmes 2147236 0.000311 History_And_Society.Business and economics
20 Avengers:_Endgame 2096072 0.000304 Culture.Entertainment
21 Vince_Neil 2048563 0.000297 Culture.Performing arts
22 The_Umbrella_Academy_(TV_series) 2010815 0.000292 Culture.Broadcasting
23 Bible 1977906 0.000287 Culture.Philosophy and religion
24 Tommy_Lee 1962732 0.000285 Culture.Performing arts
25 Keith_Flint 1953548 0.000283 Culture.Language and literature
26 Bonnie_and_Clyde 1945857 0.000282 Culture.Language and literature
27 Mick_Mars 1939377 0.000281 Culture.Performing arts
28 Wade_Robson 1897451 0.000275 Culture.Performing arts
29 Christchurch_mosque_shootings 1866022 0.000271 Geography.Countries
30 Beto_O'Rourke 1850413 0.000268 Geography.Countries
31 WrestleMania_35 1748988 0.000254 Culture.Entertainment
32 Billie_Eilish 1737580 0.000252 Culture.Performing arts
33 2019_Indian_general_election 1721628 0.000250 Geography.Countries
34 Saint_Patrick's_Day 1660495 0.000241 Geography.Europe
35 Lady_Gaga 1641279 0.000238 Culture.Performing arts
36 Queen_(band) 1482866 0.000215 Culture.Performing arts
37 Alexandria_Ocasio-Cortez 1475223 0.000214 History_And_Society.Politics and government
38 Leaving_Neverland 1467295 0.000213 Culture.Internet culture
39 United_States 1444284 0.000209 Geography.Countries
40 Nick_Jonas 1414818 0.000205 Culture.Internet culture
41 Triple_Frontier_(film) 1406955 0.000204 Culture.Entertainment
42 Google 1403287 0.000203 STEM.Technology
43 Alex_Trebek 1397506 0.000203 Culture.Language and literature
44 Kate_Beckinsale 1397442 0.000203 Culture.Entertainment
45 Game_of_Thrones 1354419 0.000196 Culture.Entertainment
46 Jacinda_Ardern 1340463 0.000194 Geography.Oceania
47 A_Star_Is_Born_(2018_film) 1329926 0.000193 Culture.Entertainment
48 Shazam!_(film) 1325363 0.000192 Culture.Entertainment
49 Felicity_Huffman 1263590 0.000183 Culture.Entertainment

Topics by page views

The table below shows the page views by topics of the top 1M pages viewed in March 2019 on English Wikipedia. Their corresponding proportions among the total page views of the top 1M pages are also calculated. Main page is excluded in this table, so the sum of the proprotion is not 100%.

In [27]:
enwiki_pv_rev_topic_summary = (enwiki_pv_rev_topic[enwiki_pv_rev_topic.page_title != 'Main_Page']
          .groupby('topic', as_index = False)['pageviews', 'proportion']
          .sum()
          .sort_values(by='pageviews', ascending=False))
enwiki_pv_rev_topic_summary
Out[27]:
topic pageviews proportion
19 Geography.Countries 876013906 0.127007
7 Culture.Entertainment 791003561 0.114682
10 Culture.Language and literature 789910351 0.114523
12 Culture.Performing arts 509424862 0.073858
15 Culture.Sports 467226462 0.067740
5 Culture.Broadcasting 440542717 0.063871
20 Geography.Europe 257480010 0.037330
36 STEM.Medicine 216034124 0.031321
41 STEM.Technology 177204035 0.025692
29 History_And_Society.Transportation 171271898 0.024831
24 History_And_Society.Business and economics 169232555 0.024536
27 History_And_Society.Military and warfare 155619000 0.022562
30 STEM.Biology 146682206 0.021266
13 Culture.Philosophy and religion 141343937 0.020492
28 History_And_Society.Politics and government 141265006 0.020481
26 History_And_Society.History and society 136577117 0.019801
16 Culture.Visual arts 133202298 0.019312
9 Culture.Internet culture 129473614 0.018771
8 Culture.Food and drink 104995798 0.015223
31 STEM.Chemistry 67170928 0.009739
38 STEM.Physics 57990065 0.008408
23 Geography.Oceania 49815409 0.007222
35 STEM.Mathematics 46069158 0.006679
25 History_And_Society.Education 36830555 0.005340
40 STEM.Space 33884499 0.004913
32 STEM.Engineering 29384135 0.004260
39 STEM.Science 26784289 0.003883
14 Culture.Plastic arts 17006891 0.002466
33 STEM.Geosciences 14654692 0.002125
37 STEM.Meteorology 12751036 0.001849
6 Culture.Crafts and hobbies 11105379 0.001610
22 Geography.Maps 9339909 0.001354
17 Geography.Bodies of water 8718838 0.001264
11 Culture.Media 7327281 0.001062
21 Geography.Landforms 6971947 0.001011
42 STEM.Time 6109674 0.000886
3 Assistance.Maintenance 4603435 0.000667
43 Unknown 4200563 0.000609
4 Culture.Arts 4177723 0.000606
1 Assistance.Contents systems 3176108 0.000460
34 STEM.Information science 3122904 0.000453
2 Assistance.Files 898307 0.000130
0 Assistance.Article improvement and grading 519105 0.000075
18 Geography.Cities 373871 0.000054

The table below aggregates the table above by WikiProject Directory (broad topics).

In [28]:
enwiki_pv_rev_topic_summary['broad topic'] = enwiki_pv_rev_topic_summary.topic.str.split(pat=".", n=1, expand=True)[0]
enwiki_pv_rev_topic_summary.groupby('broad topic', as_index = False)['pageviews', 'proportion'].sum().sort_values(by='pageviews', ascending=False)
Out[28]:
broad topic pageviews proportion
1 Culture 3546740874 0.514216
2 Geography 1208713890 0.175243
4 STEM 837841745 0.121473
3 History_And_Society 810796131 0.117551
0 Assistance 9196955 0.001333
5 Unknown 4200563 0.000609