Topics of articles translated by Google from English Wikipedia

https://phabricator.wikimedia.org/T219660

In this analysis, we use the ORES draft topic model to get the topics of articles viewed and translated by Google in March 2019. The outcome topic is the mid-level categories of WikiProject directory (see the hierarchy).

The goal of this analysis is to figure out the topics readers are interested in, but those articles are not available (or their quality is not good) in their local language. With this result, we can recommend those popular topics to editors in local communities.

It is worth to note that how the translation was initiated represents different motivations of readers, and thus we break down the analysis by these two types:

  • Pages translated by Toledo automatically (Google integrate automatic translated pages in search results). These articles represent 1) Google thinks the quality of contents in local languages is not as good as translated pages AND 2) Users are interested in these articles and thus click through the search results.
  • User initiated translation. Users paste the article links into Google translate, or click on the "Translate this page" link from their search result. In this case, users are well aware that they are reading a translated article and willing to put more effort to do that, which is an indication of a stronger interest in the articles. We break down the analysis by translation target languages.

Key Take-aways:

  • 56.3% pageviews served by Toledo are STEM (science, tech and engineering) related, which seems to aligned with the information from Google.
  • Among STEM, Medicine is the most popular, followed by Biology.
  • Indonesian and Hindi users like to translate and read articles about Countries on English Wikipedia.
  • Comparing articles served by Toledo (the majority of their readers are Indonesian) with Indonesian users initiated translation, it seems Indonesian readers's demand for Culture related contents (43.2% of Indonesian users initiated translation) haven't been fully fullfilled by Toledo yet.
In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this notebook is by default hidden for easier reading.
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code"></form>
''')
Out[1]:
The raw code for this notebook is by default hidden for easier reading.
In [2]:
%load_ext sql_magic

import findspark, os
os.environ['SPARK_HOME'] = '/usr/lib/spark2';
findspark.init()
import pyspark
import pyspark.sql
conf = pyspark.SparkConf().setMaster("yarn")  # Use master yarn here if you are going to query large datasets.
conf.set('spark.executor.memory', '4g')
conf.set('spark.executor.cores', '4')
conf.set('spark.driver.memory', '4g')
conf.set('spark.driver.maxResultSize', '4g')
conf.set('spark.logConf', True)
sc = pyspark.SparkContext(conf=conf)
spark_hive = pyspark.sql.HiveContext(sc)

%config SQL.conn_name = 'spark_hive'
In [3]:
import requests
import pandas as pd
import json

1. Translated pageviews by source languages

From the table below, we can see that the vast majority of the pages are translated from English to other languages, so for this first exploration we will only check the topics of articles on English Wikipedia.

In [4]:
%%read_sql translated_pv_by_host
select uri_host, client_srp, 
sum(count) as pageviews
from chelsyx.toledo_pageviews
where year=2019
and month = 3
group by uri_host, client_srp
order by pageviews desc
limit 15
Query started at 05:29:01 PM UTC; Query executed in 0.29 m
Out[4]:
uri_host client_srp pageviews
0 en.m.wikipedia.org False 5298730
1 en.m.wikipedia.org None 1592200
2 en.m.wikipedia.org True 816705
3 en.wikipedia.org False 739397
4 en.wikipedia.org None 287853
5 de.wikipedia.org False 202173
6 fr.wikipedia.org False 154832
7 ru.wikipedia.org False 119574
8 de.wikipedia.org None 118290
9 es.wikipedia.org False 111514
10 ja.wikipedia.org False 107760
11 it.wikipedia.org False 86300
12 fr.wikipedia.org None 72633
13 zh.wikipedia.org False 53637
14 es.wikipedia.org None 45966

2. Pages translated by Toledo

These are topics that:

  • Google thinks the quality of contents in local languages is not as good as translated pages AND
  • Users are interested in and thus click through the search results
In [5]:
%%read_sql translated_page_toledo -d
select w.namespace_id, w.page_id, p.page_title, p.page_latest as rev_id, count(*) as pageviews
from wmf.webrequest w join wmf_raw.mediawiki_page p on (w.page_id=p.page_id and w.namespace_id=p.page_namespace and 
                                                       p.wiki_db='enwiki' and p.snapshot='2019-03')
where
    year = 2019
    and month = 3
    and is_pageview
    and w.namespace_id=0
    and x_analytics_map['translationengine'] = 'GT'
    and parse_url(referer, 'QUERY') like '%client=srp%'
    and uri_host in ('en.wikipedia.org','en.m.wikipedia.org')
group by w.namespace_id, w.page_id, p.page_title, p.page_latest
Query started at 05:29:18 PM UTC; Query executed in 24.43 m
In [6]:
print('Number of unique pages: ' + str(translated_page_toledo.shape[0]))
Number of unique pages: 195003
In [7]:
# Save rev_id into a json file
translated_page_toledo[['rev_id']].to_json(path_or_buf='input_rev_id.json', orient='records', lines=True)
In [8]:
%%bash
ores score_revisions https://ores.wikimedia.org "cxie@wikimedia.org analyzing article topics" enwiki drafttopic --parallel-requests=4 --input=input_rev_id.json > output_drafttopic_toledo.json
2019-04-25 06:02:03,951 INFO:ores.utilities.score_revisions -- Reading input from input_rev_id.json
2019-04-25 06:02:03,951 INFO:ores.utilities.score_revisions -- Writing output to from <stdout>
In [7]:
# Get topic from ORES draft topic output

def get_pred_topic_best(input_json):
    try:
        topics = input_json['score']['drafttopic']['score']['probability']
        best = sorted(topics, key=topics.get, reverse=True)[0]
    except (IndexError, KeyError) as error:
        best = 'Unknown'
    return best

topic_df = pd.DataFrame([])
with open('output_drafttopic_toledo.json') as json_file: 
    for line in json_file:
        try:
            ores_results = json.loads(line)
            topic_df = topic_df.append(pd.DataFrame([[ores_results['rev_id'], get_pred_topic_best(ores_results)]]))
        except ValueError:
            print(line)

topic_df.columns = ['rev_id', 'topic']

The table below shows the number of pageviews and the corresponding proprotions in March 2019, broken down by the topics (mid-level categories of WikiProject directory). Please note that when topic is "Unknown", it means the ORES draft topic model can't figure out the topics for those articles.

The most popular articles served by Toledo are Medicine related (23.5% pageviews), followed by Countries (12%) and Biology (11.1%).

In [8]:
topic_df.rev_id=topic_df.rev_id.astype(int)
translated_page_toledo_topic = (translated_page_toledo
          .merge(topic_df, how = 'left', on='rev_id')
          .groupby('topic', as_index = False)['pageviews']
          .sum()
          .sort_values(by='pageviews', ascending=False)
         )
translated_page_toledo_topic['proportion']= translated_page_toledo_topic['pageviews']/translated_page_toledo_topic['pageviews'].sum()
translated_page_toledo_topic
Out[8]:
topic pageviews proportion
36 STEM.Medicine 191313 0.235254
19 Geography.Countries 97464 0.119850
30 STEM.Biology 90225 0.110948
41 STEM.Technology 58771 0.072270
10 Culture.Language and literature 55456 0.068193
31 STEM.Chemistry 54125 0.066556
20 Geography.Europe 22294 0.027415
7 Culture.Entertainment 21822 0.026834
38 STEM.Physics 20398 0.025083
26 History_And_Society.History and society 19846 0.024404
13 Culture.Philosophy and religion 16283 0.020023
27 History_And_Society.Military and warfare 16256 0.019990
12 Culture.Performing arts 15262 0.018767
24 History_And_Society.Business and economics 14016 0.017235
16 Culture.Visual arts 13396 0.016473
15 Culture.Sports 13051 0.016049
35 STEM.Mathematics 11663 0.014342
8 Culture.Food and drink 9063 0.011145
29 History_And_Society.Transportation 8504 0.010457
32 STEM.Engineering 8261 0.010158
39 STEM.Science 7234 0.008896
40 STEM.Space 6096 0.007496
5 Culture.Broadcasting 6087 0.007485
28 History_And_Society.Politics and government 5875 0.007224
33 STEM.Geosciences 5562 0.006839
9 Culture.Internet culture 3894 0.004788
23 Geography.Oceania 3867 0.004755
22 Geography.Maps 3337 0.004103
25 History_And_Society.Education 2423 0.002980
37 STEM.Meteorology 2379 0.002925
14 Culture.Plastic arts 1277 0.001570
6 Culture.Crafts and hobbies 1168 0.001436
3 Assistance.Maintenance 1098 0.001350
17 Geography.Bodies of water 1045 0.001285
42 STEM.Time 939 0.001155
11 Culture.Media 862 0.001060
1 Assistance.Contents systems 701 0.000862
34 STEM.Information science 572 0.000703
21 Geography.Landforms 534 0.000657
4 Culture.Arts 276 0.000339
2 Assistance.Files 141 0.000173
43 Unknown 133 0.000164
18 Geography.Cities 133 0.000164
0 Assistance.Article improvement and grading 117 0.000144

The table below aggregates the table above by WikiProject Directory (broad topics). We can see that in terms of broad topics, the most popular articles served by Toledo are STEM (science, tech and engineering) related (56.3% pageviews), followed by Culture (19.4%) and Geography (15.8%). This seems to align with the information that Gooogle tell us.

In [9]:
translated_page_toledo_topic['broad topic'] = translated_page_toledo_topic.topic.str.split(pat=".", n=1, expand=True)[0]
translated_page_toledo_topic.groupby('broad topic', as_index = False)['pageviews', 'proportion'].sum().sort_values(by='pageviews', ascending=False)
Out[9]:
broad topic pageviews proportion
4 STEM 457538 0.562626
1 Culture 157897 0.194163
2 Geography 128674 0.158228
3 History_And_Society 66920 0.082290
0 Assistance 2057 0.002529
5 Unknown 133 0.000164

3. Indonesian users initiated translation

In this section, we look at the topics of articles for which Indonesian users initiate translations from English. Users paste the article links into Google translate, or click on the "Translate this page" link from their search result. In this case, users are well aware that they are reading a translated article and willing to put more effort to do that, which is an indication of a stronger interest in the articles.

In [10]:
%%read_sql translated_page_id -d
select w.namespace_id, w.page_id, p.page_title, p.page_latest as rev_id, count(*) as pageviews
from wmf.webrequest w join wmf_raw.mediawiki_page p on (w.page_id=p.page_id and w.namespace_id=p.page_namespace and 
                                                       p.wiki_db='enwiki' and p.snapshot='2019-03')
where
    year = 2019
    and month = 3
    and is_pageview
    and w.namespace_id=0
    and x_analytics_map['translationengine'] = 'GT'
    and parse_url(referer, 'QUERY') not like '%client=srp%'
    and uri_host in ('en.wikipedia.org','en.m.wikipedia.org')
    and (
        regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])tl=([^&]*)', 2) = 'id' 
        or ((regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])tl=([^&]*)', 2) is null or regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])tl=([^&]*)', 2)='') 
             and regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])hl=([^&]*)', 2) = 'id')
        )
group by w.namespace_id, w.page_id, p.page_title, p.page_latest
Query started at 06:24:42 PM UTC; Query executed in 15.24 m
In [11]:
print('Number of unique pages: ' + str(translated_page_id.shape[0]))
Number of unique pages: 249252
In [15]:
# Save rev_id into a json file
translated_page_id[['rev_id']].to_json(path_or_buf='input_rev_id.json', orient='records', lines=True)
In [16]:
%%bash
ores score_revisions https://ores.wikimedia.org "cxie@wikimedia.org analyzing article topics" enwiki drafttopic --parallel-requests=4 --input=input_rev_id.json > output_drafttopic_id.json
2019-04-25 06:38:33,177 INFO:ores.utilities.score_revisions -- Reading input from input_rev_id.json
2019-04-25 06:38:33,177 INFO:ores.utilities.score_revisions -- Writing output to from <stdout>

The table below shows the number of pageviews and the corresponding proprotions in March 2019, broken down by the topics (mid-level categories of WikiProject directory). Please note that when topic is "Unknown", it means the ORES draft topic model can't figure out the topics for those articles.

The most popular articles are Countries related (15.5% pageviews), followed by Entertainment (10.7%), Language and literature (9%).

In [12]:
# Get topic from ORES draft topic output
topic_df = pd.DataFrame([])
with open('output_drafttopic_id.json') as json_file: 
    for line in json_file:
        try:
            ores_results = json.loads(line)
            topic_df = topic_df.append(pd.DataFrame([[ores_results['rev_id'], get_pred_topic_best(ores_results)]]))
        except ValueError:
            print(line)

topic_df.columns = ['rev_id', 'topic']
In [13]:
topic_df.rev_id=topic_df.rev_id.astype(int)
translated_page_id_topic = (translated_page_id
          .merge(topic_df, how = 'left', on='rev_id')
          .groupby('topic', as_index = False)['pageviews']
          .sum()
          .sort_values(by='pageviews', ascending=False)
         )
translated_page_id_topic['proportion']= translated_page_id_topic['pageviews']/translated_page_id_topic['pageviews'].sum()
translated_page_id_topic
Out[13]:
topic pageviews proportion
19 Geography.Countries 107235 0.154901
7 Culture.Entertainment 73774 0.106566
10 Culture.Language and literature 62470 0.090238
12 Culture.Performing arts 40229 0.058111
15 Culture.Sports 39626 0.057240
36 STEM.Medicine 37125 0.053627
30 STEM.Biology 34486 0.049815
27 History_And_Society.Military and warfare 28867 0.041698
29 History_And_Society.Transportation 26110 0.037716
41 STEM.Technology 24823 0.035857
24 History_And_Society.Business and economics 24269 0.035057
16 Culture.Visual arts 23786 0.034359
20 Geography.Europe 22121 0.031954
26 History_And_Society.History and society 20565 0.029706
13 Culture.Philosophy and religion 17405 0.025141
5 Culture.Broadcasting 15982 0.023086
31 STEM.Chemistry 13582 0.019619
8 Culture.Food and drink 11299 0.016321
9 Culture.Internet culture 9519 0.013750
28 History_And_Society.Politics and government 8623 0.012456
32 STEM.Engineering 7760 0.011209
38 STEM.Physics 6804 0.009828
23 Geography.Oceania 5023 0.007256
25 History_And_Society.Education 4671 0.006747
40 STEM.Space 3844 0.005553
39 STEM.Science 3673 0.005306
35 STEM.Mathematics 3659 0.005285
33 STEM.Geosciences 2927 0.004228
14 Culture.Plastic arts 2434 0.003516
37 STEM.Meteorology 1513 0.002186
22 Geography.Maps 1501 0.002168
6 Culture.Crafts and hobbies 1393 0.002012
3 Assistance.Maintenance 981 0.001417
11 Culture.Media 868 0.001254
17 Geography.Bodies of water 748 0.001080
42 STEM.Time 487 0.000703
21 Geography.Landforms 480 0.000693
4 Culture.Arts 435 0.000628
34 STEM.Information science 406 0.000586
1 Assistance.Contents systems 393 0.000568
18 Geography.Cities 124 0.000179
2 Assistance.Files 103 0.000149
43 Unknown 100 0.000144
0 Assistance.Article improvement and grading 59 0.000085

The table below aggregates the table above by WikiProject Directory (broad topics). We can see that in terms of broad topics, the most popular articles are Culture related (43.2% pageviews), followed by STEM (20.4%) and Geography (19.8%). Comparing with the articles served by Toledo, since the majority of their readers are also Indonesian, we can infer that the demand of Indonesian readers haven't been fully fullfilled by Toledo yet.

In [14]:
translated_page_id_topic['broad topic'] = translated_page_id_topic.topic.str.split(pat=".", n=1, expand=True)[0]
translated_page_id_topic.groupby('broad topic', as_index = False)['pageviews', 'proportion'].sum().sort_values(by='pageviews', ascending=False)
Out[14]:
broad topic pageviews proportion
1 Culture 299220 0.432223
4 STEM 141089 0.203803
2 Geography 137232 0.198231
3 History_And_Society 113105 0.163380
0 Assistance 1536 0.002219
5 Unknown 100 0.000144

4. Hindi users initiated translation

In this section, we look at the topics of articles for which Hindi users initiate translations from English. Users paste the article links into Google translate, or click on the "Translate this page" link from their search result. In this case, users are well aware that they are reading a translated article and willing to put more effort to do that, which is an indication of a stronger interest in the articles.

In [15]:
%%read_sql translated_page_hi -d
select w.namespace_id, w.page_id, p.page_title, p.page_latest as rev_id, count(*) as pageviews
from wmf.webrequest w join wmf_raw.mediawiki_page p on (w.page_id=p.page_id and w.namespace_id=p.page_namespace and 
                                                       p.wiki_db='enwiki' and p.snapshot='2019-03')
where
    year = 2019
    and month = 3
    and is_pageview
    and w.namespace_id=0
    and x_analytics_map['translationengine'] = 'GT'
    and parse_url(referer, 'QUERY') not like '%client=srp%'
    and uri_host in ('en.wikipedia.org','en.m.wikipedia.org')
    and (
        regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])tl=([^&]*)', 2) = 'hi' 
        or ((regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])tl=([^&]*)', 2) is null or regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])tl=([^&]*)', 2)='') 
             and regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])hl=([^&]*)', 2) = 'hi')
        )
group by w.namespace_id, w.page_id, p.page_title, p.page_latest
Query started at 07:07:00 PM UTC; Query executed in 14.27 m
In [16]:
print('Number of unique pages: ' + str(translated_page_hi.shape[0]))
Number of unique pages: 178329
In [20]:
# Save rev_id into a json file
translated_page_hi[['rev_id']].to_json(path_or_buf='input_rev_id.json', orient='records', lines=True)
In [21]:
%%bash
ores score_revisions https://ores.wikimedia.org "cxie@wikimedia.org analyzing article topics" enwiki drafttopic --parallel-requests=4 --input=input_rev_id.json > output_drafttopic_hi.json
2019-04-25 17:38:17,514 INFO:ores.utilities.score_revisions -- Reading input from input_rev_id.json
2019-04-25 17:38:17,514 INFO:ores.utilities.score_revisions -- Writing output to from <stdout>

The table below shows the number of pageviews and the corresponding proprotions in March 2019, broken down by the topics (mid-level categories of WikiProject directory). Please note that when topic is "Unknown", it means the ORES draft topic model can't figure out the topics for those articles.

The most popular articles are Countries related (25.5% pageviews), followed by Medicine (8.9%), Biology (7.2%).

In [17]:
# Get topic from ORES draft topic output
topic_df = pd.DataFrame([])
with open('output_drafttopic_hi.json') as json_file: 
    for line in json_file:
        try:
            ores_results = json.loads(line)
            topic_df = topic_df.append(pd.DataFrame([[ores_results['rev_id'], get_pred_topic_best(ores_results)]]))
        except ValueError:
            print(line)

topic_df.columns = ['rev_id', 'topic']
In [18]:
topic_df.rev_id=topic_df.rev_id.astype(int)
translated_page_hi_topic = (translated_page_hi
          .merge(topic_df, how = 'left', on='rev_id')
          .groupby('topic', as_index = False)['pageviews']
          .sum()
          .sort_values(by='pageviews', ascending=False)
         )
translated_page_hi_topic['proportion']= translated_page_hi_topic['pageviews']/translated_page_hi_topic['pageviews'].sum()
translated_page_hi_topic
Out[18]:
topic pageviews proportion
19 Geography.Countries 205031 0.255371
36 STEM.Medicine 71635 0.089223
30 STEM.Biology 57729 0.071903
10 Culture.Language and literature 52215 0.065035
24 History_And_Society.Business and economics 46858 0.058363
26 History_And_Society.History and society 40072 0.049911
31 STEM.Chemistry 32089 0.039968
41 STEM.Technology 31559 0.039307
28 History_And_Society.Politics and government 27793 0.034617
13 Culture.Philosophy and religion 21281 0.026506
7 Culture.Entertainment 19453 0.024229
12 Culture.Performing arts 18407 0.022926
38 STEM.Physics 16815 0.020943
29 History_And_Society.Transportation 15887 0.019788
27 History_And_Society.Military and warfare 15196 0.018927
15 Culture.Sports 14328 0.017846
25 History_And_Society.Education 14318 0.017833
20 Geography.Europe 13763 0.017142
8 Culture.Food and drink 12409 0.015456
16 Culture.Visual arts 10917 0.013597
32 STEM.Engineering 10844 0.013506
39 STEM.Science 6848 0.008529
40 STEM.Space 5474 0.006818
35 STEM.Mathematics 5306 0.006609
5 Culture.Broadcasting 4837 0.006025
9 Culture.Internet culture 4614 0.005747
33 STEM.Geosciences 4128 0.005142
37 STEM.Meteorology 3002 0.003739
22 Geography.Maps 2943 0.003666
6 Culture.Crafts and hobbies 2576 0.003208
17 Geography.Bodies of water 2443 0.003043
11 Culture.Media 2415 0.003008
34 STEM.Information science 2353 0.002931
23 Geography.Oceania 2122 0.002643
14 Culture.Plastic arts 1209 0.001506
3 Assistance.Maintenance 905 0.001127
42 STEM.Time 811 0.001010
21 Geography.Landforms 751 0.000935
1 Assistance.Contents systems 463 0.000577
43 Unknown 385 0.000480
4 Culture.Arts 254 0.000316
18 Geography.Cities 157 0.000196
2 Assistance.Files 157 0.000196
0 Assistance.Article improvement and grading 123 0.000153

The table below aggregates the table above by WikiProject Directory (broad topics). We can see that in terms of broad topics, the most popular articles are STEM related (31% pageviews), followed by Geography (28.3%), Culture (20.5%).

In [19]:
translated_page_hi_topic['broad topic'] = translated_page_hi_topic.topic.str.split(pat=".", n=1, expand=True)[0]
translated_page_hi_topic.groupby('broad topic', as_index = False)['pageviews', 'proportion'].sum().sort_values(by='pageviews', ascending=False)
Out[19]:
broad topic pageviews proportion
4 STEM 248593 0.309629
2 Geography 227210 0.282995
1 Culture 164915 0.205406
3 History_And_Society 160124 0.199438
0 Assistance 1648 0.002053
5 Unknown 385 0.000480