Topics of articles translated by Google from English Wikipedia¶

https://phabricator.wikimedia.org/T219660

In this analysis, we use the ORES draft topic model to get the topics of articles viewed and translated by Google in March 2019. The outcome topic is the mid-level categories of WikiProject directory (see the hierarchy).

The goal of this analysis is to figure out the topics readers are interested in, but those articles are not available (or their quality is not good) in their local language. With this result, we can recommend those popular topics to editors in local communities.

It is worth to note that how the translation was initiated represents different motivations of readers, and thus we break down the analysis by these two types:

Pages translated by Toledo automatically (Google integrate automatic translated pages in search results). These articles represent 1) Google thinks the quality of contents in local languages is not as good as translated pages AND 2) Users are interested in these articles and thus click through the search results.
User initiated translation. Users paste the article links into Google translate, or click on the "Translate this page" link from their search result. In this case, users are well aware that they are reading a translated article and willing to put more effort to do that, which is an indication of a stronger interest in the articles. We break down the analysis by translation target languages.

Key Take-aways:¶

56.3% pageviews served by Toledo are STEM (science, tech and engineering) related, which seems to aligned with the information from Google.
Among STEM, Medicine is the most popular, followed by Biology.
Indonesian and Hindi users like to translate and read articles about Countries on English Wikipedia.
Comparing articles served by Toledo (the majority of their readers are Indonesian) with Indonesian users initiated translation, it seems Indonesian readers's demand for Culture related contents (43.2% of Indonesian users initiated translation) haven't been fully fullfilled by Toledo yet.

from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this notebook is by default hidden for easier reading.
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code"></form>
''')

%load_ext sql_magic

import findspark, os
os.environ['SPARK_HOME'] = '/usr/lib/spark2';
findspark.init()
import pyspark
import pyspark.sql
conf = pyspark.SparkConf().setMaster("yarn")  # Use master yarn here if you are going to query large datasets.
conf.set('spark.executor.memory', '4g')
conf.set('spark.executor.cores', '4')
conf.set('spark.driver.memory', '4g')
conf.set('spark.driver.maxResultSize', '4g')
conf.set('spark.logConf', True)
sc = pyspark.SparkContext(conf=conf)
spark_hive = pyspark.sql.HiveContext(sc)

%config SQL.conn_name = 'spark_hive'

import requests
import pandas as pd
import json

1. Translated pageviews by source languages¶

From the table below, we can see that the vast majority of the pages are translated from English to other languages, so for this first exploration we will only check the topics of articles on English Wikipedia.

%%read_sql translated_pv_by_host
select uri_host, client_srp, 
sum(count) as pageviews
from chelsyx.toledo_pageviews
where year=2019
and month = 3
group by uri_host, client_srp
order by pageviews desc
limit 15

Query started at 05:29:01 PM UTC; Query executed in 0.29 m

2. Pages translated by Toledo¶

These are topics that:

Google thinks the quality of contents in local languages is not as good as translated pages AND
Users are interested in and thus click through the search results

%%read_sql translated_page_toledo -d
select w.namespace_id, w.page_id, p.page_title, p.page_latest as rev_id, count(*) as pageviews
from wmf.webrequest w join wmf_raw.mediawiki_page p on (w.page_id=p.page_id and w.namespace_id=p.page_namespace and 
                                                       p.wiki_db='enwiki' and p.snapshot='2019-03')
where
    year = 2019
    and month = 3
    and is_pageview
    and w.namespace_id=0
    and x_analytics_map['translationengine'] = 'GT'
    and parse_url(referer, 'QUERY') like '%client=srp%'
    and uri_host in ('en.wikipedia.org','en.m.wikipedia.org')
group by w.namespace_id, w.page_id, p.page_title, p.page_latest

Query started at 05:29:18 PM UTC; Query executed in 24.43 m

print('Number of unique pages: ' + str(translated_page_toledo.shape[0]))

Number of unique pages: 195003

# Save rev_id into a json file
translated_page_toledo[['rev_id']].to_json(path_or_buf='input_rev_id.json', orient='records', lines=True)

%%bash
ores score_revisions https://ores.wikimedia.org "cxie@wikimedia.org analyzing article topics" enwiki drafttopic --parallel-requests=4 --input=input_rev_id.json > output_drafttopic_toledo.json

2019-04-25 06:02:03,951 INFO:ores.utilities.score_revisions -- Reading input from input_rev_id.json
2019-04-25 06:02:03,951 INFO:ores.utilities.score_revisions -- Writing output to from <stdout>

# Get topic from ORES draft topic output

def get_pred_topic_best(input_json):
    try:
        topics = input_json['score']['drafttopic']['score']['probability']
        best = sorted(topics, key=topics.get, reverse=True)[0]
    except (IndexError, KeyError) as error:
        best = 'Unknown'
    return best

topic_df = pd.DataFrame([])
with open('output_drafttopic_toledo.json') as json_file: 
    for line in json_file:
        try:
            ores_results = json.loads(line)
            topic_df = topic_df.append(pd.DataFrame([[ores_results['rev_id'], get_pred_topic_best(ores_results)]]))
        except ValueError:
            print(line)

topic_df.columns = ['rev_id', 'topic']

The table below shows the number of pageviews and the corresponding proprotions in March 2019, broken down by the topics (mid-level categories of WikiProject directory). Please note that when topic is "Unknown", it means the ORES draft topic model can't figure out the topics for those articles.

The most popular articles served by Toledo are Medicine related (23.5% pageviews), followed by Countries (12%) and Biology (11.1%).

topic_df.rev_id=topic_df.rev_id.astype(int)
translated_page_toledo_topic = (translated_page_toledo
          .merge(topic_df, how = 'left', on='rev_id')
          .groupby('topic', as_index = False)['pageviews']
          .sum()
          .sort_values(by='pageviews', ascending=False)
         )
translated_page_toledo_topic['proportion']= translated_page_toledo_topic['pageviews']/translated_page_toledo_topic['pageviews'].sum()
translated_page_toledo_topic

The table below aggregates the table above by WikiProject Directory (broad topics). We can see that in terms of broad topics, the most popular articles served by Toledo are STEM (science, tech and engineering) related (56.3% pageviews), followed by Culture (19.4%) and Geography (15.8%). This seems to align with the information that Gooogle tell us.

translated_page_toledo_topic['broad topic'] = translated_page_toledo_topic.topic.str.split(pat=".", n=1, expand=True)[0]
translated_page_toledo_topic.groupby('broad topic', as_index = False)['pageviews', 'proportion'].sum().sort_values(by='pageviews', ascending=False)

3. Indonesian users initiated translation¶

In this section, we look at the topics of articles for which Indonesian users initiate translations from English. Users paste the article links into Google translate, or click on the "Translate this page" link from their search result. In this case, users are well aware that they are reading a translated article and willing to put more effort to do that, which is an indication of a stronger interest in the articles.

%%read_sql translated_page_id -d
select w.namespace_id, w.page_id, p.page_title, p.page_latest as rev_id, count(*) as pageviews
from wmf.webrequest w join wmf_raw.mediawiki_page p on (w.page_id=p.page_id and w.namespace_id=p.page_namespace and 
                                                       p.wiki_db='enwiki' and p.snapshot='2019-03')
where
    year = 2019
    and month = 3
    and is_pageview
    and w.namespace_id=0
    and x_analytics_map['translationengine'] = 'GT'
    and parse_url(referer, 'QUERY') not like '%client=srp%'
    and uri_host in ('en.wikipedia.org','en.m.wikipedia.org')
    and (
        regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])tl=([^&]*)', 2) = 'id' 
        or ((regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])tl=([^&]*)', 2) is null or regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])tl=([^&]*)', 2)='') 
             and regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])hl=([^&]*)', 2) = 'id')
        )
group by w.namespace_id, w.page_id, p.page_title, p.page_latest

Query started at 06:24:42 PM UTC; Query executed in 15.24 m

print('Number of unique pages: ' + str(translated_page_id.shape[0]))

Number of unique pages: 249252

# Save rev_id into a json file
translated_page_id[['rev_id']].to_json(path_or_buf='input_rev_id.json', orient='records', lines=True)

%%bash
ores score_revisions https://ores.wikimedia.org "cxie@wikimedia.org analyzing article topics" enwiki drafttopic --parallel-requests=4 --input=input_rev_id.json > output_drafttopic_id.json

2019-04-25 06:38:33,177 INFO:ores.utilities.score_revisions -- Reading input from input_rev_id.json
2019-04-25 06:38:33,177 INFO:ores.utilities.score_revisions -- Writing output to from <stdout>

The table below shows the number of pageviews and the corresponding proprotions in March 2019, broken down by the topics (mid-level categories of WikiProject directory). Please note that when topic is "Unknown", it means the ORES draft topic model can't figure out the topics for those articles.

The most popular articles are Countries related (15.5% pageviews), followed by Entertainment (10.7%), Language and literature (9%).

# Get topic from ORES draft topic output
topic_df = pd.DataFrame([])
with open('output_drafttopic_id.json') as json_file: 
    for line in json_file:
        try:
            ores_results = json.loads(line)
            topic_df = topic_df.append(pd.DataFrame([[ores_results['rev_id'], get_pred_topic_best(ores_results)]]))
        except ValueError:
            print(line)

topic_df.columns = ['rev_id', 'topic']

topic_df.rev_id=topic_df.rev_id.astype(int)
translated_page_id_topic = (translated_page_id
          .merge(topic_df, how = 'left', on='rev_id')
          .groupby('topic', as_index = False)['pageviews']
          .sum()
          .sort_values(by='pageviews', ascending=False)
         )
translated_page_id_topic['proportion']= translated_page_id_topic['pageviews']/translated_page_id_topic['pageviews'].sum()
translated_page_id_topic

The table below aggregates the table above by WikiProject Directory (broad topics). We can see that in terms of broad topics, the most popular articles are Culture related (43.2% pageviews), followed by STEM (20.4%) and Geography (19.8%). Comparing with the articles served by Toledo, since the majority of their readers are also Indonesian, we can infer that the demand of Indonesian readers haven't been fully fullfilled by Toledo yet.

translated_page_id_topic['broad topic'] = translated_page_id_topic.topic.str.split(pat=".", n=1, expand=True)[0]
translated_page_id_topic.groupby('broad topic', as_index = False)['pageviews', 'proportion'].sum().sort_values(by='pageviews', ascending=False)

4. Hindi users initiated translation¶

In this section, we look at the topics of articles for which Hindi users initiate translations from English. Users paste the article links into Google translate, or click on the "Translate this page" link from their search result. In this case, users are well aware that they are reading a translated article and willing to put more effort to do that, which is an indication of a stronger interest in the articles.

%%read_sql translated_page_hi -d
select w.namespace_id, w.page_id, p.page_title, p.page_latest as rev_id, count(*) as pageviews
from wmf.webrequest w join wmf_raw.mediawiki_page p on (w.page_id=p.page_id and w.namespace_id=p.page_namespace and 
                                                       p.wiki_db='enwiki' and p.snapshot='2019-03')
where
    year = 2019
    and month = 3
    and is_pageview
    and w.namespace_id=0
    and x_analytics_map['translationengine'] = 'GT'
    and parse_url(referer, 'QUERY') not like '%client=srp%'
    and uri_host in ('en.wikipedia.org','en.m.wikipedia.org')
    and (
        regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])tl=([^&]*)', 2) = 'hi' 
        or ((regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])tl=([^&]*)', 2) is null or regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])tl=([^&]*)', 2)='') 
             and regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])hl=([^&]*)', 2) = 'hi')
        )
group by w.namespace_id, w.page_id, p.page_title, p.page_latest

Query started at 07:07:00 PM UTC; Query executed in 14.27 m

print('Number of unique pages: ' + str(translated_page_hi.shape[0]))

Number of unique pages: 178329

# Save rev_id into a json file
translated_page_hi[['rev_id']].to_json(path_or_buf='input_rev_id.json', orient='records', lines=True)

%%bash
ores score_revisions https://ores.wikimedia.org "cxie@wikimedia.org analyzing article topics" enwiki drafttopic --parallel-requests=4 --input=input_rev_id.json > output_drafttopic_hi.json

2019-04-25 17:38:17,514 INFO:ores.utilities.score_revisions -- Reading input from input_rev_id.json
2019-04-25 17:38:17,514 INFO:ores.utilities.score_revisions -- Writing output to from <stdout>

The table below shows the number of pageviews and the corresponding proprotions in March 2019, broken down by the topics (mid-level categories of WikiProject directory). Please note that when topic is "Unknown", it means the ORES draft topic model can't figure out the topics for those articles.

The most popular articles are Countries related (25.5% pageviews), followed by Medicine (8.9%), Biology (7.2%).

# Get topic from ORES draft topic output
topic_df = pd.DataFrame([])
with open('output_drafttopic_hi.json') as json_file: 
    for line in json_file:
        try:
            ores_results = json.loads(line)
            topic_df = topic_df.append(pd.DataFrame([[ores_results['rev_id'], get_pred_topic_best(ores_results)]]))
        except ValueError:
            print(line)

topic_df.columns = ['rev_id', 'topic']

topic_df.rev_id=topic_df.rev_id.astype(int)
translated_page_hi_topic = (translated_page_hi
          .merge(topic_df, how = 'left', on='rev_id')
          .groupby('topic', as_index = False)['pageviews']
          .sum()
          .sort_values(by='pageviews', ascending=False)
         )
translated_page_hi_topic['proportion']= translated_page_hi_topic['pageviews']/translated_page_hi_topic['pageviews'].sum()
translated_page_hi_topic

The table below aggregates the table above by WikiProject Directory (broad topics). We can see that in terms of broad topics, the most popular articles are STEM related (31% pageviews), followed by Geography (28.3%), Culture (20.5%).

translated_page_hi_topic['broad topic'] = translated_page_hi_topic.topic.str.split(pat=".", n=1, expand=True)[0]
translated_page_hi_topic.groupby('broad topic', as_index = False)['pageviews', 'proportion'].sum().sort_values(by='pageviews', ascending=False)

	uri_host	client_srp	pageviews
0	en.m.wikipedia.org	False	5298730
1	en.m.wikipedia.org	None	1592200
2	en.m.wikipedia.org	True	816705
3	en.wikipedia.org	False	739397
4	en.wikipedia.org	None	287853
5	de.wikipedia.org	False	202173
6	fr.wikipedia.org	False	154832
7	ru.wikipedia.org	False	119574
8	de.wikipedia.org	None	118290
9	es.wikipedia.org	False	111514
10	ja.wikipedia.org	False	107760
11	it.wikipedia.org	False	86300
12	fr.wikipedia.org	None	72633
13	zh.wikipedia.org	False	53637
14	es.wikipedia.org	None	45966

	topic	pageviews	proportion
36	STEM.Medicine	191313	0.235254
19	Geography.Countries	97464	0.119850
30	STEM.Biology	90225	0.110948
41	STEM.Technology	58771	0.072270
10	Culture.Language and literature	55456	0.068193
31	STEM.Chemistry	54125	0.066556
20	Geography.Europe	22294	0.027415
7	Culture.Entertainment	21822	0.026834
38	STEM.Physics	20398	0.025083
26	History_And_Society.History and society	19846	0.024404
13	Culture.Philosophy and religion	16283	0.020023
27	History_And_Society.Military and warfare	16256	0.019990
12	Culture.Performing arts	15262	0.018767
24	History_And_Society.Business and economics	14016	0.017235
16	Culture.Visual arts	13396	0.016473
15	Culture.Sports	13051	0.016049
35	STEM.Mathematics	11663	0.014342
8	Culture.Food and drink	9063	0.011145
29	History_And_Society.Transportation	8504	0.010457
32	STEM.Engineering	8261	0.010158
39	STEM.Science	7234	0.008896
40	STEM.Space	6096	0.007496
5	Culture.Broadcasting	6087	0.007485
28	History_And_Society.Politics and government	5875	0.007224
33	STEM.Geosciences	5562	0.006839
9	Culture.Internet culture	3894	0.004788
23	Geography.Oceania	3867	0.004755
22	Geography.Maps	3337	0.004103
25	History_And_Society.Education	2423	0.002980
37	STEM.Meteorology	2379	0.002925
14	Culture.Plastic arts	1277	0.001570
6	Culture.Crafts and hobbies	1168	0.001436
3	Assistance.Maintenance	1098	0.001350
17	Geography.Bodies of water	1045	0.001285
42	STEM.Time	939	0.001155
11	Culture.Media	862	0.001060
1	Assistance.Contents systems	701	0.000862
34	STEM.Information science	572	0.000703
21	Geography.Landforms	534	0.000657
4	Culture.Arts	276	0.000339
2	Assistance.Files	141	0.000173
43	Unknown	133	0.000164
18	Geography.Cities	133	0.000164
0	Assistance.Article improvement and grading	117	0.000144

	broad topic	pageviews	proportion
4	STEM	457538	0.562626
1	Culture	157897	0.194163
2	Geography	128674	0.158228
3	History_And_Society	66920	0.082290
0	Assistance	2057	0.002529
5	Unknown	133	0.000164

	topic	pageviews	proportion
19	Geography.Countries	107235	0.154901
7	Culture.Entertainment	73774	0.106566
10	Culture.Language and literature	62470	0.090238
12	Culture.Performing arts	40229	0.058111
15	Culture.Sports	39626	0.057240
36	STEM.Medicine	37125	0.053627
30	STEM.Biology	34486	0.049815
27	History_And_Society.Military and warfare	28867	0.041698
29	History_And_Society.Transportation	26110	0.037716
41	STEM.Technology	24823	0.035857
24	History_And_Society.Business and economics	24269	0.035057
16	Culture.Visual arts	23786	0.034359
20	Geography.Europe	22121	0.031954
26	History_And_Society.History and society	20565	0.029706
13	Culture.Philosophy and religion	17405	0.025141
5	Culture.Broadcasting	15982	0.023086
31	STEM.Chemistry	13582	0.019619
8	Culture.Food and drink	11299	0.016321
9	Culture.Internet culture	9519	0.013750
28	History_And_Society.Politics and government	8623	0.012456
32	STEM.Engineering	7760	0.011209
38	STEM.Physics	6804	0.009828
23	Geography.Oceania	5023	0.007256
25	History_And_Society.Education	4671	0.006747
40	STEM.Space	3844	0.005553
39	STEM.Science	3673	0.005306
35	STEM.Mathematics	3659	0.005285
33	STEM.Geosciences	2927	0.004228
14	Culture.Plastic arts	2434	0.003516
37	STEM.Meteorology	1513	0.002186
22	Geography.Maps	1501	0.002168
6	Culture.Crafts and hobbies	1393	0.002012
3	Assistance.Maintenance	981	0.001417
11	Culture.Media	868	0.001254
17	Geography.Bodies of water	748	0.001080
42	STEM.Time	487	0.000703
21	Geography.Landforms	480	0.000693
4	Culture.Arts	435	0.000628
34	STEM.Information science	406	0.000586
1	Assistance.Contents systems	393	0.000568
18	Geography.Cities	124	0.000179
2	Assistance.Files	103	0.000149
43	Unknown	100	0.000144
0	Assistance.Article improvement and grading	59	0.000085

	broad topic	pageviews	proportion
1	Culture	299220	0.432223
4	STEM	141089	0.203803
2	Geography	137232	0.198231
3	History_And_Society	113105	0.163380
0	Assistance	1536	0.002219
5	Unknown	100	0.000144

	topic	pageviews	proportion
19	Geography.Countries	205031	0.255371
36	STEM.Medicine	71635	0.089223
30	STEM.Biology	57729	0.071903
10	Culture.Language and literature	52215	0.065035
24	History_And_Society.Business and economics	46858	0.058363
26	History_And_Society.History and society	40072	0.049911
31	STEM.Chemistry	32089	0.039968
41	STEM.Technology	31559	0.039307
28	History_And_Society.Politics and government	27793	0.034617
13	Culture.Philosophy and religion	21281	0.026506
7	Culture.Entertainment	19453	0.024229
12	Culture.Performing arts	18407	0.022926
38	STEM.Physics	16815	0.020943
29	History_And_Society.Transportation	15887	0.019788
27	History_And_Society.Military and warfare	15196	0.018927
15	Culture.Sports	14328	0.017846
25	History_And_Society.Education	14318	0.017833
20	Geography.Europe	13763	0.017142
8	Culture.Food and drink	12409	0.015456
16	Culture.Visual arts	10917	0.013597
32	STEM.Engineering	10844	0.013506
39	STEM.Science	6848	0.008529
40	STEM.Space	5474	0.006818
35	STEM.Mathematics	5306	0.006609
5	Culture.Broadcasting	4837	0.006025
9	Culture.Internet culture	4614	0.005747
33	STEM.Geosciences	4128	0.005142
37	STEM.Meteorology	3002	0.003739
22	Geography.Maps	2943	0.003666
6	Culture.Crafts and hobbies	2576	0.003208
17	Geography.Bodies of water	2443	0.003043
11	Culture.Media	2415	0.003008
34	STEM.Information science	2353	0.002931
23	Geography.Oceania	2122	0.002643
14	Culture.Plastic arts	1209	0.001506
3	Assistance.Maintenance	905	0.001127
42	STEM.Time	811	0.001010
21	Geography.Landforms	751	0.000935
1	Assistance.Contents systems	463	0.000577
43	Unknown	385	0.000480
4	Culture.Arts	254	0.000316
18	Geography.Cities	157	0.000196
2	Assistance.Files	157	0.000196
0	Assistance.Article improvement and grading	123	0.000153

	broad topic	pageviews	proportion
4	STEM	248593	0.309629
2	Geography	227210	0.282995
1	Culture	164915	0.205406
3	History_And_Society	160124	0.199438
0	Assistance	1648	0.002053
5	Unknown	385	0.000480