https://phabricator.wikimedia.org/T219660
In this analysis, we use the ORES draft topic model to get the topics of articles viewed and translated by Google in March 2019. The outcome topic is the mid-level categories of WikiProject directory (see the hierarchy).
The goal of this analysis is to figure out the topics readers are interested in, but those articles are not available (or their quality is not good) in their local language. With this result, we can recommend those popular topics to editors in local communities.
It is worth to note that how the translation was initiated represents different motivations of readers, and thus we break down the analysis by these two types:
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
The raw code for this notebook is by default hidden for easier reading.
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code"></form>
''')
%load_ext sql_magic
import findspark, os
os.environ['SPARK_HOME'] = '/usr/lib/spark2';
findspark.init()
import pyspark
import pyspark.sql
conf = pyspark.SparkConf().setMaster("yarn") # Use master yarn here if you are going to query large datasets.
conf.set('spark.executor.memory', '4g')
conf.set('spark.executor.cores', '4')
conf.set('spark.driver.memory', '4g')
conf.set('spark.driver.maxResultSize', '4g')
conf.set('spark.logConf', True)
sc = pyspark.SparkContext(conf=conf)
spark_hive = pyspark.sql.HiveContext(sc)
%config SQL.conn_name = 'spark_hive'
import requests
import pandas as pd
import json
From the table below, we can see that the vast majority of the pages are translated from English to other languages, so for this first exploration we will only check the topics of articles on English Wikipedia.
%%read_sql translated_pv_by_host
select uri_host, client_srp,
sum(count) as pageviews
from chelsyx.toledo_pageviews
where year=2019
and month = 3
group by uri_host, client_srp
order by pageviews desc
limit 15
These are topics that:
%%read_sql translated_page_toledo -d
select w.namespace_id, w.page_id, p.page_title, p.page_latest as rev_id, count(*) as pageviews
from wmf.webrequest w join wmf_raw.mediawiki_page p on (w.page_id=p.page_id and w.namespace_id=p.page_namespace and
p.wiki_db='enwiki' and p.snapshot='2019-03')
where
year = 2019
and month = 3
and is_pageview
and w.namespace_id=0
and x_analytics_map['translationengine'] = 'GT'
and parse_url(referer, 'QUERY') like '%client=srp%'
and uri_host in ('en.wikipedia.org','en.m.wikipedia.org')
group by w.namespace_id, w.page_id, p.page_title, p.page_latest
print('Number of unique pages: ' + str(translated_page_toledo.shape[0]))
# Save rev_id into a json file
translated_page_toledo[['rev_id']].to_json(path_or_buf='input_rev_id.json', orient='records', lines=True)
%%bash
ores score_revisions https://ores.wikimedia.org "cxie@wikimedia.org analyzing article topics" enwiki drafttopic --parallel-requests=4 --input=input_rev_id.json > output_drafttopic_toledo.json
# Get topic from ORES draft topic output
def get_pred_topic_best(input_json):
try:
topics = input_json['score']['drafttopic']['score']['probability']
best = sorted(topics, key=topics.get, reverse=True)[0]
except (IndexError, KeyError) as error:
best = 'Unknown'
return best
topic_df = pd.DataFrame([])
with open('output_drafttopic_toledo.json') as json_file:
for line in json_file:
try:
ores_results = json.loads(line)
topic_df = topic_df.append(pd.DataFrame([[ores_results['rev_id'], get_pred_topic_best(ores_results)]]))
except ValueError:
print(line)
topic_df.columns = ['rev_id', 'topic']
The table below shows the number of pageviews and the corresponding proprotions in March 2019, broken down by the topics (mid-level categories of WikiProject directory). Please note that when topic is "Unknown", it means the ORES draft topic model can't figure out the topics for those articles.
The most popular articles served by Toledo are Medicine related (23.5% pageviews), followed by Countries (12%) and Biology (11.1%).
topic_df.rev_id=topic_df.rev_id.astype(int)
translated_page_toledo_topic = (translated_page_toledo
.merge(topic_df, how = 'left', on='rev_id')
.groupby('topic', as_index = False)['pageviews']
.sum()
.sort_values(by='pageviews', ascending=False)
)
translated_page_toledo_topic['proportion']= translated_page_toledo_topic['pageviews']/translated_page_toledo_topic['pageviews'].sum()
translated_page_toledo_topic
The table below aggregates the table above by WikiProject Directory (broad topics). We can see that in terms of broad topics, the most popular articles served by Toledo are STEM (science, tech and engineering) related (56.3% pageviews), followed by Culture (19.4%) and Geography (15.8%). This seems to align with the information that Gooogle tell us.
translated_page_toledo_topic['broad topic'] = translated_page_toledo_topic.topic.str.split(pat=".", n=1, expand=True)[0]
translated_page_toledo_topic.groupby('broad topic', as_index = False)['pageviews', 'proportion'].sum().sort_values(by='pageviews', ascending=False)
In this section, we look at the topics of articles for which Indonesian users initiate translations from English. Users paste the article links into Google translate, or click on the "Translate this page" link from their search result. In this case, users are well aware that they are reading a translated article and willing to put more effort to do that, which is an indication of a stronger interest in the articles.
%%read_sql translated_page_id -d
select w.namespace_id, w.page_id, p.page_title, p.page_latest as rev_id, count(*) as pageviews
from wmf.webrequest w join wmf_raw.mediawiki_page p on (w.page_id=p.page_id and w.namespace_id=p.page_namespace and
p.wiki_db='enwiki' and p.snapshot='2019-03')
where
year = 2019
and month = 3
and is_pageview
and w.namespace_id=0
and x_analytics_map['translationengine'] = 'GT'
and parse_url(referer, 'QUERY') not like '%client=srp%'
and uri_host in ('en.wikipedia.org','en.m.wikipedia.org')
and (
regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])tl=([^&]*)', 2) = 'id'
or ((regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])tl=([^&]*)', 2) is null or regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])tl=([^&]*)', 2)='')
and regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])hl=([^&]*)', 2) = 'id')
)
group by w.namespace_id, w.page_id, p.page_title, p.page_latest
print('Number of unique pages: ' + str(translated_page_id.shape[0]))
# Save rev_id into a json file
translated_page_id[['rev_id']].to_json(path_or_buf='input_rev_id.json', orient='records', lines=True)
%%bash
ores score_revisions https://ores.wikimedia.org "cxie@wikimedia.org analyzing article topics" enwiki drafttopic --parallel-requests=4 --input=input_rev_id.json > output_drafttopic_id.json
The table below shows the number of pageviews and the corresponding proprotions in March 2019, broken down by the topics (mid-level categories of WikiProject directory). Please note that when topic is "Unknown", it means the ORES draft topic model can't figure out the topics for those articles.
The most popular articles are Countries related (15.5% pageviews), followed by Entertainment (10.7%), Language and literature (9%).
# Get topic from ORES draft topic output
topic_df = pd.DataFrame([])
with open('output_drafttopic_id.json') as json_file:
for line in json_file:
try:
ores_results = json.loads(line)
topic_df = topic_df.append(pd.DataFrame([[ores_results['rev_id'], get_pred_topic_best(ores_results)]]))
except ValueError:
print(line)
topic_df.columns = ['rev_id', 'topic']
topic_df.rev_id=topic_df.rev_id.astype(int)
translated_page_id_topic = (translated_page_id
.merge(topic_df, how = 'left', on='rev_id')
.groupby('topic', as_index = False)['pageviews']
.sum()
.sort_values(by='pageviews', ascending=False)
)
translated_page_id_topic['proportion']= translated_page_id_topic['pageviews']/translated_page_id_topic['pageviews'].sum()
translated_page_id_topic
The table below aggregates the table above by WikiProject Directory (broad topics). We can see that in terms of broad topics, the most popular articles are Culture related (43.2% pageviews), followed by STEM (20.4%) and Geography (19.8%). Comparing with the articles served by Toledo, since the majority of their readers are also Indonesian, we can infer that the demand of Indonesian readers haven't been fully fullfilled by Toledo yet.
translated_page_id_topic['broad topic'] = translated_page_id_topic.topic.str.split(pat=".", n=1, expand=True)[0]
translated_page_id_topic.groupby('broad topic', as_index = False)['pageviews', 'proportion'].sum().sort_values(by='pageviews', ascending=False)
In this section, we look at the topics of articles for which Hindi users initiate translations from English. Users paste the article links into Google translate, or click on the "Translate this page" link from their search result. In this case, users are well aware that they are reading a translated article and willing to put more effort to do that, which is an indication of a stronger interest in the articles.
%%read_sql translated_page_hi -d
select w.namespace_id, w.page_id, p.page_title, p.page_latest as rev_id, count(*) as pageviews
from wmf.webrequest w join wmf_raw.mediawiki_page p on (w.page_id=p.page_id and w.namespace_id=p.page_namespace and
p.wiki_db='enwiki' and p.snapshot='2019-03')
where
year = 2019
and month = 3
and is_pageview
and w.namespace_id=0
and x_analytics_map['translationengine'] = 'GT'
and parse_url(referer, 'QUERY') not like '%client=srp%'
and uri_host in ('en.wikipedia.org','en.m.wikipedia.org')
and (
regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])tl=([^&]*)', 2) = 'hi'
or ((regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])tl=([^&]*)', 2) is null or regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])tl=([^&]*)', 2)='')
and regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])hl=([^&]*)', 2) = 'hi')
)
group by w.namespace_id, w.page_id, p.page_title, p.page_latest
print('Number of unique pages: ' + str(translated_page_hi.shape[0]))
# Save rev_id into a json file
translated_page_hi[['rev_id']].to_json(path_or_buf='input_rev_id.json', orient='records', lines=True)
%%bash
ores score_revisions https://ores.wikimedia.org "cxie@wikimedia.org analyzing article topics" enwiki drafttopic --parallel-requests=4 --input=input_rev_id.json > output_drafttopic_hi.json
The table below shows the number of pageviews and the corresponding proprotions in March 2019, broken down by the topics (mid-level categories of WikiProject directory). Please note that when topic is "Unknown", it means the ORES draft topic model can't figure out the topics for those articles.
The most popular articles are Countries related (25.5% pageviews), followed by Medicine (8.9%), Biology (7.2%).
# Get topic from ORES draft topic output
topic_df = pd.DataFrame([])
with open('output_drafttopic_hi.json') as json_file:
for line in json_file:
try:
ores_results = json.loads(line)
topic_df = topic_df.append(pd.DataFrame([[ores_results['rev_id'], get_pred_topic_best(ores_results)]]))
except ValueError:
print(line)
topic_df.columns = ['rev_id', 'topic']
topic_df.rev_id=topic_df.rev_id.astype(int)
translated_page_hi_topic = (translated_page_hi
.merge(topic_df, how = 'left', on='rev_id')
.groupby('topic', as_index = False)['pageviews']
.sum()
.sort_values(by='pageviews', ascending=False)
)
translated_page_hi_topic['proportion']= translated_page_hi_topic['pageviews']/translated_page_hi_topic['pageviews'].sum()
translated_page_hi_topic
The table below aggregates the table above by WikiProject Directory (broad topics). We can see that in terms of broad topics, the most popular articles are STEM related (31% pageviews), followed by Geography (28.3%), Culture (20.5%).
translated_page_hi_topic['broad topic'] = translated_page_hi_topic.topic.str.split(pat=".", n=1, expand=True)[0]
translated_page_hi_topic.groupby('broad topic', as_index = False)['pageviews', 'proportion'].sum().sort_values(by='pageviews', ascending=False)