Second MLR Test for hewiki

Last time we trained the hewiki model on an analysis chain that was different than the one used when running the test, so the results were invalid. We re-run the test with a new model trained against the new analysis chain. We use the same sampling rate as last time.

This test ran from 02 January 2018 to 16 January 2018 on hewiki. There were 3 test groups: control, ltr-1024, ltr-1024-i. This report includes fulltext searches. Refer to Phabricator ticket T182616 for more details.

Data Clean-up

Fulltext search events: Deleted 232 duplicated events. Deleted 98498 unnecessary check-in events and only keep the last one. Deleted 5 events with negative load time. Removed 465 orphan (SERP-less) events. Removed 0 sessions falling into multiple test groups. Removed 2 sessions with more than 50 searches.

Data Summary

Select one of these three tabs:

Test Summary

Days	Events	Sessions	Page IDs	SERPs	Unique search queries	Searches	Same-wiki clicks	Other clicks
15	115,525	34,575	86,314	70,353	68,890	56,432	16,739	0

Select one of these sub-tabs:

Events

Event type identifies the context in which the event was created. Every time a new search is performed a searchResultPage event is created. When the user clicks a link in the results a visitPage event is created. When the user has dwelled for N seconds a checkin event occurs. If the user clicks an interwiki result provided by TextCat language detection, there is a iwclick event. If the user clicks on a sister search result from the sidebar, that’s an ssclick. If the user interacts with a result to explore similar (pages, categories, translations), there are hover-on, hover-off, and esclick events.

Searches

Searches with n same-wiki results returned

SERPs by offset

Browser & OS

The goal here is to see whether the proportions of operating system (OS) and browser usage are similar between the groups. To aid decision making, Bayes factor is computed for each row. If one group has very different OS/browser share breakdown (Bayes factor > 2), there might be something wrong with the implementation that caused or is causing the sampling to bias in favor of some OSes/browsers. Note that for brevity, we show only the top 10 OSes/browsers, and that we don’t actually expect the numbers to be different so this is included purely as a diagnostic.

Operating systems:

Browsers:

Sister Search

Select one of these sub-tabs:

Sidebar results by source and position

	1st result	2nd result	3rd result	4th result	5th result	6th result	Sum
Wikibooks	5396	13735	22018	10404	1105	24	52682
Wikinews	406	861	1938	6698	22823	240	32966
Wikiquote	5516	29334	14118	5523	219	12	54722
Wikisource	54538	5100	2064	754	123	14	62593
Wikivoyage	202	169	236	224	502	5011	6344
Wiktionary	3549	9895	10377	17820	4102	59	45802
Sum	69607	59094	50751	41423	28874	5360	255109

Results of Statistical Analysis

Same-wiki Zero Results Rate

Same-wiki Engagement

ltr-1024 vs. control

First Clicked Same-Wiki Result’s Position

Maximum Clicked Position for Same-Wiki Results

PaulScore

PaulScore is a measure of search results’ relevancy which takes into account the position of the clicked results, and is computed via the following steps:

Pick scoring factor \(0 < F < 1\) (larger values of \(F\) increase the weight of clicks on lower-ranked results).
For \(i\)-th search session \(S_i\) \((i = 1, \ldots, n)\) containing \(m\) queries \(Q_1, \ldots, Q_m\) and search result sets \(\mathbf{R}_1, \ldots, \mathbf{R}_m\):

For each \(j\)-th search query \(Q_j\) with result set \(\mathbf{R}_j\), let \(\nu_j\) be the query score: \[\nu_j = \sum_{k~\in~\{\text{0-based positions of clicked results in}~\mathbf{R}_j\}} F^k.\]
Let user’s average query score \(\bar{\nu}_{(i)}\) be \[\bar{\nu}_{(i)} = \frac{1}{m} \sum_{j = 1}^m \nu_j.\]

Then the PaulScore is the average of all users’ average query scores: \[\text{PaulScore}(F)~=~\frac{1}{n} \sum_{i = 1}^n \bar{\nu}_{(i)}.\]

We can calculate the confidence interval of PaulScore\((F)\) by approximating its distribution via boostrapping.

Other Pages of the Search Results

Dwell Time Per Visited Page

Scroll on visited pages

Search Abandon Rate

Scroll on search result page

Return Rate

Users may click back to the search result page directly after they clickthrough to an article (within 10 mins). We computed two kinds of return rate:

Among users with at least a click in their search, the proportion of searches that return to the same search page
Among users with at least a click in their search session, the proportion of sessions that return to search for different things (different search result page but in the same session)

Load time of search results pages

Interleaved test

We use a technique called interleaving to evaluate the user-perceived relevance of search results from the experimental configuration. In it, each user is their own baseline – we perform two searches behind the scenes and then interleave them together into a single set of results using the team draft algorithm described by Chapelle et al. (2012). For all the graphs in this section, A refers to control group or the first group in the interleaved group names, B referes to the name in the interleaved group names or the second group in the interleaved group names. For example, if the interleaved group name is ‘ltr-i-1024’, A is the control group and B is group ‘1024’; if the interleaved group name is ‘ltr-i-20-1024’, A is group ‘20’ and B is group ‘1024’.