Our current DBN training requires 10 instances of a particular query to create training data. We looked at increasing that minimum to 20 instances (dbn20) or 35 instances (dbn35). The control is the production LTR model trained with the current standard DBN minimum of 10. We also have two groups (dbn20-i and dbn35-i) that interleave the results from control and the two test groups (dbn20 and dbn 35) respectively.

This test ran from 02 November 2017 to 09 November 2017 on enwiki. There were 5 test groups: control, dbn20, dbn35, dbn20-i, dbn35-i. This report includes fulltext searches. Refer to Phabricator ticket T177520 for more details.

Data Clean-up

Fulltext search events: Deleted 4474 duplicated events. Deleted 0 unnecessary check-in events and only keep the last one. Deleted 0 events with negative load time. Removed 7805 orphan (SERP-less) events. Removed 0 sessions falling into multiple test groups. Removed 340 sessions with more than 50 searches.

Data Summary

Select one of these three tabs:

Test Summary

Days Events Sessions Page IDs SERPs Unique search queries Searches Same-wiki clicks Other clicks
8 1,126,914 292,966 794,763 606,705 645,589 501,913 179,414 0

Select one of these sub-tabs:

Events

Event type identifies the context in which the event was created. Every time a new search is performed a searchResultPage event is created. When the user clicks a link in the results a visitPage event is created. When the user has dwelled for N seconds a checkin event occurs. If the user clicks an interwiki result provided by TextCat language detection, there is a iwclick event. If the user clicks on a sister search result from the sidebar, that’s an ssclick. If the user interacts with a result to explore similar (pages, categories, translations), there are hover-on, hover-off, and esclick events.

Searches

Searches with n same-wiki results returned

SERPs by offset

Browser & OS

The goal here is to see whether the proportions of operating system (OS) and browser usage are similar between the groups. To aid decision making, Bayes factor is computed for each row. If one group has very different OS/browser share breakdown (Bayes factor > 2), there might be something wrong with the implementation that caused or is causing the sampling to bias in favor of some OSes/browsers. Note that for brevity, we show only the top 10 OSes/browsers, and that we don’t actually expect the numbers to be different so this is included purely as a diagnostic.

Operating systems:

Browsers:

Results of Statistical Analysis

Same-wiki Zero Results Rate

Same-wiki Engagement

dbn20 vs. control dbn35 vs. control

First Clicked Same-Wiki Result’s Position

Maximum Clicked Position for Same-Wiki Results

PaulScore

PaulScore is a measure of search results’ relevancy which takes into account the position of the clicked results, and is computed via the following steps:

  1. Pick scoring factor \(0 < F < 1\) (larger values of \(F\) increase the weight of clicks on lower-ranked results).
  2. For \(i\)-th search session \(S_i\) \((i = 1, \ldots, n)\) containing \(m\) queries \(Q_1, \ldots, Q_m\) and search result sets \(\mathbf{R}_1, \ldots, \mathbf{R}_m\):
  1. For each \(j\)-th search query \(Q_j\) with result set \(\mathbf{R}_j\), let \(\nu_j\) be the query score: \[\nu_j = \sum_{k~\in~\{\text{0-based positions of clicked results in}~\mathbf{R}_j\}} F^k.\]
  2. Let user’s average query score \(\bar{\nu}_{(i)}\) be \[\bar{\nu}_{(i)} = \frac{1}{m} \sum_{j = 1}^m \nu_j.\]
  1. Then the PaulScore is the average of all users’ average query scores: \[\text{PaulScore}(F)~=~\frac{1}{n} \sum_{i = 1}^n \bar{\nu}_{(i)}.\]

We can calculate the confidence interval of PaulScore\((F)\) by approximating its distribution via boostrapping.