Our current DBN training requires 10 instances of a particular query to create training data. We looked at increasing that minimum to 20 instances (dbn20) or 35 instances (dbn35). The control is the production LTR model trained with the current standard DBN minimum of 10. We also have two groups (dbn20-i and dbn35-i) that interleave the results from control and the two test groups (dbn20 and dbn 35) respectively.
This test ran from 02 November 2017 to 09 November 2017 on enwiki. There were 5 test groups: control, dbn20, dbn35, dbn20-i, dbn35-i. This report includes fulltext searches. Refer to Phabricator ticket T177520 for more details.
Fulltext search events: Deleted 4474 duplicated events. Deleted 0 unnecessary check-in events and only keep the last one. Deleted 0 events with negative load time. Removed 7805 orphan (SERP-less) events. Removed 0 sessions falling into multiple test groups. Removed 340 sessions with more than 50 searches.
Select one of these three tabs:
| Days | Events | Sessions | Page IDs | SERPs | Unique search queries | Searches | Same-wiki clicks | Other clicks |
|---|---|---|---|---|---|---|---|---|
| 8 | 1,126,914 | 292,966 | 794,763 | 606,705 | 645,589 | 501,913 | 179,414 | 0 |
Select one of these sub-tabs:
Event type identifies the context in which the event was created. Every time a new search is performed a searchResultPage event is created. When the user clicks a link in the results a visitPage event is created. When the user has dwelled for N seconds a checkin event occurs. If the user clicks an interwiki result provided by TextCat language detection, there is a iwclick event. If the user clicks on a sister search result from the sidebar, that’s an ssclick. If the user interacts with a result to explore similar (pages, categories, translations), there are hover-on, hover-off, and esclick events.
The goal here is to see whether the proportions of operating system (OS) and browser usage are similar between the groups. To aid decision making, Bayes factor is computed for each row. If one group has very different OS/browser share breakdown (Bayes factor > 2), there might be something wrong with the implementation that caused or is causing the sampling to bias in favor of some OSes/browsers. Note that for brevity, we show only the top 10 OSes/browsers, and that we don’t actually expect the numbers to be different so this is included purely as a diagnostic.
Operating systems:
Browsers:
Select one of these sub-tabs:
dbn20 vs. control dbn35 vs. control
PaulScore is a measure of search results’ relevancy which takes into account the position of the clicked results, and is computed via the following steps:
We can calculate the confidence interval of PaulScore\((F)\) by approximating its distribution via boostrapping.