Phab ticket: https://phabricator.wikimedia.org/T174396

query <- "SELECT
  timestamp,
  wiki,
  event_action AS action,
  event_numberOfResults, 
  event_resultSetType,
  event_platform,
  event_searchSessionToken,
  event_userSessionToken,
  userAgent
FROM MobileWebSearch_12054448
WHERE LEFT(timestamp, 8) >= '20170701'
  AND LEFT(timestamp, 8) < '20170901'
;
"
message("Using SSH tunnel & connection to Analytics-Store...")
mobilewebsearch_july_august <- wmf::mysql_read(query, "log", con = con)
message("Closing connection...")
wmf::mysql_close(con)
save(mobilewebsearch_july_august, file = "mobilewebsearch_july_august.RData")
load("data/mobilewebsearch_july_august.RData")
events <- mobilewebsearch_july_august

events <- events %>%
  mutate(
    timestamp = lubridate::ymd_hms(timestamp),
    date = as.Date(timestamp)
  ) %>%
  arrange(event_userSessionToken, event_searchSessionToken, timestamp)

# Parse useragent
events <- events %>%
  cbind(., purrr::map_df(.$userAgent, ~ wmf::null2na(jsonlite::fromJSON(.x, simplifyVector = FALSE)))) %>%
  mutate(
    browser = paste(browser_family, browser_major),
    os = case_when(
      is.na(os_major) ~ os_family,
      !is.na(os_major) & !is.na(os_minor) ~ paste0(os_family, " ", os_major, ".", os_minor),
      TRUE ~ paste(os_family, os_major)
    )
  )

Problem

We’ve seen a pattern that the mobile usage increase during the weekends (e.g. mobile web pageviews). However, our dashboard didn’t show this pattern in August. The following graph replicate what we saw from the dashboard. The weekend bump pattern is clear in the period of Jul 19 - Aug 14, but not in other times. If we look at the dashboard, we can see this pattern from time to time (e.g. Jan - Feb 2017).

events %>%
  filter(action %in% c('click-result', 'session-start', 'impression-results')) %>%
  #filter(action == 'click-result') %>%
  group_by(date, action) %>%
  tally %>%
  ggplot(aes(x=date, y=n, colour=action)) + 
  geom_line(size=1.2) +
  scale_x_date(name = "Date", date_breaks = "1 week", date_labels = "%A\n%b %d") +
  scale_y_continuous(labels=polloi::compress, name = "Events") +
  ggtitle("Mobile web search events, prefix only") +
  theme(legend.position = "bottom") +
  geom_vline(xintercept = as.numeric(as.Date("2017-07-19")),
               linetype = "dashed", color = "black") +
  geom_vline(xintercept = as.numeric(as.Date("2017-08-14")),
               linetype = "dashed", color = "black")

Check event logging

Fetch mobile web search (Schema:MobileWebSearch) data from July 1st - August 31st. User session token is used to identify users’ full interaction with the search field. The number of unique user session token is 78488. Search session token is used to identify a search, which starts at the point a distinct search field value is sent to the server or the data in the search field will correspond to a result set available in the client side cache. A search session correspond to an API call. There may be multiple search session token generated as a user type in the search field. The number of unique searchSessionToken is 623391. A user session can have multiple search sessions. It’s worth mentioning that there is no event logging schema tracking full-text search on mobile web currently.

Number of user sessions and search sessions

The following graph shows the number of user sessions, identified by user session token. We can see weekend bumps from Jul 1st to Aug 14 in all wikis.

events %>%
  mutate(is_enwiki = ifelse(wiki == "enwiki", "enwiki", "other wiki")) %>%
  group_by(date, is_enwiki) %>%
  summarize(n_session = length(unique(event_userSessionToken))) %>%
  tidyr::spread(is_enwiki, n_session) %>%
  mutate(all = enwiki + `other wiki`) %>%
  tidyr::gather(Group, n_session, -date) %>%
  ggplot(aes(x=date, y=n_session, colour=Group)) + 
  geom_line(size=1.2) +
  scale_x_date(name = "Date", date_breaks = "1 week", date_labels = "%A\n%b %d") +
  scale_y_continuous(labels=polloi::compress, name = "Number of user sessions") +
  ggtitle("Number of user sessions on mobile web, prefix only") +
  theme(legend.position = "bottom") +
  # geom_vline(xintercept = as.numeric(as.Date("2017-07-19")),
               # linetype = "dashed", color = "black") +
  geom_vline(xintercept = as.numeric(as.Date("2017-08-14")),
               linetype = "dashed", color = "black")

The following graph shows the number of search sessions, identified by search session token. We can see weekend bumps from Jul 19 to Aug 14 in all wikis.

events %>%
  mutate(is_enwiki = ifelse(wiki == "enwiki", "enwiki", "other wiki")) %>%
  group_by(date, is_enwiki) %>%
  summarize(n_search = length(unique(event_searchSessionToken))) %>%
  tidyr::spread(is_enwiki, n_search) %>%
  mutate(all = enwiki + `other wiki`) %>%
  tidyr::gather(Group, n_search, -date) %>%
  ggplot(aes(x=date, y=n_search, colour=Group)) + 
  geom_line(size=1.2) +
  scale_x_date(name = "Date", date_breaks = "1 week", date_labels = "%A\n%b %d") +
  scale_y_continuous(labels=polloi::compress, name = "Number of search sessions") +
  ggtitle("Number of search sessions on mobile web, prefix only") +
  theme(legend.position = "bottom") +
  geom_vline(xintercept = as.numeric(as.Date("2017-07-19")),
  linetype = "dashed", color = "black") +
  geom_vline(xintercept = as.numeric(as.Date("2017-08-14")),
               linetype = "dashed", color = "black")

Break down by volume of search sessions per user session

We compute the number of search sessions for each user session. Then we put the user sessions whose number of search sessions is higher than the 90th percentile into “high volume” group and the others into “low volume” group. The following graphs show weekend bumps in the low volume group before August 14.

temp <- events %>%
  group_by(event_userSessionToken) %>%
  summarize(n_search = length(unique(event_searchSessionToken))) %>%
  mutate(volume = ifelse(n_search < quantile(n_search, probs = 0.9), "low volume", "high volume"))

events %>%
  left_join(temp, by = "event_userSessionToken") %>%
  group_by(date, volume) %>%
  summarize(n_session = length(unique(event_userSessionToken))) %>%
  ggplot(aes(x=date, y=n_session, colour=volume)) + 
  geom_line(size=1.2) +
  scale_x_date(name = "Date", date_breaks = "1 week", date_labels = "%A\n%b %d") +
  scale_y_continuous(labels=polloi::compress, name = "Number of user sessions") +
  labs(title = "Number of user sessions on mobile web, by volume of search per session", subtitle = "prefix only") +
  theme(legend.position = "bottom") +
  # geom_vline(xintercept = as.numeric(as.Date("2017-07-19")),
               # linetype = "dashed", color = "black") +
  geom_vline(xintercept = as.numeric(as.Date("2017-08-14")),
               linetype = "dashed", color = "black")

events %>%
  left_join(temp, by = "event_userSessionToken") %>%
  group_by(date, volume) %>%
  summarize(n_search = length(unique(event_searchSessionToken))) %>%
  ggplot(aes(x=date, y=n_search, colour=volume)) + 
  geom_line(size=1.2) +
  scale_x_date(name = "Date", date_breaks = "1 week", date_labels = "%A\n%b %d") +
  scale_y_continuous(labels=polloi::compress, name = "Number of search sessions") +
  labs(title = "Number of search sessions on mobile web, by volume of search per session", subtitle = "prefix only") +
  theme(legend.position = "bottom") +
  # geom_vline(xintercept = as.numeric(as.Date("2017-07-19")),
               # linetype = "dashed", color = "black") +
  geom_vline(xintercept = as.numeric(as.Date("2017-08-14")),
               linetype = "dashed", color = "black")

Break down by browser, OS and wiki

We don’t see any abnormality from the following tables.

Number of user sessions by browser and OS

os Proportion
Android 4.1 1.8% (16669)
Android 4.2 3.0% (28532)
Android 4.4 11.0% (104740)
Android 5.0 3.9% (37223)
Android 5.1 11.0% (104220)
Android 6.0 13.9% (132079)
Android 7.0 5.5% (52408)
iOS 10.2 3.5% (33652)
iOS 10.3 30.8% (293094)
iOS 9.3 5.1% (48113)
Other OSes 10.5% (100103)
browser Proportion
Android 4 4.7% (44258)
Chrome Mobile 30 2.9% (27105)
Chrome Mobile 58 2.5% (23407)
Chrome Mobile 59 12.5% (118727)
Chrome Mobile 60 1.9% (18281)
Chrome Mobile iOS 59 1.8% (17513)
Mobile Safari 10 32.0% (304454)
Mobile Safari 9 5.0% (47709)
Other browsers 32.3% (307044)
Samsung Internet 3 2.1% (20319)
Samsung Internet 5 2.3% (22016)

Number of user sessions and search sessions by wiki (top 10):

Number of user session

wiki prop
enwiki 47.3% (37158)
eswiki 7.9% (6172)
dewiki 6.6% (5215)
jawiki 5.1% (3976)
ruwiki 4.7% (3687)
frwiki 4.0% (3109)
itwiki 3.0% (2365)
ptwiki 2.7% (2089)
zhwiki 2.6% (2079)
arwiki 2.5% (1970)

Number of search sessions

wiki prop
enwiki 46.6% (290412)
eswiki 9.8% (61300)
dewiki 6.3% (39142)
jawiki 4.4% (27722)
ruwiki 4.4% (27340)
frwiki 4.2% (25898)
ptwiki 3.4% (21018)
arwiki 3.0% (18469)
itwiki 2.7% (17094)
zhwiki 1.8% (11523)

Number of events per session by action type

Sam point me to T167236 as it might be related to this problem. So let’s count the number of events per session for each type of action, then group by counts to see if our problem is similar to T167236. Unlike T167236, we don’t have a “99.99% of sessions only consist of one pageview” situation here.

After we check the source code, we are sure that the EventLogging-provided class that we use for logging events (mw.eventLog.Schema) doesn’t sample per-event, it samples per pageview. In the case of the MobileWebSearch instrumentation, 1 in every 1000 pageviews will log all events generated by the instrumentation in that pageview, i.e. complete flows through the funnel are captured by the instrumentation. Thus, our mobile search pattern was not affected by a sampling issue.

events %>%
  group_by(action, event_userSessionToken) %>%
  summarise(counts = n()) %>%
  mutate(counts = ifelse(counts >= 5, "5+ events", counts)) %>%
  group_by(action, counts) %>%
  summarise(n_session = n()) %>%
  ggplot(aes(x = counts, y=n_session)) +
  geom_bar(stat = "identity", position = "dodge") + 
  scale_y_continuous(labels = polloi::compress) +
  geom_text(aes(label = n_session, vjust = -0.5), position = position_dodge(width = 1), size = 3) +
  labs(y = "Number of sessions", x = "Number of Events per session", title = "Distribution of number of events per session") + 
  facet_wrap(~action, ncol = 3)

Check webrequest

#Count daily SERP
start_date <- as.Date("2017-07-01")
end_date <- as.Date("2017-09-01")
daily_serp <- do.call(rbind, lapply(seq(start_date, end_date, "day"), function(date) {
  cat("Fetching webrequest data from ", as.character(date), "\n")
  clause_data <- wmf::date_clause(date)
  query <- paste("USE wmf;
    SELECT
      date,
      COUNT(DISTINCT CONCAT(client_ip, user_agent, query)) AS n_search,
      COUNT(*) AS n_serp, -- a lot of NULL searchToken, so don't count DISTINCT CONCAT(client_ip, user_agent, searchToken)
      COUNT(DISTINCT CONCAT(client_ip, user_agent)) AS n_user
    FROM (
      SELECT 
        CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) AS date,
        access_method,
        client_ip, 
        user_agent,
        agent_type,
        PARSE_URL(CONCAT('http://', uri_host, uri_path, uri_query), 'QUERY', 'search') AS query,
        PARSE_URL(CONCAT('http://', uri_host, uri_path, uri_query), 'QUERY', 'searchToken') AS searchToken -- a lot of NULL
      FROM webrequest",
    clause_data$date_clause,
    "AND webrequest_source = 'text'
    AND access_method = 'mobile web'
    AND agent_type = 'user'
    AND is_pageview
    -- flag for pageviews that are search results pages
    AND page_id IS NULL
    AND (
      uri_path = '/wiki/Special:Search'
      OR (
        uri_path = '/w/index.php'
        AND (
          LENGTH(PARSE_URL(CONCAT('http://', uri_host, uri_path, uri_query), 'QUERY', 'search')) > 0
          OR LENGTH(PARSE_URL(CONCAT('http://', uri_host, uri_path, uri_query), 'QUERY', 'searchToken')) > 0
        )
      )
    )
  ) AS serp
  GROUP BY date;") 
  results <- wmf::query_hive(query)
  return(results)
}))
readr::write_rds(daily_serp, "data/T174396/daily_serp.rds", "gz")
# Local
system("scp chelsyx@stat5:~/data/T174396/daily_serp.rds data/")


#Count daily Pageviews from SERP
start_date <- as.Date("2017-07-01")
end_date <- as.Date("2017-09-01")
daily_pv_from_serp <- do.call(rbind, lapply(seq(start_date, end_date, "day"), function(date) {
  cat("Fetching webrequest data from ", as.character(date), "\n")
  clause_data <- wmf::date_clause(date)
  query <- paste("USE wmf;
    SELECT
      date,
      COUNT(DISTINCT CONCAT(client_ip, user_agent, query)) AS n_search,
      COUNT(*) AS n_pv, 
      COUNT(DISTINCT CONCAT(client_ip, user_agent)) AS n_user
    FROM (
      SELECT 
        CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) AS date,
        access_method,
        client_ip, 
        user_agent,
        agent_type,
        PARSE_URL(referer, 'QUERY', 'search') AS query,
        PARSE_URL(referer, 'QUERY', 'searchToken') AS searchToken
      FROM webrequest",
    clause_data$date_clause,
    "AND webrequest_source = 'text'
    AND access_method = 'mobile web'
    AND agent_type = 'user'
    AND is_pageview
    -- only those that have been referred by a search results page:
    AND referer_class = 'internal'
    AND (
      LENGTH(PARSE_URL(referer, 'QUERY', 'search')) > 0
      OR LENGTH(PARSE_URL(referer, 'QUERY', 'searchToken')) > 0
    )
  ) AS pv
  GROUP BY date;")  
  results <- wmf::query_hive(query)
  return(results)
}))
readr::write_rds(daily_pv_from_serp, "data/T174396/daily_pv_from_serp.rds", "gz")
# Local
system("scp chelsyx@stat5:~/data/T174396/daily_pv_from_serp.rds data/")

The following two graphs show the number of full-text search on mobile web and pageviews from full-text search result pages. Interestingly, they has the same pattern as desktop full-text search – low in the weekends but high in the weekdays.

daily_serp <- readr::read_rds("data/daily_serp.rds") %>%
  mutate(date = lubridate::ymd(date)) %>%
  tidyr::gather(Group, counts, -date)
daily_pv_from_serp <- readr::read_rds("data/daily_pv_from_serp.rds") %>%
  mutate(date = lubridate::ymd(date)) %>%
  tidyr::gather(Group, counts, -date)

daily_serp %>%
  ggplot(aes(x=date, y=counts, colour=Group)) + 
  geom_line(size=1.2) +
  scale_x_date(name = "Date", date_breaks = "1 week", date_labels = "%A\n%b %d") +
  scale_y_continuous(labels=polloi::compress, name = "Counts") +
  ggtitle("Daily fulltext search on mobile web, no automata") +
  theme(legend.position = "bottom")

daily_pv_from_serp %>%
  ggplot(aes(x=date, y=counts, colour=Group)) + 
  geom_line(size=1.2) +
  scale_x_date(name = "Date", date_breaks = "1 week", date_labels = "%A\n%b %d") +
  scale_y_continuous(labels=polloi::compress, name = "Counts") +
  ggtitle("Daily pageviews on mobile web from fulltext SERP, no automata") +
  theme(legend.position = "bottom")

Discussion

  • From our dashboard and the graphs above, we can see the weekend bumps in prefix search on mobile web from time to time (e.g. Jan - Feb 2017). And if we count tokens instead of events, and filter out high volume user sessions, this pattern seems to be clearer. Thus besides counting the number of events, we can add a new graph on the dashboard which counts the number of user session tokens and filter sessions by volume (see T176811).

  • We also notice that our data has a clearer pattern before March 29th, so it’s possible that our pattern issue may be related to the drop in the number of search events on March 29th. See T176464 for more details.

  • The pattern of full-text search on mobile web looks suspicious, and further investigation is needed. See T176815 for more details.