Wikimedia Foundation logoWikimedia Privacy Engineering

Pageviews Differential Privacy — Current

Welcome to the Wikimedia Foundation's differentially-private daily pageview data release!

This dataset uses differential privacy to safely facilitate the large-scale release of pageview data at a low level of granularity, allowing users to conduct analysis on hundreds of thousands of pages per day on a country-project level.

Differential privacy injects a controlled amount of statistical noise into data before it is released. This noise impedes attempts to recover information about any single individual in the dataset without significantly changing the conclusions of the data.

You can find more information about this project on its metawiki homepage.

Update #1: From 15 Feb 2024 on, this dataset includes information about countries deemed "medium risk" and "higher risk" on the updated Country and Territory Protection List. More information on variable epsilon values, release thresholds, etc. are detailed in the rest of this document.

Update #2: During dataset upgrades to add Wikidata QIDs in early March 2024, we discovered two pre-existing dataset bugs. Firstly, due to a internal database naming error, data from the United States was not published from 6 Feb 2023 to 19 Sep 2023. Secondly, due to pipeline orchestration bugs, seven days of data (19 Jun 2023; 25 Oct 2023; and 13, 17, 19, 23, and 27 Nov 2023) were previously missing from the dataset. Due to WMF's data retention guidelines, the identifying data that would enable an exact recalculation of those days has been dropped. To at least partially rectify the situation, we've made smaller datasets for those countries/days available using the same techniques (and with the same privacy guarantees) as the historical pageview dataset that spans from 2017 to early 2023. Please see the linked documentation for precise descriptions of that algorithm's privacy and utility guarantees.

To download dataset files, go to the current dataset homepage.

Dataset characteristics

  • Time range: 6 Feb 2023 - present
  • Time granularity: daily
  • Data features:
    • country (excluding countries with a "not published" risk classification on the Country and Territory Protection List)
    • project (e.g. “en.wikipedia”, “wikidata”, “zh.wikibooks”, etc.)
    • page_id (numerical ID for a given page — together with project, this forms a unique identifier)
    • page_title (the page title for a given page_id)
    • item_id (If existing, the cross-project Wikidata QID for the page. If not existing, an empty string.)
    • gbc (the differentially-private number of pageviews this page_id received)
  • Dataset structure:
country_project_page/
|-> 2023-2-6.csv
|-> 2023-2-7.csv
...
|-> <year>-<month>-<day>.csv
  • Hive table access (all days): differential_privacy.country_project_page


Data creation pipeline

  1. Using the client-side differential privacy cookie, collect a boolean flag representing whether a given pageview was one of the first 10 unique pageviews for a given device in a given day.
  2. For date YYYY-MM-DD, retrieve all pageviews where the differential privacy cookie is true and that come from pages have received >150 global pageviews on that day
  3. Create a key space of all possible country-project-page_ids (we will calculate noise for all parts of this ~125 million-row space)
  4. Do a group-by + count on the pageview data, adding Gaussian noise (zero-concentrated differential privacy; rho=1.505E-2 for lower risk countries, rho=6.166E-4 for medium risk countries, rho=1.546E-4 for higher risk countries; sensitivity=10 unique pageviews) to each row of the dataset
  5. Calculate internal error metrics to ensure that we don't have data drift
  6. Threshold the data so that only rows with >90 pageviews (for lower risk countries — >550 pageviews and >1000 pageviews for medium and higher risk countries, respectively) are released.
  7. Share the final table with the world!

See the code for releasing this data on Wikimedia's gitlab instance



Privacy parameters

  • Noise type: Gaussian zCDP
  • Sensitivity: 10 unique pageviews
  • Privacy budget: For lower risk, rho = 1.505E-2 (roughly equivalent to epsilon = 1, delta = 1e-07). For medium risk, rho = 6.166E-4 (roughly equivalent to epsilon = 0.2, delta = 1e-07). For higher risk, rho = 1.546E-4 (roughly equivalent to epsilon = 0.1, delta = 1e-07).
  • Ingestion threshold: 150 global pageviews (from Wikimedia REST API)
  • Release threshold: For lower risk, 90 pageviews. For medium risk, 550 pageviews. For higher risk, 1000 pageviews.

For lower risk countries, pageview counts are 95% likely to be within 35.7 pageviews of the true value. For medium risk countries, the 95% confidence interval is 176.5 pageviews. For higher risk countries, the 95% confidence interval is 352.5 pageviews.



Utility and accuracy

This dataset was optimized to perform well across several utility metrics: median relative and absolute error; percentage of output rows with relative error <10%, <25%, and <50%; spurious rate; and drop rate. For more information about these metrics, you can consult the WMF DP error glossary.

For this data release:

  • Median relative error: ≤6% for lower risk, ≤7% for medium risk, ≤8% for higher risk. The average row differs from its true value by no more than 6-8%, depending on risk level.
  • Median absolute error: ≤14 pageviews for lower risk, ≤70 pageviews for medium risk, ≤140 pageviews for higher risk. The average row differs from its true value by at most 14-140 pageviews, depending on the risk level.
  • Percentage of output rows with relative error <10%: ≥60% across all risk levels. More than 60% of rows are within 10% of the true value.
  • Percentage of output rows with relative error <25%: ≥90% across all risk levels. More than 90% of rows are within 25% of the true value.
  • Percentage of output rows with relative error <50%: ≥95% across all risk levels. More than 95% of rows are within 50% of the true value.
  • Spurious rate: ≤0.05%. Fewer than 1 in 2000 published rows actually has a true value of 0.
  • Drop rate: ≤0.5%. Fewer than 1 in 200 rows that should have been published was not published.

These metrics are calculated on a daily basis, and are also calculated for continental and subcontinental regions. In order to achieve geographic equity, the vast majority of subcontinental regions must meet rigorous data quality standards (median relative error ≤6%, spurious rate ≤1%, drop rate ≤1%).



Caveats

  • The privacy guarantee of this dataset is that the contribution of the first 10 unique pageviews on a given user's browser on the data will be obfuscated. If a user clears their cookies, uses multiple devices, or uses multiple browsers, they might incur additional privacy loss.
  • This dataset only considers the first 10 unique pageviews for each user, and only on pages that garner >150 pageviews. Excluding non-unique pageviews, unique pageviews >10, bots, and lesser-visited pages means the total number of pageviews is therefore significantly lower than the real value. On a row-by-row basis, values are more similar to the ground truth.
  • Differential privacy necessarily involves adding random noise to data outputs, which means that data in this dataset may not exactly mirror the truth, and some values may be spurious (i.e. that country-project-page_id tuple might not appear in the underlying dataset). We've introduced the release threshold to deal with this fact, but keep in mind that these values are not 100% exact and some rows may be incorrect.
    • Values that are closer to the release threshold of 90 (or 550, or 1000) are likelier to be spurious.
  • There are more page titles than page_ids (because of title changes, redirects, etc.). We calculate this aggregation on page_ids and join page titles after, so the title in the dataset might not be the canonical non-redirect name.
  • From 25 May 2023 on, a more aggressive filter was used to only retrieve human-contributed views, not bot/web indexer views. Data before this date may contain bot-influenced counts.
  • Data from 6 Feb 2023 may be incomplete


Other DP datasets

You can find data from 9 Feb 2017 - 5 Feb 2023 in the country_project_page_historical dataset, and data from 1 July 2015 - 8 Feb 2017 in the country_project_page_historical_pre_2017 dataset.