Wikimedia Foundation logoWikimedia Privacy Engineering

Pageviews Differential Privacy — Historical

Welcome to the Wikimedia Foundation's differentially-private daily pageview data release!

This dataset uses differential privacy to safely facilitate the large-scale release of pageview data at a low level of granularity, allowing users to conduct analysis on hundreds of thousands of pages per day on a country-project level.

Differential privacy injects a controlled amount of statistical noise into data before it is released. This noise impedes attempts to recover information about any single individual in the dataset without significantly changing the conclusions of the data.

You can find more information about this project on its metawiki homepage.

To download dataset files, go to the historical dataset homepage.

Dataset characteristics

  • Time range: 9 Feb 2017 - 5 Feb 2023
  • Time granularity: daily
  • Data features:
    • country (excluding countries on the country protection list)
    • project (e.g. “en.wikipedia”, “wikidata”, “zh.wikibooks”, etc.)
    • page_id (numerical ID for a given page — together with project, this forms a unique identifier)
    • page_title (the page title for a given page_id)
    • item_id (If existing, the cross-project Wikidata QID for the page. If not existing, an empty string.)
    • private_count (the differentially-private number of pageviews this page_id received)
  • Dataset structure:
country_project_page/
|-> 2017-2-9.csv
...
|-> <year>-<month>-<day>.csv
...
|-> 2023-2-5.csv
  • Hive table access (all days): differential_privacy.country_project_page_historical


Data creation pipeline

  1. For date YYYY-MM-DD, query the pageview_hourly table, retrieving all rows for page_ids that have received >150 global pageviews on that day
  2. Create a key space of all possible country-project-page_ids (we will calculate noise for all parts of this ~125 million-row space)
  3. Do a group-by + sum on the pageview data, adding Laplace noise (epsilon=1, noise scale=30) to each row of the dataset
  4. Calculate internal error metrics to ensure that we don't have data drift
  5. Threshold the data so that only rows with >450 pageviews are released.
  6. Share the final table with the world!

See the code for releasing this data on Wikimedia's gitlab instance



Privacy parameters

  • Noise type: Laplace
  • Noise scale: 30
  • Privacy budget: epsilon = 1
  • Ingestion threshold: 150 global pageviews (from Wikimedia REST API)
  • Release threshold: 450 pageviews

Given these privacy parameters, pageview counts are 95% likely to be within 89.9 pageviews of the true value.



Utility and Accuracy

This dataset was optimized to perform well across several utility metrics: median absolute error; recall (defined as the % of the top up-to-1000 pages in the non-noised dataset that were actually published; spurious rate; and drop rate. For more information about these metrics, you can consult the WMF DP error glossary.

For this data release:

  • Median absolute error: ≤10. The average row differs from its true value by less than 10 pageviews.
  • Recall: ≥95%. More than 95% of the top-1000 pages each day are included in the data release. This is also calculated for individual countries, subcontinents, and continents.
  • Spurious rate: ≤0.1%. Fewer than 1 in 1000 rows of published data actually has a true value of 0.
  • Drop rate: ≤0.1%. Fewer than 1 in 1000 rows of data that should have been published were not.

These metrics were calculated on a daily basis, and were also calculated for subcontinental geographic regions.



Caveats

  • The privacy guarantee of this dataset is that a user contribution of 30 pageviews will be obfuscated. If a user contributed >30 pageviews on a given day, they have a corresponding increase in their privacy loss.
  • This dataset only considers pageviews only on pages that garner >150 pageviews. Excluding bots and lesser-visited pages means the total number of pageviews is therefore significantly lower than the real value. On a row-by-row basis, values are more similar to the ground truth.
  • Differential privacy necessarily involves adding random noise to data outputs, which means that data in this dataset may not exactly mirror the truth, and some values may be spurious (i.e. the country-project-page_id tuple does not appear in the underlying dataset). We've introduced the release threshold to deal with this fact, but keep in mind that these values are not 100% exact and some rows may be incorrect.
    • Values that are closer to the release threshold of 150 are likelier to be spurious.
  • There are more page titles than page_ids (because of title changes, redirects, etc.). We calculate this aggregation on page_ids and join page titles after, so the title in the dataset might not be the canonical non-redirect name.


Other DP datasets

You can find data from 6 Feb 2023 present in the country_project_page dataset, and data from 1 July 2015 - 8 Feb 2017 in the country_project_page_historical_pre_2017 dataset.