Pageviews Differential Privacy — Current
Dataset characteristics
- Time range: 6 Feb 2023 - present
- Time granularity: daily
- Data features:
- country (excluding countries on the country protection list)
- project (e.g. “en.wikipedia”, “wikidata”, “zh.wikibooks”, etc.)
- page_id (numerical ID for a given page — together with project, this forms a unique identifier)
- page_title (the page title for a given page_id)
- gbc (the differentially-private number of pageviews this page_id received)
- Dataset structure:
country_project_page/
|-> 2023-2-6.csv
|-> 2023-2-7.csv
...
|-> <year>-<month>-<day>.csv
- Hive table access (all days):
differential_privacy.country_project_page
Data creation pipeline
- Using the client-side differential privacy cookie, collect a boolean flag representing whether a given pageview was one of the first 10 unique pageviews for a given device in a given day.
- For date YYYY-MM-DD, retrieve all pageviews where the differential privacy cookie is true and that come from pages have received >150 global pageviews on that day
- Create a key space of all possible country-project-page_ids (we will calculate noise for all parts of this ~125 million-row space)
- Do a group-by + count on the pageview data, adding Gaussian noise (zero-concentrated differential privacy, rho=0.015, noise scale=10) to each row of the dataset
- Calculate internal error metrics to ensure that we don't have data drift
- Threshold the data so that only rows with >90 pageviews are released.
- Share the final table with the world!
See the code for releasing this data on Wikimedia's gitlab instance
Privacy parameters
- Noise type: Gaussian zCDP
- Noise scale: 10
- Privacy budget: rho = 0.015 (roughly equivalent to epsilon = 1, delta = 1e-07)
- Ingestion threshold: 150 global pageviews (from Wikimedia REST API)
- Release threshold: 90 pageviews
Caveats
- The privacy guarantee of this dataset is that the contribution of the first 10 unique pageviews on a given user's browser on the data will be obfuscated. If a user clears their cookies, uses multiple devices, or uses multiple browsers, they might incur additional privacy loss.
- This dataset only considers the first 10 unique pageviews for each user, and only on pages that garner >150 pageviews. Excluding non-unique pageviews, unique pageviews >10, bots, and lesser-visited pages means the total number of pageviews is therefore significantly lower than the real value. On a row-by-row basis, values are more similar to the ground truth.
- Differential privacy necessarily involves adding random noise to data outputs, which means that data in this dataset may not exactly mirror the truth, and some values may be spurious (i.e. that country-project-page_id tuple might not appear in the underlying dataset). We've introduced the release threshold to deal with this fact, but keep in mind that these values are not 100% exact and some rows may be incorrect.
- Values that are closer to the release threshold of 90 are likelier to be spurious.
- There are more page titles than page_ids (because of title changes, redirects, etc.). We calculate this aggregation on page_ids and join page titles after, so the title in the dataset might not be the canonical non-redirect name.
- From 25 May 2023 on, a more aggressive filter was used to only retrieve human-contributed views, not bot/web indexer views. Data before this date may contain bot-influenced counts.
- Data from 6 Feb 2023 may be incomplete
Other DP datasets
You can find data from 9 Feb 2017 - 5 Feb 2023 in the country_project_page_historical dataset, and data from 1 July 2015 - 8 Feb 2017 in the country_project_page_historical_pre_2017 dataset.