Pageviews Differential Privacy — Historical (pre-2017)
Dataset characteristics
- Time range: 1 July 2015 - 8 Feb 2017
- Time granularity: daily
- Data features:
- country (excluding countries on the country protection list)
- project (e.g. “en.wikipedia”, “wikidata”, “zh.wikibooks”, etc.)
- page_id (numerical ID for a given page — together with project, this forms a unique identifier)
- page_title (the page title for a given page_id)
- gbc (the differentially-private number of pageviews this page_id received)
- Dataset structure:
country_project_page/
|-> 2015-7-1.csv
...
|-> <year>-<month>-<day>.csv
...
|-> 2017-2-8.csv
- Hive table access (all days):
differential_privacy.country_project_page_historical_pre_2017
Data creation pipeline
- For date YYYY-MM-DD, query the pageview_hourly table, retrieving all rows for page_ids that have received >150 global pageviews on that day
- Create a key space of all possible country-project-page_ids (we will calculate noise for all parts of this ~125 million-row space)
- Do a group-by + sum on the pageview data, adding Laplace noise (epsilon=1, noise scale=300) to each row of the dataset
- Calculate internal error metrics to ensure that we don't have data drift
- Threshold the data so that only rows with >3500 pageviews are released.
- Share the final table with the world!
Why is this dataset separate from the other historical release? Prior to 9 Feb 2017, WMF's internal systems did not drop edit previews from the pageview_hourly table, and did not log page_ids.
See the code for releasing this data on Wikimedia's gitlab instance
Privacy parameters
- Noise type: Laplace
- Noise scale: 300
- Privacy budget: epsilon = 1
- Ingestion threshold: 150 global pageviews (from Wikimedia REST API)
- Release threshold: 3500 pageviews
Caveats
- The privacy guarantee of this dataset is that a user contribution of 300 pageviews will be obfuscated. If a user contributed >300 pageviews on a given day, they have a corresponding increase in their privacy loss.
- This dataset only considers pageviews only on pages that garner >150 pageviews. Excluding bots and lesser-visited pages means the total number of pageviews is therefore significantly lower than the real value. On a row-by-row basis, values are more similar to the ground truth.
- Differential privacy necessarily involves adding random noise to
data outputs, which means that data in this dataset may not
exactly mirror the truth, and some values may be
spurious (i.e. the country-project-page_id tuple does not appear in the
underlying dataset). We've introduced the release threshold to deal with
this fact, but keep in mind that these values are not 100% exact and
some rows may be incorrect.
- Values that are closer to the release threshold of 3500 are likelier to be spurious.
Other DP datasets
You can find data from 6 Feb 2023 - present in the country_project_page dataset, and data from 9 Feb 2017 - 5 Feb 2023 in the country_project_page_historical dataset.