Pageviews Differential Privacy — Current
Dataset characteristics
- Time range: 6 Feb 2023 - present
- Time granularity: daily
- Data features:
- country (excluding countries with a "not published" risk classification on the Country and Territory Protection List)
- project (e.g. “en.wikipedia”, “wikidata”, “zh.wikibooks”, etc.)
- page_id (numerical ID for a given page — together with project, this forms a unique identifier)
- page_title (the page title for a given page_id)
- item_id (If existing, the cross-project Wikidata QID for the page. If not existing, an empty string.)
- gbc (the differentially-private number of pageviews this page_id received)
- Dataset structure:
country_project_page/
|-> 2023-2-6.csv
|-> 2023-2-7.csv
...
|-> <year>-<month>-<day>.csv
- Hive table access (all days):
differential_privacy.country_project_page
Data creation pipeline
- Using the client-side differential privacy cookie, collect a boolean flag representing whether a given pageview was one of the first 10 unique pageviews for a given device in a given day.
- For date YYYY-MM-DD, retrieve all pageviews where the differential privacy cookie is true and that come from pages have received >150 global pageviews on that day
- Create a key space of all possible country-project-page_ids (we will calculate noise for all parts of this ~125 million-row space)
- Do a group-by + count on the pageview data, adding Gaussian noise (zero-concentrated differential privacy; rho=1.505E-2 for lower risk countries, rho=6.166E-4 for medium risk countries, rho=1.546E-4 for higher risk countries; sensitivity=10 unique pageviews) to each row of the dataset
- Calculate internal error metrics to ensure that we don't have data drift
- Threshold the data so that only rows with >90 pageviews (for lower risk countries — >550 pageviews and >1000 pageviews for medium and higher risk countries, respectively) are released.
- Share the final table with the world!
See the code for releasing this data on Wikimedia's gitlab instance
Privacy parameters
- Noise type: Gaussian zCDP
- Sensitivity: 10 unique pageviews
- Privacy budget: For lower risk, rho = 1.505E-2 (roughly equivalent to epsilon = 1, delta = 1e-07). For medium risk, rho = 6.166E-4 (roughly equivalent to epsilon = 0.2, delta = 1e-07). For higher risk, rho = 1.546E-4 (roughly equivalent to epsilon = 0.1, delta = 1e-07).
- Ingestion threshold: 150 global pageviews (from Wikimedia REST API)
- Release threshold: For lower risk, 90 pageviews. For medium risk, 550 pageviews. For higher risk, 1000 pageviews.
For lower risk countries, pageview counts are 95% likely to be within 35.7 pageviews of the true value. For medium risk countries, the 95% confidence interval is 176.5 pageviews. For higher risk countries, the 95% confidence interval is 352.5 pageviews.
Utility and accuracy
This dataset was optimized to perform well across several utility metrics: median relative and absolute error; percentage of output rows with relative error <10%, <25%, and <50%; spurious rate; and drop rate. For more information about these metrics, you can consult the WMF DP error glossary.
For this data release:
- Median relative error: ≤6% for lower risk, ≤7% for medium risk, ≤8% for higher risk. The average row differs from its true value by no more than 6-8%, depending on risk level.
- Median absolute error: ≤14 pageviews for lower risk, ≤70 pageviews for medium risk, ≤140 pageviews for higher risk. The average row differs from its true value by at most 14-140 pageviews, depending on the risk level.
- Percentage of output rows with relative error <10%: ≥60% across all risk levels. More than 60% of rows are within 10% of the true value.
- Percentage of output rows with relative error <25%: ≥90% across all risk levels. More than 90% of rows are within 25% of the true value.
- Percentage of output rows with relative error <50%: ≥95% across all risk levels. More than 95% of rows are within 50% of the true value.
- Spurious rate: ≤0.05%. Fewer than 1 in 2000 published rows actually has a true value of 0.
- Drop rate: ≤0.5%. Fewer than 1 in 200 rows that should have been published was not published.
These metrics are calculated on a daily basis, and are also calculated for continental and subcontinental regions. In order to achieve geographic equity, the vast majority of subcontinental regions must meet rigorous data quality standards (median relative error ≤6%, spurious rate ≤1%, drop rate ≤1%).
Caveats
- The privacy guarantee of this dataset is that the contribution of the first 10 unique pageviews on a given user's browser on the data will be obfuscated. If a user clears their cookies, uses multiple devices, or uses multiple browsers, they might incur additional privacy loss.
- This dataset only considers the first 10 unique pageviews for each user, and only on pages that garner >150 pageviews. Excluding non-unique pageviews, unique pageviews >10, bots, and lesser-visited pages means the total number of pageviews is therefore significantly lower than the real value. On a row-by-row basis, values are more similar to the ground truth.
- Differential privacy necessarily involves adding random noise to data outputs, which means that data in this dataset may not exactly mirror the truth, and some values may be spurious (i.e. that country-project-page_id tuple might not appear in the underlying dataset). We've introduced the release threshold to deal with this fact, but keep in mind that these values are not 100% exact and some rows may be incorrect.
- Values that are closer to the release threshold of 90 (or 550, or 1000) are likelier to be spurious.
- There are more page titles than page_ids (because of title changes, redirects, etc.). We calculate this aggregation on page_ids and join page titles after, so the title in the dataset might not be the canonical non-redirect name.
- From 25 May 2023 on, a more aggressive filter was used to only retrieve human-contributed views, not bot/web indexer views. Data before this date may contain bot-influenced counts.
- Data from 6 Feb 2023 may be incomplete
Other DP datasets
You can find data from 9 Feb 2017 - 5 Feb 2023 in the country_project_page_historical dataset, and data from 1 July 2015 - 8 Feb 2017 in the country_project_page_historical_pre_2017 dataset.