Wikimedia Foundation logoWikimedia Privacy Engineering

Geoeditors Differential Privacy — Monthly

Welcome to the Wikimedia Foundation's differentially-private monthly geoeditor data release!

This dataset uses differential privacy to safely facilitate the release of granular editor and edit data paritioned by month, allowing users to conduct analysis on thousands of rows of editor and edit data on a country-project level.

Differential privacy injects a controlled amount of statistical noise into data before it is released. This noise impedes attempts to recover information about any single individual in the dataset without significantly changing the conclusions of the data.

Unlike its sibling weekly data release, there is a long-term history of pre-aggregated monthly editor counts by country, project, and activity level. This historical record goes back to January 2018, and lacks edit counts. This means that neither historical data nor country protection list data will have associated edit counts. WMF is actively working on releasing this data in a differentially-private manner.

To conserve privacy budget and provide best performance on editor counts, edit counts are not published for medium and higher risk countries.

You can find more information about this project on its metawiki homepage.

To download dataset files, go to the monthly dataset or the weekly dataset.

Dataset characteristics

  • Time range: January 2018 – present
  • Time granularity: monthly
  • Data features:
    • wiki_db (corresponds to project
    • project (e.g. “en.wikipedia”, “wikidata”, “zh.wikibooks”, etc.)
    • country name (excluding countries on the country protection list)
    • country code (ISO alpha-2 codes, including "--" for unknown)
    • editor activity level (either "1 to 4", "5 to 99", or "100 or more")
    • count epsilon (the epsilon value used for the count aggregation, either 1.1, 0.2, or 0.1 depending on the risk of the country)
    • sum epsilon (the epsilon value used for the sum aggregation, either 0.9 or null)
    • editor count (total number of editors in this group, differentially privatized)
    • edit sum (total number of edits made by this group, differentially privatized, or null)
    • month (in YYYY-MM form)
  • Dataset structure:
geoeditors_monthly/
|-> 2023-07.csv
|-> 2023-08.csv
...
|-> <year>-<month>.csv
  • Hive table access (all months): differential_privacy.geoeditors_monthly


Data creation pipeline

  1. Query the editors daily table for the total number of edits made in a given month per editor
  2. Create a keyset of all projects, countries, and edit activity buckets for a given month
  3. Cast each edit count to an edit activity bucket (e.g. 3 edits --> 1 to 4 edits, 27 edits --> 5 to 99 edits, 143 edits --> 100 or more edits)
  4. Group by country, project, and edit activity bucket + count, adding Laplace noise (pure differential privacy, epsilon=1.1 for lower risk/0.2 for medium risk/0.1 for higher risk, sensitivity=1) to each row of the output dataset
  5. Drop all rows with fewer than 8/45/95 editors to reduce spurious data
  6. Use released rows for the editor count as the keyset for the edit sum
  7. Group by country, project, and edit activity bucket + sum, adding Laplace noise (pure differential privacy, epsilon=0.9 for lower risk) with clamping bounds and noise scale depending on the bucket (1-4, 5-99, 100-101 [we cast all edit counts >101 to 101]) — since each editor can only contribute to one bucket, this is a parallel mechanism and doesn't introduce more privacy loss
  8. Calculate internal error metrics to ensure that we don't have data drift
  9. Share the final table with the world!

See the code for releasing this data on Wikimedia's gitlab instance



Privacy parameters

  • Noise type: Laplace pure DP
  • Noise scale: dependent on aggregation and activity bucket, but ranges from 0.9091 to 111.1
  • Privacy budget: epsilon = 2. Epsilon is 1.1 for count and 0.9 for sum for lower risk countries, 0.2 for medium risk countries and 0.1 for higher risk countries.
  • Release threshold: 8 editors for lower risk countries, 45 for medium risk countries, 95 for higher risk countries (n/a for edits)

For lower risk countries, edit counts are 95% likely to be within 2.7234 of the true value. For medium risk countries, the 95% confidence interval is 14.979. For higher risk countries, the 95% confidence interval is 29.957.



Utility and accuracy

This dataset was optimized to perform well across several metrics: median relative error, relative error ≤ 50%, relative error ≤90%, spurious rate, and drop rate.

For this data release:

  • Median relative error: ≤10%. The average country-project-edit range-month differs from its true value by no more than 10%.
  • Percentage of rows with relative error over 50%: ≤2%. Less than 1 in 50 rows is more than 50% off from the true value.
  • Percentage of rows with relative error over 90%: ≤1%. Less than 1 in 100 rows is more than 90% off from the true value.
  • Spurious rate: ≤2%. Fewer than 1 in 50 published rows actually has a true value of 0.
  • Drop rate: ≤1%. Fewer than 1 in 100 rows that should have been published was not published.


Caveats

  • The privacy guarantee of this dataset is that the privacy loss of an individual editor's contribution to the dataset will be limited to a worst-case of epsilon = 2. If a person edits using multiple accounts, edits across multiple projects, or edits from multiple country, their privacy loss will increase with proportionally.
  • This dataset is released in conjunction with the geoeditors_weekly dataset, which induces additional privacy loss for each week overlapping with this month. In some cases, months and weeks will exactly line up, but more of the time, there will be 5 or 6 weeks overlapping with each month.
  • Differential privacy necessarily involves adding random noise to data outputs, which means that data in this dataset may not exactly mirror the truth, and some values may be spurious (i.e. that country-project-activity bucket tuple might not appear in the underlying dataset). We've introduced the release threshold to deal with this fact, but keep in mind that these values are not 100% exact and some rows may be incorrect.
    • Counts that are closer to the release threshold of 8 are likelier to be spurious.
    • Sums have smaller epsilon, and may have a greater absolute error than counts.
  • This dataset only considers Wikipedia, Wiktionary, Wikibooks, Wikiquote, Wikisource, Wikinews, Wikivoyage, Wikiversity, Commons, Wikidata, Meta, and Mediawiki. Other projects are not included.


Other DP datasets

You can find data partitioned on a weekly basis for more granular analysis in the geoeditors weekly dataset.