Geoeditors Differential Privacy

Time range: January 2018 – present
Time granularity: monthly
Data features:
- wiki_db (corresponds to project
- project (e.g. “en.wikipedia”, “wikidata”, “zh.wikibooks”, etc.)
- country name (excluding countries on the country protection list)
- country code (ISO alpha-2 codes, including "--" for unknown)
- editor activity level (either "1 to 4", "5 to 99", or "100 or more")
- count epsilon (the epsilon value used for the count aggregation, either 1.1, 0.2, or 0.1 depending on the risk of the country)
- sum epsilon (the epsilon value used for the sum aggregation, either 0.9 or null)
- editor count (total number of editors in this group, differentially privatized)
- edit sum (total number of edits made by this group, differentially privatized, or null)
- month (in YYYY-MM form)
Dataset structure:

geoeditors_monthly/
|-> 2023-07.csv
|-> 2023-08.csv
...
|-> <year>-<month>.csv

Hive table access (all months): differential_privacy.geoeditors_monthly

Query the editors daily table for the total number of edits made in a given month per editor
Create a keyset of all projects, countries, and edit activity buckets for a given month
Cast each edit count to an edit activity bucket (e.g. 3 edits --> 1 to 4 edits, 27 edits --> 5 to 99 edits, 143 edits --> 100 or more edits)
Group by country, project, and edit activity bucket + count, adding Laplace noise (pure differential privacy, epsilon=1.1 for lower risk/0.2 for medium risk/0.1 for higher risk, sensitivity=1) to each row of the output dataset
Drop all rows with fewer than 8/45/95 editors to reduce spurious data
Use released rows for the editor count as the keyset for the edit sum
Group by country, project, and edit activity bucket + sum, adding Laplace noise (pure differential privacy, epsilon=0.9 for lower risk) with clamping bounds and noise scale depending on the bucket (1-4, 5-99, 100-101 [we cast all edit counts >101 to 101]) — since each editor can only contribute to one bucket, this is a parallel mechanism and doesn't introduce more privacy loss
Calculate internal error metrics to ensure that we don't have data drift
Share the final table with the world!

See the code for releasing this data on Wikimedia's gitlab instance

Noise type: Laplace pure DP
Noise scale: dependent on aggregation and activity bucket, but ranges from 0.9091 to 111.1
Privacy budget: epsilon = 2. Epsilon is 1.1 for count and 0.9 for sum for lower risk countries, 0.2 for medium risk countries and 0.1 for higher risk countries.
Release threshold: 8 editors for lower risk countries, 45 for medium risk countries, 95 for higher risk countries (n/a for edits)

For lower risk countries, edit counts are 95% likely to be within 2.7234 of the true value. For medium risk countries, the 95% confidence interval is 14.979. For higher risk countries, the 95% confidence interval is 29.957.

This dataset was optimized to perform well across several metrics: median relative error, relative error ≤ 50%, relative error ≤90%, spurious rate, and drop rate.

For this data release:

Median relative error: ≤10%. The average country-project-edit range-month differs from its true value by no more than 10%.
Percentage of rows with relative error over 50%: ≤2%. Less than 1 in 50 rows is more than 50% off from the true value.
Percentage of rows with relative error over 90%: ≤1%. Less than 1 in 100 rows is more than 90% off from the true value.
Spurious rate: ≤2%. Fewer than 1 in 50 published rows actually has a true value of 0.
Drop rate: ≤1%. Fewer than 1 in 100 rows that should have been published was not published.

The privacy guarantee of this dataset is that the privacy loss of an individual editor's contribution to the dataset will be limited to a worst-case of epsilon = 2. If a person edits using multiple accounts, edits across multiple projects, or edits from multiple country, their privacy loss will increase with proportionally.
This dataset is released in conjunction with the geoeditors_weekly dataset, which induces additional privacy loss for each week overlapping with this month. In some cases, months and weeks will exactly line up, but more of the time, there will be 5 or 6 weeks overlapping with each month.
Differential privacy necessarily involves adding random noise to data outputs, which means that data in this dataset may not exactly mirror the truth, and some values may be spurious (i.e. that country-project-activity bucket tuple might not appear in the underlying dataset). We've introduced the release threshold to deal with this fact, but keep in mind that these values are not 100% exact and some rows may be incorrect.
- Counts that are closer to the release threshold of 8 are likelier to be spurious.
- Sums have smaller epsilon, and may have a greater absolute error than counts.
This dataset only considers Wikipedia, Wiktionary, Wikibooks, Wikiquote, Wikisource, Wikinews, Wikivoyage, Wikiversity, Commons, Wikidata, Meta, and Mediawiki. Other projects are not included.

You can find data partitioned on a weekly basis for more granular analysis in the geoeditors weekly dataset.

Geoeditors Differential Privacy — Monthly

Dataset characteristics

Data creation pipeline

Privacy parameters

Utility and accuracy

Caveats

Other DP datasets