Wikimedia Foundation logoWikimedia Privacy Engineering

Geoeditors Differential Privacy — Weekly

Welcome to the Wikimedia Foundation's differentially-private weekly geoeditor data release!

This dataset uses a technique called differential privacy to safely facilitate the release of granular editor and edit data paritioned by week, allowing users to conduct analysis on thousands of rows of editor and edit data on a country-project level.

Differential privacy injects a controlled amount of statistical noise into data before it is released. This noise impedes attempts to recover information about any single individual in the dataset without significantly changing the conclusions of the data.

Unlike its sibling monthly data release, there is no long-term history of edits by week, so the first data release is from July 2023

You can find more information about this project on its metawiki homepage.

To download dataset files, go to the monthly dataset or the weekly dataset.

Dataset characteristics

  • Time range: July 2023 – present
  • Time granularity: weekly
  • Data features:
    • wiki_db (corresponds to project)
    • project (e.g. “en.wikipedia”, “wikidata”, “zh.wikibooks”, etc.)
    • country name (excluding countries that are classified as too risky to publish on the country and territory protection list)
    • country code (ISO alpha-2 codes, including "--" for unknown)
    • editor activity level (either "1 to 4", "5 to 99", or "100 or more")
    • editor count (total number of editors in this group, differentially privatized)
    • edit count (total number of edits made by this group, differentially privatized)
    • week_start (Sundays in YYYY-MM-DD form)
  • Dataset structure:
geoeditors_weekly/
|-> 2023-07.tsv
|-> 2023-08.tsv
...
|-> <year>-<month>.tsv
  • Hive table access (all weeks): differential_privacy.geoeditors_weekly


Data creation pipeline

  1. Query the editors daily table for the total number of edits made in a given week per editor
  2. Create a keyset of all projects, countries, and edit activity buckets for a given week
  3. Cast each edit count to an edit activity bucket (e.g. 3 edits --> 1 to 4 edits, 27 edits --> 5 to 99 edits, 143 edits --> 100 or more edits)
  4. Group by country, project, and edit activity bucket + count, adding Laplace noise (pure differential privacy, epsilon=1.1, sensitivity=1, noise scale=0.9091) to each row of the output dataset
  5. Drop all rows with fewer than 8 editors to reduce spurious data
  6. Use released rows for the editor count as the keyset for the edit sum
  7. Group by country, project, and edit activity bucket + sum, adding Laplace noise (pure differential privacy, epsilon=0.9) with clamping bounds and noise scale depending on the bucket (1-4, 5-99, 100-101 [we cast all edit counts >101 to 101]) — since each editor can only contribute to one bucket, this is a parallel mechanism and doesn't introduce more privacy loss
  8. Calculate internal error metrics to ensure that we don't have data drift
  9. Share the final table with the world!

See the code for releasing this data on Wikimedia's gitlab instance



Privacy parameters

  • Noise type: Laplace pure DP
  • Noise scale: dependent on aggregation and activity bucket, but ranges from 0.9091 to 111.1
  • Privacy budget: epsilon = 2
  • Release threshold: 8 editors (n/a for edits)

Editor counts are 95% likely to be within 2.7234 of the true value.



Utility and accuracy

This dataset was optimized to perform well across several metrics: median relative error, relative error ≤ 50%, relative error ≤90%, spurious rate, and drop rate

For data release:

  • Median relative error: ≤12%. The average country-project-edit range-month differs from its true value by no more than 12%.
  • Percentage of rows with relative error over 50%: ≤2%. Less than 1 in 50 rows is more than 50% off from the true value.
  • Percentage of rows with relative error over 90%: ≤1%. Less than 1 in 100 rows is more than 90% off from the true value.
  • Spurious rate: ≤5%. Fewer than 1 in 20 published rows actually has a true value of 0.
  • Drop rate: ≤1%. Fewer than 1 in 100 rows that should have been published was not published.


Caveats

  • The privacy guarantee of this dataset is that the privacy loss of an individual editor's contribution to the dataset will be limited to a worst-case of epsilon = 2. If a person edits using multiple accounts, edits across multiple projects, or edits from multiple country, their privacy loss will increase with proportionally.
  • This dataset is released in conjunction with the geoeditors_weekly dataset, which induces additional privacy loss for each month overlapping with this week. In most cases, months will entirely contain a week (meaning a cumulative privacy loss of 4), but sometimes, 2 months will overlap with a given week (yielding a cumulative privacy loss of 6).
  • Differential privacy necessarily involves adding random noise to data outputs, which means that data in this dataset may not exactly mirror the truth, and some values may be spurious (i.e. that country-project-activity bucket tuple might not appear in the underlying dataset). We've introduced the release threshold to deal with this fact, but keep in mind that these values are not 100% exact and some rows may be incorrect.
    • Counts that are closer to the release threshold of 8 are likelier to be spurious.
    • Sums have smaller epsilon, and may have a greater absolute error than counts.
  • This dataset only considers Wikipedia, Wiktionary, Wikibooks, Wikiquote, Wikisource, Wikinews, Wikivoyage, Wikiversity, Commons, Wikidata, Meta, and Mediawiki. Other projects are not included.


Other DP datasets

You can find data partitioned on a monthly basis for more granular analysis in the geoeditors monthly dataset.