This file describes the contents of
https://analytics.wikimedia.org/datasets/wmde-analytics-engineering/Wikidata/WD_DataQuality/
The data sets and visualizations described here are produced in support of the Wikidata Quality Report.
Contact: Goran S. MilovanoviÄ‡, Data Scientist, WMDE, goran.milovanovic_ext@wikimedia.de
1. dataQuality_Stats.csv
- Field 'total_n_items': the total number of Wikidata items considered;
- Field 'total_n_items_used': the total number of Wikidata items used in Wikimedia projects;
- Field 'total_n_items_from_ORES': the total number of Wikidata items assessed by ORES;
- Field 'total_n_items_with_ORES_prediction': the total number of Wikidata items with ORES quality prediction;
- Field 'total_n_items_with_ORES_prediction_final': the total number of Wikidata items with ORES quality prediction following the merge of ORES and WDCM usage data sets;
- Field 'total_n_items_used_with_ORES_prediction_final': the total number of Wikidata items that are used in Wikimedia projects with ORES quality prediction;
- Field 'ORES_date': the actual ORES prediction update timestamp;
- Field 'WDCM date': the actual WDCM data set update timestamp.
2. dataQuality_oresScoresDistribution.csv
- This is the exact count of the number of Wikidata items receiving an ORES item quality prediction in each of the quality categories: A, B, C, D, or E.
3. dataQuality_scoreUsage.csv
- WDCM summary usage statistics for all five item quality categories (A, B, C, D, and E) obtained from ORES predictions: mean usage, median usage, maximum and minimum usage.
4. dataQuality_mostUsedItemsCategory.csv
- The 1,000 most used Wikidata items from all five item quality categories (A, B, C, D, and E) obtained from ORES predictions.
5. dataQuality_mostUsedItemsCategory_10000.csv
- The 10,000 most used Wikidata items alongside their item quality categories (A, B, C, D, and E) obtained from ORES predictions.
6. dataQuality_oresScoresDistribution_10000.csv
- This is the exact count of the number of Wikidata items receiving an ORES item quality prediction in each of the quality categories (A, B, C, D, or E) for the 10,000 most used items.
7. dataQuality_scoreUsage_10000.csv
- WDCM summary usage statistics for all five item quality categories (A, B, C, D, and E) obtained from ORES predictions: mean usage, median usage, maximum and minimum usage, for the 10,000 most used items.
8. scoreUsage_BoxPlot.csv
- Data set for the boxplot visualization of the WDCM re-use statistics aggregated over the five categories of ORES data quality predictions:
- Field: 'usage': the WDCM item re-use statistics;
- Field: 'score': the ORES item quality prediction (A, B, C, D, or E);
9. scoreUsage_BoxPlot_ggplot2.png
- The .png file for the scoreUsage_BoxPlot (see: 8) visualization
10. scoreUsageOutliers_1000.csv
The 1,000 most used Wikidata items per quality category (A, B, C, D, or E) that are recognized as outliers (i.e. used more or less then (3rd/1st quartile) +/- 1.5 * IQR(usage)) **per item quality category** (i.e., the outliers were detected for each quality group separately)
11. scoreUsage_BoxPlot_outliers_removed.csv
- Data set for the boxplot visualization of the WDCM re-use statistics aggregated over the five categories of ORES data quality predictions w. outliers remove separately from each of the ORES prediction of item quality categories (A, B, C, D, or E):
- Field: 'usage': the WDCM item re-use statistics;
- Field: 'score': the ORES item quality prediction (A, B, C, D, or E);
12. scoreUsage_BoxPlot_ggplot2_outliers_removed.png
- The .png file for the scoreUsage_BoxPlot (see: 8) visualization w. outliers removed from each of the ORES prediction of item quality categories (A, B, C, D, or E) (see : 11)
13. scoreUsage_BoxPlot_ggplot2_outliers_removed_unique_values.png
- The .png file for the scoreUsage_BoxPlot (see: 8) visualization w. outliers removed from each of the ORES prediction of item quality categories (A, B, C, D, or E) (see : 11); only data points + only unique values of the re-use statistic
14. positiveOutliers.csv
The 1,000 most used Wikidata items per quality category (A, B, C, D, or E) that are recognized as outliers (i.e. used more or less then (3rd/1st quartile) +/- 1.5 * IQR(usage)) **in general** (i.e., the outliers were detected from the data set as a whole, not quality group-wise)
15. distribution_Quality_USage.png
- Distributions of the WDCM re-use statistics across the ORES predictions for quality category (A, B, C, D, or E).
16. revids_timeline_quality_classes.png
- Time series (YYYY-MM): how many latest revids did the item from a particular quality category (A, B, C, D, or E) received and when?
17. revids_timeline_quality_classes.csv
- The data set from which (16) was produced.