A Comprehensive Dataset of Classified Citations with Identifiers from English Wikipedia (2024)
| Creators | |
|---|---|
| Publication date | 01-03-2024 |
| Description |
2024 (new!)
This is a dataset of 44.766.800 (+9.2%) citations extracted from the English Wikipedia February 2024 dump (https://dumps.wikimedia.org/enwiki/20240220/).
The same extraction and template harmonization pipeline was used as the year before. The published dataset fields are like in the previous dataset. A classification label is assigned to each citation (either 'news', 'book', 'journal' or 'other)' by the deterministic rule-based classifier that analyses available identifiers (see code documentation for details), revealing the following citation subgroups:
1. The total number of news: 10.958.151 (+9.4%)
2. The total number of books:* 3.277.629 (+8.6%)
3. The total number of journals*: 2.248.748 (+8.7%)
* Please note that these numbers do not represent the overall number of book and journal citations, we count only citations with DOI, PMID, PMC and ISBN identifiers assigned by authors (prior to the lookup process that augments citations with missing identifiers).
This dataset is not equipped with identifiers located via the lookup process (no 'acquired_ID_list' field). If there is interest in such an augmented version, see the source code for instructions or contact authors for assistance with this task.
|
| Publisher | Zenodo |
| Organisations |
|
| Document type | Dataset |
| DOI | https://doi.org/10.5281/zenodo.10782978 |
| Other links | https://zenodo.org10782978 |
| Permalink to this page | |
