A Comprehensive Dataset of Classified Citations with Identifiers from English Wikipedia (2024)

doi:https://doi.org/10.5281/zenodo.10782978

A Comprehensive Dataset of Classified Citations with Identifiers from English Wikipedia (2024)

Creators	Natallia Kokash Giovanni Colavizza
Publication date	01-03-2024
Description	2024 (new!) This is a dataset of 44.766.800 (+9.2%) citations extracted from the English Wikipedia February 2024 dump (https://dumps.wikimedia.org/enwiki/20240220/). The same extraction and template harmonization pipeline was used as the year before. The published dataset fields are like in the previous dataset. A classification label is assigned to each citation (either 'news', 'book', 'journal' or 'other)' by the deterministic rule-based classifier that analyses available identifiers (see code documentation for details), revealing the following citation subgroups: 1. The total number of news: 10.958.151 (+9.4%) 2. The total number of books:* 3.277.629 (+8.6%) 3. The total number of journals: 2.248.748 (+8.7%) Please note that these numbers do not represent the overall number of book and journal citations, we count only citations with DOI, PMID, PMC and ISBN identifiers assigned by authors (prior to the lookup process that augments citations with missing identifiers). This dataset is not equipped with identifiers located via the lookup process (no 'acquired_ID_list' field). If there is interest in such an augmented version, see the source code for instructions or contact authors for assistance with this task.
Publisher	Zenodo
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI) Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Document type	Dataset
DOI	https://doi.org/10.5281/zenodo.10782978
Other links	https://zenodo.org10782978
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

A Comprehensive Dataset of Classified Citations with Identifiers from English Wikipedia (2024)