- Web-based text anonymization with Node.js: Introducing NETANOS (Named entity-based Text Anonymization for Open Science)
- The Journal of Open Source Software
- Volume | Issue number
- 2 | 14
- Article number
- Number of pages
- Document type
- Faculty of Social and Behavioural Sciences (FMG)
- Psychology Research Institute (PsyRes)
Netanos (Named Entity-based Text ANonymization for Open Science) is a natural language processing software that anonymizes texts by identifying and replacing named entities.
The key feature of NETANOS is that the anonymization preserves critical context that allows for secondary linguistic analyses on anonymized texts.
Consider the example string “Max and Ben spent more than 1000 hours on writing the software. They started in August 2016 in Amsterdam.” While coarse anonymization such as simple “XXX” replacement would suffice to mask the true content of the string, essential text properties are lost that are needed for secondary analyses. For example, content-based deception detection approaches rely on the number of specific times and dates to differentiate between deceptive and truthful texts (Warmelink et al. 2013).
Specifically, the text anonymization is achieved with the following stepwise procedure: The input string is analyzed by Stanford’s NER, identifying organizations, locations, persons, and dates. All identified entities are replaced with their context-preserving anonymized versions. NLP-compromise’s named entity recognition tool is applied to identify potentially remaining, unrecognized entities.
Besides the key feature of context preserving text anonymization, Netanos also provides three alternative anonymization types.
• Context-preserving anonymization (key feature): Identified named entity types are replaced with a composite string consisting of the entity type and the corresponding index of occurrence. “[PERSON_1] and [PERSON_2] spent more than [DATE/TIME_1] on writing the software. They started in [DATE/TIME_2] in [LOCATION_1].”
• Named entity-based replacement: Identified entities are replaced with a different, randomly chosen named entity of the same type. “Barry and Rick spent more than 997 hours on writing the software. They started in January 14 2016 in Odessa.”
• Non-context preserving anonymization: This replacement type is inspired by the anonymization procedure suggested by the UK Data Service (Service, n.d.). It replaces all strings having a capital first letter and all numeric values with XXX. “XXX and XXX spent more than XXX hours on writing the software. XXX started in XXX XXX in XXX.”
• Combined, non-context preserving anonymization: The context-preserving replacement is used to identify candidates for replacement that are then replaced with the procedure of the non-context preserving replacement “XXX and XXX spent more than XXX XXX on writing the software. XXX started in XXX XXX in XXX.”
Note that all replacements are applied globally across the input string.
- go to publisher's site
- Other links
- With software online
If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library, or send a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.