- Using anchor text, spam filtering and Wikipedia for web search and entity ranking
- Nineteenth Text REtrieval Conference (TREC 2010), Gaithersburg, MD
- Book/source title
- The Nineteenth Text REtrieval Conference (TREC 2010) proceedings
- National Institute for Standards and Technology (NIST)
- Document type
- Conference contribution
- Interfacultary Research Institutes
- Institute for Logic, Language and Computation (ILLC)
In this paper, we document our efforts in participating to the TREC 2010 Entity Ranking and Web Tracks. We had multiple aims: For the Web Track we wanted to compare the effectiveness of anchor text of the category A and B collections and the impact of global document quality measures such as PageRank and spam scores. We find that documents in ClueWeb09 category B have a higher probability of being retrieved than other documents in category A. In ClueWeb09 category B, spam is mainly an issue for full-text retrieval. Anchor text suffers little from spam. Spam scores can be used to filter spam but also to find key resources. Documents that are least likely to be spam tend to be high-quality results. For the Entity Ranking Track, we use Wikipedia as a pivot to find relevant entities on the Web. Using category information to retrieve entities within Wikipedia leads to large improvements. Although we achieve large improvements over our baseline run that does not use category information, our best scores are still weak. Following the external links on Wikipedia pages to find the homepages of the entities in the ClueWeb collection, works better than searching an anchor text index, and combining the external links with searching an anchor text index.
If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library, or send a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.