Link detection in XML documents: What about repeated links?
| Authors |
|
|---|---|
| Publication date | 2008 |
| Host editors |
|
| Book title | Proceedings of the SIGIR 2008 Workshop on Focused Retrieval: Held in Singapore, 24 July 2008 |
| ISBN |
|
| Event | SIGIR 2008 Workshop on Focused Retrieval (Singapore) |
| Pages (from-to) | 59-66 |
| Publisher | Dunedin: University of Otago, Department of Computer Science |
| Organisations |
|
| Abstract |
Link detection is a special case of focused retrieval where potential links between documents have to be detected automatically. The use case, as studied at INEX's Link the Wiki track, is that of a new, orphaned page (here, a structured XML document) for which we need to detect relevant incoming and outgoing links to other pages (here, the INEX Wikipedia collection). We focus on outgoing links and investigate link density, and especially repeated occurrences of links with the same anchor text and destination. We provide an extensive analysis of link density and repetition, and look at parameters like the document's length, the distance between anchor text occurrences, and the frequency of the anchor text within an article. We also conduct experiments trying to determine what should be done with links that are repeated. We describe alternative approaches and compare them against two baselines: the first baseline is to link only once, and the second is to link all candidates. The performance is measured with precision and recall in terms of the total set of discovered links. Our main finding is that, although the overall impact of link repetition is modest, performance can increase by taking a informed approach to link repetition.
|
| Document type | Conference contribution |
| Published at | http://www.cs.otago.ac.nz/sigirfocus2008/paper_14.pdf |
| Downloads | |
| Permalink to this page | |
