Skip to Main content Skip to Navigation

Diversification Based Static Index Pruning - Application to Temporal Collections

Zeynep Pehlivan 1 Benjamin Piwowarski 1 Stéphane Gançarski 1
1 BD - Bases de Données
LIP6 - Laboratoire d'Informatique de Paris 6
Abstract : Nowadays, web archives preserve the history of large portions of the web. As medias are shifting from printed to digital editions, accessing these huge information sources is drawing increasingly more attention from national and international institutions, as well as from the research community. These collections are intrinsically big, leading to index files that do not fit into the memory and an increase query response time. Decreasing the index size is a direct way to decrease this query response time. Static index pruning methods reduce the size of indexes by removing a part of the postings. In the context of web archives, it is necessary to remove postings while preserving the temporal diversity of the archive. None of the existing pruning approaches take (temporal) diversification into account. In this paper, we propose a diversification-based static index pruning method. It differs from the existing pruning approaches by integrating diversification within the pruning context. We aim at pruning the index while preserving retrieval effectiveness and diversity by pruning while maximizing a given IR evaluation metric like DCG. We show how to apply this approach in the context of web archives. Finally, we show on two collections that search effectiveness in temporal collections after pruning can be improved using our approach rather than diversity oblivious approaches.
Complete list of metadatas
Contributor : Benjamin Piwowarski <>
Submitted on : Thursday, September 1, 2016 - 11:25:41 AM
Last modification on : Thursday, March 21, 2019 - 1:21:22 PM


  • HAL Id : hal-01358687, version 1


Zeynep Pehlivan, Benjamin Piwowarski, Stéphane Gançarski. Diversification Based Static Index Pruning - Application to Temporal Collections. [Research Report] Sorbonne Universités, UPMC Univ Paris 06, CNRS, LIP6 UMR 7606. 2013. ⟨hal-01358687⟩



Record views