×

zbMATH — the first resource for mathematics

Improved compressed indexes for full-text document retrieval. (English) Zbl 1268.68075
Summary: We give new space/time tradeoffs for compressed indexes that answer document retrieval queries on general sequences. On a collection of \(D\) documents of total length \(n\), current approaches require at least \(|CSA|+O(n\frac {\lg D}{\lg \lg D})\) or \(2|CSA|+o(n)\) bits of space, where \(CSA\) is a full-text index. Using monotone minimal perfect hash functions (mmphfs), we give new algorithms for document listing with frequencies and top-\(k\) document retrieval using just \(|CSA|+O(n \lg \lg \lg D)\) bits.
We also improve current solutions that use \(2|CSA|+o(n)\) bits, and consider other problems such as colored range listing, top-\(k\) most important documents, and computing arbitrary frequencies. We give proof-of-concept experimental results that show that using mmphfs may provide relevant practical tradeoffs for document listing with frequencies.

MSC:
68P20 Information storage and retrieval of data
68P05 Data structures
68W05 Nonnumerical algorithms
PDF BibTeX Cite
Full Text: DOI