×

zbMATH — the first resource for mathematics

Succinct data structures for flexible text retrieval systems. (English) Zbl 1137.68360
Summary: We propose succinct data structures for text retrieval systems supporting document listing queries and ranking queries based on the tf*idf (term frequency times inverse document frequency) scores of documents. Traditional data structures for these problems support queries only for some predetermined keywords. Recently Muthukrishnan proposed a data structure for document listing queries for arbitrary patterns at the cost of data structure size. For computing the tf*idf scores there has been no efficient data structures for arbitrary patterns. Our new data structures support these queries using small space. The space is only \(2/\varepsilon\) times the size of compressed documents plus \(10n\) bits for a document collection of length \(n\), for any 0\(<\varepsilon \leqslant \)1. This is much smaller than the previous \(O(n\log n)\) bit data structures. Query time is \(O(m+q\log^\varepsilon n)\) for listing and computing tf*idf scores for all \(q\) documents containing a given pattern of length \(m\). Our data structures are flexible in a sense that they support queries for arbitrary patterns.

MSC:
68P05 Data structures
68P20 Information storage and retrieval of data
68T10 Pattern recognition, speech recognition
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] S. Muthukrishnan, Efficient algorithms for document retrieval problems, in: Proc. ACM-SIAM SODA, 2002, pp. 657-666 · Zbl 1093.68588
[2] Blumer, A.; Blumer, J.; Haussler, D.; McConnell, R.; Ehrenfeucht, A., Complete inverted files for efficient text retrieval and analysis, Journal of the ACM, 34, 3, 578-595, (1987)
[3] Grossi, R.; Vitter, J.S., Compressed suffix arrays and suffix trees with applications to text indexing and string matching, SIAM journal on computing, 35, 2, 378-407, (2005) · Zbl 1092.68115
[4] R. Grossi, A. Gupta, J.S. Vitter, Higher order entropy analysis of compressed suffix arrays, in: DIMACS Workshop on Data Compression in Networks and Applications, 2003, pp. 841-850 · Zbl 1092.68584
[5] K. Sadakane, Succinct representations of lcp information and improvements in the compressed suffix arrays, in: Proc. ACM-SIAM SODA, 2002, pp. 225-232 · Zbl 1093.68578
[6] Sadakane, K., New text indexing functionalities of the compressed suffix arrays, Journal of algorithms, 48, 2, 294-313, (2003) · Zbl 1100.68563
[7] Ferragina, P.; Manzini, G., Indexing compressed texts, Journal of the ACM, 52, 4, 552-581, (2005) · Zbl 1323.68261
[8] Salton, G.; Wong, A.; Yang, C.S., A vector space model for automatic indexing, Communications of the ACM, 18, 11, 613-620, (1975) · Zbl 0313.68082
[9] Manber, U.; Myers, G., Suffix arrays: a new method for on-line string searches, SIAM journal on computing, 22, 5, 935-948, (1993) · Zbl 0784.68027
[10] P. Weiner, Linear pattern matching algorithms, in: Proceedings of the 14th IEEE Symposium on Switching and Automata Theory, 1973, pp. 1-11
[11] M. Farach, Optimal suffix tree construction with large alphabets, in: 38th IEEE Symp. on Foundations of Computer Science, 1997, pp. 137-143
[12] Gusfield, D., Algorithms on strings, trees, and sequences, (1997), Cambridge University Press · Zbl 0934.68103
[13] Munro, J.I.; Raman, V., Succinct representation of balanced parentheses and static trees, SIAM journal on computing, 31, 3, 762-776, (2001) · Zbl 1017.68037
[14] Munro, J.I.; Raman, V.; Rao, S.S., Space efficient suffix trees, Journal of algorithms, 39, 2, 205-222, (2001) · Zbl 0977.68069
[15] Ferragina, P.; Manzini, G.; Mäkinen, V.; Navarro, G., Succinct representation of sequences, (August 2004), Technical Report TR/DCC-2004-5, Dept. of Computer Science, Univ. of Chile
[16] Bender, M.; Farach-Colton, M., The LCA problem revisited, (), 88-94 · Zbl 0959.68133
[17] R. Raman, V. Raman, S.S. Rao, Succinct indexable dictionaries with applications to encoding k-array trees and multisets, in: Proc. ACM-SIAM SODA, 2002, pp. 233-242 · Zbl 1093.68582
[18] A. Andersson, T. Hagerup, S. Nilsson, R. Raman, Sorting in linear time?, in: ACM Symposium on Theory of Computing, 1995, pp. 427-436 · Zbl 0968.68509
[19] Hui, L., Color set size problem with applications to string matching, (), 227-240
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.