×

zbMATH — the first resource for mathematics

Compressed property suffix trees. (English) Zbl 1435.68399
Summary: Property matching is a biologically motivated problem where the task is to find those occurrences of an online pattern \(P\) in a string text \(T\) (of size \(n\)), such that the matched text part satisfies some conceptual property. The property of a string is a set \(\pi\) of (possibly overlapping) intervals \(\{(s_1,f_1)\), \((s_2,f_2),\dots\}\) corresponding to the part of text and an occurrence of a pattern \(P=T[i,\dots,(i+| P|-1)]\) is a valid output only if \(T[i,\dots,(i+| P|-1)]\) is completely contained in at least one interval \((s_j,f_j)\in\pi\). The indexing version of this problem was introduced by A. Amir et al. [Theor. Comput. Sci. 395, No. 2–3, 298–310 (2008; Zbl 1142.68066)], where the text is preprocessed in \(O(n\log \sigma+n\log\log n)\) time and an \(O(n\log n)\) bits index, named Property Suffix Tree (PST) is maintained. PST can perform property matching in \(O(| P|\log\sigma+\mathrm{occ}_\pi)\) time, where \(\mathrm{occ}_\pi\) is the number of occurrences of \(P\) in \(T\) satisfying the property. T. Kopelowitz [Lect. Notes Comput. Sci. 6129, 63–75 (2010; Zbl 1286.68528)] considered the dynamic version of this problem where intervals can be added or deleted. However, all these indexes take space linear to the size of text \((O(n\log n)\) bits), which can be much more than the size of the text \((n\log\sigma\) bits). In this paper, we propose the first index for property matching occupying space close to the entropy compressed space requirement of the text. Our compressed index takes \(|\mathrm{CSA}|+n(2+\epsilon+o(1))\) bits space and performs query answering in \(O(t(| P|)+\frac{1}{\epsilon}(1+\mathrm{occ}_\pi)t_{\mathrm{SA}})\) time, where \(|\mathrm{CSA}|\) is the size of compressed suffix array of \(T\), \(t(| P|)\) be the time for searching a pattern of length \(| P|\) in CSA, \(t_{\mathrm{SA}}\) is the time for computing the suffix array value and \(\epsilon>0\) is a constant. We also introduce a dynamic index, which takes \(|\mathrm{CSA}|+O(n+|\pi|\log n)\) bits space and performs query answering in \(O(t(| P|)+(1+\mathrm{occ}_\pi)\log n(t_{\mathrm{SA}}+\log n/\log\log n))\) time and can update (insert/delete) an interval \((s,f)\) in \(O((f-s)(\log n+t_{\mathrm{SA}}))\) time.

MSC:
68W32 Algorithms on strings
68P05 Data structures
68P30 Coding and information theory (compaction, compression, models of communication, encoding schemes, etc.) (aspects in computer science)
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Manber, U.; Myers, E. W., Suffix arrays: A new method for on-line string searches, SIAM J. Comput., 22, 5, 935-948, (1993) · Zbl 0784.68027
[2] Grossi, R.; Vitter, J. S., Compressed suffix arrays and suffix trees with applications to text indexing and string matching, SIAM J. Comput., 35, 2, 378-407, (2005), a preliminary version appears in STOCʼ00 · Zbl 1092.68115
[3] Ferragina, P.; Manzini, G., Indexing compressed text, J. ACM, 52, 4, 552-581, (2005) · Zbl 1323.68261
[4] Navarro, G.; Mäkinen, V., Compressed full-text indexes, ACM Comput. Surv., 39, 1, (2007) · Zbl 1321.68263
[5] Jurka, J., Human repetitive elements, (Meyers, R. A., Molecular Biology and Biotechnology, (1995)), 438-441
[6] Jurka, J., Origin and evolution of alu repetitive elements, (Maraia, R., The Impact of Short Interspersed Elements (SINEs) on the Host Genome, (1995)), 25-41
[7] Amir, A.; Chencinski, E.; Iliopoulos, C. S.; Kopelowitz, T.; Zhang, H., Property matching and weighted matching, Theor. Comput. Sci., 395, 2-3, 298-310, (2008) · Zbl 1142.68066
[8] Iliopoulos, C. S.; Makris, C.; Panagis, Y.; Perdikuri, K.; Theodoridis, E.; Tsakalidis, A., The weighted suffix tree: an efficient data structure for handling molecular weighted sequences and its applications, Fundam. Inform., 71, 259-277, (2006) · Zbl 1095.68029
[9] Zhang, H.; Guo, Q.; Iliopoulos, C. S., An algorithmic framework for motif discovery problems in weighted sequences, (International Conference on Algorithms and Complexity, (2010)), 335-346 · Zbl 1284.68712
[10] Iliopoulos, C. S.; Rahman, M. S., Faster index for property matching, Inf. Process. Lett., 105, 6, 218-223, (2008) · Zbl 1184.68353
[11] Juan, M. T.; Liu, J. J.; Wang, Y. L., Errata for “faster index for property matching”, Inf. Process. Lett., 109, 18, 1027-1029, (2009) · Zbl 1202.68273
[12] Kopelowitz, T., The property suffix tree with dynamic properties, (Proceedings of Symposium on Combinatorial Pattern Matching, (2010)), 63-75 · Zbl 1286.68528
[13] Zhao, H.; Lu, S., Compressed index for property matching, (Proceedings of Data Compression Conference, (2011)), 133-142
[14] Weiner, P., Linear pattern matching algorithms, (Proceedings of Symposium on Switching and Automata Theory, (1973)), 1-11
[15] Sadakane, K., Compressed suffix trees with full functionality, Theory Comput. Syst., 41, 4, 589-607, (2007) · Zbl 1148.68015
[16] Belazzougui, D.; Navarro, G., Alphabet-independent compressed text indexing, (Proceedings of European Symposium on Algorithms, (2011)), 748-759 · Zbl 1325.68307
[17] Chazelle, B., A functional approach to data structures and its use in multidimensional searching, SIAM J. Comput., 17, 3, 427-462, (1988) · Zbl 0679.68074
[18] Bender, M. A.; Farach-Colton, M., The LCA problem revisited, (Proceedings of Latin American Symposium on Theoretical Informatics, (2000)), 88-94 · Zbl 0959.68133
[19] Fischer, J.; Heun, V., Space-efficient preprocessing schemes for range minimum queries on static arrays, SIAM J. Comput., 40, 2, 465-492, (2011) · Zbl 1222.05024
[20] Jacobson, G., Succinct static data structures, (1989), Carnegie Mellon University, PhD thesis
[21] Clark, D. R.; Munro, J. I., Efficient suffix trees on secondary storage (extended abstract), (Proceedings of Symposium on Discrete Algorithms, (1996)), 383-391 · Zbl 0847.68030
[22] Pagh, R., Low redundancy in static dictionaries with o(1) worst case lookup time, (Proceedings of International Colloquium on Automata, Languages and Programming, (1999)), 595-604
[23] Raman, R.; Raman, V.; Satti, S. R., Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets, ACM Trans. Algorithms, 3, 4, (2007)
[24] Raman, R.; Raman, V.; Rao, S. S., Succinct dynamic data structures, (Proceedings of Workshop on Algorithms and Data Structures, (2001)), 426-437 · Zbl 0997.68520
[25] Hon, W.-K.; Sadakane, K.; Sung, W.-K., Succinct data structures for searchable partial sums with optimal worst-case performance, Theor. Comput. Sci., 412, 39, 5176-5186, (2011) · Zbl 1226.68032
[26] Mäkinen, V.; Navarro, G., Dynamic entropy-compressed sequences and full-text indexes, ACM Trans. Algorithms, 4, 3, (2008)
[27] Gupta, A.; Hon, W.-K.; Shah, R.; Vitter, J. S., A framework for dynamizing succinct data structures, (Proceedings of International Colloquium on Automata, Languages and Programming, (2007)), 521-532 · Zbl 1171.68435
[28] Kärkkäinen, J.; Sanders, P.; Burkhardt, S., Linear work suffix array construction, J. ACM, 53, 6, 918-936, (2006) · Zbl 1326.68111
[29] Hon, W.-K.; Sadakane, K.; Sung, W.-K., Breaking a time-and-space barrier in constructing full-text indices, (Proceedings of Symposium on Foundations of Computer Science, (2003)), 251-260
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.