zbMATH — the first resource for mathematics

Document retrieval with one wildcard. (English) Zbl 1339.68073
Summary: In this paper we extend several well-known document listing problems to the case when documents contain a substring that approximately matches the query pattern. We study the scenario when the query string can contain a wildcard symbol that matches any alphabet symbol; all documents that match a query pattern with one wildcard must be enumerated. We describe a linear space data structure that reports all documents containing a substring \(P\) in \(O(| P | + \sigma \sqrt{\log \log \log n} + {\mathtt docc})\) time, where \(\sigma\) is the alphabet size and is the number of listed documents. We also describe a succinct solution for this problem, as well as a solution for an extension of this problem. Furthermore our approach enables us to obtain an \(O(n \sigma)\)-space data structure that enumerates all documents containing both a pattern \(P_1\) and a pattern \(P_2\) in the special case when \(P_1\) and \(P_2\) differ in one symbol.
68P20 Information storage and retrieval of data
68P05 Data structures
68W32 Algorithms on strings
Full Text: DOI
[1] Lewenstein, M.; Munro, J. I.; Nekrich, Y.; Thankachan, S. V., Document retrieval with one wildcard, (Mathematical Foundations of Computer Science MFCS, (2014)), 529-540 · Zbl 1339.68074
[2] Boyer, R. S.; Moore, J. S., A fast string searching algorithm, Commun. ACM, 20, 10, 762-772, (1977) · Zbl 1219.68165
[3] Knuth, D. E.; Morris, J. H.; Pratt, V. B., Fast pattern matching in strings, SIAM J. Comput., 6, 2, 323-350, (1977) · Zbl 0372.68005
[4] Hon, W.-K.; Shah, R.; Thankachan, S. V.; Vitter, J. S., Space-efficient frameworks for top-k string retrieval, J. ACM, 61, 2, 9, (2014) · Zbl 1295.68230
[5] Shah, R.; Sheng, C.; Thankachan, S. V.; Vitter, J. S., Top-k document retrieval in external memory, (Bodlaender, H. L.; Italiano, G. F., ESA, Lecture Notes in Comput. Sci., vol. 8125, (2013), Springer), 803-814 · Zbl 1394.68129
[6] Navarro, G.; Nekrich, Y., Top-k document retrieval in optimal time and linear space, (Rabani, Y., SODA, (2012), SIAM), 1066-1077
[7] Navarro, G., Spaces, trees, and colors: the algorithmic landscape of document retrieval on sequences, ACM Comput. Surv., 46, 4, 52, (2013) · Zbl 1305.68078
[8] Hon, W.-K.; Patil, M.; Shah, R.; Thankachan, S. V.; Vitter, J. S., Indexes for document retrieval with relevance, (Brodnik, A.; López-Ortiz, A.; Raman, V.; Viola, A., Space-Efficient Data Structures, Streams, and Algorithms, Lecture Notes in Comput. Sci., vol. 8066, (2013), Springer), 351-362 · Zbl 1394.68127
[9] Amir, A.; Keselman, D.; Landau, G. M.; Lewenstein, M.; Lewenstein, N.; Rodeh, M., Text indexing and dictionary matching with one error, J. Algorithms, 37, 2, 309-325, (2000) · Zbl 0966.68062
[10] Bille, P.; Gørtz, I. L.; Vildhøj, H. W.; Vind, S., String indexing for patterns with wildcards, (SWAT, (2012)), 283-294 · Zbl 1357.68306
[11] Cole, R.; Gottlieb, L.-A.; Lewenstein, M., Dictionary matching and indexing with errors and don’t cares, (Proc. 36th Annual ACM Symposium on Theory of Computing (STOC 2004), (2004)), 91-100 · Zbl 1192.68818
[12] Lewenstein, M., Indexing with gaps, (SPIRE, (2011)), 135-143
[13] Tam, A.; Wu, E.; Lam, T. W.; Yiu, S.-M., Succinct text indexing with wildcards, (SPIRE, (2009)), 39-50
[14] Lewenstein, M.; Nekrich, Y.; Vitter, J. S., Space-efficient string indexing for wildcard pattern matching, (Proc. 31st International Symposium on Theoretical Aspects of Computer Science (STACS 2014), (2014)), 506-517 · Zbl 1359.68339
[15] Iliopoulos, C. S.; Rahman, M. S., Indexing factors with gaps, Algorithmica, 55, 1, 60-70, (2009) · Zbl 1180.68127
[16] Hon, W.-K.; Ku, T.-H.; Shah, R.; Thankachan, S. V.; Vitter, J. S., Compressed dictionary matching with one error, (Proc. 2011 Data Compression Conference (DCC 2011), (2011)), 113-122
[17] Lam, T. W.; Sung, W.-K.; Tam, S.-L.; Yiu, S.-M., Space efficient indexes for string matching with don’t cares, (ISAAC, (2007)), 846-857 · Zbl 1193.68293
[18] Rahman, M. S.; Iliopoulos, C. S., Pattern matching algorithms with don’t cares, (Proc. 33rd Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM 2007), (2007)), 116-126
[19] Huynh, T. N.; Hon, W.-K.; Lam, T.-W.; Sung, W.-K., Approximate string matching using compressed suffix arrays, Theoret. Comput. Sci., 352, 1, 240-249, (2006) · Zbl 1086.68038
[20] Lewenstein, M., Orthogonal range searching for text indexing, (Space-Efficient Data Structures, Streams, and Algorithms - Papers in Honor of J. Ian Munro on the Occasion of His 66th Birthday, (2013)), 267-302 · Zbl 1394.68099
[21] Bille, P.; Gørtz, I. L., Substring range reporting, (Proc. 22nd Annual Symposium on Combinatorial Pattern Matching (CPM 2011), (2011)), 299-308 · Zbl 1339.68049
[22] Lewenstein, M.; Munro, J. I.; Raman, V.; Thankachan, S. V., Less space: indexing for queries with wildcards, (Algorithms and Computation - Proceedings of the 24th International Symposium, ISAAC 2013, Hong Kong, China, December 16-18, 2013, (2013)), 89-99 · Zbl 1329.68315
[23] Matias, Y.; Muthukrishnan, S.; Sahinalp, S. C.; Ziv, J., Augmenting suffix trees, with applications, (ESA ’98, (1998), Springer-Verlag London, UK), 67-78
[24] Muthukrishnan, S., Efficient algorithms for document retrieval problems, (Proc. 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2002), (2002)), 657-666 · Zbl 1093.68588
[25] Cohen, H.; Porat, E., Fast set intersection and two-patterns matching, Theoret. Comput. Sci., 411, 40-42, 3795-3800, (2010) · Zbl 1207.68270
[26] Hon, W.-K.; Shah, R.; Thankachan, S. V.; Vitter, J. S., String retrieval for multi-pattern queries, (String Processing and Information Retrieval - 17th International Symposium, (2010)), 55-66
[27] Fischer, J.; Gagie, T.; Kopelowitz, T.; Lewenstein, M.; Mäkinen, V.; Salmela, L.; Välimäki, N., Forbidden patterns, (LATIN, (2012)), 327-337 · Zbl 1353.68066
[28] Hon, W.-K.; Shah, R.; Thankachan, S. V.; Vitter, J. S., Document listing for queries with excluded pattern, (Proc. 23rd Annual Symposium on Combinatorial Pattern Matching (CPM 2012), (2012)), 185-195 · Zbl 1358.68093
[29] Fischer, J.; Heun, V., Space-efficient preprocessing schemes for range minimum queries on static arrays, SIAM J. Comput., 40, 2, 465-492, (2011) · Zbl 1222.05024
[30] Weiner, P., Linear pattern matching algorithms, (SWAT (FOCS), (1973), IEEE Computer Society), 1-11
[31] Manber, U.; Myers, E. W., Suffix arrays: a new method for on-line string searches, SIAM J. Comput., 22, 5, 935-948, (1993) · Zbl 0784.68027
[32] Belazzougui, D.; Navarro, G.; Valenzuela, D., Improved compressed indexes for full-text document retrieval, J. Discrete Algorithms, 18, 3-13, (2013) · Zbl 1268.68075
[33] Sadakane, K., Succinct data structures for flexible text retrieval systems, J. Discrete Algorithms, 5, 1, 12-22, (2007) · Zbl 1137.68360
[34] Sleator, D. D.; Tarjan, R. E., A data structure for dynamic trees, J. Comput. System Sci., 26, 3, 362-391, (1983) · Zbl 0509.68058
[35] Belazzougui, D.; Navarro, G., Alphabet-independent compressed text indexing, (Proc. 19th Annual European Symposium on Algorithms (ESA 2011), (2011)), 748-759 · Zbl 1325.68307
[36] Raman, R.; Raman, V.; Satti, S. R., Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets, ACM Trans. Algorithms, 3, 4, (2007)
[37] Munro, J. I.; Navarro, G.; Shah, R.; Thankachan, S. V., Ranked document selection, (SWAT, (2014)), 344-356 · Zbl 1416.68064
[38] Chan, T. M.; Larsen, K. G.; Patrascu, M., Orthogonal range searching on the ram, revisited, (Symposium on Computational Geometry, (2011)), 1-10 · Zbl 1283.68139
[39] Alstrup, S.; Brodal, G. S.; Rauhe, T., Optimal static range reporting in one dimension, (Proc. 33rd Annual ACM Symposium on Theory of Computing (STOC), (2001)), 476-482 · Zbl 1323.68536
[40] Lewenstein, M.; Nekrich, Y.; Vitter, J. S., Space-efficient string indexing for wildcard pattern matching, CoRR · Zbl 1359.68339
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.