Efficient index for retrieving top-\(k\) most frequent documents.

*(English)*Zbl 1215.68095Summary: In the document retrieval problem (Muthukrishnan, 2002), we are given a collection of documents (strings) of total length \(D\) in advance, and our target is to create an index for these documents such that for any subsequent input pattern \(P\), we can identify which documents in the collection contain \(P\). In this paper, we study a natural extension to the above document retrieval problem. We call this top-\(k\) frequent document retrieval, where instead of listing all documents containing \(P\), our focus is to identify the top-\(k\) documents having most occurrences of \(P\). This problem forms a basis for search engine tasks of retrieving documents ranked with TFIDF (Term Frequency-Inverse Document Frequency) metric.

A related problem was studied by S. Muthukrishnan [in: Proceedings of the thirteenth annual ACM-SIAM symposium on discrete algorithms, San Francisco, CA, USA, January 6–8, 2002. Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM). 657–665 (2002; Zbl 1093.68588)], where the emphasis was on retrieving all the documents whose number of occurrences of the pattern \(P\) exceeds some frequency threshold \(f\). However, from the information retrieval point of view, it is hard for a user to specify such a threshold value \(f\) and have a sense of how many documents will be reported as the output. We develop some additional building blocks which help the user overcome this limitation. These are used to derive an efficient index for top-\(k\) frequent document retrieval problem, answering queries in \(O(|P| + \log D \log \log D + k)\) time and taking \(O(D \log D)\) space. Our approach is based on a new use of the suffix tree called induced generalized suffix tree (IGST). The practicality of the proposed index is validated by the experimental results.

A related problem was studied by S. Muthukrishnan [in: Proceedings of the thirteenth annual ACM-SIAM symposium on discrete algorithms, San Francisco, CA, USA, January 6–8, 2002. Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM). 657–665 (2002; Zbl 1093.68588)], where the emphasis was on retrieving all the documents whose number of occurrences of the pattern \(P\) exceeds some frequency threshold \(f\). However, from the information retrieval point of view, it is hard for a user to specify such a threshold value \(f\) and have a sense of how many documents will be reported as the output. We develop some additional building blocks which help the user overcome this limitation. These are used to derive an efficient index for top-\(k\) frequent document retrieval problem, answering queries in \(O(|P| + \log D \log \log D + k)\) time and taking \(O(D \log D)\) space. Our approach is based on a new use of the suffix tree called induced generalized suffix tree (IGST). The practicality of the proposed index is validated by the experimental results.

PDF
BibTeX
XML
Cite

\textit{W.-K. Hon} et al., J. Discrete Algorithms 8, No. 4, 402--417 (2010; Zbl 1215.68095)

Full Text:
DOI

##### References:

[1] | M.A. Bender, M. Farach-Colton, The LCA problem revisited, in: Proceedings of Latin American Symposium on Theoretical Informatics, 2000, pp. 88-94. · Zbl 0959.68133 |

[2] | I. Bialynicka-Birula, R. Grossi, Rank-sensitive data structures, in: Proceedings of International Symposium on String Processing and Information Retrieval, 2005, pp. 79-90. |

[3] | Boyer, R.S.; Moore, J.S., A fast string searching algorithm, Communications of the ACM, 20, 10, 762-772, (1977) · Zbl 1219.68165 |

[4] | Cormen, T.H.; Leiserson, C.E.; Rivest, R.L.; Stein, C., Introduction to algorithms, (2001), MIT Press · Zbl 1047.68161 |

[5] | Human Mitochondria Genome Database, Department of Genetics and Pathology, Uppsala University, Sweden |

[6] | eMule, Wikipedia, the free encyclopedia (based on 28 July 2008 version) |

[7] | W.K. Hon, R. Shah, J.S. Vitter, Ordered pattern matching: Towards full-text retrieval, Technical Report TR-06-008, Department of CS, Purdue University, 2006. |

[8] | L.C.K. Hui, Color set size problem with applications to string matching, in: Proceedings of Symposium on Combinatorial Pattern Matching, 1992, pp. 230-243. |

[9] | R.M. Karp, M.O. Rabin, Efficient randomized pattern-matching algorithms, Technical Report TR-31-81, Aiken Computational Laboratory, Harvard University, 1981. · Zbl 0653.68054 |

[10] | Knuth, D.E.; Morris, J.H.; Pratt, V.B., Fast pattern matching in strings, SIAM journal on computing, 6, 2, 323-350, (1977) · Zbl 0372.68005 |

[11] | V. Makinen, G. Navarro, Position-restricted substring searching, in: Proceedings of Latin American Symposium on Theoretical Informatics, 2006, pp. 703-714. · Zbl 1145.68392 |

[12] | Y. Matias, S. Muthukrishnan, S.C. Sahinalp, J. Ziv, Augmenting suffix trees, with applications, in: Proceedings of European Symposium on Algorithms, 1998, pp. 67-78. |

[13] | McCreight, E.M., A space-economical suffix tree construction algorithm, Journal of the ACM, 23, 2, 262-272, (1976) · Zbl 0329.68042 |

[14] | S. Muthukrishnan, Efficient algorithms for document retrieval problems, in: Proceedings of Symposium on Discrete Algorithms, 2002, pp. 657-666. · Zbl 1093.68588 |

[15] | K. Sadakane, Succinct representations of lcp information and improvements in the compressed suffix arrays, in: Proceedings of Symposium on Discrete Algorithms, 2002, pp. 225-232. · Zbl 1093.68578 |

[16] | K. Sadakane, Compressed suffix trees with full functionality, in: Theory of Computing Systems, 2007, pp. 589-607. · Zbl 1148.68015 |

[17] | SourceForge, SourceForget.net: Open Source Software |

[18] | N. Valimaki, V. Makinen, Space-efficient algorithms for document retrieval, in: Proceedings of Symposium on Combinatorial Pattern Matching, 2007, pp. 205-215. · Zbl 1138.68401 |

[19] | P. Weiner, Linear pattern matching algorithms, in: Proceedings of Symposium on Switching and Automata Theory, 1973, pp. 1-11. |

[20] | Weisstein, E.W., Zipf’s law |

[21] | Willard, D.E., Log-logarithmic worst-case range queries are possible in space \(\Theta(N)\), Information processing letters, 17, 2, 81-84, (1983) · Zbl 0509.68106 |

[22] | Witten, I.; Moffat, A.; Bell, T., Managing gigabytes: compressing and indexing documents and images, (1999), Morgan Kaufmann Publishers Los Altos, CA, USA · Zbl 0821.68051 |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.