Computing discriminating and generic words. (English) Zbl 1330.68059

Calder√≥n-Benavides, Liliana (ed.) et al., String processing and information retrieval. 19th international symposium, SPIRE 2012, Cartagena de Indias, Colombia, October 21–25, 2012. Proceedings. Berlin: Springer (ISBN 978-3-642-34108-3/pbk). Lecture Notes in Computer Science 7608, 307-317 (2012).
Summary: We study the following three problems of computing generic or discriminating words for a given collection of documents. Given a pattern \(P\) and a threshold \(d\), we want to report (i) all longest extensions of \(P\) which occur in at least \(d\) documents, (ii) all shortest extensions of \(P\) which occur in less than \(d\) documents, and (iii) all shortest extensions of \(P\) which occur only in \(d\) selected documents. For these problems, we propose efficient algorithms based on suffix trees and using advanced data structure techniques. For problem (i), we propose an optimal solution with constant running time per output word.
For the entire collection see [Zbl 1253.68007].


68P20 Information storage and retrieval of data
68P05 Data structures
68W32 Algorithms on strings
Full Text: DOI Link