×

Mitexcube: microtextcluster cube for online analysis of text cells and its applications. (English) Zbl 07260364

Summary: A fundamental problem of multidimensional text database analysis is efficient and effective support of various kinds of online applications, such as summarizing the content of a text cell or comparing the contents across multiple text cells. In this paper, we propose a new infrastructure called MicroTextCluster Cube (or MiTexCube) to support efficient online text analysis on multidimensional text databases by introducing micro-clusters of text documents as a compact representation of text content. Experimental results on real multidimensional text databases show that (i) MiTexCube can be materialized efficiently with reasonable overhead in space, and (ii) applications based on the proposed materialized MiTexCube are more efficient than the baseline method of direct analysis based on document units in each cell, without sacrificing much quality of analysis, and MiTexCube naturally accommodates flexible trade-off between efficiency and quality of analysis.

MSC:

62-XX Statistics
68-XX Computer science

Software:

PolyAnalyst
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Aviation safety reporting system, http://asrs.arc.nasa.gov/, 2012.
[2] C. X. Lin, B. Ding, J. Han, F. Zhu, and B. Zhao, Text cube: computing IR measures for multidimensional text database analysis, ICDM, 2008, 905-910.
[3] D. Zhang, C. Zhai, and J. Han, Topic cube: topic modeling for OLAP on multidimensional text databases, SDM (2009).
[4] T. Zhang, R. Ramakrishnan, and M. Livny, Birch: an efficient data clustering method for very large databases, SIGMOD Rec 25(2) (1996), 103-114.
[5] D. L. Davies and D. W. Bouldin, A cluster separation measure, IEEE Trans Pattern Anal Mach Intell 1 (1979), 224-227.
[6] The dblp computer science bibliography, http://www. informatik.uni-trier.de/∼ley/db/, 2012.
[7] S. Agarwal, R. Agrawal, P. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi, On the computation of multidimensional aggregates, VLDB’96, 506-521.
[8] S. Chaudhuri and U. Dayal, An overview of data warehousing and olap technology, SIGMOD Rec 26(1) (1997), 65-74.
[9] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh, Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals, Data Mining and Knowledge Discovery, Vol. 1, 29-53, Kluwer Academic Publishers, Hingham, MA, USA, 1997.
[10] F. M. fei Jiang, J. Pei, and A. W. chee Fu, Ix-cubes: iceberg cubes for data warehousing and olap on xml data, CIKM’07, Lisbon, Portugal, 2007, 905-908.
[11] E. Lo, B. Kao, W.-S. Ho, S. D. Lee, C. K. Chui, and D. W. Cheung, Olap on sequence data, SIGMOD’08, Vancouver, Canada, 2008, 649-660.
[12] Y. Tian, R. A. Hankins, and J. M. Patel, Efficient aggregation for graph summarization, SIGMOD’08, Vancouver, Canada, 2008, 567-580.
[13] W. F. Cody, J. T. Kreulen, V. Krishna, and W. S. Spangler, The integration of business intelligence and knowledge management, IBM Syst J 41(4) (2002), 697-713.
[14] Megaputer’s polyanalyst, http://www.megaputer.com/, 2011.
[15] A. Simitsis, A. Baid, Y. Sismanis, and B. Reinwald, Multidimensional content exploration, Proc VLDB Endow 1(1) (2008), 660-671.
[16] J. Han and M. Kamber, Data Mining: Concepts and Techniques, San Francisco, CA, Morgan Kaufmann, 2000. · Zbl 1230.68018
[17] G. Salton and M. McGill, Introduction to Modern Information Retrieval, New York, McGraw-Hill, 1983. · Zbl 0523.68084
[18] J. Carbonell and J. Goldstein, The use of mmr, diversitybased reranking for reordering documents and producing summaries, SIGIR ’98, Melbourne, Australia, 1998, 335-336.
[19] E. Rend´on, I. M. Abundez, C. Gutierrez, S. D. Zagal, A. Arizmendi, E. M. Quiroz, and H. E. Arzate, A comparison of internal and external cluster validation indexes, In Proceedings of the 2011 American Conference on Applied Mathematics and the 5th WSEAS International Conference on Computer Engineering And Applications, AMERICANMATH’11/CEA’11, 2011, 158-163.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.