×

Categorization of text documents taking into account some structural features. (English. Russian original) Zbl 1442.68243

J. Comput. Syst. Sci. Int. 55, No. 1, 96-105 (2016); translation from Izv. Ross. Akad. Nauk, Teor. Sist. Upr. 2016, No. 1, 104-114 (2016).
Summary: This paper reviews the possibility of upgrading the conventional “bag-of-words” model to reflect the structural features of text documents and take them into account in the process of categorization by means of machine learning theory methods. It is suggested to use these features to characterize the relationships within a set of tokens. It is also proposed to use the names of such relationships as features, along with the names of tokens. The proposed models differ from the traditional approach, which only reflects unary relations. The efficiency of the upgraded methods of machine learning is tested by means of computer experiments run for the Reuters-21578 set classes by using eight common classifiers. The relevance of applying such a modernized approach to categorize text documents with the help of simple classifiers is demonstrated.

MSC:

68T50 Natural language processing
68T05 Learning and adaptive systems in artificial intelligence
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] F. Sebastiani, “Machine learning in automated text categorization,” ACM Comp. Surv. 34 1, 1-47 (2002). · doi:10.1145/505282.505283
[2] V. Vapnik, The Nature of Statistical Learning Theory (Springer, New York, 1995). · Zbl 0833.62008 · doi:10.1007/978-1-4757-2440-0
[3] R. Schapire, “The strength of weak learnability,” Machine Learning 5, 197-227 (1990).
[4] Hofmann, T.; Cai, L., Text categorization by boosting automatically extracted concepts, 182-189 (2003)
[5] Joachims, T., Text categorization with support vector machines: learning with many relevant features, 137-142 (1998)
[6] C. Manning, P. Raghavan, and H. Schutze, Introduction to Information Retrieval (Cambridge Univ. Press, Cambridge, 2008). · Zbl 1160.68008 · doi:10.1017/CBO9780511809071
[7] Z. Harris, “Distributional structure,” Word, No. 10 (2/3), 146-162 (1954). · doi:10.1080/00437956.1954.11659520
[8] B. Croft, D. Metzler, and T. Strohman, Search Engines: Information Retrieval in Practice (Addison Wesley, Boston, 2010).
[9] R. Baeza-Yates, R. Baeza-Yates, and G. Navarro, “Integrating contents and structure in text retrieval,” ACM SIGMOD Record 25 1, 67-79 (1996). · Zbl 1088.68590 · doi:10.1145/381854.381890
[10] Scott, S.; Matwin, S., Feature engineering for text classification, 370-388 (1999)
[11] D. Manning and H. Schutze, Foundations of Statistical Natural Language Processing (MIT Press, London, 1999). · Zbl 0951.68158
[12] Cavnar, W.; Trenkle, J., N-gram-based text categorization, 161-175 (1994)
[13] G. Salton, A. Wong, and C. Yang, “A vector space model for automatic indexing,” Commun. ACM 18 11, 613-620 (1975). · Zbl 0313.68082 · doi:10.1145/361219.361220
[14] C. Buttcher, G. Clarke, and G. Cormack, Information Retrieval: Implementing and Evaluating Search Engines (MIT Press, Cambridge, 2010). · Zbl 1211.68176
[15] V. V. Gulin, “A comparative analysis of text documents classification methods,” Vest. MEI, No. 4, 100-108 (2011).
[16] A. B. Frolov, “A Finite topology principle in recognizing topological forms,” J. Comput. Syst. Sci. Int. 49, 65 (2010). · Zbl 1269.68086 · doi:10.1134/S1064230710010089
[17] A. Frolov, E. Jako, and P. Mezey, “Logical models of molecular shapes and their families,” Math. Chem., No. 30 4, 389-409 (2001). · Zbl 1003.92034 · doi:10.1023/A:1015190410232
[18] A. Frolov, E. Jako, and P. Mezey, “Metric properties of factor space of molecular shapes,” Math. Chem., No. 30 4, 411-428 (2001). · Zbl 1003.92035 · doi:10.1023/A:1015142527070
[19] P. G. Mezey, Shape in Chemistry: An Introduction to Molecular Shape Topology (Wiley, New York, 1993).
[20] K. V. Vorontsov, Machine Learning, Course of Lectures. http://shadyandexru/lectures/machineıderline☎antomllearningxml.
[21] C. J. van Rijsbergen, Information Retrieval (Butterworth, London, 1979). · Zbl 0227.68052
[22] V. N. Vapnik and A. Ya. Chervonenkis, Theory of Pattern Recognition (Nauka, Moscow, 1974) [in Russian]. · Zbl 0284.68070
[23] D. Lewis, Test Collections Reuters-21578http://wwwdaviddlewiscom/resources/testcollections/reuters21578/
[24] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction Springer Series in Statistics (Springer, New York, 2009). · Zbl 1273.62005 · doi:10.1007/978-0-387-84858-7
[25] J. R. Quinlan, C4.5: Programs for Machine Learning (Morgan Kaufmann, San Francisco, 1993).
[26] Y. Freund and R. Schapire, “Learning and an application to boosting,” J. Comput. Syst. Sci., No. 55, 119-139 (1997). · Zbl 0880.68103 · doi:10.1006/jcss.1997.1504
[27] L. Breiman, “Random forests,” Machine Learning, No. 45 1, 5-32 (2001). · Zbl 1007.68152 · doi:10.1023/A:1010933404324
[28] V. V. Gulin, “Study of the gradient boosting method on ‘careless’ desicion trees in the problem of text document classification,” Vest. MEI, No. 6, 124-131 (2012).
[29] Gulin, V. V., The library of machine learning algorithms MLLibrary (2013)
[30] Chih-Chung Chang and Chih-Jen Lin, LIBSVM—A Library for Support Vector Machines. http://wwwcsie. ntuedutw/ cjlin/libsvm/ · Zbl 0993.68080
[31] Yu. I. Zhuravlev, V. V. Ryazanov, and O. V. Sen’ko, Recognition. Mathematical Methods. The Software System. Practical Applications (Fazis, Moscow, 2006) [in Russian].
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.