×

Web page classification based on a simplified swarm optimization. (English) Zbl 1410.90269

Summary: Owing to the incredible increase in the amount of information on the World Wide Web, there is a strong need for an efficient web page classification to retrieve useful information quickly. In this paper, we propose a novel simplified swarm optimization (SSO) to learn the best weights for every feature in the training dataset and adopt the best weights to classify the new web pages in the testing dataset. Moreover, the parameter settings play an important role in the update mechanism of the SSO so that we utilize a Taguchi method to determine the parameter settings. In order to demonstrate the effectiveness of the algorithm, we compare its performance with that of the well-known genetic algorithm (GA), Bayesian classifier, and K\(-\)nearest neighbor (KNN) classifiers according to four datasets. The experimental results indicate that the SSO yields better performance than the other three approaches.

MSC:

90C59 Approximation methods and heuristics in mathematical programming
68T05 Learning and adaptive systems in artificial intelligence
68M11 Internet topics
68P20 Information storage and retrieval of data
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Baeza-Yates, R.; Ribeiro, B., Modern Information Retrieval (1999), Addison-Wesley: Addison-Wesley New York
[2] Eberhart, R. C.; Kennedy, J., A new optimizer using particle swarm theory, (Proceedings of the 6th IEEE Symposium on MicroMachine and Human Science. Proceedings of the 6th IEEE Symposium on MicroMachine and Human Science, Los Alamitos, IEEE (1995)), 39-43
[3] Goldberg, D. E., Genetic Algorithms in Search, Optimization and Machine Learning (1989), Addison-Wesley: Addison-Wesley New York · Zbl 0721.68056
[4] Han, J.; Kanber, M., Data Mining: Concepts and Techniques (2006), Morgan Kaufman: Morgan Kaufman San Francisco · Zbl 1445.68004
[5] Joachims, T., Probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization, (Proceedings of International Conference on Machine Learning. Proceedings of International Conference on Machine Learning, Nashville, TN, USA (1997)), 143-151
[6] Joachims, T., Text categorization with support vector machines: learning with many relevant features, (Proceedings of the 10th European Conference on Machine Learning (1998), Springer: Springer Berlin), 137-142
[7] Kim, S.; Zhang, B. T., Genetic mining of HTML structures for effective Web document retrieval, Artif. Intell., 18, 243-256 (2003) · Zbl 1034.68071
[8] Kuo, R. J.; Syu, Y. J.; Chen, Z. Y.; Tien, F. C., Integration of particle swarm optimization and genetic algorithm for dynamic clustering, Inf. Sci., 195, 124-140 (2012)
[9] Liu, H.; Huang, S., A genetic semi-supervised fuzzy clustering approach to text classification, Lecture Notes in Computer Science, 2762, 173-180 (2003)
[10] Özel, S. A., A web page classification system based on a genetic algorithm using tagged-terms as features, Expert Syst. Appl., 38, 3407-3415 (2011)
[11] Pietramala, A.; Policicchio, V. L.; Rullo, P.; Sidhu, I., A genetic algorithm for text classification rule induction, Lecture Notes in Artificial Intelligence, 5212, 188-203 (2008)
[12] Porter, M. F., An algorithm for suffix stripping, Program, 14, 130-137 (1980)
[13] Qi, D.; Sun, B., A genetic k-means approach for automated Web page classification, (Proceedings of IEEE International Conference on Information Reuse and Integration. Proceedings of IEEE International Conference on Information Reuse and Integration, IRI-2004, Las Vegas, NV (2004)), 241-246
[14] Ribeiro, A.; Fresno, V.; Garcia-Alegre, M. C.; Guinea, D., Web page classification: a soft computing approach, Lecture Notes in Computer Science, 2663, 103-112 (2003)
[15] Salton, G.; Wong, A.; Yang, C. S., A vector space model for automatic indexing, Commun. ACM, 18, 613-662 (1975) · Zbl 0313.68082
[16] Sebastiani, F., Machine learning in automated text categorization, ACM Comput. Surv., 34, 1-47 (2002)
[17] Selamat, A.; Omatu, S., Web page feature selection and classification using neural networks, Inf. Sci., 158, 69-88 (2004)
[18] Trotman, Choosing document structure weights, Inf. Process. Manag., 41, 243-264 (2005) · Zbl 1080.68590
[19] Wang, Z.; Zhang, Q.; Zhang, D., A PSO-based web document classification algorithm, (Proceedings of IEEE eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Network, and Parallel/Distributed Computing. Proceedings of IEEE eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Network, and Parallel/Distributed Computing, Qingdao, China (2007)), 659-664
[20] Yang, Y., An evaluation of statistical approach to text categorization, Inf. Retr., 1, 69-90 (1999)
[21] Yang, Y.; Slattery, S.; Ghani, R., A study of approaches to hypertext categorization, J. Intell. Inf. Syst., 18, 219-241 (2002)
[22] Yeh, W. C.; Chang, W. W.; Chung, Y. Y., A new hybrid approach for mining breast cancer pattern using discrete particle swarm optimization and statistical method, Expert Syst. Appl., 36, 8204-8211 (2009)
[23] Yeh, W. C., A two-stage discrete particle swarm optimization for the problem of multiple multi-level redundancy allocation in series systems, Expert Syst. Appl., 36, 9192-9200 (2009)
[24] Yeh, W. C., Optimization of the disassembly sequencing problem on the basis of self-adaptive simplified swarm optimization, IEEE Trans. Syst. Man Cybern. Syst., 42, 250-261 (2012)
[25] Yeh, W. C., Simplified swarm optimization in disassembly sequencing problems with learning effects, Comput. Oper. Res., 39, 2168-2177 (2012) · Zbl 1251.90397
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.