×

Dropout training for SVMs with data augmentation. (English) Zbl 1405.68280

Summary: Dropout and other feature noising schemes have shown promise in controlling over-fitting by artificially corrupting the training data. Though extensive studies have been performed for generalized linear models, little has been done for support vector machines (SVMs), one of the most successful approaches for supervised learning. This paper presents dropout training for both linear SVMs and the nonlinear extension with latent representation learning. For linear SVMs, to deal with the intractable expectation of the non-smooth hinge loss under corrupting distributions, we develop an iteratively re-weighted least square (IRLS) algorithm by exploring data augmentation techniques. Our algorithm iteratively minimizes the expectation of a reweighted least square problem, where the re-weights are analytically updated. For nonlinear latent SVMs, we consider learning one layer of latent representations in SVMs and extend the data augmentation technique in conjunction with first-order Taylor-expansion to deal with the intractable expected hinge loss and the nonlinearity of latent representations. Finally, we apply the similar data augmentation ideas to develop a new IRLS algorithm for the expected logistic loss under corrupting distributions, and we further develop a non-linear extension of logistic regression by incorporating one layer of latent representations. Our algorithms offer insights on the connection and difference between the hinge loss and logistic loss in dropout training. Empirical results on several real datasets demonstrate the effectiveness of dropout training on significantly boosting the classification accuracy of both linear and nonlinear SVMs.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
62J12 Generalized linear models (logistic models)
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014, 15: 1929-1958 · Zbl 1318.68153
[2] Wager, S.; Wang, S.; Liang, P., Dropout training as adaptive regularization (2013)
[3] Maaten, L. V.; Chen, M.; Tyree, S.; Weinberger, K. Q., Learning with marginalized corrupted features, 410-418 (2013)
[4] Wang, S.; Wang, M. Q.; Wager, S.; Liang, P.; Manning, C. D., Feature noising for log-linear structured prediction, 1170-1179 (2013)
[5] Wang, S.; Manning, C., Fast dropout training, 777-785 (2013)
[6] Wang, H.; Shi, X. J.; Yeung, D. Y., Relational stacked denoising autoencoder for tag recommendation, 3052-3058 (2015)
[7] Vapnik V. The Nature of Statistical Learning Theory. New York: Springer-Verlag, 1995 · Zbl 0833.62008 · doi:10.1007/978-1-4757-2440-0
[8] Burges, C. J C.; Scholkopf, B., Improving the accuracy and speed of support vector machines, 375-381 (1997)
[9] Globerson, A.; Roweis, S., Nightmare at test time: robust learning by feature deletion, 353-360 (2006)
[10] Dekel, O.; Shamir, O., Learning to classify with missing and corrupted features, 149-178 (2008) · Zbl 1470.68095
[11] Teo, C. H.; Globerson, A.; Roweis, S. T.; Smola, A. K., Convex learning with invariances, 1489-1496 (2008)
[12] Polson N G, Scott S L. Data augmentation for support vector machines. Bayesian Analysis, 2011, 6(1): 1-24 · Zbl 1330.62258 · doi:10.1214/11-BA601
[13] Polson N G, Scott J G, Windle J. Bayesian inference for logistic models using Polya-Gamma latent variables. Journal of the American Statistical Association, 2013, 108(504): 1339-1349 · Zbl 1283.62055 · doi:10.1080/01621459.2013.829001
[14] Rosasco L, De Vito E, Caponnetto A, Piana M, Verri A. Are loss functions all the same? Neural Computation, 2004, 16(5): 1063-1076 · Zbl 1089.68109 · doi:10.1162/089976604773135104
[15] Globerson, A.; Koo, T. Y.; Carreras, X.; Collins, M., Exponentiated gradient algorithms for log-linear structured prediction, 305-312 (2007)
[16] Baldi P, Sadowski P. The dropout learning algorithm. Artificial Intelligence, 2014, 210(5): 78-122 · Zbl 1333.68225 · doi:10.1016/j.artint.2014.02.004
[17] Srivastava N, Hinton G E, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014, 15: 1929-1958 · Zbl 1318.68153
[18] Srivastava N. Improving neural networks with dropout. Dissertation for the Master Degree. Toronto: University of Toronto, 2013
[19] Huang G, Song S J, Gupta J N D, Wu C. Semi-supervised and unsupervised extreme learning machines. IEEE Transactions on Cybernetics, 2014, 44(12): 2405-2417 · doi:10.1109/TCYB.2014.2307349
[20] Van Erven T, Kotlowski W, Warmuth M K. Follow the leader with dropout perturbations. Proceedings of Machine Learning Research, 2014, 35: 949-974
[21] Xu, P. Y.; Sarikaya, R., Targeted feature dropout for robust slot filling in natural language understanding, 258-262 (2014)
[22] Rashmi, R. K.; Gilad-Bachrach, R., Dart: dropouts meet multiple additive regression trees, 489-497 (2015)
[23] Chen, M. M.; Xu, Z. X.; Weinberger, K.; Sha, F., Marginalized denoising autoencoders for domain adaptation, 767-774 (2012)
[24] Chen, M. M.; Weinberger, K.; Sha, F.; Bengio, Y., Marginalized denoising autoencoders for nonlinear representation, 3342-3350 (2014)
[25] Chen, Z.; Chen, M. M.; Weinberger, K. Q.; Zhang, W. X., Marginalized denoising for link prediction and multi-label learning, 1707-1713 (2015)
[26] Chen, Z.; Zhang, W. X., A marginalized denoising method for link prediction in relational data, 298-306 (2014)
[27] Chen, M. M.; Zheng, A.; Weinberger, K., Fast image tagging, 2311-2319 (2013)
[28] Qian, Q.; Hu, J. H.; Jin, R.; Pei, J.; Zhu, S. H., Distance metric learning using dropout: a structured regularization approach, 323-332 (2014)
[29] Wager, S.; Fithian, W.; Wang, S.; Liang, P. S., Altitude training: strong bounds for single-layer dropout, 100-108 (2014)
[30] Bachman, P.; Alsharif, O.; Precup, D., Learning with pseudo-ensembles, 3365-3373 (2014)
[31] Helmbold D P, Long P M. On the inductive bias of dropout. Journal of Machine Learning Research, 2015, 16: 3403-3454 · Zbl 1351.68213
[32] Maeda S. A Bayesian encourages dropout. 2014, arXiv:1412.7003v3
[33] Gal, Y.; Ghahramani, Z., Dropout as a Bayesian approximation: representing model uncertainty in deep learning, 1651-1660 (2016)
[34] Chen, N.; Zhu, J.; Chen, J. F.; Zhang, B., Dropout training for support vector machines, 1752-1759 (2014)
[35] Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P. A., Extracting and composing robust features with denoising autoencoders, 1096-1103 (2008)
[36] Saul L K, Jaakkola T, Jordan M I. Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 1996, 4: 61-76 · Zbl 0900.68379
[37] Zhu J, Chen N, Perkins H, Zhang B. Gibbs max-margin topic models with data augmentation. Journal of Machine Learning Research, 2014, 15: 1073-1110 · Zbl 1318.68161
[38] Devroye L. Non-Uniform Random Variate Generation. New York: Springer-Verlag, 1986 · Zbl 0593.65005 · doi:10.1007/978-1-4613-8643-8
[39] Liu D C, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 1989, 45(3): 503-528 · Zbl 0696.90048 · doi:10.1007/BF01589116
[40] Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer, 2009 · Zbl 1273.62005 · doi:10.1007/978-0-387-84858-7
[41] Bengio Y, Courville A, Vincent P. Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1798-1828 · doi:10.1109/TPAMI.2013.50
[42] Guo J, Che W X, Yarowsky D, Wang H F, Liu T. A distributed representation-based framework for cross-lingual transfer parsing. Journal of Artificial Intelligence Research, 2016, 55: 995-1023
[43] Smola A J, Scholkopf B. A tutorial on support vector regression. Statistics and Computing, 2003, 14(3): 199-222 · doi:10.1023/B:STCO.0000035301.49549.88
[44] Chen, N.; Zhu, J.; Xia, F.; Zhang, B., Generalized relational topic models with data augmentation, 1273-1279 (2013)
[45] Blitzer, J.; Dredze, M.; Pereira, F., Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification, 440-447 (2007)
[46] Torralba A, Fergus R, Freeman W. A large dataset for non-parametric object and scene recognition. IEEE Transaction on Pattern Analysis and Machine Intelligence, 2008, 30(11): 1958-1970 · doi:10.1109/TPAMI.2008.128
[47] Krizhevsky, A., Learning multiple layers of features from tiny images (2009)
[48] Zhu, J.; Xing, E. P., Conditional topic random fields, 1239-1246 (2010)
[49] Rifkin R, Klautau A. In defense of one-vs-all classification. Journal of Machine Learning Research, 2004, (5): 101-141 · Zbl 1222.68287
[50] Blei, D.; McAuliffe, J. D., Supervised topic models (2007)
[51] Tang, Y., Deep learning with linear support vector machines (2013)
[52] Kingma, D. P.; Welling, M., Efficient gradient-based inference through transformations between bayes nets and neural nets, 3791-3799 (2014)
[53] Bacon, P. L.; Bengio, E.; Pineau, J.; Precup, D., Conditional computation in neural networks using a decision-theoretic approach (2015)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.