×

On discriminative Bayesian network classifiers and logistic regression. (English) Zbl 1101.68785

Summary: Discriminative learning of the parameters in the naive Bayes model is known to be equivalent to a logistic regression problem. Here we show that the same fact holds for much more general Bayesian network models, as long as the corresponding network structure satisfies a certain graph-theoretic property. The property holds for naive Bayes but also for more complex structures such as tree-augmented naive Bayes as well as for mixed diagnostic-discriminative structures. Our results imply that for networks satisfying our property, the conditional likelihood cannot have local maxima so that the global maximum can be found by simple local optimization methods. We also show that if this property does not hold, then in general the conditional likelihood can have local, non-global maxima. We illustrate our theoretical results by empirical experiments with local optimization in a conditional naive Bayes model. Furthermore, we provide a heuristic strategy for pruning the number of parameters and relevant features in such models. For many data sets, we obtain good results with heavily pruned submodels containing many fewer parameters than the original naive Bayes model.

MSC:

68T05 Learning and adaptive systems in artificial intelligence

Software:

UCI-ml
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Barndorff-Nielsen, O. (1978). Information and exponential families in statistical theory. New York, NY: John Wiley & Sons. · Zbl 0387.62011
[2] Bernardo, J. & Smith, A. (1994). Bayesian theory. New York, NY: John Wiley & Sons. · Zbl 0796.62002
[3] Bishop, C. (1995). Neural networks for pattern recognition. Oxford, UK: Oxford University Press. · Zbl 0868.68096
[4] Blake, C., & Merz, C. (1998). UCI repository of machine learning databases. http://www.ics.uci.edu/learn/MLRepository.html. University of California, Irvine, Department of Information and Computer Science.
[5] Buntine, W. (1994). Operations for learning with graphical models. Journal of Artificial Intelligence Research, 2, 159–225.
[6] Cowell, R. (2001). On searching for optimal classifiers among Bayesian networks. In T. Jaakkola, & T. Richardson (Eds.), Proceedings of the eight international workshop on artificial intelligence and statistics. (pp. 175–180). San Francisco, CA: Morgan Kaufmann.
[7] Dawid, A. (1976). Properties of diagnostic data distributions. Biometrics, 32:3, 647–658. · Zbl 0332.62078 · doi:10.2307/2529753
[8] Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1, 1–38. · Zbl 0364.62022
[9] Fayyad, U., & Irani, K. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In R. Bajcsy (Ed.), Proceedings of the thirteenth international joint conference on artificial intelligence (pp. 1022–1027). San Francisco, CA: Morgan Kaufmann.
[10] Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian network classifiers. Machine Learning, 29:2, 131–163. · Zbl 0892.68077 · doi:10.1023/A:1007465528199
[11] Greiner, R., Grove, A., & Schuurmans, D. (1997). Learning Bayesian nets that perform well. In D. Geiger, & P. P. Shenoy (Eds.), Proceedings of the thirteenth annual conference on uncertainty in artificial intelligence (pp. 198–207). San Francisco, CA: Morgan Kaufmann.
[12] Greiner, R., & Zhou, W. (2002). Structural extension to logistic regression: discriminant parameter learning of belief net classifiers. In R. Dechter, M. Kearns, & R. Sutton (Eds.), Proceedings of the eighteenth national conference on artificial intelligence (pp. 167–173). Cambridge, MA: MIT Press.
[13] Grossman, D., & Domingos, P. (2004). Learning Bayesian network classifiers by maximizing conditional likelihood. In C. E. Brodley (Ed.), Proceedings of the twenty-first international conference on machine learning (pp. 361–368). Madison, WI: Omnipress.
[14] Grünwald, P., Kontkanen, P., Myllymäki, P., Roos, T., Tirri, H., & Wettig, H. (2002). Supervised posterior distributions. Presented at the seventh Valencia international meeting on Bayesian statistics, Tenerife, Spain.
[15] Heckerman, D. (1996). A tutorial on learning with Bayesian networks. Technical Report MSR-TR-95-06, Microsoft Research, Redmond, WA. · Zbl 0886.62016
[16] Heckerman, D., & Meek, C. (1997a). Embedded Bayesian network classifiers. Technical Report MSR-TR-97-06, Microsoft Research, Redmond, WA.
[17] Heckerman, D., & Meek, C. (1997b). Models and selection criteria for regression and classification. In D. Geiger, & P.P. Shenoy (Eds.), Proceedings of the thirteenth annual conference on uncertainty in artificial intelligence (pp. 198–207). San Francisco, CA: Morgan Kaufmann.
[18] Jebara, T. (2003). Machine learning: Discriminative and generative. Boston, MA: Kluwer Academic Publishers. · Zbl 1030.68073
[19] Keogh, E., & Pazzani, M. (1999). Learning augmented Bayesian classifiers: A comparison of distribution-based and classification-based approaches. In D. Heckerman, & J. Whittaker (Eds.), Proceedings of the seventh international workshop on artificial intelligence and statistics (pp. 225–230). San Francisco, CA: Morgan Kaufmann.
[20] Kontkanen, P., Myllymäki, P., Silander, T., & Tirri, H. (1999). On supervised selection of Bayesian networks. In K. B. Laskey, & H. Prade (Eds.), Proceedings of the fifteenth international conference on uncertainty in artificial intelligence (pp. 334–342). San Francisco, CA: Morgan Kaufmann Publishers.
[21] Kontkanen, P., Myllymäki, P., Silander, T., Tirri, H., & Grünwald, P. (2000). On predictive distributions and Bayesian networks. Statistics and Computing, 10:1, 39–54. · doi:10.1023/A:1008984400380
[22] Kontkanen, P., Myllymäki, P., & Tirri, H. (2001). Classifier learning with supervised marginal likelihood. In J. S. Breese, & D. Koller (Eds.), Proceedings of the seventeenth conference on uncertainty in artificial intelligence (pp. 277-284). San Francisco, CA: Morgan Kaufmann.
[23] Lauritzen, S. (1996). Graphical models. Oxford, UK: Oxford University Press. · Zbl 0907.62001
[24] Little, R. & Rubin, D. (1987). Statistical analysis with missing data. New York, NY: John Wiley & Sons. · Zbl 0665.62004
[25] Madden, M. (2003). The performance of Bayesian network classifiers constructed using different techniques. In Working notes of the ECML/PKDD-03 workshop on probabilistic graphical models for classification (pp. 59–70).
[26] McLachlan, G. (1992). Discriminant analysis and statistical pattern recognition. New York, NY: John Wiley & Sons. · Zbl 1108.62317
[27] McLachlan, G. & Krishnan, T. (1997). The EM algorithm and extensions. New York, NY: John Wiley & Sons. · Zbl 0882.62012
[28] Minka, T. (2001). Algorithms for maximum-likelihood logistic regression. Technical Report 758, Carnegie Mellon University, Department of Statistics. Revised Sept. 2003.
[29] Myllymäki, P., Silander, T., Tirri, H., & Uronen, P. (2002). B-Course: a web-based tool for Bayesian and causal data analysis. International Journal on Artificial Intelligence Tools, 11:3, 369–387. · Zbl 05421441 · doi:10.1142/S0218213002000940
[30] Ng, A. & Jordan, M. (2001). On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 605–610). Cambridge, MA: MIT Press.
[31] Pearl, J. (1987). Evidential reasoning using stochastic simulation of causal models. Artificial Intelligence, 32:2, 245–258. · Zbl 0642.68177 · doi:10.1016/0004-3702(87)90012-9
[32] Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo, CA: Morgan Kaufmann. · Zbl 0746.68089
[33] Raina, R., Shen, Y., Ng, A. Y., & McCallum, A. (2003). Classification with hybrid generative/discriminative models. In S. Thrun, L. Saul , & B. Schölkopf (Eds.), Advances in neural information processing systems 16. Cambridge, MA: MIT Press.
[34] Rissanen, J. (1996). Fisher information and stochastic complexity. IEEE Transactions on Information Theory, 42:1, 40–47. · Zbl 0856.94006 · doi:10.1109/18.481776
[35] Russell, S., Binder, J., Koller, D., & Kanawaza, K. (1995). Local learning in probabilistic networks with hidden variables. In S. Thrun, & T. M. Mitchell (Eds.), Proceedings of the fourteenth international joint conference on artificial intelligence (pp. 1146–1152). San Francisco, CA: Morgan Kaufmann Publishers.
[36] Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6:2, 461–464. · Zbl 0379.62005 · doi:10.1214/aos/1176344136
[37] Shen, B., Su, X., Greiner, R., Musilek, P., & Cheng, C. (2003). Discriminative parameter learning of general Bayesian network classifiers. In Proceedings of the fifteenth IEEE international conference on tools with artificial intelligence (pp. 296–305). Los Alamitos, CA: IEEE Computer Society Press.
[38] Thiesson, B. (1995). Accelerated quantification of Bayesian networks with incomplete data. In U. M. Fayyad, & R. Uthurusamy (Eds.), Proceedings of the first international conference on knowledge discovery and data mining (pp. 306–311). Cambridge, MA: MIT Press.
[39] Wettig, H., Grünwald, P., Roos, T., Myllymäki, P., & Tirri, H. (2002). Supervised naive Bayes parameters. In P. Ala-Siuru, & S. Kaski (Eds.), Proceedings of the tenth Finnish artificial intelligence conference (pp. 72–83). Oulu, Finland: Finnish Artificial Intelligence Society.
[40] Wettig, H., Grünwald, P., Roos, T., Myllymäki, P., & Tirri, H. (2003). When discriminative learning of Bayesian network parameters is easy. In G. Gottlob, & T. Walsh (Eds.), Proceedings of the eighteenth international joint conference on artificial intelligence (pp. 491–496). San Francisco, CA: Morgan Kaufmann Publishers.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.