zbMATH — the first resource for mathematics

Hierarchical mixtures-of-experts for exponential family regression models: Approximation and maximum likelihood estimation. (English) Zbl 0957.62032
Summary: We consider hierarchical mixtures-of-experts (HME) models where exponential family regression models with generalized linear mean functions of the form \(\psi(\alpha+x^T\beta)\) are mixed. Here \(\psi(\cdot)\) is the inverse link function. Suppose the true response \(y\) follows an exponential family regression model with mean function belonging to a class of smooth functions of the form \(\psi(h(x))\) where \(h(\cdot)\in W^\infty_{2;K_0}\) (a Sobolev class over \([0,1]^s)\). It is shown that the HME probability density functions can approximate the true density, at a rate of \(O(m^{-2/s})\) in Hellinger distance and at a rate of \(O(m^{-4/s})\) in Kullback-Leibler divergence, where \(m\) is the number of experts, and \(s\) is the dimension of the predictor \(x\). We also provide conditions under which the mean-square error of the estimated mean response obtained from the maximum likelihood method converges to zero, as the sample size and the number of experts both increase.

62G08 Nonparametric regression and quantile regression
62J12 Generalized linear models (logistic models)
41A25 Rate of convergence, degree of approximation
Full Text: DOI
[1] Bickel, P. J. and Doksum, K. A. (1977). Mathematical Statistics. Prentice-Hall, Englewood Cliffs, NJ. · Zbl 0403.62001
[2] Bishop, C.M. (1995). Neural Networks for Pattern Recognition. Oxford Univ. Press. · Zbl 0868.68096
[3] Cacciatore, T. W. and Nowlan, S. J. (1994). Mixtures of controllers for jump linear and nonlinear plants. In Advances in Neural Informations Processing Systems 6 (G. Tesauro, D. S. Touretzky and T. K. Leen, eds.). Morgan Kaufmann, San Mateo, CA.
[4] Devroye, L. and Gyoerfi, L. (1985). Nonparametric Density Estimation: The L1 View. Wiley, New York. · Zbl 0546.62015
[5] Fritsch, J., Finke, M. and Waibel, A. (1997). Adaptively growing hierarchical mixtures of experts. In Advances in Neural Informations Processing Systems 9 (M. C. Mozer, M. I. Jordan and T. Petsche, eds.). MIT Press.
[6] Ghahramani, Z. and Hinton, G. E. (1996). The EM algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1, Dept. Computer Science, Univ. Toronto.
[7] Haussler, D. and Opper, M. (1995). General bounds on the mutual information between a parameter and n conditionally independent observations. In Proceedings of the Eighth Annual Computational Learning Theory Conference (COLT), 1995, Santa Cruz, CA. ACM Press, New York.
[8] Haykin, S. (1994). Neural Networks. Macmillan, New York. · Zbl 0828.68103
[9] Jaakkola, T. S. and Jordan, M. I. (1998). Improving the mean field approximation via the use of mixture distributions. In Learning in Graphical Models (M. I. Jordan, ed.). Kluwer, Dordrecht. · Zbl 0953.60100
[10] Jacobs, R. A., Jordan, M. I., Nowlan, S. J. and Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Comp. 3 79-87.
[11] Jennrich, R. I. (1969). Asymptotic properties of nonlinear least squares estimators. Ann. Math. Statist. 40 633-643. · Zbl 0193.47201 · doi:10.1214/aoms/1177697731
[12] Jiang, W. and Tanner, M. A. (1998). Hierarchical mixtures-of-experts for exponential family regression models: approximation and maximum likelihood estimation. Technical report, Dept. Statistics, Northwestern Univ., Evanston, IL. · Zbl 0957.62032 · doi:10.1214/aos/1018031265
[13] Jiang, W. and Tanner, M. A. (1999). On the approximation rate of hierarchical mixtures-ofexperts for generalized linear models. Neural Comp. · Zbl 0957.62032 · doi:10.1214/aos/1018031265
[14] Jordan, M. I. and Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Comp. 6 181-214.
[15] Jordan, M. I. and Xu, L. (1995). Convergence results for the EM approach to mixtures-of-experts architectures. Neural Networks 8 1409-1431.
[16] Lehmann, E. L. (1991). Theory of Point Estimation. Wadsworth, Monterey, CA. · Zbl 0801.62025
[17] McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. Chapman and Hall, London. · Zbl 0744.62098
[18] Meil a, M. and Jordan, M. I. (1995). Learning fine motion by Markov mixtures of experts. A.I. Memo 1567, Artificial Intelligence Lab., Massachusetts Institute Technology.
[19] Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions. Neural Comp. 8 164-177.
[20] Peng, F., Jacobs, R. A. and Tanner, M. A. (1996). Bayesian inference in mixtures-of-experts and hierarchical mixtures-of-experts models with an application to speech recognition. J. Amer. Statist. Assoc. 91 953-960. · Zbl 0882.62022 · doi:10.2307/2291714
[21] Tipping, M. E. and Bishop, C. M. (1997). Mixtures of probabilistic principal component analysers. Technical Report NCRG-97-003, Dept. Computer Science and Applied Mathematics, Aston Univ., Birmingham, UK.
[22] Titterington, D. M., Smith, A. F. M. and Makov, U. E. (1985). Statistical Analysis of Finite Mixture Distributions. Wiley, New York. · Zbl 0646.62013
[23] White, H. (1994). Estimation, Inference and Specification Analysis. Cambridge Univ. Press. · Zbl 0860.62100
[24] Zeevi, A. and Meir, R. (1997). Density estimation through convex combinations: approximation and estimation bounds. Neural Networks 10 99-106. · Zbl 0869.68094 · doi:10.1016/S0893-6080(96)00037-8
[25] Zeevi, A., Meir, R. and Maiorov, V. (1998). Error bounds for functional approximation and estimation using mixtures of experts. IEEE Trans. Information Theory 44 1010-1025. · Zbl 0902.68197 · doi:10.1109/18.669150
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.