Bayesian model averaging: A tutorial. (with comments and a rejoinder).

*(English)*Zbl 1059.62525
Stat. Sci. 14, No. 4, 382-417 (1999); correction ibid. 15, No. 3, 193-195 (2000).

Summary: Standard statistical practice ignores model uncertainty. Data analysts typically select a model from some class of models and then proceed as if the selected model had generated the data. This approach ignores the uncertainty in model selection, leading to over-confident inferences and decisions that are more risky than one thinks they are. Bayesian model averaging (BMA) provides a coherent mechanism for accounting for this model uncertainty. Several methods for implementing BMA have recently emerged. We discuss these methods and present a number of examples. In these examples, BMA provides improved out-of-sample predictive performance. We also provide a catalogue of currently available BMA software.

##### MSC:

62F15 | Bayesian inference |

62-01 | Introductory exposition (textbooks, tutorial papers, etc.) pertaining to statistics |

##### Keywords:

Bayesian model averaging; Bayesian graphical models; learning; model uncertainty; Markov chain Monte Carlo##### Software:

alr3
PDF
BibTeX
Cite

\textit{J. A. Hoeting} et al., Stat. Sci. 14, No. 4, 382--417 (1999; Zbl 1059.62525)

Full Text:
DOI

##### References:

[1] | Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory (B. Petrox and F. Caski, eds.) 267. · Zbl 0283.62006 |

[2] | Barnard, G. A. (1963). New methods of quality control. J. Roy. Statist. Soc. Ser. A 126 255. · Zbl 0118.14402 |

[3] | Bates, J. M. and Granger, C. W. J. (1969). The combination of forecasts. Operational Research Quarterly 20 451-468. · Zbl 0174.21901 |

[4] | Berger, J. O. and Delampady, M. (1987). Testing precise hypotheses. Statist. Sci. 2 317-352. · Zbl 0955.62545 |

[5] | Berger, J. O. and Sellke, T. (1987). Testing a point null hypothesis (withdiscussion). J. Amer. Statist. Assoc. 82 112-122. JSTOR: · Zbl 0612.62022 |

[6] | Bernardo, J. and Smith, A. (1994). Bayesian Theory. Wiley, Chichester. |

[7] | Besag, J. E., Green, P., Higdon, D. and Mengerson, K. (1995). Bayesian computation and stochastic systems. Statist. Sci. 10 3-66. · Zbl 0955.62552 |

[8] | Breiman, L. (1996). Bagging predictors. Machine Learning 26 123-140. · Zbl 0858.68080 |

[9] | Breiman, L. and Friedman, J. H. (1985). Estimating optimal transformations for multiple regression and correlation (with discussion). J. Amer. Statist. Assoc. 80 580-619. JSTOR: · Zbl 0594.62044 |

[10] | Brozek, J., Grande, F., Anderson, J. and Keys, A. (1963). Densitometric analysis of body composition: revision of some quantitative assumptions. Ann. New York Acad. Sci. 110 113-140. |

[11] | Buckland, S. T., Burnham, K. P. and Augustin, N. H. (1997). Model selection: an integral part of inference. Biometrics 53 275-290. · Zbl 0885.62118 |

[12] | Buntine, W. (1992). Learning classification trees. Statist. Comput. 2 63-73. |

[13] | Carlin, B. P. and Chib, S. (1993). Bayesian model choice via Markov chain Monte Carlo. J. Roy. Statist. Soc. Ser. B 55 473-484. · Zbl 0827.62027 |

[14] | Carlin, B. P. and Polson, N. G. (1991). Inference for nonconjugate Bayesian models using the Gibbs sampler. Canad. J. Statist. 19 399-405. JSTOR: · Zbl 0850.62285 |

[15] | Chan, P. K. and Stolfo, S. J. (1996). On the accuracy of metalearning for scalable data mining. J. Intelligent Integration of Information 8 5-28. |

[16] | Chatfield, C. (1995). Model uncertainty, data mining, and statistical inference (withdiscussion). J. Roy. Statist. Soc. Ser. A 158 419-466. |

[17] | Chib, S. and Greenberg, E. (1995). Understanding the Metropolis-Hastings algorithm. Amer. Statist. 40 327-335. |

[18] | Clemen, R. T. (1989). Combining forecasts: a review and annotated bibliography. Internat. J. Forecasting 5 559-583. |

[19] | Clyde, M., DeSimone, H. and Parmigiani, G. (1996). Prediction via orthoganalized model mixing. J. Amer. Statist. Assoc. 91 1197-1208. · Zbl 0880.62026 |

[20] | Cox, D. R. (1972). Regression models and life tables (withdiscussion). J. Roy. Statist. Soc. Ser. B 34 187-220. JSTOR: · Zbl 0243.62041 |

[21] | Dawid, A. P. (1984). Statistical theory: the prequential approach. J. Roy. Statist. Soc. Ser. A 147 278-292. JSTOR: · Zbl 0557.62080 |

[22] | Dickinson, J. P. (1973). Some statistical results on the combination of forecasts. Operational Research Quarterly 24 253- 260. |

[23] | Dijkstra, T. K. (1988). On Model Uncertainty and Its Statistical Implications. Springer, Berlin. · Zbl 1114.62303 |

[24] | Draper, D. (1995). Assessment and propagation of model uncertainty. J. Roy. Statist. Soc. Ser. B 57 45-97. Draper, D., Gaver, D. P., Goel, P. K., Greenhouse, J. B., Hedges, L. V., Morris, C. N., Tucker, J. and Waternaux, JSTOR: · Zbl 0812.62001 |

[25] | C. (1993). Combining information: National Research Council Panel on Statistical Issues and Opportunities for Research in the Combination of Information. National Academy Press, Washington, DC. Draper, D., Hodges, J. S., Leamer, E. E., Morris, C. N. and Rubin, D. B. (1987). A researchagenda for assessment and propagation of model uncertainty. Technical Report Rand Note N-2683-RC, RAND Corporation, Santa Monica, California. |

[26] | Edwards, W., Lindman, H. and Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological Review 70 193-242. · Zbl 0173.22004 |

[27] | Fernández, C., Ley, E. and Steel, M. F. (1997). Statistical modeling of fishing activities in the North Atlantic. Technical report, Dept. Econometrics, Tilburg Univ., The Netherlands. |

[28] | Fernández, C., Ley, E. and Steel, M. F. (1998). Benchmark priors for Bayesian model averaging. Technical report, Dept. Econometrics, Tilburg Univ., The Netherlands. · Zbl 1091.62507 |

[29] | Fleming, T. R. and Harrington, D. H. (1991). Counting Processes and Survival Analysis. Wiley, New York. · Zbl 0727.62096 |

[30] | Freedman, D. A., Navidi, W. and Peters, S. C. (1988). On the impact of variable selection in fitting regression equations. In On Model Uncertainty and Its Statistical Implications (T. K. Dijkstra, ed.) 1-16. Springer, Berlin. |

[31] | Freund, Y. (1995). Boosting a weak learning algorithm by majority. Inform. and Comput. 121 256-285. Fried, L. P., Borhani, N. O. Enright, P., Furberg, C. D., Gardin, J. M., Kronmal, R. A., Kuller, L. H., Manolio, T. A., Mittelmark, M. B., Newman, A., O’Leary, D. H., Psaty, · Zbl 0833.68109 |

[32] | B., Rautaharju, P., Tracy, R. P. and Weiler, P. G. (1991). The cardiovascular health study: design and rationale. Annals of Epidemiology 1 263-276. |

[33] | Furnival, G. M. and Wilson, R. W. (1974). Regression by leaps and bounds. Technometrics 16 499-511. · Zbl 0285.05110 |

[34] | Geisser, S. (1980). Discussion on sampling and Bayes’ inference in scientific modeling and robustness (by GEPB). J. Roy. Statist. Soc. Ser. A 143 416-417. JSTOR: · Zbl 0471.62036 |

[35] | George, E. and McCulloch, R. (1993). Variable selection via Gibbs sampling. J. Amer. Statist. Assoc. 88 881-889. George, E. I. (1986a). Combining minimax shrinkage estimators. J. Amer. Statist. Assoc. 81 437-445. George, E. I. (1986b). A formal Bayes multiple shrinkage estimator. Commun. Statist. Theory Methods (Special issue on Stein-type multivariate estimation) 15 2099-2114. George, E. I. (1986c). Minimax multiple shrinkage estimation. Ann. Statist. 14 188-205. |

[36] | George, E. I. (1999). Bayesian model selection. In Encyclopedia of Statistical Sciences Update 3. Wiley, New York. · Zbl 1059.62525 |

[37] | Good, I. J. (1950). Probability and the weighing of evidence. Griffin, London. · Zbl 0036.08402 |

[38] | Good, I. J. (1952). Rational decisions. J. Roy. Statist. Soc. Ser. B 14 107-114. Grambsch, P. M., Dickson, E. R., Kaplan, M., LeSage, G., Flem JSTOR: |

[39] | ing, T. R. and Langworthy, A. L. (1989). Extramural crossvalidation of the Mayo primary biliary cirrhosis survival model establishes its generalizability. Hepatology 10 846- 850. |

[40] | Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82 711-732. JSTOR: · Zbl 0861.62023 |

[41] | Heckerman, D., Geiger, D. and Chickering, D. M. (1994). Learning Bayesian networks: the combination of knowledge and statistical data. In Uncertainty in Artificial Intelligence, Proceedings of the Tenth Conference (B. L. de Mantaras and D. Poole, eds.) 293-301. Morgan Kaufman, San Francisco. · Zbl 0831.68096 |

[42] | Hodges, J. S. (1987). Uncertainty, policy analysis, and statistics. Statist. Sci. 2 259-291. |

[43] | Hoeting, J. A. (1994). Accounting for model uncertainty in linear regression. Ph.D. dissertation, Univ. Washington, Seattle. |

[44] | Hoeting, J. A., Raftery, A. E. and Madigan, D. (1996). A method for simultaneous variable selection and outlier identification in linear regression. J. Comput. Statist. 22 251- 271. · Zbl 0900.62352 |

[45] | Hoeting, J. A., Raftery, A. E. and Madigan, D. (1999). Bayesian simultaneous variable and transformation selection in linear regression. Technical Report 9905, Dept. Statistics, Colorado State Univ. Available at www.stat.colostate.edu. URL: · Zbl 0900.62352 |

[46] | Ibrahim, J. G. and Laud, P. W. (1994). A predictive approachto the analysis of designed experiments. J. Amer. Statist. Assoc. 89 309-319. JSTOR: · Zbl 0791.62080 |

[47] | Johnson, R. W. (1996). Fitting percentage of body fat to simple body measurements. J. Statistics Education 4. |

[48] | Kass, R. E. and Raftery, A. E. (1995). Bayes factors. J. Amer. Statist. Assoc. 90 773-795. · Zbl 0846.62028 |

[49] | Kass, R. E. and Wasserman, L. (1995). A reference Bayesian test for nested hypotheses with large samples. J. Amer. Statist. Assoc. 90 928-934. JSTOR: · Zbl 0851.62020 |

[50] | Katch, F. and McArdle, W. (1993). Nutrition, Weight Control, and Exercise, 4th ed. Williams and Wilkins, Philadelphia. |

[51] | Kearns, M. J., Schapire, R. E. and Sellie, L. M. (1994). Toward efficient agnostic learning. Machine Learning 17 115-142. · Zbl 0938.68797 |

[52] | Kincaid, D. and Cheney, W. (1991). Numerical Analysis. Brooks/Cole, Pacific Grove, CA. · Zbl 0745.65001 |

[53] | Kuk, A. Y. C. (1984). All subsets regression in a proportional hazards model. Biometrika 71 587-592. JSTOR: |

[54] | Kwok, S. and Carter, C. (1990). Multiple decision trees. In Uncertainty in Artificial Intelligence (R. Shachter, T. Levitt, L. Kanal and J. Lemmer, eds.) 4 323-349. North-Holland, Amsterdam. |

[55] | Lauritzen, S. L. (1996). Graphical Models. Clarendon Press, Oxford. · Zbl 0907.62001 |

[56] | Lauritzen, S. L., Thiesson, B. and Spiegelhalter, D. J. (1994). Diagnostic systems created by model selection methods: a case study. In Uncertainty in Artificial Intelligence (P. Cheeseman and W. Oldford, eds.) 4 143-152. Springer Berlin. |

[57] | Lawless, J. and Singhal, K. (1978). Efficient screening of nonnormal regression models. Biometrics 34 318-327. |

[58] | Leamer, E. E. (1978). Specification Searches. Wiley, New York. · Zbl 0384.62089 |

[59] | Lohman, T. (1992). Advance in Body Composition Assessment, Current Issues in Exercise Science. Human Kinetics Publishers, Champaign, IL. Madigan, D., Andersson, S. A., Perlman, M. and Volinsky, C. T. (1996a). Bayesian model averaging and model selection for Markov equivalence classes of acyclic digraphs. Comm. Statist. Theory Methods 25 2493-2520. Madigan, D., Andersson, S. A., Perlman, M. D. and Volinsky, C. T. (1996b). Bayesian model averaging and model selection for Markov equivalence classes of acyclic digraphs. Comm. Statist. Theory Methods 25 2493-2519. |

[60] | Madigan, D., Gavrin, J. and Raftery, A. E. (1995). Elicting prior information to enhance the predictive performance of Bayesian graphical models. Comm. Statist. Theory Methods 24 2271-2292. · Zbl 0937.62576 |

[61] | Madigan, D. and Raftery, A. E. (1991). Model selection and accounting for model uncertainty in graphical models using Occam’s window. Technical Report 213, Univ. Washington, Seattle. · Zbl 0814.62030 |

[62] | Madigan, D. and Raftery, A. E. (1994). Model selection and accounting for model uncertainty in graphical models using Occam’s window. J. Amer. Statist. Assoc. 89 1535-1546. Madigan, D., Raftery, A. E., York, J. C., Bradshaw, J. M. and Almond, R. G. (1994). Strategies for graphical model selection. In Selecting Models from Data: Artificial Intelligence and Statistics (P. Cheeseman and W. Oldford, eds.) 4 91-100. Springer, Berlin. · Zbl 0814.62030 |

[63] | Madigan, D. and York, J. (1995). Bayesian graphical models for discrete data. Internat. Statist. Rev. 63 215-232. Markus, B. H., Dickson, E. R., Grambsch, P. M., Fleming, T. R., Mazzaferro, V., Klintmalm, G., Weisner, R. H., Van Thiel, · Zbl 0834.62003 |

[64] | D. H. and Starzl, T. E. (1989). Efficacy of liver transplantation in patients withprimary biliary cirrhosis. New England J. Medicine 320 1709-1713. |

[65] | Matheson, J. E. and Winkler, R. L. (1976). Scoring rules for continuous probability distributions. Management Science 22 1087-1096. · Zbl 0349.62080 |

[66] | McCullagh, P. and Nelder, J. (1989). Generalized Linear Models, 2nd ed. Chapman & Hall, London. · Zbl 0744.62098 |

[67] | Miller, A. J. (1990). Subset Selection in Regression. Chapman and Hall, London. · Zbl 0702.62057 |

[68] | Penrose, K., Nelson, A. and Fisher, A. (1985). Generalized body composition prediction equation for men using simple measurement techniques (abstract). Medicine and Science in Sports and Exercise 17 189. |

[69] | Philips, D. B. and Smith, A. F. M. (1994). Bayesian model comparison via jump diffusions. Technical Report 94-20, Imperial College, London. |

[70] | Raftery, A. E. (1993). Bayesian model selection in structural equation models. In Testing Structural Equation Models (K. Bollen and J. Long, eds.) 163-180. Sage, Newbury Park, CA. |

[71] | Raftery, A. E. (1995). Bayesian model selection in social research(withdiscussion). In Sociological Methodology 1995 (P. V. Marsden, ed.) 111-195. Blackwell, Cambridge, MA. |

[72] | Raftery, A. E. (1996). Approximate Bayes factors and accounting for model uncertainty in generalised linear models. Biometrika 83 251-266. JSTOR: · Zbl 0864.62049 |

[73] | Raftery, A. E., Madigan, D. and Hoeting, J. (1997). Bayesian model averaging for linear regression models. J. Amer. Statist. Assoc. 92 179-191. JSTOR: · Zbl 0888.62026 |

[74] | Raftery, A. E., Madigan, D. and Volinsky, C. T. (1996). Accounting for model uncertainty in survival analysis improves predictive performance (withdiscussion). In Bayesian Statistics 5 (J. Bernardo, J. Berger, A. Dawid and A. Smith, eds.) 323-349. Oxford Univ. Press. |

[75] | Rao, J. S. and Tibshirani, R. (1997). The out-of-bootstrap method for model averaging and selection. Technical report, Dept. Statistics, Univ. Toronto. |

[76] | Regal, R. and Hook, E. B. (1991). The effects of model selection on confidence intervals for the size of a closed population. Statistics in Medicine 10 717-721. |

[77] | Roberts, H. V. (1965). Probabilistic prediction. J. Amer. Statist. Assoc. 60 50-62. JSTOR: · Zbl 0134.35802 |

[78] | Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461-46. · Zbl 0379.62005 |

[79] | Smith, A. F. M. and Roberts, G. O. (1993). Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods (with discussion). J. Roy. Statist. Soc. Ser. B 55 3-23. JSTOR: · Zbl 0779.62030 |

[80] | Spiegelhalter, D. J. (1986). Probabilistic prediction in patient management and clinical trials. Statistics in Medicine 5 421- 433. Spiegelhalter, D. J., Dawid, A., Lauritzen, S. and Cowell, |

[81] | R. (1993). Bayesian analysis in expert systems (withdiscussion). Statist. Sci. 8 219-283. · Zbl 0955.62523 |

[82] | Spiegelhalter, D. J. and Lauritzen, S. (1990). Sequential updating of conditional probabilities on directed graphical structures. Networks 20 579-605. · Zbl 0697.90045 |

[83] | Stewart, L. (1987). Hierarchical Bayesian analysis using Monte Carlo integration: computing posterior distributions when there are many possible models. The Statistician 36 211- 219. |

[84] | Taplin, R. H. (1993). Robust likelihood calculation for time series. J. Roy. Statist. Soc. Ser. B 55 829-836. · Zbl 0800.62542 |

[85] | Thompson, E. A. and Wijsman, E. M. (1990). Monte Carlo methods for the genetic analysis of complex traits. Technical Report 193, Dept. Statistics, Univ. Washington, Seattle. |

[86] | Tierney, L. and Kadane, J. B. (1986). Accurate approximations for posterior moments and marginal densities. J. Amer. Statist. Assoc. 81 82-86. JSTOR: · Zbl 0587.62067 |

[87] | Volinsky, C. T. (1997). Bayesian model averaging for censored survival models. Ph.D. dissertation, Univ. Washington, Seattle. Volinsky, C. T., Madigan, D., Raftery, A. E. and Kronmal, |

[88] | R. A. (1997). Bayesian model averaging in proportional hazard models: assessing the risk of a stroke. J. Roy. Statist. Soc. Ser. C 46 433-448. · Zbl 0903.62093 |

[89] | Weisberg, S. (1985). Applied Linear Regression, 2nd ed. Wiley, New York. · Zbl 0646.62058 |

[90] | Wolpert, D. H. (1992). Stacked generalization. Neural Networks 5 241-259. |

[91] | York, J., Madigan, D., Heuch, I. and Lie, R. T. (1995). Estimating a proportion of birthdefects by double sampling: a Bayesian approachincorporating covariates and model uncertainty. J. Roy. Statist. Soc. Ser. C 44 227-242. · Zbl 0821.62092 |

[92] | and Kooperberg, 1999). Markov chain Monte Carlo (MCMC) methods provide a stochastic method of obtaining samples from the posterior distributions f Mk Y and f Mk Mk Y and many of the algorithms that the authors mention can be viewed as special cases of reversible jump MCMC algorithms. |

[93] | . Sampling models and 2 in conjunction withthe use of Rao-Blackwellized estimators does appear to be more efficient in terms of mean squared error, when there is substantial uncertainty in the error variance (i.e., small sample sizes or low signal-to-noise ratio) or important prior information. Recently, Holmes and Mallick (1998) adapted perfect sampling (Propp and Wilson, 1996) to the context of orthogonal regression. While more computationally intensive per iteration, this may prove to be more efficient for estimation than SSVS or MC3 in problems where the method is applicable and sampling is necessary. While Gibbs and MCMC sampling has worked well in high-dimensional orthogonal problems, Wong, Hansen, Kohn and Smith (1997) found in high-dimensional problems such as nonparametric regression using nonorthogonal basis functions that Gibbs samplers were unsuitable, from both a computational efficiency standpoint as well as for numerical reasons, because the sampler tends to get stuck in local modes. Their proposed sampler ”focuses” on variables that are more ”active” at each iteration and in simulation studies provided better MSE performance than other classical nonparametric approaches or Bayesian approaches using Gibbs or reversible jump (Holmes and Mallick, 1997) sampling. With the exception of a deterministic search, most methods for implementing BMA rely on algorithms that sample models with replacement and use ergodic averages to compute expectations, as in (7). In problems, suchas linear models, where posterior model probabilities are known up to the normalizing constant, it may be more efficient to devise estimators using renormalized posterior model probabilities (Clyde, DeSimone and Parmigiani, 1996; Clyde, 1999a) and to devise algorithms based on sampling models without replacement. Based on current work withM. Littman, this appears to be a promising direction for implementation of BMA. While many recent developments have greatly advanced the class of problems that can be handled using BMA, implementing BMA in high-dimensional problems withcorrelated variables, suchas nonparametric regression, is still a challenge from both a computational standpoint and the choice of prior distributions. |

[94] | AIC, BIC, and RIC (Clyde and George, 1998, 1999; George and Foster, 1997; Hanson and Yu, 1999) for bothmodel selection and BMA. |

[95] | MODEL AVERAGING, MAYBE This paper offers a good review of one approachto dealing withstatistical model uncertainty, an important topic and one which has only begun to come into focus for us as a profession in this decade (largely because of the availability of Markov chain Monte Carlo computing methods). The authors-who together might be said to have founded the Seattle school of model uncertainty-are to be commended for taking this issue forward so vigorously over the past five years. I have eight comments on the paper, some general and some specific to the body-fat example (Jennifer Hoeting kindly sent me the data, which are well worthlooking at; the data set, and a full description of it, may be obtained by emailing the message send jse/v4n1/datasets.johnson to archive@jse.stat.ncsu.edu). |

[96] | Draper and Fouskakis, 1999). 5. What characteristics of a statistical example predict when BMA will lead to large gains? The only obvious answer I know is the ratio n/p of observations to predictors (withtens of thousands of observations and only dozens of predictors to evaluate, intuitively the price paid for shopping around in the data for a model should be small). Are the authors aware of any other simple answers to this question? As an instance of the n/p effect, in regressionstyle problems like the cirrhosis example where p is in the low dozens and n is in the hundreds, the effect of model averaging on the predictive scale can be modest. HMRV are stretching a bit when they say, in this example, that ”the people assigned to the high risk group by BMA had a higher death rate than did those assigned high risk by other methods; similarly those assigned to the low and medium risk groups by BMA had a lower total deathrate”; this can be seen by attaching uncertainty bands to the estimates in Table 5. Over the single random split into build and test data reported in that table, and assuming (at least approximate) independence of the 152 yes/no classifications aggregated in the table, deathrates in the highrisk group, withbinomial standard errors, are 81% \pm 5%, 75% \pm 6% and 72% \pm 6% for the BMA, stepwise, and top PMP methods, and combining the low and medium risk groups yields 18% \pm 4%, 19% \pm 4% and 17% \pm 4% for the three methods, respectively, hardly a rousing victory for BMA. It is probable that by averaging over many random build-test splits a ”statistically significant” difference would emerge, but the predictive advantage of BMA in this example is not large in practical terms. 6. Following on from item (4) above, now that the topic of model choice is on the table, why are we doing variable selection in regression at all? People who think that you have to choose a subset of the predictors typically appeal to vague concepts like ”parsimony,” while neglecting to mention that the ”full model” containing all the predictors may well have better out-of-sample predictive performance than many models based on subsets of the xj. Withthe body-fat data, for instance, on the same build-test split used by HMRV, the model that uses all 13 predictors in the authors’ Table 7 (fitted by least squares-Gaussian maximum likelihood) has actual coverage of nominal 90% predictive intervals of 95 0 \pm 1 8 |

[97] | ). For the purpose of approximating BMA*, I am less sanguine about Occam’s window, which is fundamentally a heuristic search algorithm. By restricting attention to the ”best” models, the subset of models selected by Occam’s Window are unlikely to be representative, and may severely bias the approximation away from BMA*. For example, suppose substantial posterior probability was diluted over a large subset of similar models, as discussed earlier. Although MCMC methods would tend to sample suchsubsets, they would be entirely missed by Occam’s Window. A possible correction for this problem might be to base selection on a uniform prior, i.e. Bayes factors, but then use a dilution prior for the averaging. However, in spite of its limitations as an approximation to BMA*, the heuristics which motivate Occam’s Window are intuitively very appealing. Perhaps it would simply be appropriate to treat and interpret BMA under Occam’s Window as a conditional Bayes procedure. |

[98] | DiCiccio et al., 1997; Oh, 1999). For BMA, it is desirable that the prior on the parameters be spread out enoughthat it is relatively flat over the region of parameter space where the likelihood is substantial (i.e., that we be in the ”stable estimation” situation described by Edwards, |

[99] | Lindman and Savage, 1963). It is also desirable that the prior not be much more spread out than is necessary to achieve this. This is because the integrated likelihood for a model declines roughly as -d as els by Raftery, Madigan and Hoeting (1997). A second suchproposal is the unit information prior (UIP), which is a multivariate normal prior centered at the maximum likelihood estimate with variance matrix equal to the inverse of the mean observed Fisher information in one observation. Under regularity conditions, this yields the simple BIC approximation given by equation (13) in our paper |

[100] | . The unit information prior, and hence BIC, have been criticized as being too conservative (i.e., too likely to favor simple models). Cox (1995) suggested that the prior standard deviation should decrease withsample size. Weakliem (1999) gave sociological examples where the UIP is clearly too spread out, and Viallefont et al. (1998) have shown how a more informative prior can lead to better performance of BMA in the analysis of epidemiological case-control studies. The UIP is a proper prior but seems to provide a conservative solution. This suggests that if BMA based on BIC favors an ”effect,” we can feel on solid ground in asserting that the data provide ev idence for its existence (Raftery, 1999). Thus BMA results based on BIC could be routinely reported as a baseline reference analysis, along withresults from other priors if available. A third approach is to allow the data to estimate the prior variance of the parameters. Lindley and Smith (1972) showed that this is essentially what ridge regression does for linear regression, and Volinsky (1997) pointed out that ridge regression has consistently outperformed other estimation methods in simulation studies. Volinsky (1997) proposed combining BMA and ridge regression by using a ”ridge regression prior” in BMA. This is closely related to empirical Bayes BMA, which Clyde and George (1999) have shown to work well for wavelets, a special case of orthogonal regression. Clyde, Raftery, Walsh and Volinsky (2000) show that this good performance of empirical Bayes BMA extends to (nonorthogonal) linear regression. |

[101] | income (Featherman and Hauser, 1977). X1 and X2 are highly correlated, but the mechanisms by which they might impact Y are quite different, so all four models are plausible a priori. The posterior model probabilities are saying that at least one of X1 and a LISREL-type model (Bollen, 1989). BMA and Bayesian model selection can still be applied in this context (e.g., Hauser and Kuo, 1998). |

[102] | ). Draper says that model choice is a decision problem, and that the use to which the model is to be put should be taken into account explicitly in the model selection process. This is true, of course, but in practice it seems rather difficult to implement. This was first advocated by Kadane and Dickey (1980) but has not been done much in practice, perhaps because specifying utilities and carrying out the full utility maximization is burdensome, and also introduces a whole new set of sensitivity concerns. We do agree with Draper’s suggestion that the analysis of the body fat data would be enhanced by a cost- benefit analysis whichtook account of bothpredictive accuracy and data collection costs. In practical decision-making contexts, the choice of statistical model is often not the question of primary interest, and the real decision to be made is something else. Then the issue is decision-making in the presence of model uncertainty, and BMA provides a solution to this. In equation (1) of our article, let be the utility of a course of action, and choose the action for which E D is maximized. Draper does not like our Figure 4. However, we see it as a way of depicting on the same graph the answers to two separate questions: is wrist circumference associated withbody fat after controlling for the other variables? and if so, how strong is the association? The posterior distribution of 13 has two components corresponding to these two questions. The answer to the first question is ”no” (i.e., the effect is zero or small) with probability 38%, represented by the solid bar in Figure 4. The answer to the second question is summarized by the continuous curve. Figure 4 shows double shrinkage, withbothdiscrete and continuous components. The posterior distribution of 13, given that 13 = 0, is shrunk continuously towards zero via its prior distribution. Then the posterior is further shrunk (discretely this time) by taking account of the probability that 13 = 0. The displays in Clyde (1999b) convey essentially the same information, and some may find them more appealing than our Figure 4. Draper suggests the use of a practical significance caliper and points out that for one choice, this gives similar results to BMA. Of course the big question here is how the caliper is chosen. BMA can itself be viewed as a significance caliper, where the choice of caliper is based on the data. Draper’s Table 1 is encouraging for BMA, because it suggests that BMA does coincide withpractical significance. It has often been observed that P values are at odds with”practical” significance, leading to strong distinctions being made in textbooks between statistical and practical significance. This seems rather unsatisfactory for our discipline: if statistical and practical significance do not at least approximately coincide, what is the use of statistical testing? We have found that BMA often gives results closer to the practical significance judgments of practitioners than do P-values. |

[103] | Bollen, K. A. (1989). Structural Equations with Latent Variables. Wiley, New York. · Zbl 0731.62159 |

[104] | Browne, W. J. (1995). Applications of Hierarchical Modelling. M.Sc. dissertation, Dept. Mathematical Sciences, Univ. Bath, UK. · Zbl 0846.90012 |

[105] | Chipman, H., George, E. I. and McCulloch, R. E. (1998). Bayesian CART model search(withdiscussion). J. Amer. Statist. Assoc. 93 935-960. Clyde, M. (1999a). Bayesian model averaging and model search strategies (withdiscussion). In Bayesian Statistics 6. (J. M. Bernardo, A. P. Dawid, J. O. Berger and A. F. M. Smith, eds) 157-185. Oxford Univ. Press. Clyde, M. (1999b). Model uncertainty and health effect studies for particulate matter. ISDS Discussion Paper 99-28. Available at www.isds.duke.edu. URL: |

[106] | Clyde, M. and DeSimone-Sasinowska, H. (1997). Accounting for model uncertainty in Poisson regression models: does particulate matter particularly matter? ISDS Discussion Paper 97-06. Available at www.isds.duke.edu. URL: |

[107] | Clyde, M. and George., E. I. (1998). Flexible empirical Bayes estimation for wavelets. ISDS Discussion Paper 98-21. Available at www.isds.duke.edu. URL: · Zbl 0957.62006 |

[108] | Clyde, M. and George., E. I. (1999). Empirical Bayes estimation in wavelet nonparametric regression. In Bayesian Inference in Wavelet-Based Models (P. Muller and B. Vidakovic, eds.) 309-322. Springer, Berlin. Clyde, M. and George, E. I. (1999a). Empirical Bayes estimation in wavelet nonparametric regression. In Bayesian Inference in Wavelet Based Models (P. Muller and B. Vidakovic, eds.) Springer, Berlin. To appear. Clyde, M. and George, E. I. (1999b). Flexible empirical Bayes estimation for wavelets. Technical Report, ISDS, Duke Univ. · Zbl 0936.62008 |

[109] | Clyde, M., Parmigiani, G. and Vidakovic, B. (1998). Multiple shrinkage and subset selection in wavelets. Biometrika 85 391-402. Clyde, M., Raftery, A. E., Walsh, D. and Volinsky, C. T. JSTOR: · Zbl 0938.62021 |

[110] | . Technical report. Available at www.stat.washington. edu/tech.reports. URL: |

[111] | Copas, J. B. (1983). Regression, prediction, and shrinkage (with discussion). J. Roy. Statist. Soc. Ser. B 45 311-354. JSTOR: · Zbl 0532.62048 |

[112] | Cox, D. R. (1995). The relation between theory and application in statistics (disc: P228-261). Test 4 207-227. · Zbl 0844.62001 |

[113] | de Finetti, B. (1931). Funzioni caratteristica di un fenomeno aleatorio. Atti Acad. Naz. Lincei 4 86-133. · JFM 57.0610.01 |

[114] | de Finetti, B. (1974, 1975). Theory of Probability 1 and 2. (Trans. by A. F. M. Smithand A. Machi). Wiley, New York. · Zbl 0328.60002 |

[115] | Dellaportas, P. and Forster, J. J. (1996). Markov chain Monte Carlo model determination for hierarchical and graphical log-linear models. Technical Report, Faculty of Mathematics, Southampton Univ. UK. DiCiccio, T. J., Kass, R. E., Raftery, A. E. and Wasserman, L. · Zbl 0949.62050 |

[116] | . Computing Bayes factors by combining simulation and asymptotic approximations. J. Amer. Statist. Assoc. 92 903-915. Draper, D. (1999a). Discussion of ”Decision models in screening for breast cancer” by G. Parmigiani. In Bayesian Statistics 6 (J. M. Bernardo, J. Berger, P. Dawid and A. F. M. Smitheds.) 541-543 Oxford Univ. Press. Draper, D. (1999b). Hierarchical modeling, variable selection, and utility. Technical Report, Dept. Mathematical Sciences, Univ. Bath, UK. JSTOR: · Zbl 1050.62520 |

[117] | Draper, D. and Fouskakis, D. (1999). Stochastic optimization methods for cost-effective quality assessment in health. Unpublished manuscript. |

[118] | Featherman, D. and Hauser, R. (1977). Opportunity and Change. Academic Press, New York. |

[119] | Gelfand, A. E., Dey, D. K. and Chang, H. (1992). Model determination using predictive distributions, withimplementation via sampling-based methods (with discussion). In Bayesian Statistics 4 (J. M. Bernardo, J. O. Berger, A. P. Dawid, A. F. M. Smith, eds.) 147-167. Oxford Univ. Press. |

[120] | Gelman, A., Meng, X.-L. and Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statist. Sinica 6 733-760. · Zbl 0859.62028 |

[121] | George, E. I. (1987). Multiple shrinkage generalizations of the James-Stein estimator. In Contributions to the Theory and Applications of Statistics A Volume in Honor of Herbert Solomon (A. E. Gelfand, ed.) 397-428. Academic Press, New York. · Zbl 0709.62750 |

[122] | George, E. I. (1999). Discussion of ”Model averaging and model searchstrategies” by M. Clyde. In Bayesian Statistics 6 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 157-185. Oxford University Press. · Zbl 0973.62022 |

[123] | George, E. I. (1999). Discussion of ”Model averaging and model searchby M. Clyde.” In Bayesian Statistics 6 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) Oxford University Press. |

[124] | George, E. I. and Foster, D. P. (1997). Calibration and empirical Bayes variable selection. Technical Report, Dept. MSIS, Univ. Texas, Austin. · Zbl 1029.62008 |

[125] | George, E. I., and McCulloch, R. E. (1997). Approaches for Bayesian variable selection. Statist. Sinica 7 339-373. · Zbl 0884.62031 |

[126] | Godsill, S. (1998). On the relationship between MCMC model uncertainty methods. Technical report Univ. Cambridge. |

[127] | Good, I. J. (1983). Good Thinking: The Foundations of Probability and Its Applications. Univ. Minnesota Press, Minneapolis. · Zbl 0583.60001 |

[128] | Granger, C. W. J. and Newbold, P. (1976). The use of R2 to determine the appropriate transformation of regression variables. J. Econometrics 4 205-210. · Zbl 0333.62043 |

[129] | Greenland, S. (1993). Methods for epidemiologic analyses of multiple exposures-a review and comparative study of maximum-likelihood, preliminary testing, and empirical Bayes regression. Statistics in Medicine 12 717-736. |

[130] | Hacking, I. (1975). The Emergence of Probability. Cambridge University Press. · Zbl 0311.01004 |

[131] | Hanson, M. and Kooperberg, C. (1999). Spline adaptation in extended linear models. Bell Labs Technical Report. Available at cm.bell-labs.com/who/cocteau/papers. |

[132] | Hanson, M. and Yu, B. (1999). Model selection and the principle of minimum description. Bell Labs Technical Report. Available at cm.bell-labs.com/who/cocteau/papers. |

[133] | Hauser, R. and Kuo, H. (1998). Does the gender composition of sibships affect women’s educational attainment? Journal of Human Resources 33 644-657. |

[134] | Holmes. C. C. and Mallick, B. K. (1997). Bayesian radial basis functions of unknown dimension. Dept. Mathematics technical report, Imperial College, London. |

[135] | Holmes, C. C. and Mallick, B. K. (1998). Perfect simulation for orthogonal model mixing. Dept. Mathematics technical report, Imperial College, London. |

[136] | Kadane, J. B. and Dickey, J. M. (1980). Bayesian decision theory and the simplification of models. In Evaluation of Econometric Models (J. Kmenta and J. Ramsey, eds.) Academic Press, New York. |

[137] | Key, J. T., Pericchi, L. R. and Smith, A. F. M. (1999). Bayesian model choice: what and why? (with discussion). In Bayesian Statistics 6 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 343-370. Oxford Univ. Press. · Zbl 0956.62007 |

[138] | Lindley, D. V. and Smith, A. F. M. (1972). Bayes estimates for the linear model (with discussion). J. Roy. Statist. Soc. Ser. B 34 1-41. JSTOR: · Zbl 0246.62050 |

[139] | Mosteller, F. and Tukey, J. W. (1977). Data Analysis and Regression. Addison-Wesley, Reading, MA. |

[140] | Oh, M.-S. (1999). Estimation of posterior density functions from a posterior sample. Comput. Statist. Data Anal. 29 411-427. · Zbl 1042.65507 |

[141] | Propp, J. G. and Wilson, D. B. (1996). Exact sampling withcoupled Markov cahins and applications to statistical mechanics. Random Structures Algorithms 9 223-252. Raftery, A. E. (1996a). Approximate Bayes factors and accounting from model uncertainty in generalised linear models. Biometrika 83 251-266. Raftery, A. E. (1996b). Hypothesis testing and model selection. In Markov Chain Monte Carlo in Practice (W. R. Gilks and D. Spiegelhalter, eds.) 163-188. Chapman and Hall, London. |

[142] | Raftery, A. E. (1999). Bayes factors and BIC: Comment on ”A Critique of the Bayesian information criterion for model selection.” Sociological Methods and Research 27 411-427. |

[143] | Sclove, S. L., Morris, C. N. and Radhakrishna, R. (1972). Nonoptimality of preliminary-test estimators for the mean of a multivariate normal distribution. Ann. Math. Statist. 43 1481-1490. · Zbl 0249.62029 |

[144] | Viallefont, V., Raftery, A. E. and Richardson, S. (1998). Variable selection and Bayesian Model Averaging in case-control studies. Technical Report 343, Dept. Statistics, Univ. Washington. |

[145] | Wasserman, L. (1998). Asymptotic inference for mixture models using data dependent priors. Technical Report 677, Dept. Statistics, Carnegie-Mellon Univ. · Zbl 0976.62028 |

[146] | Weakliem, D. L. (1999). A critique of the Bayesian information criterion for model selection. Sociological Methods and Research 27 359-297. |

[147] | Western, B. (1996). Vague theory and model uncertainty in macrosociology. Sociological Methodology 26 165-192. |

[148] | Wong, F., Hansen, M. H., Kohn, R. and Smith, M. (1997). Focused sampling and its application to nonparametric and robust regression. Bell Labs technical report. Available at cm.bell-labs.com/who/cocteau/papers. |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.