×

Clustering gene expression time course data using mixtures of multivariate \(t\)-distributions. (English) Zbl 1236.62068

Summary: Clustering gene expression time course data is an important problem in bioinformatics because understanding which genes behave similarly can lead to the discovery of important biological information. Statistically, the problem of clustering time course data is a special case of the more general problem of clustering longitudinal data. In this paper, a very general and flexible model-based technique is used to cluster longitudinal data. Mixtures of multivariate t-distributions are utilized, with a linear model for the mean and a modified Cholesky-decomposed covariance structure. Constraints are placed upon the covariance structure, leading to a novel family of mixture models, including parsimonious models. In addition to model-based clustering, these models are also used for model-based classification, i.e., semi-supervised clustering. Parameters, including the component degrees of freedom, are estimated using an expectation-maximization algorithm and two different approaches to model selection are considered. The models are applied to simulated data to illustrate their efficacy; this includes a comparison with their Gaussian analogues, the use of these Gaussian analogues with a linear model for the mean is novel in itself. Our family of multivariate t mixture models is then applied to two real gene expression time course data sets and the results are discussed. We conclude with a summary, suggestions for future work, and a discussion about constraining the degrees of freedom parameter.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
92C40 Biochemistry, molecular biology
62H10 Multivariate distribution of statistics
62H12 Estimation in multivariate analysis
65C60 Computational problems in statistics (MSC2010)
62P10 Applications of statistics to biology and medical sciences; meta analysis

Software:

PGMM
PDF BibTeX XML Cite
Full Text: DOI

References:

[1] Andrews, J.L.; McNicholas, P.D., Extending mixtures of multivariate t-factor analyzers, Statistics and computing, 21, 3, 361-373, (2011) · Zbl 1255.62171
[2] Andrews, J.L.; McNicholas, P.D., Mixtures of modified t-factor analyzers for model-based clustering, classification, and discriminant analysis, Journal of statistical planning and inference, 141, 4, 1479-1486, (2011) · Zbl 1204.62098
[3] Andrews, J.L.; McNicholas, P.D.; Subedi, S., Model-based classification via mixtures of multivariate t-distributions, Computational statistics and data analysis, 55, 1, 520-529, (2011) · Zbl 1247.62151
[4] Besag, J.; Green, P.; Higdon, D.; Mengersen, K., Bayesian computation and stochastic systems, Statistical science, 10, 1, 3-41, (1995) · Zbl 0955.62552
[5] Biernacki, C.; Celeux, G.; Govaert, G., Assessing a mixture model for clustering with the integrated completed likelihood, IEEE transactions on pattern analysis and machine intelligence, 22, 7, 719-725, (2000)
[6] Bouveyron, C.; Girard, S.; Schmid, C., High-dimensional data clustering, Computational statistics and data analysis, 52, 1, 502-519, (2007) · Zbl 1452.62433
[7] Celeux, G.; Govaert, G., Gaussian parsimonious clustering models, Pattern recognition, 28, 781-793, (1995)
[8] Chu, S.; DeRisi, J.; Eisen, M.; Mulholland, J.; Botstein, D.; Brown, P.; Herskowitz, I., The transcriptional program of sporulation in budding yeast, Science, 282, 699-705, (1998)
[9] Corduneanu, A.; Bishop, C., Variational Bayesian model selection for mixture distributions, (), 27-34
[10] Dempster, A.P.; Laird, N.M.; Rubin, D.B., Maximum likelihood from incomplete data via the EM algorithm, Journal of the royal statistical society. series B, 39, 1, 1-38, (1977) · Zbl 0364.62022
[11] Fraley, C.; Raftery, A.E., Model-based clustering, discriminant analysis, and density estimation, Journal of the American statistical association, 97, 458, 611-631, (2002) · Zbl 1073.62545
[12] Futschik, M.; Carlisle, B., Noise-robust soft clustering of gene expression time-course data, Journal of bioinformatics and computational biology, 3, 4, 965-988, (2005)
[13] Ghahramani, Z., Hinton, G.E., 1997. The EM Algorithm for Factor Analyzers. Technical Report CRG-TR-96-1, University Of Toronto, Toronto.
[14] Hosmer, D.W., A comparison of iterative maximum likelihood estimates of the parameters of a mixture of two normal distributions under three different types of sample, Biometrics, 29, 4, 761-770, (1973)
[15] Hubert, L.; Arabie, P., Comparing partitions, Journal of classification, 2, 193-218, (1985)
[16] Keribin, C., Consistent estimation of the order of mixture models. sankhyā, The Indian journal of statistics. series A, 62, 1, 49-66, (2000) · Zbl 1081.62516
[17] Kotz, S.; Nadarajah, S., Multivariate t-distributions and their applications, (2004), Cambridge University Press New York · Zbl 1100.62059
[18] Krzanowski, W.J.; Jonathan, P.; McCarthy, W.V.; Thomas, M.R., Discriminant analysis with singular covariance matrices: methods and applications to spectroscopic data, Journal of the royal statistical society. series C, 44, 1, 101-115, (1995) · Zbl 0821.62032
[19] Leroux, B.G., Consistent estimation of a mixing distribution, The annals of statistics, 20, 1350-1360, (1992) · Zbl 0763.62015
[20] Luan, Y.; Li, H., Clustering of time-course gene expression data using a mixed-effects model with b-splines, Bioinformatics, 19, 4, 474-482, (2003)
[21] Ma, P.; Castillo-Davis, C.I.; Zhong, W.; Liu, J., A data-driven clustering method for time course gene expression data, Nucleic acids research, 34, 4, 1261-1269, (2006)
[22] McGrory, C.A.; Titterington, D.M., Variational approximations in Bayesian model selection for finite mixture distributions, Computational statistics and data analysis, 51, 5352-5367, (2007) · Zbl 1445.62050
[23] McLachlan, G.J.; Basford, K.E., Mixture models: inference and applications to clustering, (1988), Marcel Dekker Inc New York · Zbl 0697.62050
[24] McLachlan, G.J.; Peel, D., Robust cluster analysis via mixtures of multivariate t-distributions, (), 658-666
[25] McLachlan, G.J.; Peel, D., Mixtures of factor analyzers, (), 599-606 · Zbl 1256.62036
[26] McNicholas, P.D., Model-based classification using latent Gaussian mixture models, Journal of statistical planning and inference, 140, 5, 1175-1181, (2010) · Zbl 1181.62095
[27] McNicholas, P.D.; Murphy, T.B., Parsimonious Gaussian mixture models, Statistics and computing, 18, 3, 285-296, (2008)
[28] McNicholas, P.D.; Murphy, T.B., Model-based clustering of longitudinal data, The Canadian journal of statistics, 38, 1, 153-168, (2010) · Zbl 1190.62120
[29] McNicholas, P.D.; Murphy, T.B., Model-based clustering of microarray expression data via latent Gaussian mixture models, Bioinformatics, 26, 21, 2705-2712, (2010)
[30] Mitchell, A.P., Control of meiotic gene expression in saccharomyces cerevisiae, Microbiological reviews, 58, 1, 56-70, (1994)
[31] Pourahmadi, M., Joint mean – covariance models with applications to longitudinal data: unconstrained parameterisation, Biometrika, 86, 3, 677-690, (1999) · Zbl 0949.62066
[32] Rand, W.M., Objective criteria for the evaluation of clustering methods, Journal of the American statistical association, 66, 846-850, (1971)
[33] Schwarz, G., Estimating the dimension of a model, The annals of statistics, 6, 461-464, (1978) · Zbl 0379.62005
[34] The Gene Ontology Consortium, 1999. The Gene Ontology Database. Accessed August 18, 2010.
[35] Tipping, T.E.; Bishop, C.M., Mixtures of probabilistic principal component analysers, Neural computation, 11, 2, 443-482, (1999)
[36] Titterington, D.M.; Smith, A.F.M.; Makov, U.E., Statistical analysis of finite mixture distributions, (1985), John Wiley & Sons Chichester · Zbl 0646.62013
[37] Ueda, N.; Ghahramani, Z., Bayesian model search for mixture models based on optimizing variational bounds, Neural networks, 15, 1223-1241, (2002)
[38] Wakefield, J.C.; Zhou, C.; Self, S.G., Modelling gene expression over time: curve clustering with informative prior distributions, (), 721-732
[39] Weizmann Institute of Science, 1996. GeneCards: The Human Gene Compendium. Accessed February 9, 2011.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.