## Greedy Gaussian segmentation of multivariate time series.(English)Zbl 1459.62170

Summary: We consider the problem of breaking a multivariate (vector) time series into segments over which the data is well explained as independent samples from a Gaussian distribution. We formulate this as a covariance-regularized maximum likelihood problem, which can be reduced to a combinatorial optimization problem of searching over the possible breakpoints, or segment boundaries. This problem can be solved using dynamic programming, with complexity that grows with the square of the time series length. We propose a heuristic method that approximately solves the problem in linear time with respect to this length, and always yields a locally optimal choice, in the sense that no change of any one breakpoint improves the objective. Our method, which we call greedy Gaussian segmentation (GGS), easily scales to problems with vectors of dimension over 1000 and time series of arbitrary length. We discuss methods that can be used to validate such a model using data, and also to automatically choose appropriate values of the two hyperparameters in the method. Finally, we illustrate our GGS approach on financial time series and Wikipedia text data.

### MSC:

 62M10 Time series, auto-correlation, regression, etc. in statistics (GARCH) 62H30 Classification and discrimination; cluster analysis (statistical aspects) 62P05 Applications of statistics to actuarial sciences and financial mathematics

### Software:

GitHub; rarhsmm; scout
Full Text:

### References:

 [1] Abonyi, J.; Feil, B.; Nemeth, S.; Arva, P., Modified Gath-Geva clustering for fuzzy segmentation of multivariate time-series, Fuzzy Sets Syst, 149, 39-56, (2005) · Zbl 1071.68543 [2] Alexander C (2000) A primer on the orthogonal GARCH model. Unpublished manuscript, ISMA Center, University of Reading, U.K [3] Ang, A.; Timmermann, A., Regime changes and financial markets, Annu Rev Fin Econ, 4, 313-337, (2012) [4] Bakis, R., Continuous speech recognition via centisecond acoustic states, J Acoust Soc Am, 59, s97, (1976) [5] Bellman, R., On the approximation of curves by line segments using dynamic programming, Commun ACM, 4, 284, (1961) · Zbl 0100.12901 [6] Bickel, PJ; Levina, E., Regularized estimation of large covariance matrices, Ann Stat, 36, 199-227, (2008) · Zbl 1132.62040 [7] Basseville M, Nikiforov IV (1993) Detection of abrupt changes: theory and application, vol 104. Prentice Hall, Englewood Cliffs · Zbl 1407.62012 [8] Bauwens, L.; Rombouts, J., On marginal likelihood computation in change-point models, Comput Stat Data Anal, 56, 3415-3429, (2012) · Zbl 1255.62234 [9] Booth, NB; Smith, AFM, A Bayesian approach to retrospective identification of change-points, J Econom, 19, 7-22, (1982) · Zbl 0492.62077 [10] Borenstein, E.; Ullman, S., Combined top-down/bottom-up segmentation, IEEE Trans Pattern Anal Mach Intell, 30, 2109-2125, (2008) [11] Bulla, J., Hidden Markov models with $$t$$ components. Increased persistence and other aspects, Quant Fin, 11, 459-475, (2011) [12] Bleakley K, Vert J-P (2011) The group fused lasso for multiple change-point detection. arXiv preprint arXiv:1106.4199 [13] Chouakria-Douzal, A.; Berthold, MR (ed.); Lenz, H-J (ed.); Bradley, E. (ed.); Kruse, R. (ed.); Borgelt, C. (ed.), Compression technique preserving correlations of a multivariate temporal sequence, 566-577, (2003), Berlin [14] Cheon, S.; Kim, J., Multiple change-point detection of multivariate mean vectors with the Bayesian approach, Comput Stat Data Anal, 54, 406-415, (2010) · Zbl 1464.62046 [15] Cappé O, Moulines E, Rydén T (2005) Inference in hidden Markov models. Springer, New York · Zbl 1080.62065 [16] Crosier, RB, Multivariate generalizations of cumulative sum quality-control schemes, Technometrics, 30, 291-303, (1988) · Zbl 0651.62095 [17] Candès, EJ; Wakin, MB; Boyd, S., Enhancing sparsity by reweighted $$\ell _1$$ minimization, J Fourier Anal Appl, 14, 877-905, (2008) · Zbl 1176.94014 [18] Gooijer, J., Detecting change-points in multidimensional stochastic processes, Comput Stat Data Anal, 51, 1892-1903, (2006) · Zbl 1157.62481 [19] Douglas, DH; Peucker, TK, Algorithms for the reduction of the number of points equired to represent a digitized line or its caricature, Cartogr Int J Geogr Inf Geovis, 10, 112-122, (1973) [20] Esling, P.; Agon, C., Time-series data mining, ACM Comput Surv, 45, 12, (2012) · Zbl 1293.68104 [21] Fiecas, M.; Franke, J.; Sachs, R.; Kamgaing, JT, Shrinkage estimation for multivariate hidden Markov models, J Am Stat Assoc, 112, 424-435, (2017) [22] Fragkou, P.; Petridis, V.; Kehagias, A., A dynamic programming algorithm for linear text segmentation, J Intell Inf Syst, 23, 179-197, (2004) · Zbl 1088.68593 [23] Fenn, DJ; Porter, MA; Williams, S.; McDonald, M.; Johnson, NF; Jones, NS, Temporal evolution of financial-market correlations, Phys Rev E, 84, 026109, (2011) [24] Guralnik V, Srivastava J (1999) Event detection from time series data. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, pp 33-42 [25] Ge, X.; Smyth, P., Segmental semi-Markov models for endpoint detection in plasma etching, IEEE Trans Semicond Eng, 259, 201-209, (2001) [26] Gustafsson F (2000) Adaptive filtering and change detection. Wiley, West Sussex [27] Galeano, P.; Wied, D., Multiple break detection in the correlation structure of random variables, Comput Stat Data Anal, 76, 262-282, (2014) · Zbl 06983978 [28] Huang, JZ; Liu, N.; Pourahmadi, M.; Liu, L., Covariance matrix selection and estimation via penalised normal likelihood, Biometrika, 93, 85-98, (2006) · Zbl 1152.62346 [29] Hu, B.; Rakthanmanon, T.; Hao, Y.; Evans, S.; Lonardi, S.; Keogh, E., Using the minimum description length to discover the intrinsic cardinality and dimensionality of time series, Data Min Knowl Discov, 29, 358-399, (2015) · Zbl 1403.68190 [30] Hallac D, Sharang A, Stahmann R, Lamprecht A, Huber M, Roehder M, Sosič R, Leskovec J (2016) Driver identification using automobile sensor data from a single turn. In IEEE 19th international conference on intelligent transport systems, pp 953-958 [31] Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning, 2nd edn. Springer, New York · Zbl 1273.62005 [32] Katz I, Crammer K (2014) Outlier-robust convex segmentation. arXiv preprint arXiv:1411.4503 [33] Keogh, E.; Chu, S.; Hart, D.; Pazzani, M.; Last, M. (ed.); Kandel, A. (ed.); Bunke, H. (ed.), Segmenting time series: a survey and novel approach, (2004), Singapore [34] Kim, S-J; Koh, K.; Boyd, S.; Gorinevsky, D., $$\ell _1$$ trend filtering, SIAM Rev, 51, 339-360, (2009) · Zbl 1171.37033 [35] Kehagias, A.; Nidelkou, E.; Petridis, V., A dynamic programming segmentation procedure for hydrological and environmental time series, Stoch Environ Res Risk Assess, 20, 77-94, (2006) · Zbl 1096.62084 [36] Lee, C-B, Bayesian analysis of a change-point in exponential families with applications, Comput Stat Data Anal, 27, 195-208, (1998) · Zbl 1042.62526 [37] Lee, J.; Hastie, T., Learning the structure of mixed graphical models, J Comput Graph Stat, 24, 230-253, (2015) [38] Li, J., Nonparametric multivariate statistical process control charts: a hypothesis testing-based approach, J Nonparametr Stat, 27, 383-400, (2015) · Zbl 1330.62203 [39] Ledoit, O.; Wolf, M., A well-conditioned estimator for large-dimensional covariance matrices, J Multivar Anal, 88, 365-411, (2004) · Zbl 1032.62050 [40] Meucci, A., Managing diversification, Risk, 22, 74-79, (2009) [41] Nystrup, P.; Hansen, BW; Larsen, HO; Madsen, H.; Lindström, E., Dynamic allocation or diversification: a regime-based approach to multiple assets, J Portf Manag, 44, 62-73, (2017) [42] Nystrup, P.; Hansen, BW; Madsen, H.; Lindström, E., Regime-based versus static asset allocation: letting the data speak, J Portf Manag, 42, 103-109, (2015) [43] Nystrup, P.; Hansen, BW; Madsen, H.; Lindström, E., Detecting change points in VIX and S&P 500: a new approach to dynamic asset allocation, J Asset Manag, 17, 361-374, (2016) [44] Nystrup, P.; Madsen, H.; Lindström, E., Long memory of financial time series and hidden Markov models with time-varying parameters, J Forecast, 36, 989-1002, (2017) · Zbl 1397.60104 [45] Partovi, MH; Caputo, M., Principal portfolios: recasting the efficient frontier, Econ Bull, 7, 1-10, (2004) [46] Picard, F.; Lebarbier, É.; Budinská, E.; Robin, S., Joint segmentation of multivariate Gaussian processes using mixed linear models, Comput Stat Data Anal, 55, 1160-1170, (2011) · Zbl 1284.62058 [47] Rajagopalan, V.; Ray, A., Symbolic time series analysis via wavelet-based partitioning, Signal Process, 86, 3309-3320, (2006) · Zbl 1172.94532 [48] Rydén, T.; Teräsvirta, T.; Åsbrink, S., Stylized facts of daily return series and the hidden Markov model, J Appl Econometr, 13, 217-244, (1998) [49] Samé, A.; Chamroukhi, F.; Govaert, G.; Aknin, P., Model-based clustering and segmentation of time series with changes in regime, Adv Data Anal Classif, 5, 301-321, (2011) · Zbl 1274.62427 [50] Son, YS; Kim, S., Bayesian single change point detection in a sequence of multivariate normal observations, Statistics, 39, 373-387, (2005) · Zbl 1089.62029 [51] Sheikh, A.; Sun, J., Regime change: Implications of macroeconomic shifts on asset class and portfolio performance, J Invest, 21, 36-54, (2012) [52] Tansey W, Padilla OHM, Suggala AS, Ravikumar P (2015) Vector-space Markov random fields via exponential families. In: Proceedings of the 32nd international conference on machine learning, volume 1, pp 684-692 [53] Tibshirani, R.; Saunders, M.; Rosset, S.; Zhu, J.; Knight, K., Sparsity and smoothness via the fused lasso, J R Stat Soc Ser B Stat Methodol, 67, 91-108, (2005) · Zbl 1060.62049 [54] Venter, JH; Steel, SJ, Finding multiple abrupt change points, Comput Stat Data Anal, 22, 481-504, (1996) · Zbl 0900.62230 [55] Verbeek, J.; Vlassis, N.; Kröse, B., Efficient greedy learning of Gaussian mixture models, Neural Comput, 15, 469-485, (2003) · Zbl 1047.68114 [56] Wahlberg, B.; Boyd, S.; Annergren, M.; Wang, Y., An ADMM algorithm for a class of total variation regularized estimation problems, IFAC Proc Vol, 45, 83-88, (2012) [57] Welford, BP, Note on a method for calculating corrected sums of squares and products, Technometrics, 4, 419-420, (1962) [58] Wahlberg B, Rojas C, Annergren M (2011) On $$\ell _1$$ mean and variance filtering. In: Proceedings of the forty fifth Asilomar conference on signals, systems and computers, pp 1913-1916 [59] Witten, D.; Tibshirani, R., Covariance-regularized regression and classification for high dimensional problems, J R Stat Soc Ser B Stat Methodol, 71, 615-636, (2009) · Zbl 1250.62033 [60] Xu Z, Liu Y (2017) Regularized autoregressive hidden semi Markov model. https://github.com/cran/rarhsmm [61] Xu, N., A survey of sensor network applications, IEEE Commun Mag, 40, 102-114, (2002) [62] Zangwill WI, Garcia CB (1981) Pathways to solutions, fixed points, and equilibria. Prentice Hall, Englewood Cliffs · Zbl 0512.90070
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.