A fragmented-periodogram approach for clustering big data time series. (English) Zbl 1474.62214

Summary: We propose and study a new frequency-domain procedure for characterizing and comparing large sets of long time series. Instead of using all the information available from data, which would be computationally very expensive, we propose some regularization rules in order to select and summarize the most relevant information for clustering purposes. Essentially, we suggest to use a fragmented periodogram computed around the driving cyclical components of interest and to compare the various estimates. This procedure is computationally simple, but able to condense relevant information of the time series. A simulation exercise shows that the smoothed fragmented periodogram works in general better than the non-smoothed one and not worse than the complete periodogram for medium to large sample sizes. We illustrate this procedure in a study of the evolution of several stock markets indices. We further show the effect of recent financial crises over these indices behaviour.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
62M10 Time series, auto-correlation, regression, etc. in statistics (GARCH)
62M15 Inference from stochastic processes and spectral analysis
62R07 Statistical aspects of big data and data science
62P05 Applications of statistics to actuarial sciences and financial mathematics
Full Text: DOI


[1] Bai, J.; Ng, S., A PANIC attack on unit roots and cointegration, Econometrica, 72, 1127-1177 (2004) · Zbl 1091.62068
[2] Bai, J.; Ng, S., Large dimensional factor analysis, Found Trends Econom, 3, 89-163 (2008)
[3] Bastos, Ja; Caiado, J., Clustering financial time series with variance ratio statistics, Quant Financ, 14, 2121-2133 (2014) · Zbl 1402.62246
[4] Boivin, J.; Ng, S., Are more data always better for factor analysis?, J Econom, 132, 169-194 (2006) · Zbl 1337.62345
[5] Bollerslev, T.; Hood, B.; Lasse, H.; Pedersen, Lh, Risk everywhere: modeling and managing volatility, Rev Financ Stud, 31, 2729-2773 (2018)
[6] Brockwell, Pj; Davis, Ra, Time series: theory and methods (1991), New York: Springer, New York
[7] Caiado, J.; Crato, N., Identifying common dynamic features in stock returns, Quant Financ, 10, 797-807 (2010)
[8] Caiado, J.; Crato, N.; Peña, D., A periodogram-based metric for time series classification, Comput Stat Data Anal, 50, 2668-2684 (2006) · Zbl 1445.62222
[9] Caiado, J.; Crato, N.; Peña, D., Comparison of time series with unequal length in the frequency domain, Commun Stat Simul Comput, 38, 527-540 (2009) · Zbl 1161.37348
[10] Caiado, J.; Maharaj, Ea; D’Urso, P.; Henning, C.; Meila, M.; Murtagh, F.; Rocci, R., Time series clustering, Handbook of cluster analysis, 241-263 (2015), Boca Raton: CRC Press, Boca Raton
[11] Coates, Ds; Diggle, Pj, Tests for comparing two estimated spectral densities, J Time Ser Anal, 7, 7-20 (1986) · Zbl 0581.62076
[12] Corsi, F., Heterogeneous autoregressive model of realized volatility (HAR-RV), J Financ Econom, 7, 174-196 (2009)
[13] Diggle, Pj; Fisher, Ni, Nonparametric comparison of cumulative periodograms, Appl Stat, 40, 423-434 (1991) · Zbl 0825.62465
[14] Doz, C.; Giannone, D.; Reichlin, L., A two step estimator for large approximate dynamic factor models, J Econom, 164, 1, 188-205 (2011) · Zbl 1441.62671
[15] Doz, C.; Giannone, D.; Reichlin, L., A quasi maximum likelihood approach for large approximate dynamic factor models, Rev Econ Stat, 94, 1014-1024 (2012)
[16] Forni, M.; Hallin, M.; Lippi, M.; Reichlin, L., The generalized dynamic factor model: identification and estimation, Rev Econ Stat, 82, 540-554 (2000)
[17] Forni, M.; Hallin, M.; Lippi, M.; Reichlin, L., The generalized dynamic factor model: one-sided estimation and forecasting, J Am Stat Assoc, 100, 830-839 (2005) · Zbl 1117.62334
[18] Galeano, P.; Peña, D., Multivariate analysis in vector time series, Resenhas, 4, 383-404 (2000) · Zbl 1098.62558
[19] Lam, C.; Yao, Q.; Bathia, N., Estimation of latent factors using high-dimensional time series, Biometrika, 98, 901-918 (2011) · Zbl 1228.62110
[20] Liao, Tw, Clustering of time series data: a survey, Pattern Recognit, 38, 1857-1874 (2005) · Zbl 1077.68803
[21] Maharaj, Ea, A significance test for classifying ARMA models, J Stat Comput Simul, 54, 305-331 (1996) · Zbl 0899.62116
[22] Otranto, E., Identifying financial time series with similar dynamic conditional correlation, Comput Stat Data Anal, 54, 1, 1-15 (2010) · Zbl 1284.91593
[23] Peña, D.; Box, Gep, Identifying a simplifying structure in time series, J Am Stat Assoc, 82, 836-843 (1987) · Zbl 0623.62081
[24] Peña, D.; Poncela, P., Non-stationary dynamic factor analysis, J Stat Plan Inference, 136, 237-257 (2006)
[25] Piccolo, D., A distance measure for classifying ARIMA models, J Time Ser Anal, 11, 152-164 (1990) · Zbl 0691.62083
[26] Poncela, P.; Ruiz, E.; Shephard, N.; Koopman, Sj, More is not always better: back to the Kalman filter in dynamic factor models, Unobserved components and time series econometrics (2015), Oxford: Oxford University Press, Oxford
[27] Stock, Jh; Watson, Mw, Forecasting using principal components from a large number of predictors, J Am Stat Assoc, 97, 1169-1179 (2002) · Zbl 1041.62081
[28] Stock, Jh; Watson, Mw; Clements, Mp; Hendry, Df, Dynamic factor models, Oxford handbook of economic forecasting (2011), Oxford: Oxford University Press, Oxford
[29] Thomson, William, The tide gauge, tidal harmonic analyser, and tide predicter, Proc Inst Civ Eng, 65, 2-25 (1881)
[30] Tong, H.; Dabas, P., Cluster of time series models: an example, J Appl Stat, 17, 187-198 (1990)
[31] Yang, Ac; Tsai, S-J; Hong, C-J; Wang, C.; Chen, T-J; Liou, Y-J, Clustering heart rate dynamics is associated with \(\beta \)-adrenergic receptor polymorphisms: analysis by information-based similarity index, PLoS ONE, 6, 5, e19232 (2011)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.