×

Non-parametric change-point estimation using string matching algorithms. (English) Zbl 1330.62322

Summary: Given the output of a data source taking values in a finite alphabet, we wish to estimate change-points, that is times when the statistical properties of the source change. Motivated by ideas of match lengths in information theory, we introduce a novel non-parametric estimator which we call CRECHE (CRossings Enumeration CHange Estimator). We present simulation evidence that this estimator performs well, both for simulated sources and for real data formed by concatenating text sources. For example, we show that we can accurately estimate the point at which a source changes from a Markov chain to an IID source with the same stationary distribution. Our estimator requires no assumptions about the form of the source distribution, and avoids the need to estimate its probabilities. Further, establishing a fluid limit and using martingale arguments.

MSC:

62L10 Sequential statistical analysis
62M09 Non-Markovian processes: estimation
68W32 Algorithms on strings
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Aggarwal R, Inclan C, Leal R (1999) Volatility in emerging stock markets. J Finan Quant Anal 34(1):33-55 · doi:10.2307/2676245
[2] Algoet PH, Cover TM (1988) A sandwich proof of the Shannon-McMillan-Breiman theorem. Ann Probab 16:899-909 · Zbl 0653.28013 · doi:10.1214/aop/1176991794
[3] Alzaid AA, Al-Osh M (1990) An integer-valued pth-order autoregressive structure (INAR(p)) process. J Appl Probab 27(2):314-324 · Zbl 0704.62081 · doi:10.2307/3214650
[4] Arratia R, Waterman MS (1985) Critical phenomena in sequence matching. Ann Probab 13(4):1236-1249 · Zbl 0576.60058 · doi:10.1214/aop/1176992808
[5] Arratia R, Waterman MS (1989) The Erdős-Rényi strong law for pattern matching with a given proportion of mismatches. Ann Probab 17(3):1152-1169 · Zbl 0688.62019 · doi:10.1214/aop/1176991262
[6] Barnett TP, Pierce DW, Schnur R (2001) Detection of anthropogenic climate change in the world’s oceans. Science 292(5515):270-274 · doi:10.1126/science.1058304
[7] Bell C, Gordon L, Pollak M (1994) An efficient nonparametric detection scheme and its application to surveillance of a Bernoulli process with unknown baseline. Lect Notes Monogr Ser 23:7-27 · Zbl 1157.62394 · doi:10.1214/lnms/1215463111
[8] Ben Hariz S, Wylie JJ, Zhang Q (2007) Optimal rate of convergence for nonparametric change-point estimators for nonstationary sequences. Ann Stat 35(4):1802-1826 · Zbl 1147.62043 · doi:10.1214/009053606000001596
[9] Braun V, Braun RK, Muller HG (2000) Multiple changepoint fitting via quasilikelihood, with application to DNA sequence segmentation. Biometrika 87(2):301-314 · Zbl 0963.62067 · doi:10.1093/biomet/87.2.301
[10] Brodsky, BE; Darkhovsky, BS, Nonparametric methods in change-point problems (1993), Dordrecht · Zbl 0779.62031
[11] Brodsky, BE; Darkhovsky, BS, Non-parametric statistical diagnosis (2000), Dordrecht · Zbl 0995.62031
[12] Cai H, Kulkarni SR, Verdú S (2006) Universal divergence estimation for finite-alphabet sources. IEEE Trans Inform Theory 52(8):3456-3475 · Zbl 1309.94053 · doi:10.1109/TIT.2006.878182
[13] Carlstein E (1988) Nonparametric change-point estimation. Ann Stat 16(1):188-197 · Zbl 0637.62041 · doi:10.1214/aos/1176350699
[14] Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New York · Zbl 0762.94001 · doi:10.1002/0471200611
[15] Darling RWR (2002) Fluid limits of pure jump markov processes: a practical guide. arXiv:math/0210109 · Zbl 0801.94004
[16] Dümbgen L (1991) The asymptotic behavior of some nonparametric change-point estimators. Ann Stat 19(3):1471-1495 · Zbl 0776.62032 · doi:10.1214/aos/1176348257
[17] Frisén M, Maré JD (1991) Optimal surveillance. Biometrika 78(2):271-280 · Zbl 0746.62083 · doi:10.1093/biomet/78.2.271
[18] Gao Y, Kontoyiannis I, Bienenstock E (2008) Estimating the entropy of binary time series: methodology, some theory and a simulation study. Entropy 10(2):71-99 · Zbl 1179.94045 · doi:10.3390/entropy-e10020071
[19] Girón J, Ginebra J, Riba A (2005) Bayesian analysis of a multinomial sequence and homogeneity of literary style. Am Stat 59(1):19-30 · doi:10.1198/000313005X21311
[20] Goldenshluger A, Tsybakov A, Zeevi A (2006) Optimal change-point estimation from indirect observations. Ann Stat 34(1):350-372 · Zbl 1091.62021 · doi:10.1214/009053605000000750
[21] Gordon L, Pollak M (1994) An efficient sequential nonparametric scheme for detecting a change of distribution. Ann Stat 22(2):763-804 · Zbl 0816.62066 · doi:10.1214/aos/1176325495
[22] Grassberger P (1989) Estimating the information content of symbol sequences and efficient codes. IEEE Trans Inform Theory 35:669-675 · doi:10.1109/18.30993
[23] Horváth L (1993) The maximum likelihood method for testing changes in the parameters of normal observations. Ann Stat 21(2):671-680 · Zbl 0778.62016 · doi:10.1214/aos/1176349143
[24] Killick R, Fearnhead P, Eckley IA (2012) Optimal detection of changepoints with a linear computational cost. J Am Stat Assoc 107(500):1590-1598 · Zbl 1258.62091 · doi:10.1080/01621459.2012.737745
[25] Kim H, Rozovskii BL, Tartakovsky AG (2004) A nonparametric multichart CUSUM test for rapid detection of DOS attacks in computer networks. Int J Comput Inform Sci 2(3):149-158
[26] Kontoyiannis, I.; Suhov, YM; Kelly, FP (ed.), Prefixes and the entropy rate for long-range sources, 89-98 (1993), New York · Zbl 0855.60034
[27] Nguyen X, Wainwright M, Jordan M (2005) Nonparametric decentralized detection using kernel methods. IEEE Trans Signal Process 53(11):4053-4066 · Zbl 1370.94337 · doi:10.1109/TSP.2005.857020
[28] Ornstein DS, Weiss B (1990) How sampling reveals a process. Ann Probab 18:905-930 · Zbl 0709.60036 · doi:10.1214/aop/1176990729
[29] Ornstein DS, Weiss B (1993) Entropy and data compression schemes. IEEE Trans Inform Theory 39:78-83 · Zbl 0764.94003 · doi:10.1109/18.179344
[30] Pollak M (1985) Optimal detection of a change in distribution. Ann Stat 13(1):206-227 · Zbl 0573.62074 · doi:10.1214/aos/1176346587
[31] Poor HV, Hadjiliadis O (2009) Quickest detection. Cambridge University Press, Cambridge · Zbl 1271.62015
[32] Quas AN (1999) An entropy estimator for a class of infinite processes. Theory Probab Appl 43(3):496-507 · Zbl 1047.28500 · doi:10.1137/S0040585X97977100
[33] Rényi A (1956) A characterization of poisson processes. Magyar Tud Akad Mat Kutató Int Közl 1:519-527 · Zbl 0103.11503
[34] Riba A, Ginebra J (2005) Change-point estimation in a multinomial sequence and homogeneity of literary style. J Appl Stat 32(1):61-74 · Zbl 1121.62476 · doi:10.1080/0266476052000330295
[35] Scott AJ, Knott M (1974) A cluster analysis method for grouping means in the analysis of variance. Biometrics 30(3):507-512 · Zbl 0284.62044 · doi:10.2307/2529204
[36] Shields PC (1992) Entropy and prefixes. Ann Probab 20:403-409 · Zbl 0765.28016 · doi:10.1214/aop/1176989934
[37] Shields PC (1997) String matching bounds via coding. Ann Probab 25:329-336 · Zbl 0873.60029 · doi:10.1214/aop/1024404290
[38] Wellner JA (1977) A martingale inequality for the empirical process. Ann Probab 5(2):303-308 · Zbl 0369.60051 · doi:10.1214/aop/1176995856
[39] Williams D (1991) Probability with martingales. Cambridge University Press, Cambridge · Zbl 0722.60001 · doi:10.1017/CBO9780511813658
[40] Ziv J, Merhav N (1993) A measure of relative entropy between individual sequences with application to universal classification. IEEE Trans Inform Theory 39(4):1270-1279 · Zbl 0801.94004 · doi:10.1109/18.243444
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.