# zbMATH — the first resource for mathematics

Distribution of the length of the longest common subsequence of two multi-state biological sequences. (English) Zbl 1147.92010
Summary: The length of the longest common subsequence (LCS) among two biological sequences has been used as a measure of similarity, and the application of this statistic is of importance in genomic studies. Even for the simple case of two sequences of equal length and composed of binary elements with equal state probabilities, the exact distribution of the length of the LCS remains an open question. This problem is also known as an NP-hard problem in computer science. Apart from combinatorial analysis, using the finite Markov chain embedding technique, we derive the exact distribution for the length of the LCS between two multi-state sequences of different lengths. Numerical results are provided to illustrate the theoretical results.

##### MSC:
 92C40 Biochemistry, molecular biology 62E15 Exact distribution theory in statistics 60J20 Applications of Markov chains and discrete-time Markov processes on general state spaces (social mobility, learning theory, industrial processes, etc.) 60J10 Markov chains (discrete-time Markov processes on discrete state spaces)
Full Text:
##### References:
  Aki, S.; Hirano, K., Sooner and later waiting time problems for runs in Markov dependent bivariate trials, Ann. inst. statist. math., 51, 17-29, (1999) · Zbl 0952.60071  Alexander, K.S., The rate of convergence of the Mean length of the longest common subsequence, Ann. appl. probab., 4, 1074-1082, (1994) · Zbl 0812.60014  Booth, H.S.; MacNamara, S.F.; Nielsen, O.M.; Wilson, S.R., An iterative approach to determining the length of the longest common subsequence of two strings, Methodology comput. appl. probab., 6, 401-421, (2004) · Zbl 1056.92044  Bundschuh, R., High precision simulations of the longest common subsequence, European phys. J. B, 22, 533-541, (2001)  Chvátal, V.; Sankoff, D., Longest common subsequences of two random sequences, J. appl. probab., 12, 306-315, (1975) · Zbl 0313.60008  Ebneshahrashoob, M.; Sobel, M., Sooner and later waiting time problems for Bernoulli trials: frequency and run quotas, Statist. probab. lett., 9, 5-11, (1990) · Zbl 0695.60016  Fu, J.C., Distribution theory of runs and patterns associated with a sequence of multi-state trials, Statist. sinica, 6, 957-974, (1996) · Zbl 0857.60068  Fu, J.C.; Chang, Y.M., On probability generating functions for waiting time distributions of compound patterns in a sequence of multistate trials, J. appl. probab., 39, 70-80, (2002) · Zbl 1008.60031  Fu, J.C.; Koutras, M.V., Distribution theory of runs: a Markov chain approach, J. amer. statist. assoc., 89, 1050-1058, (1994) · Zbl 0806.60011  Fu, J.C.; Lou, W.Y.W., Distribution theory of runs and patterns and its applications: A finite Markov chain imbedding approach, (2003), World Scientific Publishing Co River Edge, NJ · Zbl 1030.60063  Fu, J.C.; Wang, L.; Lou, W.Y.Y., On exact and large deviation approximation for the distribution of the longest run in a sequence of two-state Markov dependent trials, J. appl. probab., 40, 346-360, (2003) · Zbl 1028.60068  Han, Q.; Hirano, K., Sooner and later waiting time problems for patterns in Markov dependent trials, J. appl. probab., 40, 73-86, (2003) · Zbl 1023.60017  Jiang, T.; Li, M., On the approximation of the shortest common subsequences and longest common subsequence, SIAM J. comput., 24, 1122-1139, (1995) · Zbl 0853.68112  Koutras, M.V., On a waiting time distribution in a sequence of Bernoulli trials, Ann. inst. statist. math., 48, 789-806, (1996) · Zbl 1002.60517  Koutras, M.V., Waiting time distributions associated with runs of fixed length in two-state Markov chains, Ann. inst. statist. math., 49, 123-139, (1997) · Zbl 0913.60019  Koutras, M.V.; Alexandrou, V.A., Runs, scans and urn model distributions: a unified Markov chain approach, Ann. inst. statist. math., 47, 743-766, (1995) · Zbl 0848.60021  Koutras, M.V.; Alexandrou, V.A., Sooner waiting time problems in a sequence of trinary trials, J. appl. probab., 34, 593-609, (1997) · Zbl 0891.60020  Lou, W.Y.W., On runs and longest run tests: a method of finite Markov chain imbedding, J. amer. statist. assoc., 91, 1595-1601, (1996) · Zbl 0881.62086  Sankoff, D., Matching sequences under deletion/insertion constraints, Proc. nat. acad. sci., 69, 4-6, (1972) · Zbl 0235.05013  Sankoff, D.; Cedergren, R.J., A test for nucleotide sequence homology, J. molecular biol., 77, 159-164, (1973)  Seneta, E., Nonnegative matrices and Markov chains, (1981), Springer New York · Zbl 0471.60001  Steele, J.M., Long common subsequences and the proximity of two random strings, SIAM J. appl. math., 42, 731-737, (1982) · Zbl 0501.60063  Waterman, M.S., Introduction to computational biology, (1995), Chapman & Hall London · Zbl 0831.92011
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.