Distribution of the length of the longest common subsequence of two multi-state biological sequences.

*(English)*Zbl 1147.92010Summary: The length of the longest common subsequence (LCS) among two biological sequences has been used as a measure of similarity, and the application of this statistic is of importance in genomic studies. Even for the simple case of two sequences of equal length and composed of binary elements with equal state probabilities, the exact distribution of the length of the LCS remains an open question. This problem is also known as an NP-hard problem in computer science. Apart from combinatorial analysis, using the finite Markov chain embedding technique, we derive the exact distribution for the length of the LCS between two multi-state sequences of different lengths. Numerical results are provided to illustrate the theoretical results.

##### MSC:

92C40 | Biochemistry, molecular biology |

62E15 | Exact distribution theory in statistics |

60J20 | Applications of Markov chains and discrete-time Markov processes on general state spaces (social mobility, learning theory, industrial processes, etc.) |

60J10 | Markov chains (discrete-time Markov processes on discrete state spaces) |

##### Keywords:

multi-state trials; runs and patterns; finite Markov chain imbedding; transition probability matrix
PDF
BibTeX
XML
Cite

\textit{J. C. Fu} and \textit{W. Y. W. Lou}, J. Stat. Plann. Inference 138, No. 11, 3605--3615 (2008; Zbl 1147.92010)

Full Text:
DOI

##### References:

[1] | Aki, S.; Hirano, K., Sooner and later waiting time problems for runs in Markov dependent bivariate trials, Ann. inst. statist. math., 51, 17-29, (1999) · Zbl 0952.60071 |

[2] | Alexander, K.S., The rate of convergence of the Mean length of the longest common subsequence, Ann. appl. probab., 4, 1074-1082, (1994) · Zbl 0812.60014 |

[3] | Booth, H.S.; MacNamara, S.F.; Nielsen, O.M.; Wilson, S.R., An iterative approach to determining the length of the longest common subsequence of two strings, Methodology comput. appl. probab., 6, 401-421, (2004) · Zbl 1056.92044 |

[4] | Bundschuh, R., High precision simulations of the longest common subsequence, European phys. J. B, 22, 533-541, (2001) |

[5] | Chvátal, V.; Sankoff, D., Longest common subsequences of two random sequences, J. appl. probab., 12, 306-315, (1975) · Zbl 0313.60008 |

[6] | Ebneshahrashoob, M.; Sobel, M., Sooner and later waiting time problems for Bernoulli trials: frequency and run quotas, Statist. probab. lett., 9, 5-11, (1990) · Zbl 0695.60016 |

[7] | Fu, J.C., Distribution theory of runs and patterns associated with a sequence of multi-state trials, Statist. sinica, 6, 957-974, (1996) · Zbl 0857.60068 |

[8] | Fu, J.C.; Chang, Y.M., On probability generating functions for waiting time distributions of compound patterns in a sequence of multistate trials, J. appl. probab., 39, 70-80, (2002) · Zbl 1008.60031 |

[9] | Fu, J.C.; Koutras, M.V., Distribution theory of runs: a Markov chain approach, J. amer. statist. assoc., 89, 1050-1058, (1994) · Zbl 0806.60011 |

[10] | Fu, J.C.; Lou, W.Y.W., Distribution theory of runs and patterns and its applications: A finite Markov chain imbedding approach, (2003), World Scientific Publishing Co River Edge, NJ · Zbl 1030.60063 |

[11] | Fu, J.C.; Wang, L.; Lou, W.Y.Y., On exact and large deviation approximation for the distribution of the longest run in a sequence of two-state Markov dependent trials, J. appl. probab., 40, 346-360, (2003) · Zbl 1028.60068 |

[12] | Han, Q.; Hirano, K., Sooner and later waiting time problems for patterns in Markov dependent trials, J. appl. probab., 40, 73-86, (2003) · Zbl 1023.60017 |

[13] | Jiang, T.; Li, M., On the approximation of the shortest common subsequences and longest common subsequence, SIAM J. comput., 24, 1122-1139, (1995) · Zbl 0853.68112 |

[14] | Koutras, M.V., On a waiting time distribution in a sequence of Bernoulli trials, Ann. inst. statist. math., 48, 789-806, (1996) · Zbl 1002.60517 |

[15] | Koutras, M.V., Waiting time distributions associated with runs of fixed length in two-state Markov chains, Ann. inst. statist. math., 49, 123-139, (1997) · Zbl 0913.60019 |

[16] | Koutras, M.V.; Alexandrou, V.A., Runs, scans and urn model distributions: a unified Markov chain approach, Ann. inst. statist. math., 47, 743-766, (1995) · Zbl 0848.60021 |

[17] | Koutras, M.V.; Alexandrou, V.A., Sooner waiting time problems in a sequence of trinary trials, J. appl. probab., 34, 593-609, (1997) · Zbl 0891.60020 |

[18] | Lou, W.Y.W., On runs and longest run tests: a method of finite Markov chain imbedding, J. amer. statist. assoc., 91, 1595-1601, (1996) · Zbl 0881.62086 |

[19] | Sankoff, D., Matching sequences under deletion/insertion constraints, Proc. nat. acad. sci., 69, 4-6, (1972) · Zbl 0235.05013 |

[20] | Sankoff, D.; Cedergren, R.J., A test for nucleotide sequence homology, J. molecular biol., 77, 159-164, (1973) |

[21] | Seneta, E., Nonnegative matrices and Markov chains, (1981), Springer New York · Zbl 0471.60001 |

[22] | Steele, J.M., Long common subsequences and the proximity of two random strings, SIAM J. appl. math., 42, 731-737, (1982) · Zbl 0501.60063 |

[23] | Waterman, M.S., Introduction to computational biology, (1995), Chapman & Hall London · Zbl 0831.92011 |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.