On the probability of pattern matching in nonaligned DNA sequences: A finite Markov chain imbedding approach.

*(English)*Zbl 0941.62114
Glaz, Joseph (ed.) et al., Scan statistics and applications. Boston: Birkhäuser. Statistics for Industry and Technology. 287-302 (1999).

Summary: Mathematically, a DNA segment can be viewed as a sequence of four-state

\((A,C,G,T)\) trials, and a perfect match of size \(M\) occurs when two DNA sequences have at least one identical subsequence (or pattern) of length \(M\). Pattern matching probabilities are crucial for statistically rigorous comparisons of DNA (and other) sequences, and many bounds and approximations of such probabilities have recently been developed. There are few results on exact probabilities, especially for trials with unequal state probabilities, and no exact analytical formulae for the pattern matching probability involving arbitrarily long nonaligned sequences.

Here, a simple and efficient method based on the finite Markov chain imbedding technique is developed to obtain the exact probability of perfect matching for i.i.d. four-state trials with either equal or unequal state probabilities. A large deviation approximation is derived for very long sequences, and numerical examples are given to illustrate the results.

For the entire collection see [Zbl 0919.00015].

\((A,C,G,T)\) trials, and a perfect match of size \(M\) occurs when two DNA sequences have at least one identical subsequence (or pattern) of length \(M\). Pattern matching probabilities are crucial for statistically rigorous comparisons of DNA (and other) sequences, and many bounds and approximations of such probabilities have recently been developed. There are few results on exact probabilities, especially for trials with unequal state probabilities, and no exact analytical formulae for the pattern matching probability involving arbitrarily long nonaligned sequences.

Here, a simple and efficient method based on the finite Markov chain imbedding technique is developed to obtain the exact probability of perfect matching for i.i.d. four-state trials with either equal or unequal state probabilities. A large deviation approximation is derived for very long sequences, and numerical examples are given to illustrate the results.

For the entire collection see [Zbl 0919.00015].

##### MSC:

62P10 | Applications of statistics to biology and medical sciences; meta analysis |

92C40 | Biochemistry, molecular biology |

60J20 | Applications of Markov chains and discrete-time Markov processes on general state spaces (social mobility, learning theory, industrial processes, etc.) |