×

Significance levels for biological sequence comparison using non-linear similarity functions. (English) Zbl 0637.92006

Summary: A class of nonlinear similarity functions \(s_ 1\) has been proposed for comparing subalignments of biological sequences. The distribution of maximal \(s_ 1\)-similarities is well approximated by the extreme value distribution. The significance levels of \(s_ 1\) are studied for a variety of nucleotide frequency distributions as well as for several matrices of amino acid substitution costs. Also, the significance levels of \(s_ 1\) are explored for comparing three biological sequences. Several previously described subalignments of bovine proenkephalin and porcine prodynorphin are shown to be highly significant.

MSC:

92Cxx Physiological, cellular and medical topics
92F05 Other natural sciences (mathematical treatment)
62P10 Applications of statistics to biology and medical sciences; meta analysis
62P99 Applications of statistics
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Altschul, S. F. 1987. ”Aspects of Biological Sequence Comparison.” Ph.D. thesis, Massachusetts Institute of Technology.
[2] – and B. W. Erickson. 1985. ”Significance of Nucleotide Sequence Alignments: A Method for Random Sequence Permutation That Preserves Dinucleotide and Codon Usage.”Mol. Biol. Evol. 2, 526–538.
[3] – and –. 1986a. ”A Non-linear Measure of Subalignment Similarity and its Significance Levels.”Bull. math. Biol. 48, 617–632. · Zbl 0606.92015 · doi:10.1007/BF02462327
[4] – and –. 1986b. ”Locally Optimal Subalignments Using Non-linear Similarity Functions.”Bull. math. Biol. 48, 633–660. · Zbl 0619.92020 · doi:10.1007/BF02462328
[5] Arratia, R., L. Gordon and M. S. Waterman. 1986. ”An Extreme Value Theory for Sequence Matching.”Ann. Stat. 14, 971–993. · Zbl 0602.62015 · doi:10.1214/aos/1176350045
[6] – and M. S. Waterman. 1985. ”Critical Phenomena in Sequence Matching.”Ann. Prob. 13, 1236–1249. · Zbl 0576.60058 · doi:10.1214/aop/1176992808
[7] Dayhoff, M. O., R. M. Schwartz and B. C. Orcutt. 1978. ”A Model of Evolutionary Change in Proteins.” InAtlas of Protein Sequence and Structure, Vol. 5, (Suppl. 3), M. O. Dayhoff (Ed.), pp. 345–352. Washington: National Biomedical Research Foundation.
[8] Erickson, B. W. and P. H. Sellers. 1983. ”Recognition of Patterns in Genetic Sequences.” InTime Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, D. Sankoff and J. B. Kruskal (Eds), pp. 55–91. Reading, MA: Addison-Wesley.
[9] Fitch, W. M. 1983a. ”Calculating the Expected Frequencies of Potential Secondary Structure in Nucleic Acids as a Function of Stem Length, Loop Size, Base Composition and Nearest-Neighbor Frequencies.”Nucl. Acids Res. 11, 4655–4663. · Zbl 05435465 · doi:10.1093/nar/11.13.4655
[10] –. 1983b. ”Random Sequences.”J. mol. Biol. 163, 171–176. · doi:10.1016/0022-2836(83)90002-5
[11] Goad, W. B. and M. I. Kanehisa. 1982. ”Pattern Recognition in Nucleic Acid Sequences. I. A. General Method for Finding Local Homologies and Symmetries.”Nucl. Acids Res. 10, 247–263. · Zbl 05435567 · doi:10.1093/nar/10.1.247
[12] Gordon, L., M. F. Schilling and M. S. Waterman. 1986. ”An Extreme Value Theory for Long Head Runs.”Prob. Th. Rel. 72, 279–287. · Zbl 0587.60031 · doi:10.1007/BF00699107
[13] Gumbel, E. J. 1962. ”Statistical Theory of Extreme Values (Main Results).” InContributions to Order Statistics, A. E. Sarhan and B. G. Greenberg (Eds), pp. 56–93. New York: Wiley.
[14] Kakidani, H., Y. Furutani, H. Takahashi, M. Noda, Y. Morimoto, T. Hirose, M. Asai, S. Inayama, S. Nakanishi and S. Numa. 1982. ”Cloning and Sequence Analysis of cDNA for Porcine {\(\beta\)}-Neo-endorphin/Dynorphin Precursor.”Nature 298, 577–579. · doi:10.1038/298245a0
[15] Kruskal, J. B. 1983. ”An Overview of Sequence Comparison.” InTime Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison, D. Sankoff and J. B. Kruskal (Eds), pp. 1–44. Reading, MA: Addison-Wesley. · Zbl 0512.68048
[16] Larsen, R. J. and M. L. Marx. 1981.An Introduction to Mathematical Statistics and its Applications. Englewood Cliffs, NJ: Prentice-Hall. · Zbl 0493.62001
[17] Lawrence, C. B., D. A. Goldman and R. T. Hood. 1986. ”Optimized Homology Searches of the Gene and Protein Sequence Data Banks.”Bull. math. Biol. 48, 569–583. · Zbl 0606.92016 · doi:10.1007/BF02462324
[18] Lewis, R. V. and B. W. Erickson. 1986. ”Evolution of Proenkephalin and Prodynorphin.”Am. Zool. 26, 1027–1032.
[19] Lipman, D. J., W. J. Wilbur, T. F. Smith and M. S. Waterman. 1984. ”On the Statistical Significance of Nucleic-Acid Similarities.”Nucl. Acids Res. 12, 215–226. · Zbl 05436190 · doi:10.1093/nar/12.1Part1.215
[20] Noda, M., Y. Furutani, H. Takahashi, M. Toyosata, T. Hirose, S. Inayama, S. Nakanishi and S. Numa. 1982. ”Cloning and Sequence Analysis of cDNA for Bovine Adrenal Preproenkephalin.”Nature 295, 202–206. · doi:10.1038/295202a0
[21] Schwartz, R. M. and M. O. Dayhoff. 1978. ”Matrices for Detecting Distant Relationships.” InAtlas of Protein Sequence and Structure, Vol. 5, Suppl. 3, M. O. Dayhoff (Ed.), pp. 353–358. Washington: National Biomedical Research Foundation.
[22] Sellers, P. H. 1984. ”Pattern Recognition in Genetic Sequences by Mismatch Density.”Bull. math. Biol. 46, 501–514. · Zbl 0584.92009 · doi:10.1007/BF02459499
[23] Smith, T. F., M. S. Waterman and C. Burks. 1985. ”The Statistical Distribution of Nucleic Acid Similarities.”Nucl. Acids Res. 13, 645–656. · doi:10.1093/nar/13.2.645
[24] —- and J. R. Sadler. 1983. ”Statistical Characterization of Nucleic Acid Sequence Functional Domains.”Nucl. Acids Res. 11, 2205–2220. · doi:10.1093/nar/11.7.2205
[25] Swartz, M. N., T. A. Trautner and A. Kornberg. 1962. ”Enzymatic Synthesis of Deoxyribonucleic Acid–XI. Further Studies on Nearest Neighbor Base Sequences in Deoxyribonucleic Acids.”J. biol. Chem. 237, 1961–1967.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.