×

Alignment-free comparison of genome sequences by a new numerical characterization. (English) Zbl 1397.92196

Summary: In order to compare different genome sequences, an alignment-free method has proposed. First, we presented a new graphical representation of DNA sequences without degeneracy, which is conducive to intuitive comparison of sequences. Then, a new numerical characterization based on the representation was introduced to quantitatively depict the intrinsic nature of genome sequences, and considered as a 10-dimensional vector in the mathematical space. Alignment-free comparison of sequences was performed by computing the distances between vectors of the corresponding numerical characterizations, which define the evolutionary relationship. Two data sets of DNA sequences were constructed to assess the performance on sequence comparison. The results illustrate well validity of the method. The new numerical characterization provides a powerful tool for genome comparison.

MSC:

92C40 Biochemistry, molecular biology
92D20 Protein sequences, DNA sequences
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Abo El Maaty, M.I.; Abo-Elkhier, M.M.; Abd Elwahaab, M.A., 3D graphical representation of protein sequences and their statistical characterization, Physica A, 389, 4668-4676, (2010)
[2] Blaisdell, B.E., A measure of the similarity of sets of sequences not requiring sequence alignment, Proc. natl. acad. sci. USA, 83, 5155-5159, (1986) · Zbl 0592.92011
[3] Huang, G.; Liao, B.; Li, Y.; Liu, Z., H-L curve: a novel 2D graphical representation for DNA sequences, Chem. phys. lett., 462, 129-132, (2008)
[4] Huang, G.; Liao, B.; Li, Y.; Yu, Y., Similarity studies of DNA sequences based on a new 2D graphical representation, Biophys. chem., 243, 55-59, (2009)
[5] Jun, S-R.; Sims, G.E.; Wu, G.A.; Kim, S-H., Whole-proteome phylogeny of prokaryotes by feature frequency profiles: an alignment-free method with optimal feature resolution, Proc. natl. acad. sci. USA, 107, 133-138, (2010)
[6] Kantorovitz, M.R.; Robinson, G.E.; Sinha, S., A statistical method for alignment-free comparison of regulatory sequences, Bioinformatics, 23, 249-255, (2007)
[7] Korf, I.F.; Rose, A.B., Applying word-based algorithms: the imeter, Methods mol. biol., 553, 287-301, (2009)
[8] Liao, B.; Ding, K., A 3D graphical representation of DNA sequences and its application, Theor. comput. sci., 358, 56-64, (2006) · Zbl 1097.68660
[9] Liao, B.; Wang, T.M., New 2D graphical representation of DNA sequences, J. comput. chem., 25, 1364-1368, (2004)
[10] Lippert, R.A.; Huang, H.; Waterman, M.S., Distributional regimes for the number of k-word matches between two random sequences, Proc. natl. acad. sci. USA, 99, 13980-13989, (2002) · Zbl 1135.62395
[11] Nandy, A.; Harle, M.S.; Basak, C., Mathematical descriptors of DNA sequences: development and applications, Arkivoc, 9, 211-238, (2006)
[12] Qi, X.Q.; Wen, J.; Qi, Z.H., New 3D graphical representation of DNA sequence based on dual nucleotides, J. theor. biol., 249, 681-690, (2007) · Zbl 1453.92233
[13] Raina, S.Z.; Faith, J.J.; Disotell, T.R.; Seligmann, H.; Stewart, C.B.; Pollock, D.D., Evolution of base-substitution gradients in primate mitochondrial genomes, Genome res., 15, 665-673, (2005)
[14] Randić, M.; Vračko, M., On the similarity of DNA primary sequences, J. chem. inf. comput. sci., 40, 599-606, (2000)
[15] Randić, M.; Vračko, M.; Nandy, A., On 3-D graphical representation of DNA primary sequences and their numerical characterization, J. chem. inf. comput. sci., 40, 1235-1244, (2000)
[16] Randić, M.; Vračko, M.; Lerš, N.; Plavšić, D., Analysis of similarity/dissimilarity of DNA sequences based on novel 2D graphical representation, Chem. phys. lett., 371, 202-207, (2003)
[17] Raychaudhury, C.; Nandy, A., Indexing scheme and similarity measures for macromolecular sequences, J. chem. inf. comput. sci., 39, 243-247, (1999)
[18] Sims, G.E.; Jun, S-R.; Wu, G.A.; Kim, S-H., Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions, Proc. natl. acad. sci. USA, 106, 2677-2682, (2009)
[19] Sims, G.E.; Jun, S-R.; Wu, G.A.; Kim, S-H., Whole-genome phylogeny of mammals: evolutionary information in genic and nongenic regions, Proc. natl. acad. sci. USA, 106, 17077-17082, (2009)
[20] Song, J.; Tang, H., A new 2-D graphical representation of DNA sequences and their numerical characterization, J. biochem. biophys. methods, 63, 228-239, (2005)
[21] Stuart, G.W.; Moffet, K.; Baker, S., Integrated gene species phylogenies from unaligned whole genome protein sequences, Bioinformatics, 18, 100-108, (2002)
[22] Stuart, G.W.; Moffet, K.; Leader, J.J., A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes, Mol. biol. evol., 19, 554-562, (2002)
[23] Vinga, S.; Almeida, J., Alignment-free sequence comparison—a review, Bioinformatics, 19, 513-523, (2003)
[24] Wen, J.; Zhang, Y., A 2D graphical representation of protein sequence and its numerical characterization, Chem. phys. lett, 476, 281-286, (2009)
[25] Wu, T-J.; Burke, J.P.; Davison, D.B., A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words, Biometrics, 53, 1431-1439, (1997) · Zbl 0931.62100
[26] Wu, T-J.; Hsieh, Y-C.; Li, L-A., Statistical measures of DNA dissimilarity under Markov chain models of base composition, Biometrics, 57, 441-448, (2001) · Zbl 1209.62339
[27] Wu, T-J.; Huang, Y-H.; Li, L-A., Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences, Bioinformatics, 21, 4124-4132, (2005)
[28] Yu, C.; Liang, Q.; Yin, C.; He, R.L.; Yau, S.S-T.; Novel, A, Construction of genome space with biological geometry, DNA res, 17, 155-168, (2010)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.