×

Computing the tandem duplication distance is NP-hard. (English) Zbl 1483.68504

Summary: In computational biology, tandem duplication is an important biological phenomenon which can occur either at the genome or at the DNA level. A tandem duplication takes a copy of a genome segment and inserts it right after the segment-this can be represented as the string operation \(AXB\Rightarrow AXXB\). Tandem exon duplications have been found in many species such as human, fly, and worm and have been largely studied in computational biology. The tandem duplication (TD) distance problem we investigate in this paper is defined as follows: given two strings \(S\) and \(T\) over the same alphabet \(\Sigma\), compute the smallest sequence of TDs required to convert \(S\) to \(T\). The natural question of whether the TD distance can be computed in polynomial time was posed in [Lect. Notes Comput. Sci. 2950, 297–308 (2004; Zbl 1200.68146)] by P. Leupold et al. and had remained open, despite the fact that TDs have received much attention ever since. In this paper, we focus on the special case when all characters of \(S\) are distinct. This is known as the exemplar TD distance, which is of special relevance in bioinformatics. We first prove that this problem is NP-hard when the alphabet size is unbounded, settling the 16-year-old open problem. We then show how to adapt the proof to \(|\Sigma|=4\), hence proving the NP-hardness of the TD problem for any \(|\Sigma|\geq 4\). One of the tools we develop for the reduction is a new problem called Cost-Effective Subgraph, for which we obtain W[1]-hardness results that might be of independent interest. We finally show that computing the exemplar TD distance between \(S\) and \(T\) is fixed-parameter tractable. Our results open the door to many other questions, and we conclude with several open problems.

MSC:

68W32 Algorithms on strings
68Q17 Computational difficulty of problems (lower bounds, completeness, difficulty of approximation, etc.)
68Q27 Parameterized complexity, tractability and kernelization
68Q45 Formal languages and automata
92D10 Genetics and epigenetics

Citations:

Zbl 1200.68146
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] N. Alon, J. Bruck, F. F. Hassanzadeh, and S. Jain, Duplication distance to the root for binary sequences, IEEE Trans. Inform. Theory, 63 (2017), pp. 7793-7803, https://doi.org/10.1109/TIT.2017.2744608. · Zbl 1390.94811
[2] G. Benson and L. Dong, Reconstructing the duplication history of a tandem repeat, in Proceedings of the 17th International Conference on Intelligent Systems for Molecular Biology, ISMB’99, AAAI, 1999, pp. 44-53.
[3] D. P. Bovet and S. Varricchio, On the regularity of languages on a binary alphabet generated by copying systems, Inform. Process. Lett., 44 (1992), pp. 119-123, https://doi.org/10.1016/0020-0190(92)90050-6. · Zbl 0759.68049
[4] L. Bulteau, G. Fertin, and I. Rusu, Sorting by transpositions is difficult, SIAM J. Discrete Math., 26 (2012), pp. 1148-1180, https://doi.org/10.1137/110851390. · Zbl 1256.05004
[5] B. Charlesworth, P. Sniegowski, and W. Stephan, The evolutionary dynamics of repetitive DNA in eukaryotes, Nature, 371 (1994), pp. 215-220, https://doi.org/10.1038/371215a0.
[6] K. Chaudhuri, K. Chen, R. Mihaescu, and S. Rao, On the tandem duplication-random loss model of genome rearrangement, in Proceedings of 17th ACM-SIAM Symposium on Discrete Algorithms, SODA’06, SIAM, 2006, pp. 564-570. · Zbl 1192.92017
[7] Z.-Z. Chen, L. Wang, and Z. Wang, Approximation algorithms for reconstructing the duplication history of tandem repeats, Algorithmica, 54 (2009), pp. 501-529, https://doi.org/10.1007/s00453-008-9209-8. · Zbl 1185.68852
[8] D.-J. Cho, Y.-S. Han, and H. Kim, Bound-decreasing duplication system, Theoret. Comput. Sci., 793 (2019), pp. 152-168, https://doi.org/10.1016/j.tcs.2019.06.018. · Zbl 1430.68141
[9] J. Dassow, V. Mitrana, and G. Paun, On the regularity of the duplication closure, Bull. Eur. Assoc. Theor. Comput. Sci. EATCS, 69 (1999), pp. 133-136. · Zbl 0941.68605
[10] R. G. Downey and M. R. Fellows, Fixed-parameter tractability and completenessII: On completeness for W[1], Theoret. Comput. Sci., 141 (1995), pp. 109-131, https://doi.org/10.1016/0304-3975(94)00097-3. · Zbl 0873.68059
[11] A. Ehrenfeucht and G. Rozenberg, On regularity of languages generated by copying systems, Discrete Appl. Math., 8 (1984), pp. 313-317, https://doi.org/10.1016/0166-218X(84)90129-X. · Zbl 0549.68075
[12] O. Gascuel, M. D. Hendy, A. Jean-Marie, and R. McLachlan, The combinatorics of tandem duplication trees, Syst. Biol., 52 (2003), pp. 110-118, https://doi.org/10.1080/10635150390132821.
[13] D. Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge University Press, Cambridge, UK, 1997. · Zbl 0934.68103
[14] D. Gusfield and J. Stoye, Linear time algorithms for finding and representing all the tandem repeats in a string, J. Comput. System Sci., 69 (2004), pp. 525-546, https://doi.org/10.1016/j.jcss.2004.03.004. · Zbl 1076.68111
[15] S. Hannenhalli and P. A. Pevzner, Transforming men into mice (polynomial algorithm for genomic distance problem), in Proceedings of IEEE 36th Symposium on Foundations of Computer Science, FOCS’95, IEEE, 1995, pp. 581-592, https://doi.org/10.1109/SFCS.1995.492588. · Zbl 0938.68939
[16] F. F. Hassanzadeh, M. Schwartz, and J. Bruck, The capacity of string-duplication systems, IEEE Trans. Inform. Theory, 62 (2016), pp. 811-824, https://doi.org/10.1109/TIT.2015.2505735. · Zbl 1359.94317
[17] M. Ito, P. Leupold, and K. Shikishima-Tsuji, Closure of languages under bounded duplications, in Proceedings of the 10th International Conference on Developments in Language Theory, Lecture Notes in Comput. Sci. 4036, Spring-Verlag, Cham, 2006, pp. 238-247, https://doi.org/10.1007/11779148_22. · Zbl 1227.68058
[18] S. Jain, F. F. Hassanzadeh, and J. Bruck, Capacity and expressiveness of genomic tandem duplication, IEEE Trans. Inform. Theory, 63 (2017), pp. 6129-6138, https://doi.org/10.1109/TIT.2017.2728079. · Zbl 1390.92088
[19] R. M. Karp, Reducibility among combinatorial problems, in Complexity of Computer Computations, R. E. Miller, J. W. Thatcher, and J. D. Bohlinger, eds., The IBM Research Symposia Series, Springer, Boston, MA, 1972, pp. 85-103, https://doi.org/10.1007/978-1-4684-2001-2_9. · Zbl 1467.68065
[20] G. M. Landau, J. P. Schmidt, and D. Sokol, An algorithm for approximate tandem repeats, J. Comput. Biol., 8 (2001), pp. 1-18, https://doi.org/10.1089/106652701300099038.
[21] J. Leech, A problem on strings of beads, Math. Gaz., 41 (1957), pp. 277-278. · Zbl 0079.01101
[22] I. Letunic, R. R. Copley, and P. Bork, Common exon duplication in animals and its role in alternative splicing, Hum. Mol. Genet., 11 (2002), pp. 1561-1567, https://doi.org/10.1093/hmg/11.13.1561.
[23] P. Leupold, C. Martin-Vide, and V. Mitrana, Uniformly bounded duplication languages, Discrete Appl. Math., 146 (2005), pp. 301-310, https://doi.org/10.1016/j.dam.2004.10.003. · Zbl 1077.68047
[24] P. Leupold, V. Mitrana, and J. M. Sempere, Formal languages arising from gene repeated duplication, in Aspects of Molecular Computing, G. P. Natasa Jonoska and G. Rozenberg, eds., Lecture Notes in Comput. Sci. 2950, Springer-Verlag, Berlin, 2004, pp. 297-308, https://doi.org/10.1007/978-3-540-24635-0_22. · Zbl 1200.68146
[25] M. E. Macdonald, C. M. Ambrose, M. P. Duyao, et al., A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington’s disease, Cell, 72 (1993), pp. 971-983, https://doi.org/10.1016/0092-8674(93)90585-E.
[26] L. Oesper, A. M. Ritz, S. J. Aerni, R. Drebin, and B. J. Raphael, Reconstructing cancer genomes from paired-end sequencing data, BMC Bioinform., 13 (2012), p. S10, https://doi.org/10.1186/1471-2105-13-S6-S10.
[27] L. Qingge, X. He, Z. Liu, and B. Zhu, On the minimum copy number generation problem in cancer genomics, in Proceedings of the 10th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB’18, New York, NY, 2018, ACM Press, pp. 260-269, https://doi.org/10.1145/3233547.3233586.
[28] D. Sankoff, Gene and genome duplication, Curr. Opinion Genet. Dev., 11 (2001), pp. 681-684, https://doi.org/10.1016/S0959-437X(00)00253-7.
[29] A. J. Sharp, D. P. Locke, S. D. McGrath, Z. Cheng, J. A. Bailey, R. U. Vallente, L. M. Pertz, R. A. Clark, S. Schwartz, R. Segraves, V. V. Oseroff, D. G. Albertson, D. Pinkel, and E. E. Eichler, Segmental duplications and copy-number variation in the human genome, Am. J. Hum. Genet., 77 (2005), pp. 78-88, https://doi.org/10.1086/431652.
[30] J. W. Szostak and R. Wu, Unequal crossing over in the ribosomal DNA of Saccharomyces cerevisiae, Nature, 284 (1980), pp. 426-430, https://doi.org/10.1038/284426a0.
[31] O. Tremblay-Savard, D. Bertrand, and N. El-Mabrouk, Evolution of orthologous tandemly arrayed gene clusters, BMC Bioinform., 12 (2011), p. S2, https://doi.org/10.1186/1471-2105-12-S9-S2.
[32] M.-W. Wang, On the irregularity of the duplication closure, Bull. Eur. Assoc. Theor. Comput. Sci. EATCS, 70 (2000), pp. 162-163. · Zbl 0983.68111
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.