×

A new challenge for compression algorithms: Genetic sequences. (English) Zbl 0813.92019

Summary: Universal data compression algorithms fail to compress genetic sequences. This is due to the specificity of this particular kind of “text”. We analyze in some detail the properties of the sequences, which cause the failure of classical algorithms. We then present a lossless algorithm, biocompress-2, to compress the information contained in DNA and RNA sequences, based on the detection of regularities, such as the presence of palindromes. The algorithm combines substitutional and statistical methods, and to the best of our knowledge, leads to the highest compression of DNA. The results, although not satisfactory, give insight to the necessary correlation between compression and comprehension of genetic sequences.

MSC:

92D20 Protein sequences, DNA sequences
68P25 Data encryption (aspects in computer science)
92-04 Software, source code, etc. for problems pertaining to biology
68P99 Theory of data
92C40 Biochemistry, molecular biology
PDFBibTeX XMLCite
Full Text: DOI Link