At the roots of dictionary compression: string attractors | Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing

Abstract

A well-known fact in the field of lossless text compression is that high-order entropy is a weak model when the input contains long repetitions. Motivated by this fact, decades of research have generated myriads of so-called dictionary compressors: algorithms able to reduce the text’s size by exploiting its repetitiveness. Lempel-Ziv 77 is one of the most successful and well-known tools of this kind, followed by straight-line programs, run-length Burrows-Wheeler transform, macro schemes, collage systems, and the compact directed acyclic word graph. In this paper, we show that these techniques are different solutions to the same, elegant, combinatorial problem: to find a small set of positions capturing all distinct text’s substrings. We call such a set a string attractor. We first show reductions between dictionary compressors and string attractors. This gives the approximation ratios of dictionary compressors with respect to the smallest string attractor and allows us to uncover new asymptotic relations between the output sizes of different dictionary compressors. We then show that the k-attractor problem — deciding whether a text has a size-t set of positions capturing all substrings of length at most k — is NP-complete for k≥ 3. This, in particular, includes the full string attractor problem. We provide several approximation techniques for the smallest k-attractor, show that the problem is APX-complete for constant k, and give strong inapproximability results. To conclude, we provide matching lower and upper bounds for the random access problem on string attractors. The upper bound is proved by showing a data structure supporting queries in optimal time. Our data structure is universal: by our reductions to string attractors, it supports random access on any dictionary-compression scheme. In particular, it matches the lower bound also on LZ77, straight-line programs, collage systems, and macro schemes, and therefore essentially closes (at once) the random access problem for all these compressors.

Formats available

You can view the full content in the following formats:

Supplementary Material

References

[2]

Linear time construction of compressed text indices in compact space. In STOC. 148–193.

[3]

D. Belazzougui, P. Hagge Cording, S. J. Puglisi, and Y. Tabei. 2015. Access, rank, and select in grammar-compressed strings. In ESA. 142–154.

[4]

D. Belazzougui and F. Cunial. 2017. Fast label extraction in the CDAWG. In SPIRE. 161–175.

[5]

D. Belazzougui and F. Cunial. 2017. Representing the suffix tree with the CDAWG. In CPM. 7:1–7:13.

[6]

D. Belazzougui, F. Cunial, T. Gagie, N. Prezza, and M. Raffinot. 2015. Composite repetition-aware data structures. In CPM. 26–39.

[7]

D. Belazzougui, T. Gagie, P. Gawrychowski, J. Kärkkäinen, A. O. Pereira, S. J. Puglisi, and Y. Tabei. 2015. Queries on LZ-bounded encodings. In DCC. 83–92.

[8]

P. Berman and M. Karpinski. 1999. On some tighter inapproximability results. In ICALP. 200–209.

[9]

P. Bille, M. B. Ettienne, I. L. Gørtz, and H. W. Vildhøj. 2017. Time-space trade-offs for Lempel-Ziv compressed indexing. In CPM. 16:1–16:17.

[10]

P. Bille, G. M. Landau, R. Raman, K. Sadakane, S. R. Satti, and O. Weimann. 2015.

[11]

Random access to grammar-compressed strings and trees. SIAM J. Comput. 44, 3 (2015), 513–539.

[12]

A. Blumer, J. Blumer, D. Haussler, R. M. McConnell, and A. Ehrenfeucht. 1987.

[13]

Complete inverted files for efficient text retrieval and analysis. J. ACM 34, 3 (1987), 578–595.

[14]

M. Burrows and D. J. Wheeler. 1994.

[15]

A block-sorting lossless data compression algorithm. Technical Report 124. Digital Equipment Corporation.

[16]

M. Charikar, E. Lehman, D. Liu, R. Panigrahy, M. Prabhakaran, A. Sahai, and A. Shelat. 2005. The smallest grammar problem. IEEE Trans. Information Theory 51, 7 (2005), 2554–2576.

[17]

T. M. Cover and J. A. Thomas. 2006.

[18]

Elements of information theory 2nd edition. Wiley.

[19]

M. Crochemore and L. Ilie. 2008. Computing Longest Previous Factor in linear time and applications. Inf. Process. Lett. 106, 2 (2008), 75–80.

[20]

M. Crochemore and R. Vérin. 1997.

[21]

Direct construction of compact directed acyclic word graphs. In CPM. 116–129.

[22]

M. Crochemore and R. Vérin. 1997. On compact directed acyclic word graphs. In Structures in Logic and Computer Science. Springer, 192–211.

[23]

R. Duh and M. Fürer. 1997. Approximation of k-set cover by semi-local optimization. In STOC. 256–264.

[25]

Large alphabets and incompressibility. Inf. Process. Lett. 99, 6 (2006), 246–251.

[26]

T. Gagie, G. Navarro, and N. Prezza. 2017. Optimal-time text indexing in BWTruns bounded space. arXiv preprint 1705.10382 (2017).

[27]

T. Gagie, G. Navarro, and N. Prezza. 2018. On the approximation ratio of Lempel-Ziv parsing. In LATIN. To appear.

[28]

T. Gagie, G. Navarro, and N. Prezza. 2018. Optimal-time text indexing in BWTruns bounded space. In SODA. 1459–1477.

[30]

String compression algorithms. Ph.D. Dissertation. Princeton University.

[31]

D. Hucke, M. Lohrey, and C. P. Reh. 2016. The smallest grammar problem revisited. In SPIRE. 35–49.

[32]

D. S. Johnson. 1973. Approximation algorithms for combinatorial problems. In STOC. 38–49.

[33]

T. Kida, T. Matsumoto, Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. 2003.

[34]

Collage system: A unifying framework for compressed pattern matching. Theor. Comput. Sci. 298, 1 (2003), 253–272.

[35]

J. C. Kieffer and E. Yang. 2000. Grammar-based codes: A new class of universal lossless source codes. IEEE Trans. Information Theory 46, 3 (2000), 737–754.

[36]

J. C. Kieffer, E. Yang, G. J. Nelson, and P. C. Cosman. 2000.

[37]

Universal lossless compression via multilevel pattern matching. IEEE Trans. Information Theory 46, 4 (2000), 1227–1245.

[39]

Three approaches to the quantitative definition of information. Probl. Inf. Transm. 1, 1 (1965), 1–7.

[40]

S. R. Kosaraju and G. Manzini. 1999. Compression of low entropy strings with Lempel-Ziv algorithms. SIAM J. Comput. 29, 3 (1999), 893–911.

[41]

S. Kreft and G. Navarro. 2013. On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483 (2013), 115–133.

[42]

N. J. Larsson and A. Moffat. 2000. Off-line dictionary-based compression. Proc. IEEE 88, 11 (2000), 1722–1732.

[43]

A. Lempel and J. Ziv. 1976. On the complexity of finite sequences. IEEE Trans. Information Theory 22, 1 (1976), 75–81.

[44]

V. Mäkinen, G. Navarro, J. Sirén, and N. Välimäki. 2010. Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17, 3 (2010), 281–308.

[45]

U. Manber and E. W. Myers. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 5 (1993), 935–948.

[46]

J. I. Munro, G. Navarro, and Y. Nekrich. 2017.

[47]

Space-efficient construction of compressed indexes in deterministic linear time. In SODA. 408–424.

[48]

T. Nishimoto, T. I, S. Inenaga, H. Bannai, and M. Takeda. 2016. Fully dynamic data structure for LCE queries in compressed space. In MFCS. 72:1–72:15.

[49]

C. H. Papadimitriou and M. Yannakakis. 1988. Optimization, approximation, and complexity classes (extended abstract). In STOC. 229–234.

[50]

M. Rodeh, V. R. Pratt, and S. Even. 1981. Linear algorithm for data compression via string matching. J. ACM 28, 1 (1981), 16–24.

[51]

W. Rytter. 2003. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302, 1-3 (2003), 211–222.

[52]

J. A. Storer and T. G. Szymanski. 1978. The macro model for data compression. In STOC. 30–39.

[53]

J. A. Storer and T. G. Szymanski. 1982. Data compression via textural substitution. J. ACM 29, 4 (1982), 928–951.

[54]

E. Verbin and W. Yu. 2013. Data structure lower bounds on random access to grammar-compressed strings. In CPM. 247–258.

[55]

P. Weiner. 1973. Linear pattern matching algorithms. In SWAT (FOCS). 1–11.

[56]

T. A. Welch. 1984. A technique for high-performance data compression. Computer 6, 17 (1984), 8–19.

[57]

J. Ziv and A. Lempel. 1977. A universal algorithm for sequential data compression. IEEE Trans. Information Theory 23, 3 (1977), 337–343.

[58]

J. Ziv and A. Lempel. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Information Theory 24, 5 (1978), 530–536. Abstract 1 Introduction 1.1 Our Contributions 2 Preliminaries 3 String attractors 3.1 Reductions from Dictionary Compressors 3.2 Reductions to Dictionary Compressors 4 Computational complexity 5 Optimal-time random access 6 Conclusions Acknowledgments References