DEIM Forum 2013 C10-2 Re-Pair {k sekine, sasakawa, syoshid, 1999 Larsson Moffat Re-Pair Re-Pair Re-Pair

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "DEIM Forum 2013 C10-2 Re-Pair {k sekine, sasakawa, syoshid, 1999 Larsson Moffat Re-Pair Re-Pair Re-Pair"

Transcription

1 DEIM Forum 2013 C10-2 Re-Pair k sekine, sasakawa, syoshid, 1999 Larsson Moffat Re-Pair Re-Pair Re-Pair VF Variable-to-Fixed-Length Encoding for Large Texts Using Re-Pair Algorithm with Efficient Shared Dictionaries Kei SEKINE, Hirohito SASAKAWA, Satoshi YOSHIDA, and Takuya KIDA Grad. School of Inform. Sci. and Tech., Hokkaido University, N14 W9, Sapporo , JAPAN k sekine, sasakawa, syoshid, Abstract A Re-Pair algorithm proposed by Larsson and Moffat in 1999 is a simple grammar-based compression method, which achieves an extremely high compression ratio. However, since the Re-Pair algorithm is offline and very space consuming, we need to divide a very large text into smaller blocks to apply the algorithm on the very large text. In such case, it is expected that we can improve the compression speed and ratio by sharing a part of the dictionaries among all the blocks. In this paper, we show empirically how the compression speed and ratio vary with constructing the shared dictionary on the text constructed by sampling uniformly from the whole input text. Key words Lossless Data Compression, Re-Pair, Grammer-Based Compression, VF Coding 1. Burrows-Wheeler [1] Lemple-Ziv [14], [15] Bzip2 Burrows-Wheeler Wan Moffat [13] Larsson Moffat Re-Pair [6] Re-Pair Wan Moffat Re-Pair 1

2 [13] 3 Re-Pair Re-Pair Blocked- Re-Pair-VF [16] Blocked-Re-Pair-VF Re-Pair-VF Blocked-Re-Pair-VF Re-Pair Re-Pair variablelengh-to-fixed-length VF VF VF VF VF [2], [5] VF VF bzip2 (27 ) 2. LZ78 [15] LZW [14] Bisection [3] straight line program CFG [4], [6], [8], [9]. Re-Pair [6] SEQUITUR [8] Maruyama [7] VF Klein Shapira [5] Kida [2] [12] VF STVF STVF STVF VF Tunstall [10] gzip VF Uemura [11] Gzip [11] Tunstall 100 Wan Moffat [13] Re-Merge Re-Pair Re-Merge Re-Pair Re-Merge WSJ508 20% WSJ MB SGML GHz Intel Xeon 2GB Debian GNU/Linux Re-Pair [6] 3. 1 Re-Pair Re-Pair 2

3 (Σ, V, σ, R) Σ = a 1, a 2,, a Σ } V = α 1, α 2,, α V } σ V R V (Σ V ) R Σ V Re-Pair σ α i1 α i2 α im ( i k 1,, Σ + V 1}), a i if 1 < = i < = Σ, α i α j α k (1 < = j, k < i) if i > Σ. 2-gram Re-Pair 2-gram 2-gram R 2-gram σ R σ 3. 2 Re-Pair VF B, S, L, C T 0,..., Σ 1} B, S, L, C, T T D T B T N N/B c c = C/ N/B S 1 2-gram (α, β) 2 2-gram (α, β) D 3 2-gram (α, β) 2-gram S 1 Input Text Text1 Text2 Text3 Text4 Shared Dic Block Separation Sampling & Merge Re-Pair-VF Re-Pair-VF Re-Pair S 2 L 2-gram 1 2-gram 2 2-gram D 3 2-gram 2-gram 2 L 2-gram L L L 4. C GCC version

4 Shared Dic Text1 Text2 Text3 Text4 replace and so on... Text1' Text2' Text3' Re-Pair-VF Re-Pair-VF Re-Pair-VF Local Dic1 Local Local Comp1 Comp2 Comp3 Dic2 Dic3 2 2-gram Re-Pair-VF 4. 1 CPU : Intel Xeon processor 3.6GHz : 32GB OS : Ubuntu OS Pizza & Chili corpus 1 DNA dna XML dblp.xml english 2GB Blocked-Re-Pair-VF 4. 2 B L S C Pizza & Chili corpus : texts.html 3 B L S NEW (xm) C = x (MB) OLD S 2 L B = 128 L = 19 S = 1/ % 4. 3 B L S C 4 4

5 5 C (MB) L (bit) B = 128 (MB) S = 3/8 6 C(MB) L (bit) B = 128(MB) S = 3/8 4 sec B L S NEW (xm) C = x (MB) OLD C gram B

6 7 B (MB) 9 S B S C = 128 (MB) L = 19 (bit) (MB) C = 128 (MB) L = 19 (bit) 8 B (MB) 10 S S C = 128 (MB) L = 19 (bit) B C = 128 (MB) L = 19 (bit) S gram S S = 0/ L

7 11 L S C = OLD NEW (MB) B = 128 (MB) 14 OLD NEW 12 L S C = 128 (MB) B = 128 (MB) 4. 4 gzip 2, bzip2 3, LZMA (7-Zip) 4, 4 Ubuntu gzip 1.4 bzip LZMA alpha bzip2 14 bzip zip : 3 bzip2 : 4 LZMA : 15 OLD NEW 15 1/10 bzip2 VF 5. VF Blocked-Re-pair-VF 7

8 bzip2 Re-Merge [13] Re-Merge Re-Merge [1] M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, California, [2] T. Kida. Suffix tree based VF-coding for compressed pattern matching. In Proc. of Data Compression Conference 2009 (DCC 2009), p. 449, Mar [3] J. C. Kieffer, G. Nelson E.-H. Yang, and P. Cosman. Universal lossless compression via multilevel pattern matching. IEEE Trans. Inform. Theory, Vol. 46, No. 4, pp , [4] J. C. Kieffer and E.-H. Yang. Grammar-based codes: a new class of universal lossless source codes. IEEE Trans. on Inform. Theory, Vol. 46, No. 3, pp , [5] S. T. Klein and D. Shapira. Improved variable-to-fixed length codes. In Proc. of the 15th International Symposium on String Processing and Information Retrieval (SPIRE 2008), pp , [6] N. J. Larsson and A. Moffat. Off-line dictionary-based compression. Proceedings of the IEEE, Vol. 88, No. 11, pp , [7] S. Maruyama, Y. Tanaka, H. Sakamoto, and M. Takeda. Context-sensitive grammar transform: Compression and pattern matching. In Proc. of 15th International Symposium on String Processing and Information Retrieval (SPIRE 2008), pp , Nov [8] C. Nevill-Manning, I. Witten, and D. Maulsby. Compression by induction of hierarchical grammars. In Proc. of the Data Compression Conference 1994 (DCC 94), pp IEEE, [9] H. Sakamoto, T. Kida, and S. Shimozono. A space-saving linear-time algorithm for grammar-based compression. In String Processing and Information Retrieval, Vol of Lecture Notes in Computer Science, pp Springer Berlin / Heidelberg, [10] B. P. Tunstall. Synthesis of noiseless compression codes. PhD thesis, Georgia Institute of Technology, Atlanta, GA, [11] T. Uemura, S. Yoshida, T. Kida, T. Asai, and S. Okamoto. Training parse trees for efficient VF coding. In Proc. of the 17th international conference on String processing and information retrieval (SPIRE 2010), pp , [12] E. Ukkonen. On-line construction of suffix trees. Algorithmica, Vol. 14, No. 3, pp , [13] R. Wan and A. Moffat. Block merging for off-line compression. J. Am. Soc. Inf. Sci. Technol., Vol. 58, No. 1, pp. 3 14, January [14] T. A. Welch. A technique for high performance data compression. IEEE Comput., Vol. 17, pp. 8 19, June [15] J. Ziv and A. Lempel. Compression of individual sequences via variable-length coding. IEEE Trans. on Inform. Theory, Vol. 24, No. 5, pp , Sep [16],,,.. 156,