DEIM Forum 2013 C10-2 Re-Pair {k sekine, sasakawa, syoshid, 1999 Larsson Moffat Re-Pair Re-Pair Re-Pair

Similar documents
共有辞書を用いた 効率の良い圧縮アルゴリズム

透過的データアクセスを実現する新しいデータ圧縮法

, IT.,.,..,.. i

2. CABAC CABAC CABAC 1 1 CABAC Figure 1 Overview of CABAC 2 DCT 2 0/ /1 CABAC [3] 3. 2 値化部 コンテキスト計算部 2 値算術符号化部 CABAC CABAC

17 Proposal of an Algorithm of Image Extraction and Research on Improvement of a Man-machine Interface of Food Intake Measuring System

(a) 1 (b) 3. Gilbert Pernicka[2] Treibitz Schechner[3] Narasimhan [4] Kim [5] Nayar [6] [7][8][9] 2. X X X [10] [11] L L t L s L = L t + L s

Run-Based Trieから構成される 決定木の枝刈り法

O(m) CPU 1 CPU Xeon LZEnd L1 1 L2 1.5 MiB 40 [10] CPU on-memory 2 2 LZEnd LZEnd 1 Xeon 5670 Xeon 5670 L1 data L2 LL Last Level 129 KiB 1

DEIM Forum 2009 C8-4 QA NTT QA QA QA 2 QA Abstract Questions Recomme

2). 3) 4) 1.2 NICTNICT DCRA Dihedral Corner Reflector micro-arraysdcra DCRA DCRA DCRA 3D DCRA PC USB PC PC ON / OFF Velleman K8055 K8055 K8055

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

1 7.35% 74.0% linefeed point c 200 Information Processing Society of Japan

THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE.

IPSJ SIG Technical Report Vol.2011-EC-19 No /3/ ,.,., Peg-Scope Viewer,,.,,,,. Utilization of Watching Logs for Support of Multi-

1 Table 1: Identification by color of voxel Voxel Mode of expression Nothing Other 1 Orange 2 Blue 3 Yellow 4 SSL Humanoid SSL-Vision 3 3 [, 21] 8 325

Input image Initialize variables Loop for period of oscillation Update height map Make shade image Change property of image Output image Change time L

Web Web Web Web Web, i

258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System

21 Key Exchange method for portable terminal with direct input by user

Duplicate Near Duplicate Intact Partial Copy Original Image Near Partial Copy Near Partial Copy with a background (a) (b) 2 1 [6] SIFT SIFT SIF

IPSJ SIG Technical Report 1, Instrument Separation in Reverberant Environments Using Crystal Microphone Arrays Nobutaka ITO, 1, 2 Yu KITANO, 1

Fig. 3 3 Types considered when detecting pattern violations 9)12) 8)9) 2 5 methodx close C Java C Java 3 Java 1 JDT Core 7) ) S P S

& Vol.5 No (Oct. 2015) TV 1,2,a) , Augmented TV TV AR Augmented Reality 3DCG TV Estimation of TV Screen Position and Ro

IPSJ SIG Technical Report Pitman-Yor 1 1 Pitman-Yor n-gram A proposal of the melody generation method using hierarchical pitman-yor language model Aki

DEIM Forum 2010 A Web Abstract Classification Method for Revie

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

[2] OCR [3], [4] [5] [6] [4], [7] [8], [9] 1 [10] Fig. 1 Current arrangement and size of ruby. 2 Fig. 2 Typography combined with printing

P2P P2P peer peer P2P peer P2P peer P2P i

(MIRU2008) HOG Histograms of Oriented Gradients (HOG)

IT i

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

IPSJ SIG Technical Report Vol.2014-EIP-63 No /2/21 1,a) Wi-Fi Probe Request MAC MAC Probe Request MAC A dynamic ads control based on tra

kut-paper-template2.dvi

IPSJ SIG Technical Report Vol.2011-MUS-91 No /7/ , 3 1 Design and Implementation on a System for Learning Songs by Presenting Musical St

IPSJ SIG Technical Report Vol.2010-GN-74 No /1/ , 3 Disaster Training Supporting System Based on Electronic Triage HIROAKI KOJIMA, 1 KU

DPA,, ShareLog 3) 4) 2.2 Strino Strino STRain-based user Interface with tacticle of elastic Natural ObjectsStrino 1 Strino ) PC Log-Log (2007 6)

IPSJ SIG Technical Report Secret Tap Secret Tap Secret Flick 1 An Examination of Icon-based User Authentication Method Using Flick Input for

IPSJ SIG Technical Report Vol.2010-SLDM-144 No.50 Vol.2010-EMB-16 No.50 Vol.2010-MBL-53 No.50 Vol.2010-UBI-25 No /3/27 Twitter IME Twitte

BOK body of knowledge, BOK BOK BOK 1 CC2001 computing curricula 2001 [1] BOK IT BOK 2008 ITBOK [2] social infomatics SI BOK BOK BOK WikiBOK BO

[1] [3]. SQL SELECT GENERATE< media >< T F E > GENERATE. < media > HTML PDF < T F E > Target Form Expression ( ), 3.. (,). : Name, Tel name tel

3D UbiCode (Ubiquitous+Code) RFID ResBe (Remote entertainment space Behavior evaluation) 2 UbiCode Fig. 2 UbiCode 2. UbiCode 2. 1 UbiCode UbiCode 2. 2

xx/xx Vol. Jxx A No. xx 1 Fig. 1 PAL(Panoramic Annular Lens) PAL(Panoramic Annular Lens) PAL (2) PAL PAL 2 PAL 3 2 PAL 1 PAL 3 PAL PAL 2. 1 PAL

Core1 FabScalar VerilogHDL Cache Cache FabScalar 1 CoreConnect[2] Wishbone[3] AMBA[4] AMBA 1 AMBA ARM L2 AMBA2.0 AMBA2.0 FabScalar AHB APB AHB AMBA2.0

12) NP 2 MCI MCI 1 START Simple Triage And Rapid Treatment 3) START MCI c 2010 Information Processing Society of Japan

1., 1 COOKPAD 2, Web.,,,,,,.,, [1]., 5.,, [2].,,.,.,, 5, [3].,,,.,, [4], 33,.,,.,,.. 2.,, 3.., 4., 5., ,. 1.,,., 2.,. 1,,

1 1 tf-idf tf-idf i

IPSJ SIG Technical Report Vol.2010-NL-199 No /11/ treebank ( ) KWIC /MeCab / Morphological and Dependency Structure Annotated Corp

ID 3) 9 4) 5) ID 2 ID 2 ID 2 Bluetooth ID 2 SRCid1 DSTid2 2 id1 id2 ID SRC DST SRC 2 2 ID 2 2 QR 6) 8) 6) QR QR QR QR

4. C i k = 2 k-means C 1 i, C 2 i 5. C i x i p [ f(θ i ; x) = (2π) p 2 Vi 1 2 exp (x µ ] i) t V 1 i (x µ i ) 2 BIC BIC = 2 log L( ˆθ i ; x i C i ) + q

DEIM Forum 2009 E

Table 1. Reluctance equalization design. Fig. 2. Voltage vector of LSynRM. Fig. 4. Analytical model. Table 2. Specifications of analytical models. Fig

( )

自然言語処理16_2_45

IPSJ SIG Technical Report Vol.2014-DBS-159 No.6 Vol.2014-IFAT-115 No /8/1 1,a) 1 1 1,, 1. ([1]) ([2], [3]) A B 1 ([4]) 1 Graduate School of Info

IT,, i

IPSJ SIG Technical Report NetMAS NetMAS NetMAS One-dimensional Pedestrian Model for Fast Evacuation Simulator Shunsuke Soeda, 1 Tomohisa Yam

The 15th Game Programming Workshop 2010 Magic Bitboard Magic Bitboard Bitboard Magic Bitboard Bitboard Magic Bitboard Magic Bitboard Magic Bitbo

28 Horizontal angle correction using straight line detection in an equirectangular image

12 DCT A Data-Driven Implementation of Shape Adaptive DCT

IPSJ SIG Technical Report Vol.2016-CE-137 No /12/ e β /α α β β / α A judgment method of difficulty of task for a learner using simple

第62巻 第1号 平成24年4月/石こうを用いた木材ペレット

Microsoft Word - toyoshima-deim2011.doc

,,,,., C Java,,.,,.,., ,,.,, i

IPSJ SIG Technical Report iphone iphone,,., OpenGl ES 2.0 GLSL(OpenGL Shading Language), iphone GPGPU(General-Purpose Computing on Graphics Proc

IPSJ SIG Technical Report Vol.2012-HCI-149 No /7/20 1 1,2 1 (HMD: Head Mounted Display) HMD HMD,,,, An Information Presentation Method for Weara

1 3DCG [2] 3DCG CG 3DCG [3] 3DCG 3 3 API 2 3DCG 3 (1) Saito [4] (a) 1920x1080 (b) 1280x720 (c) 640x360 (d) 320x G-Buffer Decaudin[5] G-Buffer D

kut-paper-template.dvi

1034 IME Web API Web API 1 IME Fig. 1 Suitable situations for context-aware IME. IME IME IME IME 1 GPS Web API Web API Web API Web )

20 Method for Recognizing Expression Considering Fuzzy Based on Optical Flow

fiš„v8.dvi

1 Web [2] Web [3] [4] [5], [6] [7] [8] S.W. [9] 3. MeetingShelf Web MeetingShelf MeetingShelf (1) (2) (3) (4) (5) Web MeetingShelf

[2] 2. [3 5] 3D [6 8] Morishima [9] N n 24 24FPS k k = 1, 2,..., N i i = 1, 2,..., n Algorithm 1 N io user-specified number of inbetween omis

IPSJ SIG Technical Report Vol.2009-DPS-141 No.23 Vol.2009-GN-73 No.23 Vol.2009-EIP-46 No /11/27 t-room t-room 2 Development of

The copyright of this material is retained by the Information Processing Society of Japan (IPSJ). The material has been made available on the website

IPSJ SIG Technical Report Vol.2012-CG-148 No /8/29 3DCG 1,a) On rigid body animation taking into account the 3D computer graphics came

Vol. 42 No. SIG 8(TOD 10) July HTML 100 Development of Authoring and Delivery System for Synchronized Contents and Experiment on High Spe

IPSJ SIG Technical Report Vol.2009-DPS-141 No.20 Vol.2009-GN-73 No.20 Vol.2009-EIP-46 No /11/27 1. MIERUKEN 1 2 MIERUKEN MIERUKEN MIERUKEN: Spe

1: A/B/C/D Fig. 1 Modeling Based on Difference in Agitation Method artisoc[7] A D 2017 Information Processing

IPSJ SIG Technical Report Vol.2011-IOT-12 No /3/ , 6 Construction and Operation of Large Scale Web Contents Distribution Platfo

16_.....E...._.I.v2006

(3.6 ) (4.6 ) 2. [3], [6], [12] [7] [2], [5], [11] [14] [9] [8] [10] (1) Voodoo 3 : 3 Voodoo[1] 3 ( 3D ) (2) : Voodoo 3D (3) : 3D (Welc

Vol. 42 No MUC-6 6) 90% 2) MUC-6 MET-1 7),8) 7 90% 1 MUC IREX-NE 9) 10),11) 1) MUCMET 12) IREX-NE 13) ARPA 1987 MUC 1992 TREC IREX-N

1: 2: 3: 4: 2. 1 Exploratory Search [4] Exploratory Search 2. 1 [7] [8] [9] [10] Exploratory Search

1 2. Nippon Cataloging Rules NCR [6] (1) 5 (2) 4 3 (3) 4 (4) 3 (5) ISSN 7 International Standard Serial Number ISSN (6) (7) 7 16 (8) ISBN ISSN I

23 Study on Generation of Sudoku Problems with Fewer Clues

修士論文

04.™ƒ”R/’Ô”�/’Xfl©

Vol.55 No (Jan. 2014) saccess 6 saccess 7 saccess 2. [3] p.33 * B (A) (B) (C) (D) (E) (F) *1 [3], [4] Web PDF a m

<95DB8C9288E397C389C88A E696E6462>

09_加藤_紀要_2007

untitled

MDD PBL ET 9) 2) ET ET 2.2 2), 1 2 5) MDD PBL PBL MDD MDD MDD 10) MDD Executable UML 11) Executable UML MDD Executable UML

A Feasibility Study of Direct-Mapping-Type Parallel Processing Method to Solve Linear Equations in Load Flow Calculations Hiroaki Inayoshi, Non-member

IPSJ SIG Technical Report Vol.2010-CVIM-170 No /1/ Visual Recognition of Wire Harnesses for Automated Wiring Masaki Yoneda, 1 Ta

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

B HNS 7)8) HNS ( ( ) 7)8) (SOA) HNS HNS 4) HNS ( ) ( ) 1 TV power, channel, volume power true( ON) false( OFF) boolean channel volume int

,,.,.,,.,.,.,.,,.,..,,,, i

3_39.dvi


2. Twitter Twitter 2.1 Twitter Twitter( ) Twitter Twitter ( 1 ) RT ReTweet RT ReTweet RT ( 2 ) URL Twitter Twitter 140 URL URL URL 140 URL URL

Transcription:

DEIM Forum 2013 C10-2 Re-Pair 060 0814 14 9 E-mail: k sekine, sasakawa, syoshid, kida}@ist.hokudai.ac.jp 1999 Larsson Moffat Re-Pair Re-Pair Re-Pair VF Variable-to-Fixed-Length Encoding for Large Texts Using Re-Pair Algorithm with Efficient Shared Dictionaries Kei SEKINE, Hirohito SASAKAWA, Satoshi YOSHIDA, and Takuya KIDA Grad. School of Inform. Sci. and Tech., Hokkaido University, N14 W9, Sapporo 060 0814, JAPAN E-mail: k sekine, sasakawa, syoshid, kida}@ist.hokudai.ac.jp Abstract A Re-Pair algorithm proposed by Larsson and Moffat in 1999 is a simple grammar-based compression method, which achieves an extremely high compression ratio. However, since the Re-Pair algorithm is offline and very space consuming, we need to divide a very large text into smaller blocks to apply the algorithm on the very large text. In such case, it is expected that we can improve the compression speed and ratio by sharing a part of the dictionaries among all the blocks. In this paper, we show empirically how the compression speed and ratio vary with constructing the shared dictionary on the text constructed by sampling uniformly from the whole input text. Key words Lossless Data Compression, Re-Pair, Grammer-Based Compression, VF Coding 1. Burrows-Wheeler [1] Lemple-Ziv [14], [15] Bzip2 Burrows-Wheeler Wan Moffat [13] Larsson Moffat Re-Pair [6] Re-Pair Wan Moffat Re-Pair 1

[13] 3 Re-Pair Re-Pair Blocked- Re-Pair-VF [16] Blocked-Re-Pair-VF Re-Pair-VF Blocked-Re-Pair-VF Re-Pair Re-Pair variablelengh-to-fixed-length VF VF VF VF VF [2], [5] VF VF bzip2 (27 ) 2. LZ78 [15] LZW [14] Bisection [3] straight line program CFG [4], [6], [8], [9]. Re-Pair [6] SEQUITUR [8] Maruyama [7] VF Klein Shapira [5] Kida [2] [12] VF STVF STVF STVF VF Tunstall [10] gzip VF Uemura [11] Gzip [11] Tunstall 100 Wan Moffat [13] Re-Merge Re-Pair Re-Merge Re-Pair Re-Merge WSJ508 20% WSJ508 508MB SGML 1987 1992 2.8GHz Intel Xeon 2GB Debian GNU/Linux 4400 3. Re-Pair [6] 3. 1 Re-Pair Re-Pair 2

(Σ, V, σ, R) Σ = a 1, a 2,, a Σ } V = α 1, α 2,, α V } σ V R V (Σ V ) R Σ V Re-Pair σ α i1 α i2 α im ( i k 1,, Σ + V 1}), a i if 1 < = i < = Σ, α i α j α k (1 < = j, k < i) if i > Σ. 2-gram Re-Pair 2-gram 2-gram R 2-gram σ R σ 3. 2 Re-Pair VF B, S, L, C T 0,..., Σ 1} B, S, L, C, T T D 1 2 2 T B T N N/B c c = C/ N/B S 1 2-gram (α, β) 2 2-gram (α, β) D 3 2-gram (α, β) 2-gram S 1 Input Text Text1 Text2 Text3 Text4 Shared Dic Block Separation Sampling & Merge Re-Pair-VF Re-Pair-VF Re-Pair S 2 L 2-gram 1 2-gram 2 2-gram D 3 2-gram 2-gram 2 L 2-gram L L L 4. C GCC version 4.6.3 3

Shared Dic Text1 Text2 Text3 Text4 replace and so on... Text1' Text2' Text3' Re-Pair-VF Re-Pair-VF Re-Pair-VF Local Dic1 Local Local Comp1 Comp2 Comp3 Dic2 Dic3 2 2-gram Re-Pair-VF 4. 1 CPU : Intel Xeon processor 3.6GHz : 32GB OS : Ubuntu OS 12.04 Pizza & Chili corpus 1 DNA dna XML dblp.xml english 2GB Blocked-Re-Pair-VF 4. 2 B L S C 4 3 4 4 3 1 Pizza & Chili corpus : http://pizzachili.dcc.uchile.cl/ texts.html 3 B L S NEW (xm) C = x (MB) OLD S 2 L B = 128 L = 19 S = 1/8 27.78 % 4. 3 B L S C 4 4

5 C (MB) L (bit) B = 128 (MB) S = 3/8 6 C(MB) L (bit) B = 128(MB) S = 3/8 4 sec B L S NEW (xm) C = x (MB) OLD 4. 3. 1 C 5 6 5 6 2-gram 4. 3. 2 B 7 8 7 8 5

7 B (MB) 9 S B S C = 128 (MB) L = 19 (bit) (MB) C = 128 (MB) L = 19 (bit) 8 B (MB) 10 S S C = 128 (MB) L = 19 (bit) B C = 128 (MB) L = 19 (bit) 4. 3. 3 S 9 10 9 10 2-gram S S = 0/8 4. 3. 4 L 11 12 11 12 6

11 L S C = 128 13 OLD NEW (MB) B = 128 (MB) 14 OLD NEW 12 L S C = 128 (MB) B = 128 (MB) 4. 4 gzip 2, bzip2 3, LZMA (7-Zip) 4, 4 Ubuntu 12.04 gzip 1.4 bzip2 1.0.6 LZMA 5.1.0 alpha 13 14 15 13 bzip2 14 bzip2 2 3 2 zip : http://www.gzip.org/ 3 bzip2 : http://www.bzip.org/ 4 LZMA : http://www.7-zip.org/ 15 OLD NEW 15 1/10 bzip2 VF 5. VF Blocked-Re-pair-VF 7

bzip2 Re-Merge [13] Re-Merge Re-Merge [1] M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, California, 1994. [2] T. Kida. Suffix tree based VF-coding for compressed pattern matching. In Proc. of Data Compression Conference 2009 (DCC 2009), p. 449, Mar. 2009. [3] J. C. Kieffer, G. Nelson E.-H. Yang, and P. Cosman. Universal lossless compression via multilevel pattern matching. IEEE Trans. Inform. Theory, Vol. 46, No. 4, pp. 1227 1245, 2000. [4] J. C. Kieffer and E.-H. Yang. Grammar-based codes: a new class of universal lossless source codes. IEEE Trans. on Inform. Theory, Vol. 46, No. 3, pp. 737 754, 2000. [5] S. T. Klein and D. Shapira. Improved variable-to-fixed length codes. In Proc. of the 15th International Symposium on String Processing and Information Retrieval (SPIRE 2008), pp. 39 50, 2008. [6] N. J. Larsson and A. Moffat. Off-line dictionary-based compression. Proceedings of the IEEE, Vol. 88, No. 11, pp. 1722 1732, 2000. [7] S. Maruyama, Y. Tanaka, H. Sakamoto, and M. Takeda. Context-sensitive grammar transform: Compression and pattern matching. In Proc. of 15th International Symposium on String Processing and Information Retrieval (SPIRE 2008), pp. 27 38, Nov. 2008. [8] C. Nevill-Manning, I. Witten, and D. Maulsby. Compression by induction of hierarchical grammars. In Proc. of the Data Compression Conference 1994 (DCC 94), pp. 244 253. IEEE, 1994. [9] H. Sakamoto, T. Kida, and S. Shimozono. A space-saving linear-time algorithm for grammar-based compression. In String Processing and Information Retrieval, Vol. 3246 of Lecture Notes in Computer Science, pp. 218 229. Springer Berlin / Heidelberg, 2004. [10] B. P. Tunstall. Synthesis of noiseless compression codes. PhD thesis, Georgia Institute of Technology, Atlanta, GA, 1967. [11] T. Uemura, S. Yoshida, T. Kida, T. Asai, and S. Okamoto. Training parse trees for efficient VF coding. In Proc. of the 17th international conference on String processing and information retrieval (SPIRE 2010), pp. 179 184, 2010. [12] E. Ukkonen. On-line construction of suffix trees. Algorithmica, Vol. 14, No. 3, pp. 249 260, 1995. [13] R. Wan and A. Moffat. Block merging for off-line compression. J. Am. Soc. Inf. Sci. Technol., Vol. 58, No. 1, pp. 3 14, January 2007. [14] T. A. Welch. A technique for high performance data compression. IEEE Comput., Vol. 17, pp. 8 19, June 1984. [15] J. Ziv and A. Lempel. Compression of individual sequences via variable-length coding. IEEE Trans. on Inform. Theory, Vol. 24, No. 5, pp. 530 536, Sep 1978. [16],,,.. 156, 2012 12. 8