gengo.dvi

Similar documents
No. 3 Oct The person to the left of the stool carried the traffic-cone towards the trash-can. α α β α α β α α β α Track2 Track3 Track1 Track0 1

% 95% 2002, 2004, Dunkel 1986, p.100 1

A Japanese Word Dependency Corpus ÆüËܸì¤Îñ¸ì·¸¤ê¼õ¤±¥³¡¼¥Ñ¥¹

21 Pitman-Yor Pitman- Yor [7] n -gram W w n-gram G Pitman-Yor P Y (d, θ, G 0 ) (1) G P Y (d, θ, G 0 ) (1) Pitman-Yor d, θ, G 0 d 0 d 1 θ Pitman-Yor G

Modal Phrase MP because but 2 IP Inflection Phrase IP as long as if IP 3 VP Verb Phrase VP while before [ MP MP [ IP IP [ VP VP ]]] [ MP [ IP [ VP ]]]

IPSJ SIG Technical Report Pitman-Yor 1 1 Pitman-Yor n-gram A proposal of the melody generation method using hierarchical pitman-yor language model Aki

,,.,.,,.,.,.,.,,.,..,,,, i

1: A/B/C/D Fig. 1 Modeling Based on Difference in Agitation Method artisoc[7] A D 2017 Information Processing

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

3 2 2 (1) (2) (3) (4) 4 4 AdaBoost 2. [11] Onishi&Yoda [8] Iwashita&Stoica [5] 4 [3] 3. 3 (1) (2) (3)

& Vol.5 No (Oct. 2015) TV 1,2,a) , Augmented TV TV AR Augmented Reality 3DCG TV Estimation of TV Screen Position and Ro

Vol.54 No (July 2013) [9] [10] [11] [12], [13] 1 Fig. 1 Flowchart of the proposed system. c 2013 Information

Vol.55 No (Jan. 2014) saccess 6 saccess 7 saccess 2. [3] p.33 * B (A) (B) (C) (D) (E) (F) *1 [3], [4] Web PDF a m

駒田朋子.indd

258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System

28 Horizontal angle correction using straight line detection in an equirectangular image

DEIM Forum 2010 A Web Abstract Classification Method for Revie

2) TA Hercules CAA 5 [6], [7] CAA BOSS [8] 2. C II C. ( 1 ) C. ( 2 ). ( 3 ) 100. ( 4 ) () HTML NFS Hercules ( )

, : GUI Web Java 2.1 GUI GUI GUI 2 y = x y = x y = x

JFE.dvi

ID 3) 9 4) 5) ID 2 ID 2 ID 2 Bluetooth ID 2 SRCid1 DSTid2 2 id1 id2 ID SRC DST SRC 2 2 ID 2 2 QR 6) 8) 6) QR QR QR QR

IPSJ SIG Technical Report Vol.2012-MUS-96 No /8/10 MIDI Modeling Performance Indeterminacies for Polyphonic Midi Score Following and

11_寄稿論文_李_再校.mcd

1 Web [2] Web [3] [4] [5], [6] [7] [8] S.W. [9] 3. MeetingShelf Web MeetingShelf MeetingShelf (1) (2) (3) (4) (5) Web MeetingShelf

IPSJ SIG Technical Report Vol.2010-NL-199 No /11/ treebank ( ) KWIC /MeCab / Morphological and Dependency Structure Annotated Corp

29 jjencode JavaScript

johnny-paper2nd.dvi

IPSJ SIG Technical Report Vol.2010-CVIM-170 No /1/ Visual Recognition of Wire Harnesses for Automated Wiring Masaki Yoneda, 1 Ta

untitled

IPSJ SIG Technical Report Vol.2009-CVIM-167 No /6/10 Real AdaBoost HOG 1 1 1, 2 1 Real AdaBoost HOG HOG Real AdaBoost HOG A Method for Reducing

P2P P2P peer peer P2P peer P2P peer P2P i

A Feasibility Study of Direct-Mapping-Type Parallel Processing Method to Solve Linear Equations in Load Flow Calculations Hiroaki Inayoshi, Non-member

( ) [1] [4] ( ) 2. [5] [6] Piano Tutor[7] [1], [2], [8], [9] Radiobaton[10] Two Finger Piano[11] Coloring-in Piano[12] ism[13] MIDI MIDI 1 Fig. 1 Syst

1., 1 COOKPAD 2, Web.,,,,,,.,, [1]., 5.,, [2].,,.,.,, 5, [3].,,,.,, [4], 33,.,,.,,.. 2.,, 3.., 4., 5., ,. 1.,,., 2.,. 1,,

16_.....E...._.I.v2006

( )

Computational Semantics 1 category specificity Warrington (1975); Warrington & Shallice (1979, 1984) 2 basic level superiority 3 super-ordinate catego

知能と情報, Vol.30, No.5, pp

1 7.35% 74.0% linefeed point c 200 Information Processing Society of Japan

¥ì¥·¥Ô¤Î¸À¸ì½èÍý¤Î¸½¾õ

24 Region-Based Image Retrieval using Fuzzy Clustering

Fig. 3 Flow diagram of image processing. Black rectangle in the photo indicates the processing area (128 x 32 pixels).

,,,,., C Java,,.,,.,., ,,.,, i

DPA,, ShareLog 3) 4) 2.2 Strino Strino STRain-based user Interface with tacticle of elastic Natural ObjectsStrino 1 Strino ) PC Log-Log (2007 6)

IPSJ-TOD

第 55 回自動制御連合講演会 2012 年 11 月 17 日,18 日京都大学 1K403 ( ) Interpolation for the Gas Source Detection using the Parameter Estimation in a Sensor Network S. T

IS1-09 第 回画像センシングシンポジウム, 横浜,14 年 6 月 2 Hough Forest Hough Forest[6] Random Forest( [5]) Random Forest Hough Forest Hough Forest 2.1 Hough Forest 1 2.2

EQUIVALENT TRANSFORMATION TECHNIQUE FOR ISLANDING DETECTION METHODS OF SYNCHRONOUS GENERATOR -REACTIVE POWER PERTURBATION METHODS USING AVR OR SVC- Ju

fiš„v8.dvi

IPSJ SIG Technical Report Vol.2016-CE-137 No /12/ e β /α α β β / α A judgment method of difficulty of task for a learner using simple

/ p p

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2017-CG-166 No /3/ HUNTEXHUNTER1 NARUTO44 Dr.SLUMP1,,, Jito Hiroki Satoru MORITA The

大学における原価計算教育の現状と課題

第62巻 第1号 平成24年4月/石こうを用いた木材ペレット

The 18th Game Programming Workshop ,a) 1,b) 1,c) 2,d) 1,e) 1,f) Adapting One-Player Mahjong Players to Four-Player Mahjong

計量国語学 アーカイブ ID KK 種別 特集 招待論文 A タイトル Webコーパスの概念と種類, 利用価値 語史研究の情報源としてのWebコーパス Title The Concept, Types and Utility of Web Corpora: Web Corpora as

Mhij =zhij... (2) Đhij {1, 2,...,lMhij}... (3)

06’ÓŠ¹/ŒØŒì

Vol. 48 No. 4 Apr LAN TCP/IP LAN TCP/IP 1 PC TCP/IP 1 PC User-mode Linux 12 Development of a System to Visualize Computer Network Behavior for L

soturon.dvi

先端社会研究 ★5★号/4.山崎

Studies of Foot Form for Footwear Design (Part 9) : Characteristics of the Foot Form of Young and Elder Women Based on their Sizes of Ball Joint Girth

Grund.dvi

Web Basic Web SAS-2 Web SAS-2 i

自然言語処理24_705

2 ( ) i

B HNS 7)8) HNS ( ( ) 7)8) (SOA) HNS HNS 4) HNS ( ) ( ) 1 TV power, channel, volume power true( ON) false( OFF) boolean channel volume int

TCP/IP IEEE Bluetooth LAN TCP TCP BEC FEC M T M R M T 2. 2 [5] AODV [4]DSR [3] 1 MS 100m 5 /100m 2 MD 2 c 2009 Information Processing Society of

Vol. 42 No MUC-6 6) 90% 2) MUC-6 MET-1 7),8) 7 90% 1 MUC IREX-NE 9) 10),11) 1) MUCMET 12) IREX-NE 13) ARPA 1987 MUC 1992 TREC IREX-N

Oda

5 5 5 Barnes et al


kiyo5_1-masuzawa.indd

Vol.53 No (Mar. 2012) 1, 1,a) 1, 2 1 1, , Musical Interaction System Based on Stage Metaphor Seiko Myojin 1, 1,a

2017 (413812)

1 Table 1: Identification by color of voxel Voxel Mode of expression Nothing Other 1 Orange 2 Blue 3 Yellow 4 SSL Humanoid SSL-Vision 3 3 [, 21] 8 325

特集_03-07.Q3C

[2] OCR [3], [4] [5] [6] [4], [7] [8], [9] 1 [10] Fig. 1 Current arrangement and size of ruby. 2 Fig. 2 Typography combined with printing

A Study of Effective Application of CG Multimedia Contents for Help of Understandings of the Working Principles of the Internal Combustion Engine (The

p.14 p.14 p.17 1 p レッテル貼り文 2015: PC 20 p : PC 4


IPSJ SIG Technical Report Vol.2012-CG-148 No /8/29 3DCG 1,a) On rigid body animation taking into account the 3D computer graphics came

013858,繊維学会誌ファイバー1月/報文-02-古金谷

2reN-A14.dvi

Q [4] 2. [3] [5] ϵ- Q Q CO CO [4] Q Q [1] i = X ln n i + C (1) n i i n n i i i n i = n X i i C exploration exploitation [4] Q Q Q ϵ 1 ϵ 3. [3] [5] [4]

DEIM Forum 2009 E

Web Web Web Web Web, i

2017 Journal of International and Advanced Japanese Studies Vol. 9, February 2017, pp Master s and Doctoral Programs in International and Adv

IPSJ SIG Technical Report Vol.2013-GN-86 No.35 Vol.2013-CDS-6 No /1/17 1,a) 2,b) (1) (2) (3) Development of Mobile Multilingual Medical


Sobel Canny i

IPSJ SIG Technical Report Vol.2014-HCI-158 No /5/22 1,a) 2 2 3,b) Development of visualization technique expressing rainfall changing conditions

1 [1, 2, 3, 4, 5, 8, 9, 10, 12, 15] The Boston Public Schools system, BPS (Deferred Acceptance system, DA) (Top Trading Cycles system, TTC) cf. [13] [

4.1 % 7.5 %

28 TCG SURF Card recognition using SURF in TCG play video

08-特集04.indd

[2] 2. [3 5] 3D [6 8] Morishima [9] N n 24 24FPS k k = 1, 2,..., N i i = 1, 2,..., n Algorithm 1 N io user-specified number of inbetween omis


自然言語処理21_249

58 10

Transcription:

4 97.52% tri-gram 92.76% 98.49% : Japanese word segmentation by Adaboost using the decision list as the weak learner Hiroyuki Shinnou In this paper, we propose the new method of Japanese word segmentation by Adaboost using the decision list as the weak learner. The word segmentation is regarded as the classification problem of judging whether the word boundary exists between two characters or not. By solving the problem by the decision list method, we can conduct Japanese word segmentation. Our method has the advantage not to suffer the unknown word problem because we do not use dictionary information as an attribute of our decision list. Moreover, by taking this approach we can use Adaboost which is actively researched in the machine learning domain recently. Adaboost improves the precision of our decision list. In experiments, we built the decision list through Kyoto University Corpus (about 40K sentences). The precision of this decision list was 97.52%. This values was much higher than the precision of character based tri-gram model, 92.76%. By using Adaboost method, our precision was improved to 98.49%. Furthermore, our word segmentation system was excellent in detecting unknown words. KeyWords: Word segmentation, classification problem, decision list, Adaboost, Faculty of Engineering, Ibaraki University Department of Systems Engineering 3

Vol. 8 No. 2 Apr. 2001 1 HMM (Hidden Markov Model) HMM HMM a b c tri-gram 2 HMM ( 1997; Tsuji and Kageura 1997; 1998) HMM +1 1 n-gram (Yarowsky 1994) (Freund and Schapire 1997) 4

tri-gram HMM 2 2.1 n s = c 1 c 2 c n c i c i c i+1 b i (+1) ( 1) b i i = 1, 2,, n 1 +1 1. 1 +1 1 +1 %wwâwwww ww*ww]ww_wwtwwjww ww ww wwuww)ww<ww"wwï ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ %wwâw w w *w ]ww_wwtwwjww ww ww w Uw )ww<ww"w ï 1 1 Word segmentation by class assignment 2.2 5

Vol. 8 No. 2 Apr. 2001 step 1 step 2 step 3 n att 1, att 2,, att n att a C (att, a) C ((att, a), C) 1 ((att, a), C) f C f C Ĉ (att, a) pw(att, a) pw((att, a)) = log fĉ C Ĉ f C step 4 2.3 b i b i 1 7 1 1 Setting attributes att 1 c i 1c ic i+1 att 2 c ic i+1c i+2 att 3 c i 1c i att 4 c ic i+1 att 5 c i+1c i+2 att 6 1 ((c i ), (c i+1 )) att 7 2 ((c i ), (c i+1 )) 6 7 6 7 2 9 6

2 2 Classification of character types default default default default 6 2.4 5 6 b 5 +1 1 b 5 7 (att 1, ) (att 2, ) (att 3, ) (att 4, ) (att 5, ) (att 6, ) (att 7, ) 7

Vol. 8 No. 2 Apr. 2001 3 3 Example of class judgement (att 1, ) (att 2, ) (att 3, ) +1 2.74377 (att 4, ) +1 5.83188 (att 5, ) +1 1.64565 (att 6, ) +1 6.33293 (att 7, ) +1 8.64488 (att 7, ) +1 b 5 3 2 ( 2 Y ) {+1, 1} 2 (x 1, y 1 ), (x 2, y 2 ),, (x m, y m ) x i y i x i y i +1 1 h 1 h 1 h 1 x i x i y i h 1 x i (x 1, y 1 ), (x 2, y 2 ),, m, y m ) h 2 T T h 1, h 2,, h T T = 3 x h 1 +1 h 2 8

1 h 3 +1 1 2.0 2.2 +1.2 +1 2 ǫ t ) Given: (x 1, y 1 ),, (x m, y m ) where x i X,y i Y = {1, 1} Initialize Di(i) = 1/m For t = 1,, T Train weak learner using distribution D t Get weak hypothesis h t : X Y with error ǫ t = Pr i Dt [h t (x i ) y i ] Choose α t = 1 2 ln(1 ǫ t ǫ t ) Update: { D t+1 (i) = D t(i) Z t e αt if h t (x i ) = y i e αt if h t (x i ) y i where Z t is a normalization factor Output the final hypothesis: T H(x) = sign( α t h t (x)) t=1 2 2 AdaBoost / 9

Vol. 8 No. 2 Apr. 2001 / / / / 4 5 b 4 (att 1, ) (att 2, ) (att 3, ) (att 4, ) (att 5, ) (att 6, ) (att 7, ) 1 step 2 1 ((att 1, ), 1) ((att 2, ), 1) ((att 3, ), 1) ((att 4, ), 1) ((att 5, ), 1) ((att 6, ), 1) ((att 7, ), 1) 1 h k 4 5 +1 h k+1 1 step 2 2 4 4.1 n-gram n-gram ( 1998) n-gram n-gram Viterbi HMM ( 4 ) 10

950117.KNP 1,234 1 35,717 1,234 56,411 56,411 tri-gram CMU-Cambridge Toolkit 2 Witten-Bell discounting 0 ( 1999) tri-gram tri-gram 56,411 52,328 4,083 92.76% 7 136,114 56,411 55,015 1,396 97.52% tri-gram 92.76% 4.2 % 3 3 3 56,411 55,560 851 98.49% 4.3 3 35,717 1,234 1 EOS 2 CMU-Cambridge Toolkit http://svr-www.eng.cam.ac.uk/ prc14/toolkit.html 11

Vol. 8 No. 2 Apr. 2001 98.6 precision 98.4 98.2 98 97.8 97.6 97.4 1 2 3 4 5 6 7 3 3 Precision by boosting 914,392 41,890 32,764 6,479 1,024 832 1,024 832 1,024 832 688 562 67.2% 67.5% 9 (1) 124 123 12

(2) 94 91 (1) (3) 44 41 (4) 7 3 (5) 210 156 (6) 38 32 (7) 21 17 (8) 426 310 3 (9) 64 59 9 4 3 (7) (8) 13

Vol. 8 No. 2 Apr. 2001 4 4 Detection of unknown words (1) 124 101 124 (2) 94 57 0 (3) 44 40 44 (4) 7 5 7 (5) 210 188 210 (6) 38 19 0 (7) 21 4 21 (8) 426 246 0 (9) 64 28 0 1,024 688 (67.2%) 406 (39.6%) 4 39.6% 67.2% (1),(3),(4),(5),(7) 83.3% 5 2 56,411 0 1 83 57.83% 1 2 2 3 4 (Quinlan 1993) (Ratnaparkhi 1998) 7 bi-gram tri-gram 7 14 14

1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0 2 4 6 8 10 12 14 4 4 Identification strength and precision bi-gram tri-gram 50% 15

Vol. 8 No. 2 Apr. 2001 4 3 1 (,, 1999) 2 3 ( 1998; 1999) ( 2000) (Shinnou 2000) 6 n-gram K11 IV 71. 4 336 851 7 16

Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55 (1), 119 139. Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publisher. Ratnaparkhi, A. (1998). Maximum Entropy Models for Natural Language Ambiguity Resolution. In PhD thesis. University of Pennsylvania. Shinnou, H. (2000). Deterministic Japanese Word Segmentation by Decision List Method. In PRICAI-2000 (poster session), pp. 822 822. Tsuji, K. and Kageura, K. (1997). An HMM-based Method for Segmenting Japanese Terms and Keywords based on Domain-Specific Bilingual Corpora. In The 4th Natural Language Processing Pacific Rim Symposium, pp. 557 560. Yarowsky, D. (1994). Decision Lists for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French. In 32th Annual Meeting of the Association for Computational Linguistics, pp. 88 95. (1999). Suffix array., NL-131-7. (1998). PPM., NL-128-2.,, (1999).., 6 (7), 93 108. (1999)... (2000).., NL-140-1. (1998).. 4, pp. 524 527. (1997).. 3, pp. 421 424. 17

Vol. 8 No. 2 Apr. 2001 : 36 60. 62., 5 4 9 10 ( ) (2000 8 26 ) (2000 10 6 ) (2001 1 12 ) 18