JUMAN++ version 1.01 28 9
Morphological Analysis System JUMAN++ 1.01 Copyright 2016 Kyoto University All rights reserved. Licensed under the Apache License, Version 2.0 (the License ); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/license-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. Version 1.01 September 2016 Version 1.00 September 2016
1 1 2 JUMAN++ 2 2.1......................................... 2 2.2...................................... 2 2.3......................................... 3 2.4......................................... 3 2.5......................... 5 2.6 Python.................................... 6 2.7................................ 7 3 8 3.1.................................... 8 3.2........................................... 8 3.2.1.................................. 9 3.2.2................................ 9 3.2.3.............. 11 3.2.4................................ 12 3.3......................................... 12 4 14 4.1.............................. 14 4.2.............................. 15 4.2.1.................. 15 4.2.2................................. 16 4.2.3................................... 16 4.2.4..................................... 17 4.3 JUMAN.................................... 17 5 JUMAN++ 20 5.1.................................. 20 5.2.................................... 20 5.3........................... 21 5.4.................................... 22 6 23 24 25 i
A 25 A.1 (JUMAN.grammar)................... 25 A.1.1.................. 25 A.1.2........................ 25 A.2 (JUMAN.kankei).................... 26 A.2.1.................. 26 A.2.2........................ 26 A.3 (JUMAN.katuyou)...................... 27 A.3.1.................... 27 A.3.2........................... 27 B 28 B.1......................................... 28 B.2...................................... 29 B.3......................................... 32 B.4......................................... 35 B.5................................. 36 ii
1 JUMAN Chasen MeCab JUMAN++ RNN(recurrent neural network) RNN [6, 7] JUMAN++ JUMAN 3 Wikipedia JUMAN++ JUMAN++ JUMAN++ CREST ( : ) CREST JUMAN++ JUMAN JUMAN 28 9 Email: nl-resource@nlp.ist.i.kyoto-u.ac.jp 1
2 JUMAN++ 2.1 JUMAN++ OS: Linux ( CentOS 6.7 ) : 4GB : 2GB gcc (4.9 ) Boost C++ Libraries (1.57 ) 1 gperftool 2 libunwind 3 (gperftool 64bit ) 2.2 JUMAN++ JUMAN++ JUMAN++ % wget http://lotus.kuee.kyoto-u.ac.jp/nl-resource/jumanpp/jumanpp-1.01.tar.xz ( 600MB) % tar xjvf jumanpp-1.01.tar.xz % cd jumanpp-1.01 %./configure % make % sudo make install /usr/local/./configure --prefix=/path/to/somewhere/ 1 http://www.boost.org/ 2 https://github.com/gperftools/gperftools 3 http://www.nongnu.org/libunwind/ 2
2.3 JUMAN++ jumanpp UTF-8 4 # ##JUMAN++ 2.7 % cat cake.txt # S-ID: 00000000-01 % cat cake.txt jumanpp # S-ID: 00000000-01 JUMAN++:1.01 6 1 * 0 * 0 " : / : - : " 9 1 * 0 * 0 NIL 2 * 0 1 2 " : / : " -s, --specifics N N-Best 2.4 -B, --beam width Beam 4.1 (default: width = 5) --partial 5.3 --force-single-path 4.1 -v, --version --debug -h, --help 2.4 JUMAN (default) ID ID ID ID 4 3
: 6 1 * 0 * 0 " : / : : ; - : " 9 1 * 0 * 0 NIL 9 2 * 0 * 0 NIL 2 * 0 1 2 " : / " @ 2 * 0 1 2 " : / " @ 2 * 0 1 2 " : / " * ID 0 NIL @ \ \ \ 1 6 * 0 * 0 " : / " (-s) N-best JUMAN \t N-best ID ID ID ID ID ID # - : # MA-SCORE rank1:-9.52349 rank2:-9.59653 rank3:-9.72774 rank4:-9.75929 rank5:-10.6167-21 0 0 1 / 7 1 * 0 * 0 :-1.92711 :-0.74683 :-2.67394 :1;2;3;4;5-44 21 2 2 / 9 3 * 0 * 0 :-0.294647 :-0.286072 :-0.580719 :5-43 21 2 2 / 9 1 * 0 * 0 :0.755723 :-0.286072 :0.469651 :1;2;3;4-93 44;43 3 5 / 2 * 0 10 2 : : / : : / :-0.741122 :-1.41253 :-2.15365 :3-70 44;43 3 3 / 9 2 * 0 * 0 :0.368752 :-0.264521 :0.104231 :1;2;4;5 4
- 137 70 4 5 / 14 7 1 2 :-1.35178 :-1.27911 :-2.63089 :4-136 70 4 5 / 2 * 0 1 2 :-1.01211 :-0.991401 :-2.00351 :1;5-135 70 4 5 / 2 * 0 1 2 :-1.01211 :-0.991401 :-2.00351 :1;5-134 70 4 5 / 2 * 0 1 2 :-1.01211 :-0.991401 :-2.00351 :1;5-133 70 4 5 / 2 * 0 10 2 :-1.11446 :-0.991401 :-2.10586 :2-132 70 4 5 / 2 * 0 10 2 : :-1.11446 :-0.991401 :-2.10586 :2 JUMAN JUMAN N-best RNN JUMAN++ 4 2.5 JUMAN++ JUMAN++ script script/server.rb, script/client.rb JUMAN++ server.rb --cmd JUMAN++ TCP 12000 --port 1234 $ ruby script/server.rb --cmd "jumanpp -B 5" --host host.name --port 1234 JUMAN++ client.rb --host <hostname> 12000 --port 1234 $ echo " " ruby script/client.rb --host host.name --port 1234 5
6 1 * 0 * 0 " : / : - : " 9 1 * 0 * 0 NIL 2 * 0 1 2 " : / : " 2.6 Python python pyknp python JUMAN++ % wget http://nlp.ist.i.kyoto-u.ac.jp/nl-resource/knp/pyknp-0.3.tar.gz % tar xvf pyknp-0.3.tar.gz % cd pyknp-0.3 % sudo python setup.py install [--prefix=path] pyknp python 2.7 pyknp python 2, python 3 sample/python juman.py #-*- encoding: utf-8 -*- from pyknp import Jumanpp import sys import codecs sys.stdin = codecs.getreader( utf_8 )(sys.stdin) sys.stdout = codecs.getwriter( utf_8 )(sys.stdout) # Use Juman++ in subprocess mode jumanpp = Jumanpp() result = jumanpp.analysis(u" ") for mrph in result.mrph_list(): print u" :%s" % (mrph.midasi) pyknp Readme KNP sample/python knp.py 6
2.7 JUMAN++ ##JUMAN++ ##JUMAN++ set-lattice N N-best N ##JUMAN++ set-beam width width ##JUMAN++ set-force-single-path ##JUMAN++ unset-force-single-path 7
3 3.1 JUMAN++ JUMAN [1] 1. (cf. JUMAN.grammar) ( ) 14 2. (cf. JUMAN.kankei) 3. (cf. JUMAN.katuyou) 21 7 3.2 3.3 5 3.2 JUMAN++ dic.bin, dic.da 5 3.2.2 3.2.4 5 JUMAN 8
3.2.1 dic ; BNF ::= ( # ) ( # ( # )) ::= ( ) ::= ( ) ::= # # ::= ( # ) ::= ( # ) NIL ::= ( # ) NIL # # # # # # # (") \" : 2 ; : : ; : : - ; 3.2.2 Wikipedia Wiktionary Web 9
JUMAN ContentW.dic 3 Noun.koyuu.dic 8 Postp.dic Suffix.dic Rendaku.dic 3 ( : ) % echo " " jumanpp 6 4 * 0 * 0 " : / : : : " 6 1 * 0 * 0 " : / : ; - : " Onomatopeia.dic ( ) Wikipedia Wikipedia.dic Wikipedia 2016/06/01 83 JUMAN Wikipedia Wikipedia :Wikipedia Wikipedia Wikipedia 10
Wiktionary Wiktionary.dic Wiktionary 2016/06/01 Wikipedia 2,000 Wiktionary Wiktionary :Wiktionary Web Web.dic Web 1 1 : https://github.com/murawaki/lebyr 3.2.3 B.1 22 B.2 12 ( ) : : : : B.3 B.4 11
3 : / : / B.5 ) : : : : 3.2.4 dict-build 3.2.1 dict-build/userdic/ dict-build install.sh --prefix /path/to/somewhere/ % make % sudo./install.sh 3.3 2.2 5 1. 6 1995 CD-ROM Readme EUC-jp UTF-8 6 http://nlp.ist.i.kyoto-u.ac.jp/index.php? 12
2. ( )3 7. 3. Web Web 1,000 JUMAN++ 7 http://nlp.ist.i.kyoto-u.ac.jp/index.php?kwdlc 13
1: : 4 JUMAN++ Recurrent Neural Network (RNN) JUMAN MeCab [6] 4.1 1 : 1-3 gram : RNN RNN RNN B N-best N N-best --force-single-path 14
4.2 4.2.1 6 2 * 0 * 0 " : / : " 2 * 0 16 8 " : / : : / " 14 7 31 2 " : / " 2 * 0 3 8 " : / : : : / " 14 5 18 2 " : / " 12 * 0 * 0 * 0 " : / " 14 7 31 2 " : / " 15
12 * 0 * 0 * 0 " : / " 14 7 31 2 " : / " 4.2.2 ( ) 6 7 * 0 * 0 " : " 6 7 * 0 * 0 " : " 6 7 * 0 * 0 " : " 1 2 * 0 * 0 NIL 6 7 * 0 * 0 " : " 1 2 * 0 * 0 NIL 6 7 * 0 * 0 " : " 4.2.3 8 * 0 * 0 * 0 " " 2 * 0 1 2 " : / : " 16
4.2.4 ( ) : 8 15 1 * 0 * 0 " : " 9 1 * 0 * 0 NIL 2 * 0 3 2 " : / : : : / " 4.3 JUMAN JUMAN JUMAN++ JUMAN JUMAN++ JUMAN++ 9 v ( : / v ) : JUMAN KNP : JUMAN % echo " " juman 6 1 * 0 * 0 " : / : : " @ 6 1 * 0 * 0 " : / : : " 9 1 * 0 * 0 NIL 2 * 0 2 8 " : / " 9 1 * 0 * 0 NIL 8 5.4 9 17
3 * 0 21 7 " : / : : / " 14 7 16 2 " : / " : JUMAN++ % echo " " jumanpp 6 1 * 0 * 0 " : / : : " @ 6 1 * 0 * 0 " : / : : " 9 1 * 0 * 0 NIL 6 1 * 0 * 0 " : / v : " 9 1 * 0 * 0 NIL 3 * 0 21 7 " : / : : / " 14 7 16 2 " : / " JUMAN JUMAN JUMAN++ JUMAN : JUMAN % echo " " juman 2 * 0 1 2 " : / " 5 * 0 30 3 " " 9 2 * 0 * 0 " " 14 5 18 2 " " : JUMAN++ % echo " " jumanpp 2 * 0 1 2 " : / " 5 * 0 30 3 NIL 9 2 * 0 * 0 NIL 14 5 18 2 " : / " 18
JUMAN JUMAN++ : JUMAN % echo " " juman 6 7 * 0 * 0 " : " : JUMAN++ % echo " " jumanpp 6 7 * 0 * 0 " : " 19
解析済み Web テキスト 学習学習基本モデル RNNLM タグ付きコーパス 再学習 本システム RNN 言語モデル ( 再学習 ) 2: 5 JUMAN++ JUMAN++ Web 2 3.3 Exact Soft Confidence-Weighted Learning [5] [5, 6, 7] 5.1 knp. script $ cat xxxx.knp... yyyy.knp ruby script/corpus2train.rb > train.fmrp train.fmrp 5.2 JUMAN++ jumanpp --train -t, --train : -i : (default: 10) -o, --outputmodel : (default: output.mdl) -C : Exact Soft Confidence-Weighted C (default: 1.0) -P : Exact Soft Confidence-Weighted ϕ (default: 1.65) -B, --beam : (default: 5) --output-intermediate-model : 20
weight.mdl % jumanpp --train train.fmrp --outputmodel trained.mdl % sudo rm /usr/local/share/jumanpp/weight.mdl.map % sudo cp trained.mdl /usr/local/share/jumanpp/weight.mdl ( /usr/local/ JUMAN++ ) ITERATION:0 50475/50476 avg:0.0202897 loss:0 /, --output-intermediate-model 5.3 --partial JUMAN++ \t JUMAN++ 21
echo " " jumanpp 7 2 * 0 * 0 NIL 8 * 0 * 0 * 0 " : / " 6 2 * 0 * 0 " : / : " 2 * 0 16 8 " : / : : / " 14 5 18 2 " : / " $ echo " \t " jumanpp --partial 7 2 * 0 * 0 NIL 6 1 * 0 * 0 " : / : : - " 9 2 * 0 * 0 NIL 6 2 * 0 * 0 " : / : " 2 * 0 16 8 " : / : : / " 14 5 18 2 " : / " sample/part-sample.txt 5.2 % cat sample/part-sample.txt jumanpp --partial ruby script/corpus2train.rb > partial.fmrp % cat train.fmrp partial.fmrp > part_train.fmrp % jumanpp --train part_train.fmrp --outputmodel part_trained.mdl 5.4 Faster RNNLM (HS/NCE) toolkit 10 10 https://github.com/yandex/faster-rnnlm 22
Faster RNNLM (HS/NCE) toolkit RNN Faster RNNLM toolkit _ _ _ _ _ _ _ _ _ _ _ _ _ JUMAN script % cat corpus.txt jumanpp ruby script/corpus2train.rb ruby script/fullmrp2basep.rb > data_for_lm.txt Faster RNNLM toolkit --nce --direct data for LM.txt data for LM.train Validation data for LM.valid lang.mdl, lang.mdl.nnet % faster-rnnlm --rnnlm lang.mdl --train data_for_lm.train --valid data_for_lm.valid --nce 22 --hidden 100 --direct 100 --direct-order 3 -bptt 4 --use-cuda 1 -independent Faster RNNLM toolkit toolkit readme 6 Windows(Cygwin) ( ) 23
[1], 1989. [2] 42, 1991. [3] NL-101, 1994. [4] JUMAN 2, 1996. [5] Jialei Wang, Peilin Zhao and Steven C.H. Hoi Exact Soft Confidence-Weighted Learning Proceedings of 29th International Conference on Machine Learning, 2012. [6] Hajime Morita, Daisuke Kawahara and Sadao Kurohashi Morphological Analysis for Unsegmented Languages using Recurrent Neural Network Language Model Proceedings of EMNLP 2015: Conference on Empirical Methods in Natural Language Processing, 2015. [7] RNN 78, 2016. 24
A BNF # # NIL A.1 (JUMAN.grammar) A.1.1 ::= ( ) ( ( ) ) ::= ( # ) ( # %) ::= ::= ( # ) ( # %) A.1.2 (( %)) ; (( ) (( ) ( ) ( ) ( ) ( ) ( ))) ; ;... ; (( ) (( ) ( ) ( ) ( %) ( %) ( %))) ; 25
A.2 (JUMAN.kankei) A.2.1 ::= ( ( ) ) ::= ( # ) ( # # ) ::= # # A.2.2 (( ) ( )) ; (( ) ( )) (( ) ( )) ; ; ; 26
A.3 (JUMAN.katuyou) A.3.1 ::= ( # ( ) ) ::= ::= ( # ) ::= # * A.3.2 ( (( * ) ( ) ( * ) ( ) ( ) ( ) ( ) ( * ) ( ) ( ) ( ) ( )) ) ( (( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )) ) 27
B B.1 / / ( ) ID ID 28
B.2 - - - - - - - 29
- - - - - - - - - - - - - 30
- - ( ) ( ) ( ) ( ) ( ) ( ) - ( ) - 31
- - - - - - B.3 32
33
34
B.4 ( 4,000 ) Web 1,500 2,000 : : :7:0.00607 : : :45:0.00106 Web 150 : ( 50,000 ) : : : : : : : : : : : : : : : : : : : : : 35
( 700 ) : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : B.5 : : : : : : : : 36
: : : : : : : : : : ( ) : 37