JUMAN++ version

Similar documents
Morphological Analysis System JUMAN Copyright 2016 Kyoto University All rights reserved. Licensed under the Apache License, Version 2.0 (the Li

AQUOS ケータイ2 ユーザーガイド

DIGNO® ケータイ ユーザーガイド

Xperia™ Z3 ユーザーガイド

DIGNO® G ユーザーガイド

エレクトーンのお客様向けiPhone/iPad接続マニュアル

インターネット接続ガイド v110

2

iPhone/iPad接続マニュアル

基本操作ガイド

Copyright SATO International All rights reserved. This software is based in part on the work of the Independen

操作ガイド(本体操作編)

VNX for Fileでの監査ツールの構成および使用

EPSON ES-D200 パソコンでのスキャンガイド


PX-403A

操作ガイド(本体操作編)

FC741E2_091201

EPSON PX-503A ユーザーズガイド

WHITE PAPER RNN

IPSJ SIG Technical Report Vol.2010-NL-199 No /11/ treebank ( ) KWIC /MeCab / Morphological and Dependency Structure Annotated Corp

EPSON EP-803A/EP-803AW ユーザーズガイド

untitled

EPSON EP-703A ユーザーズガイド

Appendix

基本操作ガイド

ScanFront300/300P セットアップガイド

DS-30


VQT3B86-4 DMP-HV200 DMP-HV150 μ μ l μ

PX-504A

EP-704A

PX-434A/PX-404A

ES-D400/ES-D350

PX-673F


相続支払い対策ポイント

150423HC相続資産圧縮対策のポイント

ハピタス のコピー.pages

Copyright 2008 All Rights Reserved 2


nakayama15icm01_l7filter.pptx

ScanFront 220/220P 取扱説明書

ScanFront 220/220P セットアップガイド

ネットワークビデオレコーダー VK-64/VK-16/VK-Lite v2.2 セットアップガイド

DDK-7 取扱説明書 v1.10

GT-X980

untitled

WYE771W取扱説明書

RedHat OpenFOAM OpenFOAM ver 2.3 RedHat(RHEL)

slice00_install.dvi

たのしいプログラミング Pythonではじめよう!

Chapter

展開とプロビジョニングの概念



ユーザーズマニュアル

GT-F740/GT-S640

Zinstall WinWin 日本語ユーザーズガイド

上出来8現場カメラ セットアップマニュアル

IM 21B04C50-01

untitled

DS-70000/DS-60000/DS-50000

SonicWALL SSL-VPN 4000 導入ガイド


ユーザーズマニュアル

GT-X830

LAN Copyright c Daikoku Manabu This tutorial is licensed under a Creative Commons Attribution 2.1 Japan License

Oracle Application Server 10g( )インストール手順書

Oracle Application Server 10g(9

untitled



外部SQLソース入門

TH-47LFX60 / TH-47LFX6N

¥ì¥·¥Ô¤Î¸À¸ì½èÍý¤Î¸½¾õ

Systemwalker IT Service Management Systemwalker IT Service Management V11.0L10 IT Service Management - Centric Manager Windows

ProVAL Recent Projects, ProVAL Online 3 Recent Projects ProVAL Online Show Online Content on the Start Page Page 13

sato-FBSDW key

操作ガイド(本体操作編)

DDR3 SDRAMメモリ・インタフェースのレベリング手法の活用

Huawei G6-L22 QSG-V100R001_02

A Japanese Word Dependency Corpus ÆüËܸì¤Îñ¸ì·¸¤ê¼õ¤±¥³¡¼¥Ñ¥¹

初心者にもできるアメブロカスタマイズ新2016.pages


- 2 Copyright (C) All Rights Reserved.


A : kerl kerl Erlang/OTP Erlang/OTP 2 2 Elixir/Phoenix URL 2 PDF A.2 Bash macos.bash_profile exp

Transcription:

JUMAN++ version 1.0 28 9

Morphological Analysis System JUMAN++ 1.0 Copyright 2016 Kyoto University All rights reserved. Licensed under the Apache License, Version 2.0 (the License ); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/license-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. Version 1.0 September 2016

1 1 2 JUMAN++ 2 2.1......................................... 2 2.2...................................... 2 2.3......................................... 3 2.4......................................... 4 2.5......................... 6 2.6 Python.................................... 6 2.7................................ 7 3 9 3.1.................................... 9 3.2........................................... 9 3.2.1.................................. 10 3.2.2................................ 10 3.2.3.............. 12 3.2.4................................ 13 3.3......................................... 13 4 15 4.1.............................. 15 4.2.............................. 16 4.2.1.................. 16 4.2.2................................. 17 4.2.3................................... 17 4.2.4..................................... 18 4.3 JUMAN.................................... 18 5 JUMAN++ 21 5.1.................................. 21 5.2.................................... 21 5.3........................... 22 5.4.................................... 23 6 24 25 26 i

A 26 A.1 (JUMAN.grammar)................... 26 A.1.1.................. 26 A.1.2........................ 26 A.2 (JUMAN.kankei).................... 27 A.2.1.................. 27 A.2.2........................ 27 A.3 (JUMAN.katuyou)...................... 28 A.3.1.................... 28 A.3.2........................... 28 B 29 B.1......................................... 29 B.2...................................... 30 B.3......................................... 33 B.4......................................... 36 B.5................................. 37 ii

1 JUMAN Chasen MeCab JUMAN++ RNN(recurrent neural network) RNN [6, 7] JUMAN++ JUMAN 3 Wikipedia JUMAN++ JUMAN++ JUMAN++ CREST ( : ) CREST JUMAN++ JUMAN JUMAN 28 9 Email: nl-resource@nlp.ist.i.kyoto-u.ac.jp 1

2 JUMAN++ 2.1 JUMAN++ OS: Linux ( CentOS 6.7 ) : 4GB : 2GB gcc (4.9 ) Boost C++ Libraries (1.57 ) 1 gperftool 2 libunwind 3 (gperftool 64bit ) 2.2 JUMAN++ % wget http://lotus.kuee.kyoto-u.ac.jp/nl-resource/jumanpp/jumanpp-1.0.tar.xz ( 1.3GB) % tar xjf jumanpp-1.0.tar.xz % cd jumanpp-1.0 % ls jumanpp-src jumanpp-resource jumanpp-manual-1.0.pdf README.md README_ja.md 1 http://www.boost.org/ 2 https://github.com/gperftools/gperftools 3 http://www.nongnu.org/libunwind/ 2

JUMAN++ $JPPRC % export JPPRC=/usr/local/share/jumanpp-resource/ ( /usr/local/share/jumanpp-resource/ ) % sudo mv jumanpp-resource $JPPRC % cd jumanpp-src %./configure --enable-default-resource-path=$jpprc % make % sudo make install /usr/local/./configure --prefix=/path/to/somewhere/ 2.3 JUMAN++ jumanpp UTF-8 4 # ##JUMAN++ 2.7 % cat cake.txt # S-ID: 00000000-01 % cat cake.txt jumanpp # S-ID: 00000000-01 JUMAN++:1.00 6 1 * 0 * 0 " : / : - : " 9 1 * 0 * 0 NIL 2 * 0 1 2 " : / : " 4 3

-D, --dir path -s, --specifics N N-Best 2.4 -B, --beam width Beam 4.1 (default: width = 5) --partial 5.3 --force-single-path 4.1 -v, --version --debug -h, --help ( -D, --dir ) $HOME/.jumanpprc.jumanpprc % echo [ ] > $HOME/.jumanpprc 1. ( -D, --dir ) 2. $HOME/.jumanpprc 3. --enable-default-resource-path 4. 2.4 JUMAN (default) ID ID ID ID : 6 1 * 0 * 0 " : / : : ; - : " 9 1 * 0 * 0 NIL 9 2 * 0 * 0 NIL 2 * 0 1 2 " : / " @ 2 * 0 1 2 " : / " @ 2 * 0 1 2 " : / " 4

* ID 0 NIL @ \ \ \ 1 6 * 0 * 0 " : / " (-s) N-best JUMAN \t N-best ID ID ID ID ID ID # - : # MA-SCORE rank1:-9.52349 rank2:-9.59653 rank3:-9.72774 rank4:-9.75929 rank5:-10.6167-21 0 0 1 / 7 1 * 0 * 0 :-1.92711 :-0.74683 :-2.67394 :1;2;3;4;5-44 21 2 2 / 9 3 * 0 * 0 :-0.294647 :-0.286072 :-0.580719 :5-43 21 2 2 / 9 1 * 0 * 0 :0.755723 :-0.286072 :0.469651 :1;2;3;4-93 44;43 3 5 / 2 * 0 10 2 : : / : : / :-0.741122 :-1.41253 :-2.15365 :3-70 44;43 3 3 / 9 2 * 0 * 0 :0.368752 :-0.264521 :0.104231 :1;2;4;5-137 70 4 5 / 14 7 1 2 :-1.35178 :-1.27911 :-2.63089 :4-136 70 4 5 / 2 * 0 1 2 :-1.01211 :-0.991401 :-2.00351 :1;5-135 70 4 5 / 2 * 0 1 2 :-1.01211 :-0.991401 :-2.00351 :1;5-134 70 4 5 / 2 * 0 1 5

2 :-1.01211 :-0.991401 :-2.00351 :1;5-133 70 4 5 / 2 * 0 10 2 :-1.11446 :-0.991401 :-2.10586 :2-132 70 4 5 / 2 * 0 10 2 : :-1.11446 :-0.991401 :-2.10586 :2 JUMAN JUMAN N-best RNN JUMAN++ 4 2.5 JUMAN++ JUMAN++ jumanpp-src/script script/server.rb, script/client.rb JUMAN++ server.rb --cmd JUMAN++ TCP 12000 --port 1234 $ ruby script/server.rb --cmd "jumanpp -B 5" --host host.name --port 1234 JUMAN++ client.rb --host <hostname> 12000 --port 1234 $ echo " " ruby script/client.rb --host host.name --port 1234 6 1 * 0 * 0 " : / : - : " 9 1 * 0 * 0 NIL 2 * 0 1 2 " : / : " 2.6 Python python pyknp python JUMAN++ 6

% wget http://nlp.ist.i.kyoto-u.ac.jp/nl-resource/knp/pyknp-0.3.zip % unzip pyknp-0.3.zip % cd pyknp-0.3 % sudo python setup.py install [--prefix=path] pyknp python 2.7 pyknp python 2, python 3 jumanpp-src/sample/python juman.py #-*- encoding: utf-8 -*- from pyknp import Jumanpp import sys import codecs sys.stdin = codecs.getreader( utf_8 )(sys.stdin) sys.stdout = codecs.getwriter( utf_8 )(sys.stdout) # Use Juman++ in subprocess mode jumanpp = Jumanpp() result = jumanpp.analysis(u" ") for mrph in result.mrph_list(): print u" :%s" % (mrph.midasi) pyknp Readme KNP jumanpp-src/sample/python knp.py 2.7 JUMAN++ ##JUMAN++ ##JUMAN++ set-lattice N N-best N ##JUMAN++ set-beam width width 7

##JUMAN++ set-force-single-path ##JUMAN++ unset-force-single-path 8

3 3.1 JUMAN++ JUMAN [1] 1. (cf. JUMAN.grammar) ( ) 14 2. (cf. JUMAN.kankei) 3. (cf. JUMAN.katuyou) 21 7 3.2 3.3 5 3.2 JUMAN++ dic.bin, dic.da 5 3.2.2 3.2.4 5 JUMAN 9

3.2.1 dic ; BNF ::= ( # ) ( # ( # )) ::= ( ) ::= ( ) ::= # # ::= ( # ) ::= ( # ) NIL ::= ( # ) NIL # # # # # # # (") \" : 2 ; : : ; : : - ; 3.2.2 Wikipedia Wiktionary Web 10

JUMAN ContentW.dic 3 Noun.koyuu.dic 8 Postp.dic Suffix.dic Rendaku.dic 3 ( : ) % echo " " jumanpp 6 4 * 0 * 0 " : / : : : " 6 1 * 0 * 0 " : / : ; - : " Onomatopeia.dic ( ) Wikipedia Wikipedia.dic Wikipedia 2016/06/01 83 JUMAN Wikipedia Wikipedia :Wikipedia Wikipedia Wikipedia 11

Wiktionary Wiktionary.dic Wiktionary 2016/06/01 Wikipedia 2,000 Wiktionary Wiktionary :Wiktionary Web Web.dic Web 1 1 : https://github.com/murawaki/lebyr 3.2.3 B.1 22 B.2 12 ( ) : : : : B.3 B.4 12

3 : / : / B.5 ) : : : : 3.2.4 $JPPRC $JPPRC/source 3.2.1 $JPPRC/source/userdic/ $JPPRC/source % make % sudo cp jumanpp_dic/* $JPPRC/ 3.3 2.2 5 1. 6 1995 CD-ROM Readme EUC-jp UTF-8 6 http://nlp.ist.i.kyoto-u.ac.jp/index.php? 13

2. ( )3 7. 3. Web Web 1,000 JUMAN++ 7 http://nlp.ist.i.kyoto-u.ac.jp/index.php?kwdlc 14

1: : 4 JUMAN++ Recurrent Neural Network (RNN) JUMAN MeCab [6] 4.1 1 : 1-3 gram : RNN RNN RNN B N-best N N-best --force-single-path 15

4.2 4.2.1 6 2 * 0 * 0 " : / : " 2 * 0 16 8 " : / : : / " 14 7 31 2 " : / " 2 * 0 3 8 " : / : : : / " 14 5 18 2 " : / " 12 * 0 * 0 * 0 " : / " 14 7 31 2 " : / " 16

12 * 0 * 0 * 0 " : / " 14 7 31 2 " : / " 4.2.2 ( ) 6 7 * 0 * 0 " : " 6 7 * 0 * 0 " : " 6 7 * 0 * 0 " : " 1 2 * 0 * 0 NIL 6 7 * 0 * 0 " : " 1 2 * 0 * 0 NIL 6 7 * 0 * 0 " : " 4.2.3 8 * 0 * 0 * 0 " " 2 * 0 1 2 " : / : " 17

4.2.4 ( ) : 8 15 1 * 0 * 0 " : " 9 1 * 0 * 0 NIL 2 * 0 3 2 " : / : : : / " 4.3 JUMAN JUMAN JUMAN++ JUMAN JUMAN++ JUMAN++ 9 v ( : / v ) : JUMAN KNP : JUMAN % echo " " juman 6 1 * 0 * 0 " : / : : " @ 6 1 * 0 * 0 " : / : : " 9 1 * 0 * 0 NIL 2 * 0 2 8 " : / " 9 1 * 0 * 0 NIL 8 5.4 9 18

3 * 0 21 7 " : / : : / " 14 7 16 2 " : / " : JUMAN++ % echo " " jumanpp 6 1 * 0 * 0 " : / : : " @ 6 1 * 0 * 0 " : / : : " 9 1 * 0 * 0 NIL 6 1 * 0 * 0 " : / v : " 9 1 * 0 * 0 NIL 3 * 0 21 7 " : / : : / " 14 7 16 2 " : / " JUMAN JUMAN JUMAN++ JUMAN : JUMAN % echo " " juman 2 * 0 1 2 " : / " 5 * 0 30 3 " " 9 2 * 0 * 0 " " 14 5 18 2 " " : JUMAN++ % echo " " jumanpp 2 * 0 1 2 " : / " 5 * 0 30 3 NIL 9 2 * 0 * 0 NIL 14 5 18 2 " : / " 19

JUMAN JUMAN++ : JUMAN % echo " " juman 6 7 * 0 * 0 " : " : JUMAN++ % echo " " jumanpp 6 7 * 0 * 0 " : " 20

解析済み Web テキスト 学習学習基本モデル RNNLM タグ付きコーパス 再学習 本システム RNN 言語モデル ( 再学習 ) 2: 5 JUMAN++ JUMAN++ Web 2 3.3 Exact Soft Confidence-Weighted Learning [5] [5, 6, 7] 5.1 knp. jumanpp-src/script $ cat xxxx.knp... yyyy.knp ruby jumanpp-src/script/corpus2train.rb > train.fmrp train.fmrp 5.2 JUMAN++ jumanpp --train -t, --train : -i : (default: 10) -o, --outputmodel : (default: output.mdl) -C : Exact Soft Confidence-Weighted C (default: 1.0) -P : Exact Soft Confidence-Weighted ϕ (default: 1.65) -B, --beam : (default: 5) --output-intermediate-model : 21

weight.mdl % jumanpp --train train.fmrp --outputmodel trained.mdl % sudo cp trained.mdl $JPPRC/weight.mdl ITERATION:0 50475/50476 avg:0.0202897 loss:0 /, --output-intermediate-model 5.3 --partial JUMAN++ \t JUMAN++ echo " " jumanpp 7 2 * 0 * 0 NIL 22

8 * 0 * 0 * 0 " : / " 6 2 * 0 * 0 " : / : " 2 * 0 16 8 " : / : : / " 14 5 18 2 " : / " $ echo " \t " jumanpp --partial 7 2 * 0 * 0 NIL 6 1 * 0 * 0 " : / : : - " 9 2 * 0 * 0 NIL 6 2 * 0 * 0 " : / : " 2 * 0 16 8 " : / : : / " 14 5 18 2 " : / " jumanpp-src/sample/part-sample.txt 5.2 % cat sample/part-sample.txt jumanpp --partial ruby script/corpus2train.rb > partial.fmrp % cat train.fmrp partial.fmrp > part_train.fmrp % jumanpp --train part_train.fmrp --outputmodel part_trained.mdl 5.4 Faster RNNLM (HS/NCE) toolkit 10 Faster RNNLM (HS/NCE) toolkit RNN 10 https://github.com/yandex/faster-rnnlm 23

Faster RNNLM toolkit _ _ _ _ _ _ _ _ _ _ _ _ _ JUMAN jumanpp-src/script % cat corpus.txt jumanpp ruby script/corpus2train.rb ruby script/fullmrp2basep.rb > data_for_lm.txt Faster RNNLM toolkit --nce --direct data for LM.txt data for LM.train Validation data for LM.valid lang.mdl, lang.mdl.nnet % faster-rnnlm --rnnlm lang.mdl --train data_for_lm.train --valid data_for_lm.valid --nce 22 --hidden 100 --direct 100 --direct-order 3 -bptt 4 --use-cuda 1 -independent Faster RNNLM toolkit toolkit readme 6 Windows(Cygwin) ( ) 24

[1], 1989. [2] 42, 1991. [3] NL-101, 1994. [4] JUMAN 2, 1996. [5] Jialei Wang, Peilin Zhao and Steven C.H. Hoi Exact Soft Confidence-Weighted Learning Proceedings of 29th International Conference on Machine Learning, 2012. [6] Hajime Morita, Daisuke Kawahara and Sadao Kurohashi Morphological Analysis for Unsegmented Languages using Recurrent Neural Network Language Model Proceedings of EMNLP 2015: Conference on Empirical Methods in Natural Language Processing, 2015. [7] RNN 78, 2016. 25

A BNF # # NIL A.1 (JUMAN.grammar) A.1.1 ::= ( ) ( ( ) ) ::= ( # ) ( # %) ::= ::= ( # ) ( # %) A.1.2 (( %)) ; (( ) (( ) ( ) ( ) ( ) ( ) ( ))) ; ;... ; (( ) (( ) ( ) ( ) ( %) ( %) ( %))) ; 26

A.2 (JUMAN.kankei) A.2.1 ::= ( ( ) ) ::= ( # ) ( # # ) ::= # # A.2.2 (( ) ( )) ; (( ) ( )) (( ) ( )) ; ; ; 27

A.3 (JUMAN.katuyou) A.3.1 ::= ( # ( ) ) ::= ::= ( # ) ::= # * A.3.2 ( (( * ) ( ) ( * ) ( ) ( ) ( ) ( ) ( * ) ( ) ( ) ( ) ( )) ) ( (( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )) ) 28

B B.1 / / ( ) ID ID 29

B.2 - - - - - - - 30

- - - - - - - - - - - - - 31

- - ( ) ( ) ( ) ( ) ( ) ( ) - ( ) - 32

- - - - - - B.3 33

34

35

B.4 ( 4,000 ) Web 1,500 2,000 : : :7:0.00607 : : :45:0.00106 Web 150 : ( 50,000 ) : : : : : : : : : : : : : : : : : : : : : 36

( 700 ) : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : B.5 : : : : : : : : 37

: : : : : : : : : : ( ) : 38