Morphological Analysis System JUMAN Copyright 2016 Kyoto University All rights reserved. Licensed under the Apache License, Version 2.0 (the Li

Similar documents
JUMAN++ version

AQUOS ケータイ2 ユーザーガイド

DIGNO® ケータイ ユーザーガイド

SoftBank 304SH 取扱説明書

メールをサーバーに保存する 実行 SD カードに保存したメールデータを サーバーに保存します ほかの携帯電話でも利用可能な形式で保存するため データの一 部が破棄 または変更される場合があります 保存が開始されます 保存を中断する場合は キャンセルをタップします 中断した場合 データは保存されません

Xperia™ Z3 ユーザーガイド

DIGNO® G ユーザーガイド

Xperia™ XZs ユーザーガイド

一般社団法人 電子情報通信学会 THE INSTITUTE OF ELECTRONICS, 社団法人 電子情報通信学会 INFORMATION AND COMMUNICATION ENGINEERS 信学技報 IEICE Technical Report NLC ( ) 信学

エレクトーンのお客様向けiPhone/iPad接続マニュアル

インターネット接続ガイド v110

2


iPhone/iPad接続マニュアル

基本操作ガイド

Copyright SATO International All rights reserved. This software is based in part on the work of the Independen

操作ガイド(本体操作編)

EPSON ES-D200 パソコンでのスキャンガイド


PX-403A

操作ガイド(本体操作編)

Appendix

WHITE PAPER RNN

EPSON PX-503A ユーザーズガイド

IPSJ SIG Technical Report Vol.2010-NL-199 No /11/ treebank ( ) KWIC /MeCab / Morphological and Dependency Structure Annotated Corp

untitled

EPSON EP-803A/EP-803AW ユーザーズガイド

EPSON EP-703A ユーザーズガイド

基本操作ガイド

ScanFront300/300P セットアップガイド

DS-30

VNX for Fileでの監査ツールの構成および使用


VQT3B86-4 DMP-HV200 DMP-HV150 μ μ l μ

ネットワークビデオレコーダー VK-64/VK-16/VK-Lite v2.2 セットアップガイド

PX-673F

PX-504A

EP-704A

PX-434A/PX-404A

ES-D400/ES-D200

ES-D400/ES-D350

FC741E2_091201

相続支払い対策ポイント

150423HC相続資産圧縮対策のポイント

ハピタス のコピー.pages

Copyright 2008 All Rights Reserved 2


Zinstall WinWin 日本語ユーザーズガイド

RedHat OpenFOAM OpenFOAM ver 2.3 RedHat(RHEL)

ScanFront 220/220P 取扱説明書

ScanFront 220/220P セットアップガイド

GT-X980

untitled

nakayama15icm01_l7filter.pptx

WYE771W取扱説明書

2015 9

たのしいプログラミング Pythonではじめよう!

Oracle Application Server 10g( )インストール手順書

Chapter

Oracle Application Server 10g(9



GT-F740/GT-S640

上出来8現場カメラ セットアップマニュアル

DDK-7 取扱説明書 v1.10



GT-X830


Sophos Anti-Virus UNIX or Linux startup guide

untitled



Mail_Spam_Manual_120815b



TH-47LFX60 / TH-47LFX6N

¥ì¥·¥Ô¤Î¸À¸ì½èÍý¤Î¸½¾õ

appli_HPhi_install

Systemwalker IT Service Management Systemwalker IT Service Management V11.0L10 IT Service Management - Centric Manager Windows

Docker Haruka Iwao Storage Solution Architect, Red Hat K.K. February 12, 2015

IM 21B04C50-01

操作ガイド(本体操作編)

sato-FBSDW key

DS-70000/DS-60000/DS-50000

DDR3 SDRAMメモリ・インタフェースのレベリング手法の活用

Huawei G6-L22 QSG-V100R001_02

A Japanese Word Dependency Corpus ÆüËܸì¤Îñ¸ì·¸¤ê¼õ¤±¥³¡¼¥Ñ¥¹

初心者にもできるアメブロカスタマイズ新2016.pages


28 Docker Design and Implementation of Program Evaluation System Using Docker Virtualized Environment

dvi

- 2 Copyright (C) All Rights Reserved.

SonicWALL SSL-VPN 4000 導入ガイド

Microsoft Word - D JP.docx


ユーザーズマニュアル

ksocket Documentation

展開とプロビジョニングの概念

外部SQLソース入門

Transcription:

JUMAN++ version 1.01 28 9

Morphological Analysis System JUMAN++ 1.01 Copyright 2016 Kyoto University All rights reserved. Licensed under the Apache License, Version 2.0 (the License ); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/license-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. Version 1.01 September 2016 Version 1.00 September 2016

1 1 2 JUMAN++ 2 2.1......................................... 2 2.2...................................... 2 2.3......................................... 3 2.4......................................... 3 2.5......................... 5 2.6 Python.................................... 6 2.7................................ 7 3 8 3.1.................................... 8 3.2........................................... 8 3.2.1.................................. 9 3.2.2................................ 9 3.2.3.............. 11 3.2.4................................ 12 3.3......................................... 12 4 14 4.1.............................. 14 4.2.............................. 15 4.2.1.................. 15 4.2.2................................. 16 4.2.3................................... 16 4.2.4..................................... 17 4.3 JUMAN.................................... 17 5 JUMAN++ 20 5.1.................................. 20 5.2.................................... 20 5.3........................... 21 5.4.................................... 22 6 23 24 25 i

A 25 A.1 (JUMAN.grammar)................... 25 A.1.1.................. 25 A.1.2........................ 25 A.2 (JUMAN.kankei).................... 26 A.2.1.................. 26 A.2.2........................ 26 A.3 (JUMAN.katuyou)...................... 27 A.3.1.................... 27 A.3.2........................... 27 B 28 B.1......................................... 28 B.2...................................... 29 B.3......................................... 32 B.4......................................... 35 B.5................................. 36 ii

1 JUMAN Chasen MeCab JUMAN++ RNN(recurrent neural network) RNN [6, 7] JUMAN++ JUMAN 3 Wikipedia JUMAN++ JUMAN++ JUMAN++ CREST ( : ) CREST JUMAN++ JUMAN JUMAN 28 9 Email: nl-resource@nlp.ist.i.kyoto-u.ac.jp 1

2 JUMAN++ 2.1 JUMAN++ OS: Linux ( CentOS 6.7 ) : 4GB : 2GB gcc (4.9 ) Boost C++ Libraries (1.57 ) 1 gperftool 2 libunwind 3 (gperftool 64bit ) 2.2 JUMAN++ JUMAN++ JUMAN++ % wget http://lotus.kuee.kyoto-u.ac.jp/nl-resource/jumanpp/jumanpp-1.01.tar.xz ( 600MB) % tar xjvf jumanpp-1.01.tar.xz % cd jumanpp-1.01 %./configure % make % sudo make install /usr/local/./configure --prefix=/path/to/somewhere/ 1 http://www.boost.org/ 2 https://github.com/gperftools/gperftools 3 http://www.nongnu.org/libunwind/ 2

2.3 JUMAN++ jumanpp UTF-8 4 # ##JUMAN++ 2.7 % cat cake.txt # S-ID: 00000000-01 % cat cake.txt jumanpp # S-ID: 00000000-01 JUMAN++:1.01 6 1 * 0 * 0 " : / : - : " 9 1 * 0 * 0 NIL 2 * 0 1 2 " : / : " -s, --specifics N N-Best 2.4 -B, --beam width Beam 4.1 (default: width = 5) --partial 5.3 --force-single-path 4.1 -v, --version --debug -h, --help 2.4 JUMAN (default) ID ID ID ID 4 3

: 6 1 * 0 * 0 " : / : : ; - : " 9 1 * 0 * 0 NIL 9 2 * 0 * 0 NIL 2 * 0 1 2 " : / " @ 2 * 0 1 2 " : / " @ 2 * 0 1 2 " : / " * ID 0 NIL @ \ \ \ 1 6 * 0 * 0 " : / " (-s) N-best JUMAN \t N-best ID ID ID ID ID ID # - : # MA-SCORE rank1:-9.52349 rank2:-9.59653 rank3:-9.72774 rank4:-9.75929 rank5:-10.6167-21 0 0 1 / 7 1 * 0 * 0 :-1.92711 :-0.74683 :-2.67394 :1;2;3;4;5-44 21 2 2 / 9 3 * 0 * 0 :-0.294647 :-0.286072 :-0.580719 :5-43 21 2 2 / 9 1 * 0 * 0 :0.755723 :-0.286072 :0.469651 :1;2;3;4-93 44;43 3 5 / 2 * 0 10 2 : : / : : / :-0.741122 :-1.41253 :-2.15365 :3-70 44;43 3 3 / 9 2 * 0 * 0 :0.368752 :-0.264521 :0.104231 :1;2;4;5 4

- 137 70 4 5 / 14 7 1 2 :-1.35178 :-1.27911 :-2.63089 :4-136 70 4 5 / 2 * 0 1 2 :-1.01211 :-0.991401 :-2.00351 :1;5-135 70 4 5 / 2 * 0 1 2 :-1.01211 :-0.991401 :-2.00351 :1;5-134 70 4 5 / 2 * 0 1 2 :-1.01211 :-0.991401 :-2.00351 :1;5-133 70 4 5 / 2 * 0 10 2 :-1.11446 :-0.991401 :-2.10586 :2-132 70 4 5 / 2 * 0 10 2 : :-1.11446 :-0.991401 :-2.10586 :2 JUMAN JUMAN N-best RNN JUMAN++ 4 2.5 JUMAN++ JUMAN++ script script/server.rb, script/client.rb JUMAN++ server.rb --cmd JUMAN++ TCP 12000 --port 1234 $ ruby script/server.rb --cmd "jumanpp -B 5" --host host.name --port 1234 JUMAN++ client.rb --host <hostname> 12000 --port 1234 $ echo " " ruby script/client.rb --host host.name --port 1234 5

6 1 * 0 * 0 " : / : - : " 9 1 * 0 * 0 NIL 2 * 0 1 2 " : / : " 2.6 Python python pyknp python JUMAN++ % wget http://nlp.ist.i.kyoto-u.ac.jp/nl-resource/knp/pyknp-0.3.tar.gz % tar xvf pyknp-0.3.tar.gz % cd pyknp-0.3 % sudo python setup.py install [--prefix=path] pyknp python 2.7 pyknp python 2, python 3 sample/python juman.py #-*- encoding: utf-8 -*- from pyknp import Jumanpp import sys import codecs sys.stdin = codecs.getreader( utf_8 )(sys.stdin) sys.stdout = codecs.getwriter( utf_8 )(sys.stdout) # Use Juman++ in subprocess mode jumanpp = Jumanpp() result = jumanpp.analysis(u" ") for mrph in result.mrph_list(): print u" :%s" % (mrph.midasi) pyknp Readme KNP sample/python knp.py 6

2.7 JUMAN++ ##JUMAN++ ##JUMAN++ set-lattice N N-best N ##JUMAN++ set-beam width width ##JUMAN++ set-force-single-path ##JUMAN++ unset-force-single-path 7

3 3.1 JUMAN++ JUMAN [1] 1. (cf. JUMAN.grammar) ( ) 14 2. (cf. JUMAN.kankei) 3. (cf. JUMAN.katuyou) 21 7 3.2 3.3 5 3.2 JUMAN++ dic.bin, dic.da 5 3.2.2 3.2.4 5 JUMAN 8

3.2.1 dic ; BNF ::= ( # ) ( # ( # )) ::= ( ) ::= ( ) ::= # # ::= ( # ) ::= ( # ) NIL ::= ( # ) NIL # # # # # # # (") \" : 2 ; : : ; : : - ; 3.2.2 Wikipedia Wiktionary Web 9

JUMAN ContentW.dic 3 Noun.koyuu.dic 8 Postp.dic Suffix.dic Rendaku.dic 3 ( : ) % echo " " jumanpp 6 4 * 0 * 0 " : / : : : " 6 1 * 0 * 0 " : / : ; - : " Onomatopeia.dic ( ) Wikipedia Wikipedia.dic Wikipedia 2016/06/01 83 JUMAN Wikipedia Wikipedia :Wikipedia Wikipedia Wikipedia 10

Wiktionary Wiktionary.dic Wiktionary 2016/06/01 Wikipedia 2,000 Wiktionary Wiktionary :Wiktionary Web Web.dic Web 1 1 : https://github.com/murawaki/lebyr 3.2.3 B.1 22 B.2 12 ( ) : : : : B.3 B.4 11

3 : / : / B.5 ) : : : : 3.2.4 dict-build 3.2.1 dict-build/userdic/ dict-build install.sh --prefix /path/to/somewhere/ % make % sudo./install.sh 3.3 2.2 5 1. 6 1995 CD-ROM Readme EUC-jp UTF-8 6 http://nlp.ist.i.kyoto-u.ac.jp/index.php? 12

2. ( )3 7. 3. Web Web 1,000 JUMAN++ 7 http://nlp.ist.i.kyoto-u.ac.jp/index.php?kwdlc 13

1: : 4 JUMAN++ Recurrent Neural Network (RNN) JUMAN MeCab [6] 4.1 1 : 1-3 gram : RNN RNN RNN B N-best N N-best --force-single-path 14

4.2 4.2.1 6 2 * 0 * 0 " : / : " 2 * 0 16 8 " : / : : / " 14 7 31 2 " : / " 2 * 0 3 8 " : / : : : / " 14 5 18 2 " : / " 12 * 0 * 0 * 0 " : / " 14 7 31 2 " : / " 15

12 * 0 * 0 * 0 " : / " 14 7 31 2 " : / " 4.2.2 ( ) 6 7 * 0 * 0 " : " 6 7 * 0 * 0 " : " 6 7 * 0 * 0 " : " 1 2 * 0 * 0 NIL 6 7 * 0 * 0 " : " 1 2 * 0 * 0 NIL 6 7 * 0 * 0 " : " 4.2.3 8 * 0 * 0 * 0 " " 2 * 0 1 2 " : / : " 16

4.2.4 ( ) : 8 15 1 * 0 * 0 " : " 9 1 * 0 * 0 NIL 2 * 0 3 2 " : / : : : / " 4.3 JUMAN JUMAN JUMAN++ JUMAN JUMAN++ JUMAN++ 9 v ( : / v ) : JUMAN KNP : JUMAN % echo " " juman 6 1 * 0 * 0 " : / : : " @ 6 1 * 0 * 0 " : / : : " 9 1 * 0 * 0 NIL 2 * 0 2 8 " : / " 9 1 * 0 * 0 NIL 8 5.4 9 17

3 * 0 21 7 " : / : : / " 14 7 16 2 " : / " : JUMAN++ % echo " " jumanpp 6 1 * 0 * 0 " : / : : " @ 6 1 * 0 * 0 " : / : : " 9 1 * 0 * 0 NIL 6 1 * 0 * 0 " : / v : " 9 1 * 0 * 0 NIL 3 * 0 21 7 " : / : : / " 14 7 16 2 " : / " JUMAN JUMAN JUMAN++ JUMAN : JUMAN % echo " " juman 2 * 0 1 2 " : / " 5 * 0 30 3 " " 9 2 * 0 * 0 " " 14 5 18 2 " " : JUMAN++ % echo " " jumanpp 2 * 0 1 2 " : / " 5 * 0 30 3 NIL 9 2 * 0 * 0 NIL 14 5 18 2 " : / " 18

JUMAN JUMAN++ : JUMAN % echo " " juman 6 7 * 0 * 0 " : " : JUMAN++ % echo " " jumanpp 6 7 * 0 * 0 " : " 19

解析済み Web テキスト 学習学習基本モデル RNNLM タグ付きコーパス 再学習 本システム RNN 言語モデル ( 再学習 ) 2: 5 JUMAN++ JUMAN++ Web 2 3.3 Exact Soft Confidence-Weighted Learning [5] [5, 6, 7] 5.1 knp. script $ cat xxxx.knp... yyyy.knp ruby script/corpus2train.rb > train.fmrp train.fmrp 5.2 JUMAN++ jumanpp --train -t, --train : -i : (default: 10) -o, --outputmodel : (default: output.mdl) -C : Exact Soft Confidence-Weighted C (default: 1.0) -P : Exact Soft Confidence-Weighted ϕ (default: 1.65) -B, --beam : (default: 5) --output-intermediate-model : 20

weight.mdl % jumanpp --train train.fmrp --outputmodel trained.mdl % sudo rm /usr/local/share/jumanpp/weight.mdl.map % sudo cp trained.mdl /usr/local/share/jumanpp/weight.mdl ( /usr/local/ JUMAN++ ) ITERATION:0 50475/50476 avg:0.0202897 loss:0 /, --output-intermediate-model 5.3 --partial JUMAN++ \t JUMAN++ 21

echo " " jumanpp 7 2 * 0 * 0 NIL 8 * 0 * 0 * 0 " : / " 6 2 * 0 * 0 " : / : " 2 * 0 16 8 " : / : : / " 14 5 18 2 " : / " $ echo " \t " jumanpp --partial 7 2 * 0 * 0 NIL 6 1 * 0 * 0 " : / : : - " 9 2 * 0 * 0 NIL 6 2 * 0 * 0 " : / : " 2 * 0 16 8 " : / : : / " 14 5 18 2 " : / " sample/part-sample.txt 5.2 % cat sample/part-sample.txt jumanpp --partial ruby script/corpus2train.rb > partial.fmrp % cat train.fmrp partial.fmrp > part_train.fmrp % jumanpp --train part_train.fmrp --outputmodel part_trained.mdl 5.4 Faster RNNLM (HS/NCE) toolkit 10 10 https://github.com/yandex/faster-rnnlm 22

Faster RNNLM (HS/NCE) toolkit RNN Faster RNNLM toolkit _ _ _ _ _ _ _ _ _ _ _ _ _ JUMAN script % cat corpus.txt jumanpp ruby script/corpus2train.rb ruby script/fullmrp2basep.rb > data_for_lm.txt Faster RNNLM toolkit --nce --direct data for LM.txt data for LM.train Validation data for LM.valid lang.mdl, lang.mdl.nnet % faster-rnnlm --rnnlm lang.mdl --train data_for_lm.train --valid data_for_lm.valid --nce 22 --hidden 100 --direct 100 --direct-order 3 -bptt 4 --use-cuda 1 -independent Faster RNNLM toolkit toolkit readme 6 Windows(Cygwin) ( ) 23

[1], 1989. [2] 42, 1991. [3] NL-101, 1994. [4] JUMAN 2, 1996. [5] Jialei Wang, Peilin Zhao and Steven C.H. Hoi Exact Soft Confidence-Weighted Learning Proceedings of 29th International Conference on Machine Learning, 2012. [6] Hajime Morita, Daisuke Kawahara and Sadao Kurohashi Morphological Analysis for Unsegmented Languages using Recurrent Neural Network Language Model Proceedings of EMNLP 2015: Conference on Empirical Methods in Natural Language Processing, 2015. [7] RNN 78, 2016. 24

A BNF # # NIL A.1 (JUMAN.grammar) A.1.1 ::= ( ) ( ( ) ) ::= ( # ) ( # %) ::= ::= ( # ) ( # %) A.1.2 (( %)) ; (( ) (( ) ( ) ( ) ( ) ( ) ( ))) ; ;... ; (( ) (( ) ( ) ( ) ( %) ( %) ( %))) ; 25

A.2 (JUMAN.kankei) A.2.1 ::= ( ( ) ) ::= ( # ) ( # # ) ::= # # A.2.2 (( ) ( )) ; (( ) ( )) (( ) ( )) ; ; ; 26

A.3 (JUMAN.katuyou) A.3.1 ::= ( # ( ) ) ::= ::= ( # ) ::= # * A.3.2 ( (( * ) ( ) ( * ) ( ) ( ) ( ) ( ) ( * ) ( ) ( ) ( ) ( )) ) ( (( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )) ) 27

B B.1 / / ( ) ID ID 28

B.2 - - - - - - - 29

- - - - - - - - - - - - - 30

- - ( ) ( ) ( ) ( ) ( ) ( ) - ( ) - 31

- - - - - - B.3 32

33

34

B.4 ( 4,000 ) Web 1,500 2,000 : : :7:0.00607 : : :45:0.00106 Web 150 : ( 50,000 ) : : : : : : : : : : : : : : : : : : : : : 35

( 700 ) : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : B.5 : : : : : : : : 36

: : : : : : : : : : ( ) : 37