自然言語処理24_705

Similar documents
IPSJ SIG Technical Report Vol.2010-NL-199 No /11/ treebank ( ) KWIC /MeCab / Morphological and Dependency Structure Annotated Corp

Modal Phrase MP because but 2 IP Inflection Phrase IP as long as if IP 3 VP Verb Phrase VP while before [ MP MP [ IP IP [ VP VP ]]] [ MP [ IP [ VP ]]]

¥ì¥·¥Ô¤Î¸À¸ì½èÍý¤Î¸½¾õ

A Japanese Word Dependency Corpus ÆüËܸì¤Îñ¸ì·¸¤ê¼õ¤±¥³¡¼¥Ñ¥¹

NINJAL Project Review Vol.3 No.3

自然言語処理21_249

Haiku Generation Based on Motif Images Using Deep Learning Koki Yoneda 1 Soichiro Yokoyama 2 Tomohisa Yamashita 2 Hidenori Kawamura Scho

第62巻 第1号 平成24年4月/石こうを用いた木材ペレット

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

1., 1 COOKPAD 2, Web.,,,,,,.,, [1]., 5.,, [2].,,.,.,, 5, [3].,,,.,, [4], 33,.,,.,,.. 2.,, 3.., 4., 5., ,. 1.,,., 2.,. 1,,

Vol.54 No (July 2013) [9] [10] [11] [12], [13] 1 Fig. 1 Flowchart of the proposed system. c 2013 Information

大学における原価計算教育の現状と課題

( )

kut-paper-template.dvi

untitled

..,,,, , ( ) 3.,., 3.,., 500, 233.,, 3,,.,, i

<95DB8C9288E397C389C88A E696E6462>

,,,,., C Java,,.,,.,., ,,.,, i

1 1 tf-idf tf-idf i

Q-Learning Support-Vector-Machine NIKKEI NET Infoseek MSN i

IPSJ SIG Technical Report Vol.2011-EC-19 No /3/ ,.,., Peg-Scope Viewer,,.,,,,. Utilization of Watching Logs for Support of Multi-

udc-3.dvi

& Vol.5 No (Oct. 2015) TV 1,2,a) , Augmented TV TV AR Augmented Reality 3DCG TV Estimation of TV Screen Position and Ro

21 Pitman-Yor Pitman- Yor [7] n -gram W w n-gram G Pitman-Yor P Y (d, θ, G 0 ) (1) G P Y (d, θ, G 0 ) (1) Pitman-Yor d, θ, G 0 d 0 d 1 θ Pitman-Yor G

HASC2012corpus HASC Challenge 2010,2011 HASC2011corpus( 116, 4898), HASC2012corpus( 136, 7668) HASC2012corpus HASC2012corpus

WII-A 2017 Web SNS Tweet 2. [ 02] [ 10, 14, 07, 12, 12]. Matsumoto et al. [Matsumoto 11] [ 15] / Support Vector Machine 2 / [ 01, 16] [ 01] [ 1

2016

corpus.indd

_念3)医療2009_夏.indd

NINJAL Research Papers No.8


Studies of Foot Form for Footwear Design (Part 9) : Characteristics of the Foot Form of Young and Elder Women Based on their Sizes of Ball Joint Girth

ñ{ï 01-65

untitled

Vol.55 No (Jan. 2014) saccess 6 saccess 7 saccess 2. [3] p.33 * B (A) (B) (C) (D) (E) (F) *1 [3], [4] Web PDF a m

05_藤田先生_責

THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE.

DPA,, ShareLog 3) 4) 2.2 Strino Strino STRain-based user Interface with tacticle of elastic Natural ObjectsStrino 1 Strino ) PC Log-Log (2007 6)

2 ( ) i

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2013-HPC-139 No /5/29 Gfarm/Pwrake NICT NICT 10TB 100TB CPU I/O HPC I/O NICT Gf

36 Theoretical and Applied Linguistics at Kobe Shoin No. 20, 2017 : Key Words: syntactic compound verbs, lexical compound verbs, aspectual compound ve

Web Stamps 96 KJ Stamps Web Vol 8, No 1, 2004

[2] OCR [3], [4] [5] [6] [4], [7] [8], [9] 1 [10] Fig. 1 Current arrangement and size of ruby. 2 Fig. 2 Typography combined with printing

fiš„v8.dvi

BOK body of knowledge, BOK BOK BOK 1 CC2001 computing curricula 2001 [1] BOK IT BOK 2008 ITBOK [2] social infomatics SI BOK BOK BOK WikiBOK BO

Comparison of the strengths of Japanese Collegiate Baseball Leagues in past 30 seasons Takashi Toriumi 1, Hirohito Watada 2, The Tokyo Big 6 Baseball

-like BCCWJ CD-ROM CiNii NII BCCWJ BCCWJ

The copyright of this material is retained by the Information Processing Society of Japan (IPSJ). The material has been made available on the website

DEIM Forum 2009 C8-4 QA NTT QA QA QA 2 QA Abstract Questions Recomme

16_.....E...._.I.v2006

. Yahoo! 1!goo 2 QA..... QA Web Web [1]Web Web Yin [2] Web Web Web. [3] Web Wikipedia 1 2

untitled


外国語学部 紀要30号(横書)/03_菊地俊一


@08470030ヨコ/篠塚・窪田 221号

*1 *2 *1 JIS A X TEM 950 TEM JIS Development and Research of the Equipment for Conversion to Harmless Substances and Recycle of Asbe

音響モデル triphone 入力音声 音声分析 デコーダ 言語モデル N-gram bigram HMM の状態確率として利用 出力層 triphone: 3003 ノード リスコア trigram 隠れ層 2048 ノード X7 層 1 Structure of recognition syst

main.dvi

WHITE PAPER RNN

Transcription:

nwjc2vec: word2vec nwjc2vec nwjc2vec nwjc2vec 2 nwjc2vec 7 nwjc2vec word2vec nwjc2vec: Word Embedding Data Constructed from NINJAL Web Japanese Corpus Hiroyuki Shinnou, Masayuki Asahara, Kanako Komiya and Minoru Sasaki We constructed word embedding data (named as nwjc2vec ) using the NINJAL Web Japanese Corpus and word2vec software, and released it publicly. In this report, nwjc2vec is introduced, and the result of two types of experiments that were conducted to evaluate the quality of nwjc2vec is shown. In the first experiment, the evaluation based on word similarity is considered. Using a word similarity dataset, we calculate Spearman s rank correlation coefficient. In the second experiment, the evaluation based on task is considered. As the task, we consider word sense disambiguation (WSD) and language model construction using Recurrent Neural Network (RNN). The results obtained using the nwjc2vec were compared with the results obtained using word embedding constructed from the article data of newspaper for seven years. The nwjc2vec is shown to be high quality. Key Words: Word Embedding, NINJAL Web Japanese Corpus, word2vec, Department of Computer and Information Sciences, Ibaraki University, National Institute for Japanese Language and Linguistics

Vol. 24 No. 5 December 2017 1 one-hot N N w i N i 1 0 w one-hot Mikolov word2vec (Mikolov, Sutskever, Chen, Corrado, and Dean 2013b; Mikolov, Chen, Corrado, and Dean 2013a) ( 2016) 1 word2vec 2 GloVe 3 NWJC (Asahara, Maekawa, Imada, Kato, and Konishi 2014) nwjc2vec 4 NWJC 258 1 2,050 5 NWJC 1,200 nwjc2vec 1 mecab-owakati word2vec 2 https://github.com/svn2github/word2vec 3 https://nlp.stanford.edu/projects/glove/ 4 http://nwjc-data.ninjal.ac.jp/ 5 2008 unidic 706

nwjc2vec: nwjc2vec nwjc2vec nwjc2vec Recurrent Neural Network, RNN 7 nwjc2vec 2 nwjc2vec 2.1 NWJC NWJC 100 Heritrix-3.1.1 6 1 3 1 URL nwc-toolkit-0.0.2 7 MeCab- 0.996 8 UniDic-2.1.2 9 CaboCha-0.69 10 UniDic 11 URL (Asahara, Kawahara, Takei, Masuoka, Ohba, Torii, Morii, Tanaka, Maekawa, Kato, and Konishi 2016) 2014 10 12 NWJC-2014-4Q 1 6 http://webarchive.jira.com/wiki/display/heritrix/heritrix/ 7 https://github.com/xen/nwc-toolkit 8 https://taku910.github.io/mecab/ 9 http://unidic.ninjal.ac.jp/ 10 https://taku910.github.io/cabocha/ 11 CaboCha./configure --with-posset=unidic UniDic 707

Vol. 24 No. 5 December 2017 2.2 word2vec 1 NWJC-2014-4Q word2vec 12 CBOW 2 word2vec 13 word 14 mrph 2 2.3 nwjc2vec 2 nwjc2vec nwjc2vec 1 1 e_1 e_2 e_200 1 : NWJC-2014-4Q URL 83,992,556 8,399 URL 3,885,889,575 38 1,463,142,939 14 25,836,947,421 258 2 word2vec CBOW or skip-gram -cbow 1 -size 200 -window 8 -negative 25 softmax -hs 0 -sample 1e-4 -iter 15 12 https://github.com/svn2github/word2vec 13 word2vec demo-word.sh NWJC nwjc2vec 3 14 unidic-mecab kana-accent-2.1.2 dicrc 26 708

nwjc2vec: e_i i,,,,*,*,*,,,,,,,,*,*,*, -10.491043-2.121982-3.084628 4.024705 3.570072 12.781445,,,,*,*,*,,,,,,,,*,*,*, 1 word2vec 15 1 nwjc2vec 1,738,455 16 1,541,651 nwjc2vec 3 etc 3 (%) 1,570,477 90.34 129,167 7.43 12,507 0.71 7,083 0.41 4,884 0.28 3,761 0.21 3,614 0.21 1,496 0.08 1,163 0.07 971 0.05 390 0.02 366 0.02 330 0.02 125 0.01 100 0.01 etc 2,021 0.12 1,738,455 100.00 15 L2-16 header 1 1,738,456 709

Vol. 24 No. 5 December 2017 3 etc 9.261 9.641, 5.105 0.1 1 3 15 75 17 10 4 1 4 3 15 75,,,,,,,,,,,,,,,,,,,,,,,,,,, 17 3 3 710

nwjc2vec: 3 nwjc2vec ( 2017) nwjc2vec 2 7 nwjc2vec 3.1 mai2vec nwjc2vec 7 93 99 7 6,791,403 MeCab-0.996 UniDic-2.1.2 word2vec mai2vec word2vec nwjc2vec 2 mai2vec 132,509 3.2 4 10 11 0 10 711

Vol. 24 No. 5 December 2017 5 959 901 1,102 1,463 431 190 793 152 6 mai2vec 0.293 0.313 0.197 0.223 nwjc2vec 0.342 0.464 0.206 0.345 https://github.com/tmu-nlp/japanesewordsimilaritydataset mai2vec nwjc2vec 5 mai2vec nwjc2vec 18 6 nwjc2vec mai2vec mai2vec 3.3 3.3.1 Sugawara (Sugawara, Takamura, Sasano, and Okumura 2015) Sugawara 2 / / / / / / / / / / / / / / / / / / / V Sugawara 2 4 V V V V V 18 1 712

nwjc2vec: 7 (%) baseline mai2vec nwjc2vec mai2vec-0 nwjc2vec-0 76.92 77.07 77.71 76.51 76.35 2 V V V V nwjc2vec mai2vec nwjc2vec - / - / - / - / - / - / - / - / - / - / - / - / - / - / - / - / - / - / - SemEval-2 (Okumura, Shirai, Komiya, and Yokono 2011) 50 50 50 50 7 baseline SemEval-2 mai2vec mai2vec nwjc2vec nwjc2vec 1 1 word2vec mai2vec-0 nwjc2vec-0 SVM 19 nwjc2vec nwjc2vec 1 3.3.2 RNN RNN t s t w t 19 https://www.csie.ntu.edu.tw/ cjlin/libsvm/ 713

Vol. 24 No. 5 December 2017 w t+1 RNN Long Short-Term Memory LSTM (Gers, Schmidhuber, and Cummins 2000) LSTM t 2 t w t w t LSTM LSTM t + 1 LSTM w 0 w t h t c t y t W one-hot W y t w t+1 w t w t LSTM mai2vec (mai2vec-lm) nwjc2vec (nwjc2vec-lm) nwjc2vec LSTM (base-lm) 2 LSTM t 714

nwjc2vec: (Maekawa, Yamazaki, Ogiso, Maruyama, Ogura, Kashino, Koiso, Yamaguchi, Tanaka, and Den 2014) Yahoo! Yahoo! 7,330 7,226 104 1 epoch 20 8 3 8 epoch base-lm mai2vec-lm nwjc2vec-lm 1 148.13 195.41 212.52 2 126.98 146.07 151.45 3 124.33 129.34 129.82 4 125.93 123.98 120.84 5 130.35 124.72 118.68 6 136.17 130.37 122.79 7 143.96 135.43 128.49 8 150.31 142.84 136.91 9 159.09 150.90 147.10 10 167.91 159.91 160.29 3 20 7,226 715

Vol. 24 No. 5 December 2017 mai2vec-lm nwjc2vec-lm base-lm nwjc2vec-lm mai2vec-lm nwjc2vec mai2vec 4 mai2vec nwjc2vec nwjc2vec mai2vec mai2vec ( 2017) mai2vec nwjc2vec nwjc2vec mai2vec SemEval-2 baseline baseline SemEval-2 baseline baseline 0.2% ( 2015) 77.28% nwjc2vec 0.43% Yamaki wikipedia 77.10% (Yamaki, Shinnou, Komiya, and Sasaki 2016) mai2vec mai2vec nwjc2vec baseline nwjc2vec mai2vec 0.64% 0.64% mai2vec nwjc2vec 175,302 15,082 mai2vec 7,424 3,204 nwjc2vec 404 324 nwjc2vec mai2vec 716

nwjc2vec: 9 fine-tuning epoch nwjc2vec-lm fine-tuning 1 212.52 194.72 2 151.45 137.16 3 129.82 118.32 4 120.84 113.40 5 118.68 112.82 6 122.79 115.78 7 128.49 121.34 8 136.91 127.69 9 147.10 133.37 10 160.29 140.86 nwjc2vec fine-tuning fine-tuning fine-tuning nwjc2vec 21 ( 2016) nwjc2vec mai2vec 30 nwjc2vec fine-tuning LSTM 9 4 epoch fine-tuning fine-tuning 5 nwjc2vec nwjc2vec word2vec nwjc2vec 2 nwjc2vec 21 window 5 Negative Sample 20 SkipGram 717

Vol. 24 No. 5 December 2017 4 fine-tuning nwjc2vec nwjc2vec nwjc2vec fine-tuning (2011 2015) (2016 2021) all-words WSD (2016 2017) Asahara, M., Kawahara, K., Takei, Y., Masuoka, H., Ohba, Y., Torii, Y., Morii, T., Tanaka, Y., Maekawa, K., Kato, S., and Konishi, H. (2016). BonTen Corpus Concordance System for NINJAL Web Japanese Corpus. In Proceedings of COLING 2016, the 26th International 718

nwjc2vec: Conference on Computational Linguistics: System Demonstrations, pp. 25 29. Asahara, M., Maekawa, K., Imada, M., Kato, S., and Konishi, H. (2014). Archiving and Analysing Techniques of the Ultra-large-scale Web-based Corpus Project of NINJAL, Japan. Alexandria: The Journal of National and International Library and Information Issues, 25 (1 2), pp. 129 148. (2017). nwjc2vec:. 23, pp. 94 97. Gers, F. A., Schmidhuber, J., and Cummins, F. (2000). Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12 (10), pp. 2451 2471. Maekawa, K., Yamazaki, M., Ogiso, T., Maruyama, T., Ogura, H., Kashino, W., Koiso, H., Yamaguchi, M., Tanaka, M., and Den, Y. (2014). Balanced Corpus of Contemporary Written Japanese. Language Resources and Evaluation, 48 (2), pp. 345 371. Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient Estimation of Word Representations in Vector Space. In ICLR Workshop Paper. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed Representations of Words and Phrases and Their Compositionality. In Advances in Neural Information Processing Systems, pp. 3111 3119. (2016).., 31 (2), pp. 189 201. Okumura, M., Shirai, K., Komiya, K., and Yokono, H. (2011). On SemEval-2010 Japanese WSD Task., 18 (3), pp. 293 307. (2016). Chainer.. (2015).. 21, pp. 59 62. Sugawara, H., Takamura, H., Sasano, R., and Okumura, M. (2015). Context Representation with Word Embeddings for WSD. In PACLING-2015, pp. 149 155. Yamaki, S., Shinnou, H., Komiya, K., and Sasaki, M. (2016). Supervised Word Sense Disambiguation with Sentences Similarities from Context Word Embeddings. In PACLIC-30, pp. 115 121. 1985 1987 1993 719

Vol. 24 No. 5 December 2017 1998 2003 2004 2012 2005 2009 2010 2014 1996 2001 2001 12 2017 6 1 2017 8 4 2017 9 5 720