nwjc2vec: word2vec nwjc2vec nwjc2vec nwjc2vec 2 nwjc2vec 7 nwjc2vec word2vec nwjc2vec: Word Embedding Data Constructed from NINJAL Web Japanese Corpus Hiroyuki Shinnou, Masayuki Asahara, Kanako Komiya and Minoru Sasaki We constructed word embedding data (named as nwjc2vec ) using the NINJAL Web Japanese Corpus and word2vec software, and released it publicly. In this report, nwjc2vec is introduced, and the result of two types of experiments that were conducted to evaluate the quality of nwjc2vec is shown. In the first experiment, the evaluation based on word similarity is considered. Using a word similarity dataset, we calculate Spearman s rank correlation coefficient. In the second experiment, the evaluation based on task is considered. As the task, we consider word sense disambiguation (WSD) and language model construction using Recurrent Neural Network (RNN). The results obtained using the nwjc2vec were compared with the results obtained using word embedding constructed from the article data of newspaper for seven years. The nwjc2vec is shown to be high quality. Key Words: Word Embedding, NINJAL Web Japanese Corpus, word2vec, Department of Computer and Information Sciences, Ibaraki University, National Institute for Japanese Language and Linguistics
Vol. 24 No. 5 December 2017 1 one-hot N N w i N i 1 0 w one-hot Mikolov word2vec (Mikolov, Sutskever, Chen, Corrado, and Dean 2013b; Mikolov, Chen, Corrado, and Dean 2013a) ( 2016) 1 word2vec 2 GloVe 3 NWJC (Asahara, Maekawa, Imada, Kato, and Konishi 2014) nwjc2vec 4 NWJC 258 1 2,050 5 NWJC 1,200 nwjc2vec 1 mecab-owakati word2vec 2 https://github.com/svn2github/word2vec 3 https://nlp.stanford.edu/projects/glove/ 4 http://nwjc-data.ninjal.ac.jp/ 5 2008 unidic 706
nwjc2vec: nwjc2vec nwjc2vec nwjc2vec Recurrent Neural Network, RNN 7 nwjc2vec 2 nwjc2vec 2.1 NWJC NWJC 100 Heritrix-3.1.1 6 1 3 1 URL nwc-toolkit-0.0.2 7 MeCab- 0.996 8 UniDic-2.1.2 9 CaboCha-0.69 10 UniDic 11 URL (Asahara, Kawahara, Takei, Masuoka, Ohba, Torii, Morii, Tanaka, Maekawa, Kato, and Konishi 2016) 2014 10 12 NWJC-2014-4Q 1 6 http://webarchive.jira.com/wiki/display/heritrix/heritrix/ 7 https://github.com/xen/nwc-toolkit 8 https://taku910.github.io/mecab/ 9 http://unidic.ninjal.ac.jp/ 10 https://taku910.github.io/cabocha/ 11 CaboCha./configure --with-posset=unidic UniDic 707
Vol. 24 No. 5 December 2017 2.2 word2vec 1 NWJC-2014-4Q word2vec 12 CBOW 2 word2vec 13 word 14 mrph 2 2.3 nwjc2vec 2 nwjc2vec nwjc2vec 1 1 e_1 e_2 e_200 1 : NWJC-2014-4Q URL 83,992,556 8,399 URL 3,885,889,575 38 1,463,142,939 14 25,836,947,421 258 2 word2vec CBOW or skip-gram -cbow 1 -size 200 -window 8 -negative 25 softmax -hs 0 -sample 1e-4 -iter 15 12 https://github.com/svn2github/word2vec 13 word2vec demo-word.sh NWJC nwjc2vec 3 14 unidic-mecab kana-accent-2.1.2 dicrc 26 708
nwjc2vec: e_i i,,,,*,*,*,,,,,,,,*,*,*, -10.491043-2.121982-3.084628 4.024705 3.570072 12.781445,,,,*,*,*,,,,,,,,*,*,*, 1 word2vec 15 1 nwjc2vec 1,738,455 16 1,541,651 nwjc2vec 3 etc 3 (%) 1,570,477 90.34 129,167 7.43 12,507 0.71 7,083 0.41 4,884 0.28 3,761 0.21 3,614 0.21 1,496 0.08 1,163 0.07 971 0.05 390 0.02 366 0.02 330 0.02 125 0.01 100 0.01 etc 2,021 0.12 1,738,455 100.00 15 L2-16 header 1 1,738,456 709
Vol. 24 No. 5 December 2017 3 etc 9.261 9.641, 5.105 0.1 1 3 15 75 17 10 4 1 4 3 15 75,,,,,,,,,,,,,,,,,,,,,,,,,,, 17 3 3 710
nwjc2vec: 3 nwjc2vec ( 2017) nwjc2vec 2 7 nwjc2vec 3.1 mai2vec nwjc2vec 7 93 99 7 6,791,403 MeCab-0.996 UniDic-2.1.2 word2vec mai2vec word2vec nwjc2vec 2 mai2vec 132,509 3.2 4 10 11 0 10 711
Vol. 24 No. 5 December 2017 5 959 901 1,102 1,463 431 190 793 152 6 mai2vec 0.293 0.313 0.197 0.223 nwjc2vec 0.342 0.464 0.206 0.345 https://github.com/tmu-nlp/japanesewordsimilaritydataset mai2vec nwjc2vec 5 mai2vec nwjc2vec 18 6 nwjc2vec mai2vec mai2vec 3.3 3.3.1 Sugawara (Sugawara, Takamura, Sasano, and Okumura 2015) Sugawara 2 / / / / / / / / / / / / / / / / / / / V Sugawara 2 4 V V V V V 18 1 712
nwjc2vec: 7 (%) baseline mai2vec nwjc2vec mai2vec-0 nwjc2vec-0 76.92 77.07 77.71 76.51 76.35 2 V V V V nwjc2vec mai2vec nwjc2vec - / - / - / - / - / - / - / - / - / - / - / - / - / - / - / - / - / - / - SemEval-2 (Okumura, Shirai, Komiya, and Yokono 2011) 50 50 50 50 7 baseline SemEval-2 mai2vec mai2vec nwjc2vec nwjc2vec 1 1 word2vec mai2vec-0 nwjc2vec-0 SVM 19 nwjc2vec nwjc2vec 1 3.3.2 RNN RNN t s t w t 19 https://www.csie.ntu.edu.tw/ cjlin/libsvm/ 713
Vol. 24 No. 5 December 2017 w t+1 RNN Long Short-Term Memory LSTM (Gers, Schmidhuber, and Cummins 2000) LSTM t 2 t w t w t LSTM LSTM t + 1 LSTM w 0 w t h t c t y t W one-hot W y t w t+1 w t w t LSTM mai2vec (mai2vec-lm) nwjc2vec (nwjc2vec-lm) nwjc2vec LSTM (base-lm) 2 LSTM t 714
nwjc2vec: (Maekawa, Yamazaki, Ogiso, Maruyama, Ogura, Kashino, Koiso, Yamaguchi, Tanaka, and Den 2014) Yahoo! Yahoo! 7,330 7,226 104 1 epoch 20 8 3 8 epoch base-lm mai2vec-lm nwjc2vec-lm 1 148.13 195.41 212.52 2 126.98 146.07 151.45 3 124.33 129.34 129.82 4 125.93 123.98 120.84 5 130.35 124.72 118.68 6 136.17 130.37 122.79 7 143.96 135.43 128.49 8 150.31 142.84 136.91 9 159.09 150.90 147.10 10 167.91 159.91 160.29 3 20 7,226 715
Vol. 24 No. 5 December 2017 mai2vec-lm nwjc2vec-lm base-lm nwjc2vec-lm mai2vec-lm nwjc2vec mai2vec 4 mai2vec nwjc2vec nwjc2vec mai2vec mai2vec ( 2017) mai2vec nwjc2vec nwjc2vec mai2vec SemEval-2 baseline baseline SemEval-2 baseline baseline 0.2% ( 2015) 77.28% nwjc2vec 0.43% Yamaki wikipedia 77.10% (Yamaki, Shinnou, Komiya, and Sasaki 2016) mai2vec mai2vec nwjc2vec baseline nwjc2vec mai2vec 0.64% 0.64% mai2vec nwjc2vec 175,302 15,082 mai2vec 7,424 3,204 nwjc2vec 404 324 nwjc2vec mai2vec 716
nwjc2vec: 9 fine-tuning epoch nwjc2vec-lm fine-tuning 1 212.52 194.72 2 151.45 137.16 3 129.82 118.32 4 120.84 113.40 5 118.68 112.82 6 122.79 115.78 7 128.49 121.34 8 136.91 127.69 9 147.10 133.37 10 160.29 140.86 nwjc2vec fine-tuning fine-tuning fine-tuning nwjc2vec 21 ( 2016) nwjc2vec mai2vec 30 nwjc2vec fine-tuning LSTM 9 4 epoch fine-tuning fine-tuning 5 nwjc2vec nwjc2vec word2vec nwjc2vec 2 nwjc2vec 21 window 5 Negative Sample 20 SkipGram 717
Vol. 24 No. 5 December 2017 4 fine-tuning nwjc2vec nwjc2vec nwjc2vec fine-tuning (2011 2015) (2016 2021) all-words WSD (2016 2017) Asahara, M., Kawahara, K., Takei, Y., Masuoka, H., Ohba, Y., Torii, Y., Morii, T., Tanaka, Y., Maekawa, K., Kato, S., and Konishi, H. (2016). BonTen Corpus Concordance System for NINJAL Web Japanese Corpus. In Proceedings of COLING 2016, the 26th International 718
nwjc2vec: Conference on Computational Linguistics: System Demonstrations, pp. 25 29. Asahara, M., Maekawa, K., Imada, M., Kato, S., and Konishi, H. (2014). Archiving and Analysing Techniques of the Ultra-large-scale Web-based Corpus Project of NINJAL, Japan. Alexandria: The Journal of National and International Library and Information Issues, 25 (1 2), pp. 129 148. (2017). nwjc2vec:. 23, pp. 94 97. Gers, F. A., Schmidhuber, J., and Cummins, F. (2000). Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12 (10), pp. 2451 2471. Maekawa, K., Yamazaki, M., Ogiso, T., Maruyama, T., Ogura, H., Kashino, W., Koiso, H., Yamaguchi, M., Tanaka, M., and Den, Y. (2014). Balanced Corpus of Contemporary Written Japanese. Language Resources and Evaluation, 48 (2), pp. 345 371. Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient Estimation of Word Representations in Vector Space. In ICLR Workshop Paper. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed Representations of Words and Phrases and Their Compositionality. In Advances in Neural Information Processing Systems, pp. 3111 3119. (2016).., 31 (2), pp. 189 201. Okumura, M., Shirai, K., Komiya, K., and Yokono, H. (2011). On SemEval-2010 Japanese WSD Task., 18 (3), pp. 293 307. (2016). Chainer.. (2015).. 21, pp. 59 62. Sugawara, H., Takamura, H., Sasano, R., and Okumura, M. (2015). Context Representation with Word Embeddings for WSD. In PACLING-2015, pp. 149 155. Yamaki, S., Shinnou, H., Komiya, K., and Sasaki, M. (2016). Supervised Word Sense Disambiguation with Sentences Similarities from Context Word Embeddings. In PACLIC-30, pp. 115 121. 1985 1987 1993 719
Vol. 24 No. 5 December 2017 1998 2003 2004 2012 2005 2009 2010 2014 1996 2001 2001 12 2017 6 1 2017 8 4 2017 9 5 720