() (MeCab) 1 Juman ChaSen 2 MeCab ChaSen 1.3 MeCab MeCab OS Windows MeCab [] [Binary package for MS-Windows] [] sourceforge.net [mecab-win32] Mac OS

Similar documents

() (MeCab) *1 Juman ChaSen *2 MeCab ChaSen 1.3 MeCab MeCab OS Windows MeCab [] [Binary package for MS-Windows] [] sourceforge.net [mecab-win32] Mac OS

Unix * 3 PC 2 Linux, Mac *4 Windows Cygwin Cygwin gnuplot Cygwin unix emulator online gnuplot *5 matplotlib *6 SuperMongo *7 gnuplot gnuplot OS *8 Uni

ワードプロセッシングについて

Rによる計量分析：データ解析と可視化 - 第2回セットアップ

REALV5_A4…p_Ł\1_4A_OCF

「都市から地方への人材誘致・移住促進に関する調査」

<91498EE88CA D815B2E786C73>

〔　大　会　役　員　〕

橡本体資料＋参考条文.PDF

SNJ HQカリキュラムパソコン入門コース

できるん?! セキュリティ〜ハードディスクの情報消去〜改訂第三版

XFree XFree86 2. Kterm 3. Canna 4. Vi Vi VIM 5. Emacs Emacs 21 XEmacs XFree Mac OS X XDarwin Aqua XFree ( X L

Microsoft Word - ChoreonoidStartUpGuide.docx

Microsoft Word - マニュアル4.1J

Installation and New Features Guide for FileMaker Pro 10 and FileMaker Pro 10 Advanced

R による統計解析入門

slice00_install.dvi

MathLibre KNOPPIX (next generation) 2012 KNOPPIX/Math MathLibre KNOPPIX , KNOPPIX 6.0, next generation. KNOPPIX/Math KDE,

Morphological Analysis System JUMAN Copyright 2016 Kyoto University All rights reserved. Licensed under the Apache License, Version 2.0 (the Li

1 1.1 PC PC PC PC PC workstation PC hardsoft PC PC CPU 1 Gustavb, Wikimedia Commons.

R Commanderを用いたデータ解析

LAPLINK ヘルプデスク操作ガイド

Sophos Anti-Virus UNIX or Linux startup guide

SHOBI_Portal_Manual

2.2 Sage I 11 factor Sage Sage exit quit 1 sage : exit 2 Exiting Sage ( CPU time 0m0.06s, Wall time 2m8.71 s). 2.2 Sage Python Sage 1. Sage.sage 2. sa

programmingII2019-v01

できるん?! セキュリティ～ハードディスクの情報消去～

_...j.f......_..

Parallels Desktop 7 クイックスタートガイド

2 Windows 10 *1 3 Linux 3.1 Windows Bash on Ubuntu on Windows cygwin MacOS Linux OS Ubuntu OS Linux OS 1 GUI Windows Explorer Mac Finder 1 GUI

24 SPAM Performance Comparison of Machine Learning Algorithms for SPAM Discrimination

28 Docker Design and Implementation of Program Evaluation System Using Docker Virtualized Environment

JUMAN++ version

3 Ubuntu Linux Ubuntu Linux Debian Linux DistroWatch.com 1 Debian Ubuntu Linux 1 Debian CD(4.1 ) Knoppix Debian CentOS Linux CentOS 1 Ubuntu L

,…I…y…„†[…e…B…ﬁ…O…V…X…e…•‡Ì…J†[…l…‰ﬁ®“ì‡Ì›Â”‰›»pdfauthor

test CreateIndex test.helpindex test Info.plist XCode Info.plist CFBundleHelpBookFolder string test CFBundleHelpBookName string test.html AppleTitle J

4-1. html css html ht tp ht tp html HyperTex t Markup Language: html <meta ht tp - equiv="content-type" content=" tex t /html; charset=utf-

MINI2440マニュアル

Transcription:

RMeCab 2008 6 16 1 MeCab RMeCab 1 1.1.............................................. 1 1.2............................................ 1 1.3 MeCab......................................... 2 1.4 RMeCab.......................................... 5 2 RMeCab 7 2.1 RMeCab..................................... 8 2.2 MeCab.......................................... 13 2.3........................................ 15 2.4........................................... 16 2.5............................................ 19 2.6 N-gram................................................ 21 2.7............................................. 26 1 MeCab RMeCab 1.1 MeCab R MeCab RMeCab 1.2 ishida-m@ias.tokushima-u.ac.jp 1

() (MeCab) *1 Juman ChaSen *2 MeCab ChaSen 1.3 MeCab MeCab OS Windows MeCab [] [Binary package for MS-Windows] [] sourceforge.net [mecab-win32] Mac OS X Linux [mecab] Mac OS X Linux [mecab-ipadic] Windows Windows MeCab 0.97 mecab-0.97.exe [OK] Shift Jis MeCab C Program Files Mac OS X Unix Downloads Terminal Mac OS X DVD *1 http://mecab.sourceforge.net/ *2 http://mecab.sourceforge.net/feature.html 2

1 1 MeCab # # ** $ cd Downloads $ tar zxvf mecab-0.**.tar.gz $ cd mecab-0.** $./configure --with-charset=utf-8 $ make $ sudo make install # $ tar zxf mecab-ipadic-2.7.0-20070****.tar.gz $ cd mecab-ipadic-2.7.0-20070**** $./configure --with-charset=utf-8 $ make $ sudo make install Windows MeCab [Enter] C work test.txt res.txt 3

C: Program Files MeCab bin > mecab c: work test.txt > c: work res.txt test.txt res.txt 1 1 EOS,*,*,*,*,*,,,,,*,*,*,*,,,,,,*,*,*,,,,,*,*,*,*,,,,,,*,*,*,,,,,*,*,,,,,,*,*,*,,,,,,,*,*,*,*,,, 1 1 MeCab (), 1, 2, 3,,,,, EOS (end of sentence) (token) (type) MeCab 1 2 EOS,,*,*,*,*,,,,,*,*,*,*,,,,,*,*,*,*,,,,*,*,*,,,,,,*,*,*,,,,,,,*,*,*,*,,,,,*,*,*,*,,,,,*,*,*,*,,,,,*,*,*,*,,,,,*,*,,,,,,,*,*,*,*,,, 1 2 9 4

8 MeCab CSV () () MeCab R MeCab R R MeCab RMeCab 1.4 RMeCab RMeCab R MeCab R RMeCab 1.4.1 RMeCab R MeCab R MeCab 2 RMeCab *1 OS RMeCab RMeCab 0.50 RMeCab 0.50.zip RMeCab 0.50.tgz, RMeCab 0.50.tar.gz.zip Windows Mac OS X.tgz Unix.tar.gz Windows RMeCabInstall.txt Windows R RMeCab *2 [1] Windows R *3 R R [] - [ zip ] RMeCab ***.zip ( 1 2)*** RMeCabInstall.txt RMeCabInstall.bat *1 http://groups.google.co.jp/group/rmecab/ *2 *3 RMeCabInstall.txt R MeCab RMeCabInstall.bat MeCab bin libmecab.dll R library RMeCab libs libmecab.dll 5

RMeCabInstall.bat ( 1 3) 1 1 2 RMeCab 1 3 Mac OS X R [] - [] [CRAN] [] [install] RMeCab ***.tgz *** Linux R R R getwd() ** > install.packages("rmecab_0.**.tar.gz", destdir=".", repos = NULL) 6

2 RMeCab RMeCab R RMeCab Windows R [] - [ ] RMeCab ( 2 1 ) R library(rmecab) [Enter] R 2 1 RMeCab RMeCab 2 1 *1 2 1 2 1 RMeCab *2 Windows data.zip Mac OS X Unix data.tar.gz Windows data zip [] [] [] zip data data *1 *2 http://groups.google.co.jp/group/rmecab 7

RMeCabC RMeCabText RMeCabDF RMeCabFreq docmatrix collocate collscores Ngram N, N-gram NgramDF N, N-gram docngram N N-gram rmsign RMeCab 2 1 RMeCab C (C:) R R getwd() 2.1 RMeCab RMeCab RMeCabText() RMeCabFreq() MeCab 2.1.1 RMeCabC() RMeCabC() MeCab > res <- RMeCabC("") > res [[1]] 8

"" [[2]] "" [[3]] "" [[4]] "" #... > res[[1]] # "" > unlist(res)... "" "" "" ""... > x <- "" # > res <- RMeCabC(x) > unlist(res) R [[]] res[[1]] R unlist() ( x) RMeCabC() RMeCabC() 2 1 () 2 0 2 1 > res <- RMeCabC("", 1) > unlist(res) # 9

"" "" "" "" > res <- RMeCabC("", 0) > unlist(res) # "" "" "" "" 2 1 2 0 ( 2 ) () > res <- RMeCabC("") > res2 <- unlist(res) > res2 "" "" "" "" "" "" "" > res2[names(res2) == ""] "" "" "" "" > names(res2) == "" # [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE RMeCabC() res2 res2 names() == (TRUE) (FALSE) [] TRUE FALSE res2 TRUE which() TRUE any() > res3 <- names(res2) == "" 10

> res3 [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE > which(res3) [1] 1 3 5 7 > any(res3) [1] TRUE which() TRUE any() TRUE TRUE 2.1.2 RMeCabText() RMeCabText() 1 10 MeCab RMeCab data data yukiguni.txt > res <- RMeCabText("yukiguni.txt") > res [[1]] [1] "" "" "" "*" "*" [6] "*" "*" "" "" "" [[2]] [1] "" "" "" "" "*" "*" "*" "" [9] "" "" [[3]] [1] "" "" "" "*" [5] "*" "" "" "" #... 2.1.3 RMeCabFreq() RMeCabFreq() Windows Linux Mac OS X Windows > res <- RMeCabFreq("yukiguni.txt") length = 13 > res 11

Term Info1 Info2 Freq 1 3 2 1 3 1 #... res Term Info1 Info2 Freq 1 3 1 data kumo.txt > pt1 <- proc.time() # > res <- RMeCabFreq("kumo.txt") length = 447 > pt2 <- proc.time() > # > pt2 - pt1 # 0.008 0.008 1.703 MeCab Windows Mac OS X Linux RMeCabFreq() length = 447 () () 447 Linux Mac OS X 446 Linux Mac OS X MeCab OS MeCab,,*,*,*,*,,,,,*,*,,,,,,,*,*,,,,, Windows,,*,*,*,*,,,,,*,*,,*,*,,,,*,*,,*,*,*,,,,*,*,,,,, 12

Windows Mac OS X Linux OS OS MeCab MeCab MeCab *1 Unix OS Windows MeCab 2.2 MeCab Windows MeCab Mac OS X Linux Mecab C: Program Files MeCab bin > mecab,,,,*,*,,,,,,,*,*,,,,,*,*,,,,,,*,*,*,,,,, EOS CSV,-1,-1,1000,,,,,*,*,,, MeCab ID ID 1 2 3 motohiro.csv C data ( "C:\data" ) ID ID -1 MeCab Windows []-[]-[ ]-[] cd MeCab bin MeCab *1 http://mecab.sourceforge.net/dic.html 13

C: Program Files MeCab mecab-dict-index.exe MeCab motohiro.csv \ () C: data > cd C: Program Files MeCab bin C: Program Files MeCab bin > mecab-dict-index.exe \ -d c: Program Files MeCab dic ipadic \ -u ishida.dic -f shift-jis -t shift-jis \ c: data motohiro.csv reading c: data mecabdic.csv... 1 emitting double-array: 100% ########################################### done! done mecab-dict-index.exe ishida.dic C: data MeCab C: Program Files MeCab dict dicrc Windows ([]-[]-[]-[]) userdic = C: data ishida.dic MeCab C: Program Files MeCab bin > mecab,,,,*,*,,,,,,,*,*,,,,*,*,*,,,,, EOS *1 *1 http://mecab.sourceforge.net/dic.html 14

2.3 R 2.3.1 RMeCabDF() RMeCabDF() 2 2 1 2 3 3 RMeCabDF() 1 2 3 1 data photo.csv ID, Sex, Reply 1, F, 2, M, 3, F, 4, F, 5, M, 2 2 CSV > # > dat <- read.csv("photo.csv") > res <- RMeCabDF(dat, 3) # () > res <- RMeCabDF(dat, 3, 1) # RMeCabDF() res length(res) [[]] res res[[1]] res[[1]] 5 > res[[1]] "" "" "" "" "" 15

2.4 (term-document matrix) T erm doc1 doc2 doc3 1 1 1 1 1 0 1 0 0 0 1 1 0 0 1 0 0 1 0 0 1 doc1, doc2, doc3 doc1: doc2: doc3: RMeCabText() doc1 doc2 doc3 16 2.4.1 docmatrix() docmatrix() 1 data doc doc1.txt, doc2.txt, doc3.txt > res <- docmatrix("doc", pos = c("","","")) > res docs terms doc1.txt doc2.txt doc3.txt [[LESS-THAN-1]] 0 0 0 16

[[TOTAL-TOKENS]] 4 4 8 1 1 1 1 1 0 1 0 0 0 1 1 0 0 1 0 0 1 0 0 1 docmatrix() 1 2 pos docmatrix() minfreq 1 2 2 1 2 0 2 [[LESS-THAN-2]] 2 docmatrix() () () 2 [[LESS-THAN-1]] 1 1 minfreq minfreq 2 [[TOTAL-TOKENS]] pos ()pos > res <- docmatrix("doc", pos = c("","")) > res docs terms doc1.txt doc2.txt doc3.txt [[LESS-THAN-1]] 0 0 0 [[TOTAL-TOKENS]] 4 4 8 1 1 0 1 0 0 0 1 1 0 0 1 17

[[TOTAL-TOKENS]] 2 minfreq 1 > res <- docmatrix("doc", pos = c("",""), minfreq = 2) > res docs terms doc1.txt doc2.txt doc3.txt [[LESS-THAN-2]] 2 2 2 [[TOTAL-TOKENS]] 4 4 8 2 2 [[LESS-THAN-2]] 2 1 morikita > res <- docmatrix("morikita", pos = c("",""), minfreq = 2) > nrow(res) [1] 11 > res docs terms morikita1.txt morikita2.txt morikita3.txt [[LESS-THAN-2]] 18 19 21 [[TOTAL-TOKENS]] 42 60 77 2 0 0 2 0 0 0 5 2 0 2 0 0 2 0 0 0 2 0 0 2 0 0 2 0 0 2 () 426077 2 181921 morikita1.txt 2 0 morikita3.txt 18

1 [[LESS-THAN-2]] 2.5 () 1 3 100 3 CPU (local weight) (global weight) (normalization) 3 (term frequency; TF) IDF (inverse document frequency) (2002) (1999) 2.5.1 docmatrix() docmatrix() tf (), tf2 (: logarithimic TF)tf3 (2 : binary weight) idf ()idf2 ( IDF)idf3 ( IDF)idf4 () norm () weight * tf idf > res <- docmatrix("doc", pos = c("","",""), weight = "tf*idf") > res docs 19

terms doc1.txt doc2.txt doc3.txt 1.000000 1.000000 1.000000 1.584963 1.584963 0.000000 2.584963 0.000000 0.000000 0.000000 2.584963 0.000000 0.000000 1.584963 1.584963 0.000000 1.584963 1.584963 0.000000 0.000000 2.584963 0.000000 0.000000 2.584963 doc1.txt 1 tf idf id f = log N n i + 1 N n i w i 2 idf log2(3/3) + 1 1 log2(3/2) + 1) 1.584963 log2(3/1) + 1) 2.584963 tf weight *norm > res <- docmatrix("doc", pos = c("","",""), weight = "tf*idf*norm") > res docs terms doc1.txt doc2.txt doc3.txt 0.3132022 0.2563399 0.2271069 0.4964137 0.4062891 0.0000000 0.8096159 0.0000000 0.0000000 0.0000000 0.6626290 0.0000000 0.0000000 0.4062891 0.3599560 0.0000000 0.4062891 0.3599560 0.0000000 0.0000000 0.5870629 0.0000000 0.0000000 0.5870629 1 () 20

8 docmatrix() (t f id f ) 2 doc1.txt 1 2 + 1.584963 2 + 2.584963 2 = 3.192827 tf*idf 1 2.6 N-gram N-gram N N 2 2 3 [ - ] [ - ] [ - ] N 2 bi-gram () 2 3 N 2 2 4 2 4 bi-gram bi-gram 2 5 21

2 5 2.6.1 Ngram() Ngram() N bi-gram N-gram Ngram() R bi-gram > res <- Ngram("yukiguni.txt") file = yukiguni.txt Ngram = 2 length = 38 > nrow(res) [1] 38 > res # Ngram Freq 1 [-] 1 2 [-] 1 3 [-] 1 4 [-] 1 5 [-] 1 6 [-] 1 #... 34 [-] 1 35 [-] 1 36 [-] 1 37 [-] 1 38 [-] 1 bi-gram > res <- Ngram("yukiguni.txt", type = 1, N = 2) file = yukiguni.txt Ngram = 2 length = 25 22

> nrow(res) [1] 25 > res Ngram Freq 1 [-] 1 2 [-] 1 3 [-] 1 4 [-] 1 5 [-] 1 #.. 20 [-] 1 21 [-] 1 22 [-] 1 23 [-] 1 24 [-] 1 25 [-] 1 bi-gram tri-gram tri-gram N 3 3-gram > # bi-gram > res <- Ngram("yukiguni.txt", type = 2, N = 2) file = yukiguni.txt Ngram = 2 length = 13 > nrow(res) [1] 13 > res Ngram Freq 1 [-] 2 2 [-] 3 3 [-] 2 4 [-] 3 5 [-] 2 6 [-] 2 7 [-] 1 8 [-] 1 9 [-] 6 10 [-] 1 11 [-] 1 12 [-] 1 13 [-] 2 23

> > # tri-bram > res <- Ngram("yukiguni.txt", type = 2, N = 3) file = yukiguni.txt Ngram = 3 length = 20 > nrow(res) [1] 20 > res Ngram Freq 1 [--] 1 2 [--] 1 3 [--] 2 4 [--] 1 5 [--] 1 #... 16 [--] 1 17 [--] 1 18 [--] 1 19 [--] 1 20 [--] 1 Ngram() type 1 N-gram > res <- Ngram("yukiguni.txt", type = 1, N = 2, pos = "") file = Ngram = 2 length = 7 > res Ngram Freq 1 [-] 1 2 [-] 1 3 [-] 1 4 [-] 1 5 [-] 1 6 [-] 1 7 [-] 1 pos = "" N-gram 24

N-gram N-gram RMeCab 5 1 2.6.2 NgramDF() NgramDF() Ngram() N-gram > kekkadf <- NgramDF("yukiguni.txt", type = 1, N = 2, pos = "") file = yukiguni.txt Ngram = 2 > kekkadf Ngram1 Ngram2 Freq 1 1 2 1 3 1 4 1 5 1 6 1 7 1 bi-gram (Freq) 1 Ngram() [- ] 1 2.6.3 docngram() docngram() Ngram() 1 type N Ngram() data doc > res <- docngram("doc") > nrow(res) [1] 16 > res Text Ngram doc1.txt doc2.txt doc3.txt [-] 0 0 1 [-] 0 0 1 25

[-] 0 1 0 # 9 Ngram() docngram() N-gram %in% 9 2.7 (collocation) () (node) () 2.7.1 collocate() RMeCab collocate() 1 node () span span 3 > res <- collocate("kumo.txt", node = "", span = 3) > nrow(res) [1] 33 > res[25:33,] Term Span Total 25 10 10 26 2 7 27 4 4 28 2 14 29 1 4 30 2 7 31 1 3 32 [[MORPHEMS]] 31 413 33 [[TOKENS]] 70 1808 26

Span Total 2 [[MORPHEMS]] [[TOKENS]] Span 70 10 3 60 1808 413 collocate() T MI T T (Barnbrook, 1996, p.97) ( - ) Church et al. (1991) 4 1808 4 ( 4 ) 4/1808 10 3 3 2 10 (4/1808 3 2 10) T T T 2 1.65 (Church et al., 1991) MI MI 2 ( ) 4 4 1808 3 2 10 R 27

> log2( 4 / ((4/1808) * 10 * 3 * 2)) MI MI 1.58 (Barnbrook, 1996) MI T T MI RMeCab T MI collocate() ( res) 1 collocate() node span collscores() > res2 <- collscores(res, node = "", span = 3) > res2[25:33,] Term Span Total T MI 25 10 10 NA NA 26 2 7 1.2499520 3.105933 27 4 4 1.9336283 4.913288 28 2 14 1.0856905 2.105933 29 1 4 0.8672566 2.913288 30 2 7 1.2499520 3.105933 31 1 3 0.9004425 3.328326 32 [[MORPHEMS]] 31 413 NA NA 33 [[TOKENS]] 70 1808 NA NA NA [[MORPHEMS]] [[TOKENS]] NA T 1.9 2 MI 4.9 1.58 28

bi-gram, 21 collocate(), 8, 26 colscores(), 8 collscores(), 28 docmatrix(), 8, 16 docngram(), 8, 25 FALSE, 10 IDF, 19 MeCab, 2, 13 MI, 27 Ngram(), 8, 22 NgramDF(), 8 NgramDF(), 25 proc.time(), 12 RMeCab, 1, 5 RMeCabC(), 8 RMeCabDF(), 8, 15 RMeCabFreq(), 8, 11 RMeCabText(), 8, 11 rmsign(), 8 TF, 19 tri-gram, 23 TRUE, 10 T, 27 unlist(), 9, 12 MeCab, 2 RMeCab, 5, 26, 19, 19, 19, 19, 17, 1, 4, 4, 16, 4, 26, 4, 17, 4, 21 29

Barnbrook, Geoff (1996) Language and Computers: Edinburgh. Church, K. W., W. Gale, P. Hanks, and D. Hindle (1991) Using statistics in lexical analysis, in Using On-line Resources to Build a Lexicon: Lawrence Erlbaum, pp. 115 164. (2007) R S-PLUS 1 (1999) 5 - (2002) (2006) R 3 30