RMeCab 2008 6 16 1 MeCab RMeCab 1 1.1.............................................. 1 1.2............................................ 1 1.3 MeCab......................................... 2 1.4 RMeCab.......................................... 5 2 RMeCab 7 2.1 RMeCab..................................... 8 2.2 MeCab.......................................... 13 2.3........................................ 15 2.4........................................... 16 2.5............................................ 19 2.6 N-gram................................................ 21 2.7............................................. 26 1 MeCab RMeCab 1.1 MeCab R MeCab RMeCab 1.2 ishida-m@ias.tokushima-u.ac.jp 1
() (MeCab) *1 Juman ChaSen *2 MeCab ChaSen 1.3 MeCab MeCab OS Windows MeCab [] [Binary package for MS-Windows] [] sourceforge.net [mecab-win32] Mac OS X Linux [mecab] Mac OS X Linux [mecab-ipadic] Windows Windows MeCab 0.97 mecab-0.97.exe [OK] Shift Jis MeCab C Program Files Mac OS X Unix Downloads Terminal Mac OS X DVD *1 http://mecab.sourceforge.net/ *2 http://mecab.sourceforge.net/feature.html 2
1 1 MeCab # # ** $ cd Downloads $ tar zxvf mecab-0.**.tar.gz $ cd mecab-0.** $./configure --with-charset=utf-8 $ make $ sudo make install # $ tar zxf mecab-ipadic-2.7.0-20070****.tar.gz $ cd mecab-ipadic-2.7.0-20070**** $./configure --with-charset=utf-8 $ make $ sudo make install Windows MeCab [Enter] C work test.txt res.txt 3
C: Program Files MeCab bin > mecab c: work test.txt > c: work res.txt test.txt res.txt 1 1 EOS,*,*,*,*,*,,,,,*,*,*,*,,,,,,*,*,*,,,,,*,*,*,*,,,,,,*,*,*,,,,,*,*,,,,,,*,*,*,,,,,,,*,*,*,*,,, 1 1 MeCab (), 1, 2, 3,,,,, EOS (end of sentence) (token) (type) MeCab 1 2 EOS,,*,*,*,*,,,,,*,*,*,*,,,,,*,*,*,*,,,,*,*,*,,,,,,*,*,*,,,,,,,*,*,*,*,,,,,*,*,*,*,,,,,*,*,*,*,,,,,*,*,*,*,,,,,*,*,,,,,,,*,*,*,*,,, 1 2 9 4
8 MeCab CSV () () MeCab R MeCab R R MeCab RMeCab 1.4 RMeCab RMeCab R MeCab R RMeCab 1.4.1 RMeCab R MeCab R MeCab 2 RMeCab *1 OS RMeCab RMeCab 0.50 RMeCab 0.50.zip RMeCab 0.50.tgz, RMeCab 0.50.tar.gz.zip Windows Mac OS X.tgz Unix.tar.gz Windows RMeCabInstall.txt Windows R RMeCab *2 [1] Windows R *3 R R [] - [ zip ] RMeCab ***.zip ( 1 2)*** RMeCabInstall.txt RMeCabInstall.bat *1 http://groups.google.co.jp/group/rmecab/ *2 *3 RMeCabInstall.txt R MeCab RMeCabInstall.bat MeCab bin libmecab.dll R library RMeCab libs libmecab.dll 5
RMeCabInstall.bat ( 1 3) 1 1 2 RMeCab 1 3 Mac OS X R [] - [] [CRAN] [] [install] RMeCab ***.tgz *** Linux R R R getwd() ** > install.packages("rmecab_0.**.tar.gz", destdir=".", repos = NULL) 6
2 RMeCab RMeCab R RMeCab Windows R [] - [ ] RMeCab ( 2 1 ) R library(rmecab) [Enter] R 2 1 RMeCab RMeCab 2 1 *1 2 1 2 1 RMeCab *2 Windows data.zip Mac OS X Unix data.tar.gz Windows data zip [] [] [] zip data data *1 *2 http://groups.google.co.jp/group/rmecab 7
RMeCabC RMeCabText RMeCabDF RMeCabFreq docmatrix collocate collscores Ngram N, N-gram NgramDF N, N-gram docngram N N-gram rmsign RMeCab 2 1 RMeCab C (C:) R R getwd() 2.1 RMeCab RMeCab RMeCabText() RMeCabFreq() MeCab 2.1.1 RMeCabC() RMeCabC() MeCab > res <- RMeCabC("") > res [[1]] 8
"" [[2]] "" [[3]] "" [[4]] "" #... > res[[1]] # "" > unlist(res)... "" "" "" ""... > x <- "" # > res <- RMeCabC(x) > unlist(res) R [[]] res[[1]] R unlist() ( x) RMeCabC() RMeCabC() 2 1 () 2 0 2 1 > res <- RMeCabC("", 1) > unlist(res) # 9
"" "" "" "" > res <- RMeCabC("", 0) > unlist(res) # "" "" "" "" 2 1 2 0 ( 2 ) () > res <- RMeCabC("") > res2 <- unlist(res) > res2 "" "" "" "" "" "" "" > res2[names(res2) == ""] "" "" "" "" > names(res2) == "" # [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE RMeCabC() res2 res2 names() == (TRUE) (FALSE) [] TRUE FALSE res2 TRUE which() TRUE any() > res3 <- names(res2) == "" 10
> res3 [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE > which(res3) [1] 1 3 5 7 > any(res3) [1] TRUE which() TRUE any() TRUE TRUE 2.1.2 RMeCabText() RMeCabText() 1 10 MeCab RMeCab data data yukiguni.txt > res <- RMeCabText("yukiguni.txt") > res [[1]] [1] "" "" "" "*" "*" [6] "*" "*" "" "" "" [[2]] [1] "" "" "" "" "*" "*" "*" "" [9] "" "" [[3]] [1] "" "" "" "*" [5] "*" "" "" "" #... 2.1.3 RMeCabFreq() RMeCabFreq() Windows Linux Mac OS X Windows > res <- RMeCabFreq("yukiguni.txt") length = 13 > res 11
Term Info1 Info2 Freq 1 3 2 1 3 1 #... res Term Info1 Info2 Freq 1 3 1 data kumo.txt > pt1 <- proc.time() # > res <- RMeCabFreq("kumo.txt") length = 447 > pt2 <- proc.time() > # > pt2 - pt1 # 0.008 0.008 1.703 MeCab Windows Mac OS X Linux RMeCabFreq() length = 447 () () 447 Linux Mac OS X 446 Linux Mac OS X MeCab OS MeCab,,*,*,*,*,,,,,*,*,,,,,,,*,*,,,,, Windows,,*,*,*,*,,,,,*,*,,*,*,,,,*,*,,*,*,*,,,,*,*,,,,, 12
Windows Mac OS X Linux OS OS MeCab MeCab MeCab *1 Unix OS Windows MeCab 2.2 MeCab Windows MeCab Mac OS X Linux Mecab C: Program Files MeCab bin > mecab,,,,*,*,,,,,,,*,*,,,,,*,*,,,,,,*,*,*,,,,, EOS CSV,-1,-1,1000,,,,,*,*,,, MeCab ID ID 1 2 3 motohiro.csv C data ( "C:\data" ) ID ID -1 MeCab Windows []-[]-[ ]-[] cd MeCab bin MeCab *1 http://mecab.sourceforge.net/dic.html 13
C: Program Files MeCab mecab-dict-index.exe MeCab motohiro.csv \ () C: data > cd C: Program Files MeCab bin C: Program Files MeCab bin > mecab-dict-index.exe \ -d c: Program Files MeCab dic ipadic \ -u ishida.dic -f shift-jis -t shift-jis \ c: data motohiro.csv reading c: data mecabdic.csv... 1 emitting double-array: 100% ########################################### done! done mecab-dict-index.exe ishida.dic C: data MeCab C: Program Files MeCab dict dicrc Windows ([]-[]-[]-[]) userdic = C: data ishida.dic MeCab C: Program Files MeCab bin > mecab,,,,*,*,,,,,,,*,*,,,,*,*,*,,,,, EOS *1 *1 http://mecab.sourceforge.net/dic.html 14
2.3 R 2.3.1 RMeCabDF() RMeCabDF() 2 2 1 2 3 3 RMeCabDF() 1 2 3 1 data photo.csv ID, Sex, Reply 1, F, 2, M, 3, F, 4, F, 5, M, 2 2 CSV > # > dat <- read.csv("photo.csv") > res <- RMeCabDF(dat, 3) # () > res <- RMeCabDF(dat, 3, 1) # RMeCabDF() res length(res) [[]] res res[[1]] res[[1]] 5 > res[[1]] "" "" "" "" "" 15
2.4 (term-document matrix) T erm doc1 doc2 doc3 1 1 1 1 1 0 1 0 0 0 1 1 0 0 1 0 0 1 0 0 1 doc1, doc2, doc3 doc1: doc2: doc3: RMeCabText() doc1 doc2 doc3 16 2.4.1 docmatrix() docmatrix() 1 data doc doc1.txt, doc2.txt, doc3.txt > res <- docmatrix("doc", pos = c("","","")) > res docs terms doc1.txt doc2.txt doc3.txt [[LESS-THAN-1]] 0 0 0 16
[[TOTAL-TOKENS]] 4 4 8 1 1 1 1 1 0 1 0 0 0 1 1 0 0 1 0 0 1 0 0 1 docmatrix() 1 2 pos docmatrix() minfreq 1 2 2 1 2 0 2 [[LESS-THAN-2]] 2 docmatrix() () () 2 [[LESS-THAN-1]] 1 1 minfreq minfreq 2 [[TOTAL-TOKENS]] pos ()pos > res <- docmatrix("doc", pos = c("","")) > res docs terms doc1.txt doc2.txt doc3.txt [[LESS-THAN-1]] 0 0 0 [[TOTAL-TOKENS]] 4 4 8 1 1 0 1 0 0 0 1 1 0 0 1 17
[[TOTAL-TOKENS]] 2 minfreq 1 > res <- docmatrix("doc", pos = c("",""), minfreq = 2) > res docs terms doc1.txt doc2.txt doc3.txt [[LESS-THAN-2]] 2 2 2 [[TOTAL-TOKENS]] 4 4 8 2 2 [[LESS-THAN-2]] 2 1 morikita > res <- docmatrix("morikita", pos = c("",""), minfreq = 2) > nrow(res) [1] 11 > res docs terms morikita1.txt morikita2.txt morikita3.txt [[LESS-THAN-2]] 18 19 21 [[TOTAL-TOKENS]] 42 60 77 2 0 0 2 0 0 0 5 2 0 2 0 0 2 0 0 0 2 0 0 2 0 0 2 0 0 2 () 426077 2 181921 morikita1.txt 2 0 morikita3.txt 18
1 [[LESS-THAN-2]] 2.5 () 1 3 100 3 CPU (local weight) (global weight) (normalization) 3 (term frequency; TF) IDF (inverse document frequency) (2002) (1999) 2.5.1 docmatrix() docmatrix() tf (), tf2 (: logarithimic TF)tf3 (2 : binary weight) idf ()idf2 ( IDF)idf3 ( IDF)idf4 () norm () weight * tf idf > res <- docmatrix("doc", pos = c("","",""), weight = "tf*idf") > res docs 19
terms doc1.txt doc2.txt doc3.txt 1.000000 1.000000 1.000000 1.584963 1.584963 0.000000 2.584963 0.000000 0.000000 0.000000 2.584963 0.000000 0.000000 1.584963 1.584963 0.000000 1.584963 1.584963 0.000000 0.000000 2.584963 0.000000 0.000000 2.584963 doc1.txt 1 tf idf id f = log N n i + 1 N n i w i 2 idf log2(3/3) + 1 1 log2(3/2) + 1) 1.584963 log2(3/1) + 1) 2.584963 tf weight *norm > res <- docmatrix("doc", pos = c("","",""), weight = "tf*idf*norm") > res docs terms doc1.txt doc2.txt doc3.txt 0.3132022 0.2563399 0.2271069 0.4964137 0.4062891 0.0000000 0.8096159 0.0000000 0.0000000 0.0000000 0.6626290 0.0000000 0.0000000 0.4062891 0.3599560 0.0000000 0.4062891 0.3599560 0.0000000 0.0000000 0.5870629 0.0000000 0.0000000 0.5870629 1 () 20
8 docmatrix() (t f id f ) 2 doc1.txt 1 2 + 1.584963 2 + 2.584963 2 = 3.192827 tf*idf 1 2.6 N-gram N-gram N N 2 2 3 [ - ] [ - ] [ - ] N 2 bi-gram () 2 3 N 2 2 4 2 4 bi-gram bi-gram 2 5 21
2 5 2.6.1 Ngram() Ngram() N bi-gram N-gram Ngram() R bi-gram > res <- Ngram("yukiguni.txt") file = yukiguni.txt Ngram = 2 length = 38 > nrow(res) [1] 38 > res # Ngram Freq 1 [-] 1 2 [-] 1 3 [-] 1 4 [-] 1 5 [-] 1 6 [-] 1 #... 34 [-] 1 35 [-] 1 36 [-] 1 37 [-] 1 38 [-] 1 bi-gram > res <- Ngram("yukiguni.txt", type = 1, N = 2) file = yukiguni.txt Ngram = 2 length = 25 22
> nrow(res) [1] 25 > res Ngram Freq 1 [-] 1 2 [-] 1 3 [-] 1 4 [-] 1 5 [-] 1 #.. 20 [-] 1 21 [-] 1 22 [-] 1 23 [-] 1 24 [-] 1 25 [-] 1 bi-gram tri-gram tri-gram N 3 3-gram > # bi-gram > res <- Ngram("yukiguni.txt", type = 2, N = 2) file = yukiguni.txt Ngram = 2 length = 13 > nrow(res) [1] 13 > res Ngram Freq 1 [-] 2 2 [-] 3 3 [-] 2 4 [-] 3 5 [-] 2 6 [-] 2 7 [-] 1 8 [-] 1 9 [-] 6 10 [-] 1 11 [-] 1 12 [-] 1 13 [-] 2 23
> > # tri-bram > res <- Ngram("yukiguni.txt", type = 2, N = 3) file = yukiguni.txt Ngram = 3 length = 20 > nrow(res) [1] 20 > res Ngram Freq 1 [--] 1 2 [--] 1 3 [--] 2 4 [--] 1 5 [--] 1 #... 16 [--] 1 17 [--] 1 18 [--] 1 19 [--] 1 20 [--] 1 Ngram() type 1 N-gram > res <- Ngram("yukiguni.txt", type = 1, N = 2, pos = "") file = Ngram = 2 length = 7 > res Ngram Freq 1 [-] 1 2 [-] 1 3 [-] 1 4 [-] 1 5 [-] 1 6 [-] 1 7 [-] 1 pos = "" N-gram 24
N-gram N-gram RMeCab 5 1 2.6.2 NgramDF() NgramDF() Ngram() N-gram > kekkadf <- NgramDF("yukiguni.txt", type = 1, N = 2, pos = "") file = yukiguni.txt Ngram = 2 > kekkadf Ngram1 Ngram2 Freq 1 1 2 1 3 1 4 1 5 1 6 1 7 1 bi-gram (Freq) 1 Ngram() [- ] 1 2.6.3 docngram() docngram() Ngram() 1 type N Ngram() data doc > res <- docngram("doc") > nrow(res) [1] 16 > res Text Ngram doc1.txt doc2.txt doc3.txt [-] 0 0 1 [-] 0 0 1 25
[-] 0 1 0 # 9 Ngram() docngram() N-gram %in% 9 2.7 (collocation) () (node) () 2.7.1 collocate() RMeCab collocate() 1 node () span span 3 > res <- collocate("kumo.txt", node = "", span = 3) > nrow(res) [1] 33 > res[25:33,] Term Span Total 25 10 10 26 2 7 27 4 4 28 2 14 29 1 4 30 2 7 31 1 3 32 [[MORPHEMS]] 31 413 33 [[TOKENS]] 70 1808 26
Span Total 2 [[MORPHEMS]] [[TOKENS]] Span 70 10 3 60 1808 413 collocate() T MI T T (Barnbrook, 1996, p.97) ( - ) Church et al. (1991) 4 1808 4 ( 4 ) 4/1808 10 3 3 2 10 (4/1808 3 2 10) T T T 2 1.65 (Church et al., 1991) MI MI 2 ( ) 4 4 1808 3 2 10 R 27
> log2( 4 / ((4/1808) * 10 * 3 * 2)) MI MI 1.58 (Barnbrook, 1996) MI T T MI RMeCab T MI collocate() ( res) 1 collocate() node span collscores() > res2 <- collscores(res, node = "", span = 3) > res2[25:33,] Term Span Total T MI 25 10 10 NA NA 26 2 7 1.2499520 3.105933 27 4 4 1.9336283 4.913288 28 2 14 1.0856905 2.105933 29 1 4 0.8672566 2.913288 30 2 7 1.2499520 3.105933 31 1 3 0.9004425 3.328326 32 [[MORPHEMS]] 31 413 NA NA 33 [[TOKENS]] 70 1808 NA NA NA [[MORPHEMS]] [[TOKENS]] NA T 1.9 2 MI 4.9 1.58 28
bi-gram, 21 collocate(), 8, 26 colscores(), 8 collscores(), 28 docmatrix(), 8, 16 docngram(), 8, 25 FALSE, 10 IDF, 19 MeCab, 2, 13 MI, 27 Ngram(), 8, 22 NgramDF(), 8 NgramDF(), 25 proc.time(), 12 RMeCab, 1, 5 RMeCabC(), 8 RMeCabDF(), 8, 15 RMeCabFreq(), 8, 11 RMeCabText(), 8, 11 rmsign(), 8 TF, 19 tri-gram, 23 TRUE, 10 T, 27 unlist(), 9, 12 MeCab, 2 RMeCab, 5, 26, 19, 19, 19, 19, 17, 1, 4, 4, 16, 4, 26, 4, 17, 4, 21 29
Barnbrook, Geoff (1996) Language and Computers: Edinburgh. Church, K. W., W. Gale, P. Hanks, and D. Hindle (1991) Using statistics in lexical analysis, in Using On-line Resources to Build a Lexicon: Lawrence Erlbaum, pp. 115 164. (2007) R S-PLUS 1 (1999) 5 - (2002) (2006) R 3 30