() (MeCab) *1 Juman ChaSen *2 MeCab ChaSen 1.3 MeCab MeCab OS Windows MeCab [] [Binary package for MS-Windows] [] sourceforge.net [mecab-win32] Mac OS

Similar documents
() (MeCab) *1 Juman ChaSen *2 MeCab ChaSen 1.3 MeCab MeCab OS Windows MeCab [] [Binary package for MS-Windows] [] sourceforge.net [mecab-win32] Mac OS

Unix * 3 PC 2 Linux, Mac *4 Windows Cygwin Cygwin gnuplot Cygwin unix emulator online gnuplot *5 matplotlib *6 SuperMongo *7 gnuplot gnuplot OS *8 Uni

UniDic version

X Window System X X &

28 Docker Design and Implementation of Program Evaluation System Using Docker Virtualized Environment

できるん?! セキュリティ 〜ハードディスクの情報消去〜 改訂第三版

1 1.1 PC PC PC PC PC workstation PC hardsoft PC PC CPU 1 Gustavb, Wikimedia Commons.

作業手順手引き

MathLibre KNOPPIX (next generation) 2012 KNOPPIX/Math MathLibre KNOPPIX , KNOPPIX 6.0, next generation. KNOPPIX/Math KDE,

UNIX

Morphological Analysis System JUMAN Copyright 2016 Kyoto University All rights reserved. Licensed under the Apache License, Version 2.0 (the Li

Rによる計量分析:データ解析と可視化 - 第2回 セットアップ

1 L A TEX

1 1 CentOS Java JDK(JavaSE Development Kit)......

unix.dvi

2.2 Sage I 11 factor Sage Sage exit quit 1 sage : exit 2 Exiting Sage ( CPU time 0m0.06s, Wall time 2m8.71 s). 2.2 Sage Python Sage 1. Sage.sage 2. sa

Compiled MODELSでのDFT位相検出装置のモデル化と評価


GNU Emacs GNU Emacs

Quickstart Guide 3rd Edition

0_テキストマイニング環境構築_mac

R による統計解析入門

TeraTerm Pro V.2.32の利用法

thesis.dvi

JUMAN++ version

dvi

( ) Shift JIS ( ) ASCII ASCII ( ) 8bit = 1 Byte JIS(Japan Industrial Standard) X 0201 (X ) 2 Byte JIS ISO-2022-JP, Shift JIS, EUC 1 Byte 2 By

MINI2440マニュアル

2 Windows 10 *1 3 Linux 3.1 Windows Bash on Ubuntu on Windows cygwin MacOS Linux OS Ubuntu OS Linux OS 1 GUI Windows Explorer Mac Finder 1 GUI

Introduction Purpose This training course demonstrates the use of the High-performance Embedded Workshop (HEW), a key tool for developing software for

ORCA (Online Research Control system Architecture)

LAPLINK ヘルプデスク 操作ガイド

wireshark dissector with lua

A : kerl kerl Erlang/OTP Erlang/OTP 2 2 Elixir/Phoenix URL 2 PDF A.2 Bash macos.bash_profile exp

KLCシリーズ インストール/セットアップ・ガイド

R Commanderを用いたデータ解析

programmingII2019-v01

1 1 Gnuplot gnuplot Windows gnuplot gp443win32.zip gnuplot binary, contrib, demo, docs, license 5 BUGS, Chang

help gem gem gem my help

KLC - AE 2007 インストール・セットアップ・ガイド

3 Ubuntu Linux Ubuntu Linux Debian Linux DistroWatch.com 1 Debian Ubuntu Linux 1 Debian CD(4.1 ) Knoppix Debian CentOS Linux CentOS 1 Ubuntu L


1 I EViews View Proc Freeze

できるん?! セキュリティ ~ハードディスクの情報消去~

untitled

Sophos Anti-Virus UNIX or Linux startup guide

Microsoft Word - ChoreonoidStartUpGuide.docx

tebiki00.dvi

インテル(R) Visual Fortran Composer XE

FileMaker Server Getting Started Guide

SR-X526R1 サーバ収容スイッチ ご利用にあたって

1 1 tf-idf tf-idf i

会員用_DB・DC-住所情報照会事務処理要領_H2201.PDF

Moldplus_Server_4.12

untitled

FileMaker Server Getting Started Guide

FileMaker Server 9 Getting Started Guide


CD Microsoft, MS-DOS, Windows 95, Windows 98,Windows Me, Windows NT, Windows 2000, Visual Basic, Visual Basic.NET, Visual C#, Excel, ActiveX Microsoft

はじめての帳票作成

test CreateIndex test.helpindex test Info.plist XCode Info.plist CFBundleHelpBookFolder string test CFBundleHelpBookName string test.html AppleTitle J

24 SPAM Performance Comparison of Machine Learning Algorithms for SPAM Discrimination

Red Hat Enterprise Linux 6 Portable SUSE Linux Enterprise Server 9 Portable SUSE Linux Enterprise Server 10 Portable SUSE Linux Enterprise Server 11 P

Adobe Postscript 3 Expansion Unit

<4D F736F F D B B83578B6594BB2D834A836F815B82D082C88C60202E646F63>

double float

Transcription:

RMeCab 2008 11 8 1 MeCab RMeCab 1 1.1.............................................. 1 1.2............................................ 1 1.3 MeCab......................................... 2 1.4 RMeCab.......................................... 5 2 RMeCab 7 2.1 RMeCab..................................... 8 2.2 MeCab.......................................... 13 2.3........................................ 15 2.4........................................... 16 2.5 docmatrix2()......................................... 22 2.6 docmatrixdf()........................................ 23 2.7............................................ 24 2.8 N-gram................................................ 26 2.9............................................. 36 1 MeCab RMeCab 1.1 MeCab R MeCab RMeCab 1.2 ishida-m@ias.tokushima-u.ac.jp 1

() (MeCab) *1 Juman ChaSen *2 MeCab ChaSen 1.3 MeCab MeCab OS Windows MeCab [] [Binary package for MS-Windows] [] sourceforge.net [mecab-win32] Mac OS X Linux [mecab] Mac OS X Linux [mecab-ipadic] Windows Windows MeCab 0.97 mecab-0.97.exe [OK] Shift JIS MeCab C Program Files Mac OS X Unix Downloads Terminal Mac OS X DVD *1 http://mecab.sourceforge.net/ *2 http://mecab.sourceforge.net/feature.html 2

1 1 MeCab # # ** $ cd Downloads $ tar zxvf mecab-0.**.tar.gz $ cd mecab-0.** $./configure --with-charset=utf-8 $ make $ sudo make install # $ tar zxf mecab-ipadic-2.7.0-20070****.tar.gz $ cd mecab-ipadic-2.7.0-20070**** $./configure --with-charset=utf-8 $ make $ sudo make install Windows MeCab [Enter] C work 3

test.txt res.txt C: Program Files MeCab bin > mecab c: work test.txt > c: work res.txt test.txt res.txt 1 1 EOS,*,*,*,*,*,,,,,*,*,*,*,,,,,,*,*,*,,,,,*,*,*,*,,,,,,*,*,*,,,,,*,*,,,,,,*,*,*,,,,,,,*,*,*,*,,, 1 1 MeCab (), 1, 2, 3,,,,, EOS (end of sentence) (token) (type) MeCab 1 2 EOS,,*,*,*,*,,,,,*,*,*,*,,,,,*,*,*,*,,,,*,*,*,,,,,,*,*,*,,,,,,,*,*,*,*,,,,,*,*,*,*,,,,,*,*,*,*,,,,,*,*,*,*,,,,,*,*,,,,,,,*,*,*,*,,, 1 2 4

9 8 MeCab CSV (??) () MeCab R MeCab R R MeCab RMeCab 1.4 RMeCab RMeCab R MeCab R RMeCab 1.4.1 RMeCab R MeCab R MeCab?? RMeCab *1 OS RMeCab RMeCab 0.50 RMeCab 0.59.zip RMeCab 0.59.tgz, RMeCab 0.59.tar.gz.zip Windows Mac OS X.tgz Unix.tar.gz Windows RMeCabInstall.txt Windows R RMeCab *2 [1] Windows R *3 R R getwd() C:/PROGRA 1/R/R-2* *.*/library * R [] - [ zip ] *1 http://groups.google.co.jp/group/rmecab/ *2 *3 RMeCabInstall.txt R MeCab RMeCabInstall.bat MeCab bin libmecab.dll R library RMeCab libs libmecab.dll 5

RMeCab ***.zip ( 1 2)*** RMeCabInstall.txt RMeCabInstall.bat RMeCabInstall.bat ( 1 3) 1 Windows R getwd() C:/PROGRA 1/R/R-2* *.*/library Windows XP MeCabInstall.bat RMeCabInstallXP.bat Vista RMeCabInstallVista.bat 1 2 RMeCab 1 3 Mac OS X R [] - [] [CRAN] [] 6

[install] RMeCab ***.tgz *** Linux R R R getwd() ** > install.packages("rmecab_0.**.tar.gz", destdir=".", repos = NULL) 2 RMeCab RMeCab R RMeCab Windows R [] - [ ] RMeCab ( 2 1 )Mac OS X [ ] RMeCab R library(rmecab) [Enter] R 2 1 RMeCab RMeCab 2 1 *1 2 1 *1 7

RMeCabC RMeCabText RMeCabDF RMeCabFreq docmatrix, docmatrix2, docmatrixdf collocate collscores Ngram N, N-gram NgramDF N, N-gram NgramDF2 N,, N-gram docngram N N-gram docngram2 N,, N-gram 2 1 RMeCab 2 1 RMeCab *2 Windows data2.zip Mac OS X Unix data2.tar.gz Windows data2 zip [ ] [] [] zip data2 data2 C (C:) R R getwd() 2.1 RMeCab RMeCab RMeCabText() RMeCabFreq() MeCab *2 http://groups.google.co.jp/group/rmecab 8

2.1.1 RMeCabC() RMeCabC() MeCab R Windows [Ctrl] [r] <- RMeCabC("") [[1]] "" [[2]] "" [[3]] "" [[4]] "" #... [[1]] # "" > unlist(res)... "" "" "" ""... > x <- "" # <- RMeCabC(x) > unlist(res)... "" "" "" ""... 9

R [[]] res[[1]] R unlist() ( x) RMeCabC() RMeCabC() 2 1 () 2 0 2 1 <- RMeCabC("", 1) > unlist(res) # "" "" "" "" <- RMeCabC("", 0) > unlist(res) # "" "" "" "" 2 1 2 0 ( 2 ) () <- RMeCabC("") 2 <- unlist(res) 2 "" "" "" "" "" "" "" 2[names(res2) == ""] "" "" "" "" > names(res2) == "" # [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE 10

Mac OS X Linux R-2.8.0 1 2 Encoding(names(res2))<- "UTF-8" # Encoding(res2) <- "UTF-8"# RMeCabC() res2 res2 names() == (TRUE) (FALSE) [] TRUE FALSE res2 TRUE which() TRUE any() 3 <- names(res2) == "" 3 [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE > which(res3) [1] 1 3 5 7 > any(res3) [1] TRUE which() TRUE any() TRUE TRUE 2.1.2 RMeCabText() RMeCabText() 1 10 MeCab RMeCab data2 data2 yukiguni.txt 11

<- RMeCabText("yukiguni.txt") [[1]] [1] "" "" "" "*" "*" [6] "*" "*" "" "" "" [[2]] [1] "" "" "" "" "*" "*" "*" "" [9] "" "" [[3]] [1] "" "" "" "*" [5] "*" "" "" "" #... 2.1.3 RMeCabFreq() RMeCabFreq() Windows Linux Mac OS X Windows <- RMeCabFreq("yukiguni.txt") length = 13 Term Info1 Info2 Freq 1 3 2 1 3 1 #... res Term Info1 Info2 Freq 1 3 1 R data2 kumo.txt > pt1 <- proc.time() # <- RMeCabFreq("kumo.txt") length = 447 > pt2 <- proc.time() 12

> # > pt2 - pt1 # 0.008 0.008 1.703 MeCab Windows Mac OS X Linux RMeCabFreq() length = 447 () () 447 Linux Mac OS X 446 Linux Mac OS X MeCab OS MeCab,,*,*,*,*,,,,,*,*,,,,,,,*,*,,,,, Windows,,*,*,*,*,,,,,*,*,,*,*,,,,*,*,,*,*,*,,,,*,*,,,,, Windows Mac OS X Linux OS OS MeCab MeCab MeCab *1 Unix OS Windows MeCab 2.2 MeCab Windows MeCab Mac OS X Linux Mecab *1 http://mecab.sourceforge.net/dic.html 13

C: Program Files MeCab bin > mecab,,,,*,*,,,,,,,*,*,,,,,*,*,,,,,,*,*,*,,,,, EOS CSV,-1,-1,1000,,,,,*,*,,, MeCab ID ID 1 2 3 motohiro.csv C data ( "C:\data" ) ID ID -1 MeCab Windows []-[]-[ ]-[] cd MeCab bin MeCab C: Program Files MeCab mecab-dict-index.exe MeCab motohiro.csv (\) () C: data > cd C: Program Files MeCab bin C: Program Files MeCab bin > mecab-dict-index.exe \ -d c: Program Files MeCab dic ipadic \ -u ishida.dic -f shift-jis -t shift-jis \ c: data motohiro.csv reading c: data mecabdic.csv... 1 emitting double-array: 100% ########################################### done! done mecab-dict-index.exe ishida.dic C: data MeCab C: Program Files MeCab dict dicrc 14

Windows ([]-[]-[]-[]) userdic = C: data ishida.dic MeCab C: Program Files MeCab bin > mecab,,,,*,*,,,,,,,*,*,,,,*,*,*,,,,, EOS *1 2.3 R 2.3.1 RMeCabDF() RMeCabDF() 2 2 1 2 3 3 RMeCabDF() 1 2 3 1 data photo.csv > # > dat <- read.csv("photo.csv") <- RMeCabDF(dat, 3) # () <- RMeCabDF(dat, 3, 1) # <- RMeCabDF(dat, "Reply",1) # *1 http://mecab.sourceforge.net/dic.html 15

ID, Sex, Reply 1, F, 2, M, 3, F, 4, F, 5, M, 2 2 CSV RMeCabDF() res length(res) [[]] res res[[1]] res[[1]] 5 [[1]] "" "" "" "" "" 2.4 (term-document matrix) T erm doc1 doc2 doc3 1 1 1 1 1 0 1 0 0 0 1 1 0 0 1 0 0 1 0 0 1 doc1, doc2, doc3 doc1: doc2: doc3: RMeCabText() doc1 doc2 doc3 16

16 <- docmatrix("doc", pos = c("","")) file = doc/doc1.txt file = doc/doc2.txt file = doc/doc3.txt Term Document Matrix includes 2 information rows! whose names are [[LESS-THAN-1]] and [[TOTAL-TOKENS]] if you remove these rows, run result[ row.names(result)!= "[[LESS-THAN-1]]", ] result[ row.names(result)!= "[[TOTAL-TOKENS]]", ] docs terms doc1.txt doc2.txt doc3.txt [[LESS-THAN-1]] 0 0 0 [[TOTAL-TOKENS]] 4 6 8 1 1 0 1 0 0 0 1 1 0 1 1 [[TOTAL-TOKENS]] <- res[ row.names(res)!= "[[LESS-THAN-1]]", ] <- res[ row.names(res)!= "[[TOTAL-TOKENS]]", ] docs terms doc1.txt doc2.txt doc3.txt 1 1 1 1 1 0 1 0 0 17

0 1 0 0 1 1 0 1 1 0 0 1 0 0 1 docmatrix() <- res[rowsums(res) >= 2,] # 2 docs terms doc1.txt doc2.txt doc3.txt 1 1 1 1 1 0 0 1 1 0 1 1 2 rowsums() >=2 2 docmatrix() minfreq 1 2 2 1 2 0 minfreq 3 A 3 1 2 A 3 0 [[LESS-THAN-3]] 3 [[TOTAL-TOKENS]] pos 0 dcomatrix2() minfreq rowsums(res) minfreq 2 <- docmatrix("doc", pos = c("",""), minfreq = 2) #... docs 18

terms doc1.txt doc2.txt doc3.txt [[LESS-THAN-2]] 2 3 2 [[TOTAL-TOKENS]] 4 6 8 2 2 [[LESS-THAN-2]] 2 1 doc1.txt 2 2 morikita <- docmatrix("morikita", pos = c("","")) file = morikita/morikita1.txt file = morikita/morikita2.txt file = morikita/morikita3.txt Term Document Matrix includes 2 information rows! whose names are [[LESS-THAN-1]] and [[TOTAL-TOKENS]] if you remove these rows, run result[ row.names(result)!= "[[LESS-THAN-1]]", ] result[ row.names(result)!= "[[TOTAL-TOKENS]]", ] docs terms morikita1.txt morikita2.txt morikita3.txt [[LESS-THAN-1]] 0 0 0 [[TOTAL-TOKENS]] 42 61 77 1 0 0 1 0 0 1 1 0 1 0 2 1 0 1 1 1 1 #... 2 <- res[ row.names(res)!= "[[LESS-THAN-1]]", ] <- res[ row.names(res)!= "[[TOTAL-TOKENS]]", ] <- res[rowsums(res) >= 2,] # 2 19

docs terms morikita1.txt morikita2.txt morikita3.txt 1 1 0 1 0 2 1 0 1 1 1 1 1 5 2 #... minfreq 2 2 <- docmatrix("morikita", pos = c("",""), minfreq = 2) file = morikita/morikita1.txt file = morikita/morikita2.txt file = morikita/morikita3.txt Term Document Matrix includes 2 information rows! whose names are [[LESS-THAN-2]] and [[TOTAL-TOKENS]] if you remove these rows, run result[ row.names(result)!= "[[LESS-THAN-2]]", ] result[ row.names(result)!= "[[TOTAL-TOKENS]]", ] docs terms morikita1.txt morikita2.txt morikita3.txt [[LESS-THAN-2]] 18 19 21 [[TOTAL-TOKENS]] 42 61 77 2 0 0 2 0 0 0 5 2 0 2 0 0 2 0 0 0 2 0 0 2 0 0 2 0 0 2 426177 2 181921 morikita1.txt 2 0 20

morikita3.txt 1 2 1 [[LESS-THAN-2]] sym 1 pos [[TOTAL-TOKENS]] <- docmatrix("doc", pos = c("",""), sym = 1) #... docs terms doc1.txt doc2.txt doc3.txt [[LESS-THAN-1]] 0 0 0 [[TOTAL-TOKENS]] 5 7 9 1 1 0 1 0 0 0 1 1 0 1 1 [[TOTAL-TOKENS]] <- docmatrix("doc", pos = c("","")) #... docs terms doc1.txt doc2.txt doc3.txt [[LESS-THAN-1]] 0 0 0 [[TOTAL-TOKENS]] 4 6 8 # 1 1 0 1 0 0 0 1 1 0 1 1 pos <- docmatrix(targetdir, pos = c("","","")) #... docs terms doc1.txt doc2.txt doc3.txt 21

[[LESS-THAN-1]] 0 0 0 [[TOTAL-TOKENS]] 5 7 9 # sym=1 1 1 1 1 1 0 1 0 0 0 1 1 0 1 1 2.5 docmatrix2() docmatrix2() 1 () directory, pos, minfreq, sym, weight directory ( )pos minfreq docmatrix() minfreq = 2 2 docmatrix() sym sym = 0 sym = 1 pos sym = 1 docmatrix() [[LESS-THAN-1]] [[TOTAL-TOKENS]] docmatrix2() <- docmatrix2("doc")# doc to open doc f_count=3 doc2.txt doc3.txt doc1.txt to close dir file_name = doc/doc2.txt opened file_name = doc/doc3.txt opened file_name = doc/doc1.txt opened number of extracted terms = 4 to make matrix now doc1.txt doc2.txt doc3.txt 1 1 0 1 0 0 0 1 1 22

0 1 1 > # pos <- docmatrix2("doc", pos = c("","","") ) #...... doc2.txt doc3.txt doc1.txt 1 1 1 ## 1 1 0 1 0 0 0 1 1 0 1 1 RMeCabDF() RMeCabDF() > # 5 <- docmatrix2("kumo.txt", minfreq = 5) file_name = kumo.txt opened number of extracted terms = 21 to make matrix now texts 12 18 13...... 14 7 7 5 17 2.6 docmatrixdf() docmatridf() docmatrix2() photo.csv Reply 23

> dat <- read.csv("photo.csv", head = T) <- docmatrixdf(dat[,"reply"]) OBS.1 OBS.2 OBS.3 OBS.4 OBS.5 0 1 0 1 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 OBS. (NA ) 0 2.7 () 1 3 100 3 CPU (local weight) (global weight) (normalization) 3 TF (term frequency) IDF (inverse document frequency) (2002) (1999) 24

2.7.1 docmatrix() docmatrix2() docmatrixdf() tf (), tf2 (: logarithimic TF)tf3 (2 : binary weight) idf ()idf2 ( IDF) idf3 ( IDF)idf4 () norm () weight * tf idf <- docmatrix("doc", pos = c("","",""), weight = "tf*idf") docs terms doc1.txt doc2.txt doc3.txt 1.000000 1.000000 1.000000 1.584963 1.584963 0.000000 2.584963 0.000000 0.000000 0.000000 2.584963 0.000000 0.000000 1.584963 1.584963 0.000000 1.584963 1.584963 0.000000 0.000000 2.584963 0.000000 0.000000 2.584963 doc1.txt 1 tf idf id f = log N n i + 1 N n i w i 2 idf log2(3/3) + 1 1 log2(3/2) + 1) 1.584963 log2(3/1) + 1) 2.584963 tf weight *norm <- docmatrix("doc", pos = c("","",""), weight = "tf*idf*norm") docs 25

terms doc1.txt doc2.txt doc3.txt 0.3132022 0.2563399 0.2271069 0.4964137 0.4062891 0.0000000 0.8096159 0.0000000 0.0000000 0.0000000 0.6626290 0.0000000 0.0000000 0.4062891 0.3599560 0.0000000 0.4062891 0.3599560 0.0000000 0.0000000 0.5870629 0.0000000 0.0000000 0.5870629 1 () 8 docmatrix() (t f id f ) 2 doc1.txt 1 2 + 1.584963 2 + 2.584963 2 = 3.192827 tf*idf 1 2.8 N-gram N-gram N N 2 2 3 [ - ] [ - ] [ - ] N 2 bi-gram () 2 3 26

N 2 2 4 2 4 bi-gram bi-gram 2 5 2 5 2.8.1 Ngram() Ngram() N bi-gram N-gram Ngram() R bi-gram <- Ngram("yukiguni.txt") file = yukiguni.txt Ngram = 2 length = 38 > nrow(res) [1] 38 # Ngram Freq 1 [-] 1 2 [-] 1 3 [-] 1 4 [-] 1 5 [-] 1 6 [-] 1 #... 34 [-] 1 35 [-] 1 27

36 [-] 1 37 [-] 1 38 [-] 1 bi-gram <- Ngram("yukiguni.txt", type = 1, N = 2) file = yukiguni.txt Ngram = 2 length = 25 > nrow(res) [1] 25 Ngram Freq 1 [-] 1 2 [-] 1 3 [-] 1 4 [-] 1 5 [-] 1 #.. 20 [-] 1 21 [-] 1 22 [-] 1 23 [-] 1 24 [-] 1 25 [-] 1 bi-gram tri-gram tri-gram N 3 3-gram > # bi-gram <- Ngram("yukiguni.txt", type = 2, N = 2) file = yukiguni.txt Ngram = 2 length = 13 > nrow(res) [1] 13 Ngram Freq 1 [-] 2 2 [-] 3 3 [-] 2 4 [-] 3 28

5 [-] 2 6 [-] 2 7 [-] 1 8 [-] 1 9 [-] 6 10 [-] 1 11 [-] 1 12 [-] 1 13 [-] 2 > > # tri-bram <- Ngram("yukiguni.txt", type = 2, N = 3) file = yukiguni.txt Ngram = 3 length = 20 > nrow(res) [1] 20 Ngram Freq 1 [--] 1 2 [--] 1 3 [--] 2 4 [--] 1 5 [--] 1 #... 16 [--] 1 17 [--] 1 18 [--] 1 19 [--] 1 20 [--] 1 Ngram() type 1 N-gram <- Ngram("yukiguni.txt", type = 1, N = 2, pos = "") file = yukiguni.txt Ngram = 2 length = 7 Ngram Freq 1 [-] 1 2 [-] 1 29

3 [-] 1 4 [-] 1 5 [-] 1 6 [-] 1 7 [-] 1 pos = "" N-gram N-gram N-gram Ngram() 4 1 docngram2() 2.8.2 NgramDF() NgramDF() Ngram() N-gram > kekkadf <- NgramDF("yukiguni.txt", type = 1, N = 2, pos = "") file = yukiguni.txt Ngram = 2 > kekkadf Ngram1 Ngram2 Freq 1 1 2 1 3 1 4 1 5 1 6 1 7 1 bi-gram (Freq) 1 Ngram() [- ] 1 N-gram NgramDF() 4 1 NgramDF2() 30

2.8.3 NgramDF2() NgramDF2() NgramDF() directory, type, pos, minfreq, Nsym ()type (type=0) (type=1) (type=2) pos pos = c(, ) minfreq minfreq=2 2 N N-gram R sym type 1 () sym = 0 sym = 1 pos sym = 1 > # <- NgramDF2("yukiguni.txt", type = 1, N = 2, pos = "") file_name = yukiguni.txt opened number of extracted terms = 7 Ngram1 Ngram2 yukiguni.txt 1 1 2 1 3 1 4 1 5 1 6 1 7 1 # # NgramDF2() <- NgramDF2("yukiguni.txt", type = 1, N = 2, pos = c("","")) file_name = yukiguni.txt opened number of extracted terms = 10 Ngram1 Ngram2 yukiguni.txt 1 1 2 1 3 1 4 1 5 1 6 1 31

7 1 8 1 9 1 10 1 > targetdir <- "doc" <- NgramDF2(targetDir)# # type = 0, N = 2 # Ngram1 Ngram2 doc1.txt doc2.txt doc3.txt # 1 0 0 1 # 2 1 1 1 # 3 0 0 1 # 4 1 1 0 # 5 0 1 0 #... <- NgramDF2(targetDir, type = 1, pos = c("","") ) # Ngram1 Ngram2 doc1.txt doc2.txt doc3.txt # 1 1 0 0 # 2 0 1 0 # 3 0 1 1 <- NgramDF2(targetDir, type = 1, pos = c("","","") ) # # Ngram1 Ngram2 doc1.txt doc2.txt doc3.txt # 1 1 1 0 # 2 1 0 0 # 3 0 0 1 # 4 0 1 0 # 5 0 1 1 <- NgramDF2(targetDir, type = 2) # # Ngram1 Ngram2 doc1.txt doc2.txt doc3.txt # 1 0 0 1 32

# 2 1 1 1 # 3 1 1 1 # 4 0 0 1 # 5 0 0 1 # 6 1 1 1 # 7 1 1 0 <- NgramDF2(targetDir, type = 2, minfreq = 2) # 2 # Ngram1 Ngram2 doc1.txt doc2.txt doc3.txt # 1 0 0 1 # 2 1 1 1 # 3 1 1 1 # 4 1 1 1 # 5 1 1 0 ## 2.8.4 docngram() docngram() Ngram() 1 type N Ngram() data doc Ngram 2 <- docngram("doc") file = doc/doc1.txt Ngram = 2 length = 1 file = doc/doc2.txt Ngram = 2 length = 2 file = doc/doc3.txt Ngram = 2 length = 1 Text Ngram doc1.txt doc2.txt doc3.txt [-] 1 0 0 [-] 0 1 0 [-] 0 1 1 33

??Ngram() docngram() N-gram %in%??n-gram docngram() 4 1 docngram2() 2.8.5 docngram2() docngram2() N Ngram() directory, type, pos, minfreq, Nsym ()type (type=0) (type=1) (type=2) pos pos = c(, ) minfreq minfreq=2 2 N N-gram R sym type 1 () sym = 0 sym = 1 pos sym = 1 <- docngram2(targetdir, pos = c("","") ) # 2-gram # doc1.txt doc2.txt doc3.txt # [-] 0 0 1 # [-] 1 1 1 # [-] 0 0 1 # [-] 1 1 0 ##... <- docngram2(targetdir, type = 1, pos = c("","") ) # # doc1.txt doc2.txt doc3.txt # [-] 1 0 0 # [-] 0 1 0 # [-] 0 1 1 34

<- docngram2(targetdir, type = 1, pos = c("","","") ) # # doc1.txt doc2.txt doc3.txt # [-] 1 1 0 # [-] 1 0 0 # [-] 0 0 1 # [-] 0 1 0 <- docngram2(targetdir, type = 2) # doc1.txt doc2.txt doc3.txt # [-] 0 0 1 # [-] 1 1 1 # [-] 1 1 1 # [-] 0 0 1 # [-] 0 0 1 # [-] 1 1 1 # [-] 1 1 0 res <- docngram2(targetdir, type = 2, N = 5) res # doc1.txt doc2.txt doc3.txt # [----] 0 0 1 # [----] 0 0 1 # [----] 0 1 0 # [----] 0 0 1 # [----] 0 0 1 # [----] 0 0 1 # [----] 0 1 0 # [----] 1 1 0 res <- docngram2(targetdir, type = 2, minfreq =2, N = 5) # res # doc1.txt doc2.txt doc3.txt # [----] 1 1 0 35

2.9 (collocation) () (node) () 2.9.1 collocate() RMeCab collocate() 1 node () span span 3 <- collocate("kumo.txt", node = "", span = 3) > nrow(res) [1] 33 [25:33,] Term Span Total 25 10 10 26 2 7 27 4 4 28 2 14 29 1 4 30 2 7 31 1 3 32 [[MORPHEMS]] 31 413 33 [[TOKENS]] 70 1808 Span Total 2 36

[[MORPHEMS]] [[TOKENS]] Span 70 10 3 60 1808 413 collocate() T MI T T (Barnbrook, 1996, p.97) ( - ) Church et al. (1991) 4 1808 4 ( 4 ) 4/1808 10 3 3 2 10 (4/1808 3 2 10) T T T 2 1.65 (Church et al., 1991) MI MI 2 ( ) 4 4 1808 3 2 10 R > log2( 4 / ((4/1808) * 10 * 3 * 2)) 37

MI MI 1.58 (Barnbrook, 1996) MI T T MI RMeCab T MI collocate() ( res) 1 collocate() node span collscores() 2 <- collscores(res, node = "", span = 3) 2[25:33,] Term Span Total T MI 25 10 10 NA NA 26 2 7 1.2499520 3.105933 27 4 4 1.9336283 4.913288 28 2 14 1.0856905 2.105933 29 1 4 0.8672566 2.913288 30 2 7 1.2499520 3.105933 31 1 3 0.9004425 3.328326 32 [[MORPHEMS]] 31 413 NA NA 33 [[TOKENS]] 70 1808 NA NA NA [[MORPHEMS]] [[TOKENS]] NA T 1.9 2 MI 4.9 1.58 Mac OS X Linux R-2.8.0 1 Encoding(res$Term) <- "UTF-8" 38

bi-gram, 27 collocate(), 8 collocate(), 37 colscores(), 8 collscores(), 39 docmatrix(), 8 docmatrixdf(), 8 docmatrix(), 17 docmatrix2(), 8 docmatrix2(), 25 docngram(), 8 docngram(), 34 docngram2(), 8 docngram2(), 35, 4, 16, 4, 37, 4, 17, 4, 25 FALSE, 11 IDF, 24 MeCab, 2, 13 MI, 38 Ngram(), 8 Ngram(), 28 NgramDF(), 8 NgramDF(), 31 NgramDF2(), 8 NgramDF(), 32 proc.time(), 12 RMeCab, 1, 5 RMeCabC(), 8 RMeCabC(), 9 RMeCabDF(), 8 RMeCabDF(), 15 RMeCabFreq(), 8 RMeCabFreq(), 12 RMeCabText(), 8 RMeCabText(), 11 TF, 24 tri-gram, 29 TRUE, 11 T, 38 unlist(), 10, 12 MeCab, 2 RMeCab, 5, 37, 23, 24, 24, 24, 24, 17, 1, 4 39

Barnbrook, Geoff (1996) Language and Computers: Edinburgh. Church, K. W., W. Gale, P. Hanks, and D. Hindle (1991) Using statistics in lexical analysis, in Using On-line Resources to Build a Lexicon: Lawrence Erlbaum, pp. 115 164. (2007) R S-PLUS 1 (1999) 5 - (2002) (2006) R 3 40