Latent Dirichlet Allocation W707 s-taiji@is.titech.ac.jp 1 / 37
LDA (Latent Dirichlet Allocation) Wikipediade LDA 2 / 37
1 LDA: Latent Dirichlet Allocation 2 Wikipedia LDA 3 3 / 37
DF 4 / 37
DF 4 / 37
Bag of Words Bag of words: 100 please credit. money Bag of wards 5 / 37
Bag of Words ( ) ( 0). 1 2 3 N 1 4 8 0 2 2 2 0 2 1 3 2 4 0 0. ( ) 6 / 37
: n i x i : (x 1,..., x k ) Mult(β; n) P(x 1,..., x k π) = n! x 1!... x k! βx1 1... βx k k. β = (β 1,..., β M ) M i=1 β i = 1, β i 0 7 / 37
Dirichlet Dirichlet β ( ) β Diri(α) n i=1 p(π α) = Γ(α i) n Γ( n i=1 α β α i 1 i. i) α = (α 1,..., α n ) α i > 0 n β i=1 8 / 37
Dirichlet Wikipedia 9 / 37
: Dirichlet X = (X 1,..., X k ) x = (x 1,..., x k ) ( k i=1 x i = n) π Diri(α) 10 / 37
: Dirichlet X = (X 1,..., X k ) x = (x 1,..., x k ) ( k i=1 x i = n) π Diri(α) Dirichlet p(π x) = p(x π)p(π α) p(x π)p(π α)dπ (π x 1 1... πx k k }{{} ) (π α 1 1 1... π α 1 k k ) }{{} = π x1+α1 1 1... π x k +α k 1 k = Diri((x 1 + α 1,..., x k + α k )). = (α 1,..., α k ) = (x 1 + α 1,..., x k + α k ) 10 / 37
LDA: Latent Dirichlet Allocation : K, M, N β (k) = (β (k) 1,..., β(k) ) (k = 1,..., K) K N π (d) = (π (d) 1,..., π(d) K ) : (x (d) 1,..., x (d) M ) K M π (d) k Mult(β k ; n (d) ) }{{} k=1 k ( K = Mult π (d) k β k ; n (d)). k=1 n (d) d {π (d) } {β (k) } 11 / 37
LDA 12 / 37
: X = (x (d) 1,..., x (d) M )N d=1 (N M: ) x (d) i : i d ( p x (d) K ( ) N K p(x {β k } K k=1, {π (d) } N d=1) = p x (d) β (k) π (d) k. k=1 β(k) π (d) k. Mult d=1 k=1 ) x (d) ( K ) k=1 π(d) k β k ; n (d) 13 / 37
: ( ) N K p(x {β k } K k=1, {π (d) } N d=1) = p x (d) β k π (d) k. d=1 k=1 LDA {β k } K k=1 {π(d) } N d=1 Dirichlet π β β k π (d) LDA p({β k } K k=1, {π(d) } N d=1 X ). { ˆβ k } K k=1 {ˆπ (d) } N d=1 LDA 14 / 37
LDA 1 1 LDA 2 3 15 / 37
. Gibbs ( ) ( ) Collapsed Collapsed Gibbs 16 / 37
R LDA library(topicmodels) library(lda) X K > LDA(X, K) 17 / 37
1 LDA: Latent Dirichlet Allocation 2 Wikipedia LDA 3 18 / 37
2014 Wikipedia http://dumps.wikimedia.org/jawiki/20140624/ jawiki-20140624-pages-articles1.xml.bz2 19 / 37
Python Python Gensim (Bag-of-words, ) Python3.4.1 + Numpy1.8.1 + Scipy0.14.0, Windows 7, 64bit Gensim WikiCorpus JaWikiCorpus () python MeCab > python jawikicorpus_make.py jawiki-20140624-pages-articles1.xml.bz2 jawiki1 20 / 37
MeCab,,*,*,*,*,,,,,,*,*,*,,,,,*,*,,,,,,*,*,*,,,,,,*,*,*,,,,,,,*,*,*,*,,, 21 / 37
( ) 10 62158 aa 202 31510 aaa 132 10543 aab 65 15293 aac 48 25269 aaron 42 19714 ab 212 32430 aba 93 19037 abba 21 45622 abbey 24 19673 abc 706 10 0 1 2 3 4 EU ( ) 5 6 7 8 9 22 / 37
Matrix Market %%MatrixMarket matrix coordinate real general 59749 62999 4970557 1 867 2 1 1577 1 1 6045 1 1 9144 1 1 9393 1 1 10498 2 1 11234 3 1 11705 1 59,749 62,999 ( ) 4,970,557 23 / 37
LDA 20, Gibbs K = 20 wiki.lda <- LDA(ssx,K,method= Gibbs,control = list(burnin=2000,iter = 5000)) list(burnin=2000,iter = 5000) 5000 2000 24 / 37
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 [1,] " " "de" " " "windows" " " [2,] " " "la" "" "gt" "mm" [3,] " " " " " " "pc" " " [4,] " " "cc" " cd" "lt" "km" [5,] " " "file" "vol" " " " " [6,] " " " " " " "os" " " [7,] " " " " " " "ms" " " [8,] "" " " "" "ii" " " [9,] "" "" "one" "mhz" " " [10,] " " " " " " "mb" " " [11,] " " "le" " " "vs" " " [12,] " " " " " " "minus" "cm" [13,] " " " " " " "for" " " [14,] " " "image" " " "mac" " " [15,] " " " " "" "system" "dd" 25 / 37
Topic 6 Topic 7 Topic 8 Topic 9 [1,] " " " " "nbsp" " " [2,] " " "tbs" "km" " " [3,] " " "nhk" " " " " [4,] " " " " " " " " [5,] " " " " "" " " [6,] " " " " " " " " [7,] " " " " " " " " [8,] " " " " " " " " [9,] " " " " " " " " [10,] " " " " " " " " [11,] "" "" " " " " [12,] "" " " " " " " [13,] "" " " "" " " [14,] " " " " " " " " [15,] "" " " " " " " 26 / 37
Topic 10 Topic 11 Topic 12 Topic 13 Topic 14 [1,] "bs" " " " " " " "en" [2,] "hd" " " " " " " " " [3,] " " "" " " " " " " [4,] " " "op" " " " " " " [5,] " " " " "jr " " " "" [6,] "com" " " " " " " " " [7,] "sports" "" " " " " "right" [8,] " " " " " " " " " " [9,] "tv" " " " " " " " " [10,] "sup" "" "" " " " " [11,] " " " " "" " " " " [12,] " " "" " " "kg" " " [13,] " " " " " " " " "" [14,] " " " " "" " " "png" [15,] "" " " " " " " " " 27 / 37
Topic 15 Topic 16 Topic 17 Topic 18 [1,] "and" "km" "ch" " " [2,] "in" "text" " " " " [3,] "file" "style" " " " " [4,] "to" " " " " " " [5,] "university" " " " " " " [6,] "new" " " " " " " [7,] "on" "center" " " " " [8,] "by" "align" " " " " [9,] "with" "bar" " " " " [10,] "press" " " "kw" " " [11,] "for" " " " " " " [12,] "at" "" " " " " [13,] "en" " " " " " " [14,] "white" " " " " " " [15,] "black" "bull" "fm" " " 28 / 37
Topic 19 Topic 20 [1,] "th" " " [2,] "love" " " [3,] "live" " " [4,] "" " " [5,] "in" " " [6,] "dvd" " " [7,] "you" " " [8,] "cd" " " [9,] "best" "" [10,] "to" " " [11,] "music" " " [12,] "" " " [13,] "my" "cm" [14,] "on" " " [15,] "go" " " 29 / 37
Topic 3: Topic 4: Topic 5: Topic 6: Topic 18: 30 / 37
"Topic 3 :" " " "Topic 4 :" "Xeon PC-9821 ThinkCentre Safari Microsoft X68000 Unicode E0000-E0FFF MC68000.NET Framew "Topic 18 :" " ( ) ( ) 1 ( ) 4 " 31 / 37
wordcloud(vocs,freq) (library(wordcloud) ) vocs freq (0 1 ). freq 32 / 37
1 LDA: Latent Dirichlet Allocation 2 Wikipedia LDA 3 33 / 37
(gam) gam (For_report.zip ) mackgam.r mackgam.r lon,lat b.depth,c.dist UBRE (Unbiased Risk Estimator) UBRE mack.gamadd$gcv.ubre (UBRE GCV ) 34 / 37
(gam) ken-kankyo-kakou.csv (For_report.zip ) (4 ) (hclust) (Mclust) (hclust plot(hc) Mclust heatmap(result$z,col=redgreen(256)) ) 35 / 37
(LDA): (optional) LDA jawiki2 (jawiki2_short_bow.mm jawiki2_short_titles_tmp.txt jawiki2_short_wordids_tmp.txt ) 36 / 37
n R pdf ( tex ) 8/8( ) http://www.is.titech.ac.jp/~s-taiji/lecture/dataanalysis/dataanalysis.html 37 / 37