Linear-Chain CRF Conditional Random Fields(CRF) CRF Linear-Chain CRF Ye (2009) Linear-Chain CRF i

Similar documents
x i 2 x x i i 1 i xi+ 1xi+ 2x i+ 3 健康児に本剤を接種し ( 窓幅 3 n-gram 長の上限 3 の場合 ) 文字 ( 種 )1-gram: -3/ 児 (K) -2/ に (H) -1/ 本 (K) 1/ 剤 (K) 2/ を (H) 3/ 接 (K) 文字 (

( : A9TB2096)

A(6, 13) B(1, 1) 65 y C 2 A(2, 1) B( 3, 2) C 66 x + 2y 1 = 0 2 A(1, 1) B(3, 0) P 67 3 A(3, 3) B(1, 2) C(4, 0) (1) ABC G (2) 3 A B C P 6

,. Black-Scholes u t t, x c u 0 t, x x u t t, x c u t, x x u t t, x + σ x u t, x + rx ut, x rux, t 0 x x,,.,. Step 3, 7,,, Step 6., Step 4,. Step 5,,.

2017 (413812)

zz + 3i(z z) + 5 = 0 + i z + i = z 2i z z z y zz + 3i (z z) + 5 = 0 (z 3i) (z + 3i) = 9 5 = 4 z 3i = 2 (3i) zz i (z z) + 1 = a 2 {

ii 3.,. 4. F. (), ,,. 8.,. 1. (75%) (25%) =7 20, =7 21 (. ). 1.,, (). 3.,. 1. ().,.,.,.,.,. () (12 )., (), 0. 2., 1., 0,.

.3. (x, x = (, u = = 4 (, x x = 4 x, x 0 x = 0 x = 4 x.4. ( z + z = 8 z, z 0 (z, z = (0, 8, (,, (8, 0 3 (0, 8, (,, (8, 0 z = z 4 z (g f(x = g(

(3) (2),,. ( 20) ( s200103) 0.7 x C,, x 2 + y 2 + ax = 0 a.. D,. D, y C, C (x, y) (y 0) C m. (2) D y = y(x) (x ± y 0), (x, y) D, m, m = 1., D. (x 2 y

ii 3.,. 4. F. ( ), ,,. 8.,. 1. (75% ) (25% ) =7 24, =7 25, =7 26 (. ). 1.,, ( ). 3.,...,.,.,.,.,. ( ) (1 2 )., ( ), 0., 1., 0,.

all.dvi

x, y x 3 y xy 3 x 2 y + xy 2 x 3 + y 3 = x 3 y xy 3 x 2 y + xy 2 x 3 + y 3 = 15 xy (x y) (x + y) xy (x y) (x y) ( x 2 + xy + y 2) = 15 (x y)

No δs δs = r + δr r = δr (3) δs δs = r r = δr + u(r + δr, t) u(r, t) (4) δr = (δx, δy, δz) u i (r + δr, t) u i (r, t) = u i x j δx j (5) δs 2

Computational Semantics 1 category specificity Warrington (1975); Warrington & Shallice (1979, 1984) 2 basic level superiority 3 super-ordinate catego

& 3 3 ' ' (., (Pixel), (Light Intensity) (Random Variable). (Joint Probability). V., V = {,,, V }. i x i x = (x, x,, x V ) T. x i i (State Variable),

IPSJ SIG Technical Report Vol.2010-CVIM-170 No /1/ Visual Recognition of Wire Harnesses for Automated Wiring Masaki Yoneda, 1 Ta

1 Abstract 2 3 n a ax 2 + bx + c = 0 (a 0) (1) ( x + b ) 2 = b2 4ac 2a 4a 2 D = b 2 4ac > 0 (1) 2 D = 0 D < 0 x + b 2a = ± b2 4ac 2a b ± b 2

gengo.dvi

x () g(x) = f(t) dt f(x), F (x) 3x () g(x) g (x) f(x), F (x) (3) h(x) = x 3x tf(t) dt.9 = {(x, y) ; x, y, x + y } f(x, y) = xy( x y). h (x) f(x), F (x

73

I

II A A441 : October 02, 2014 Version : Kawahira, Tomoki TA (Kondo, Hirotaka )

1 [1, 2, 3, 4, 5, 8, 9, 10, 12, 15] The Boston Public Schools system, BPS (Deferred Acceptance system, DA) (Top Trading Cycles system, TTC) cf. [13] [

数学の基礎訓練I

さくらの個別指導 ( さくら教育研究所 ) 1 φ = φ 1 : φ [ ] a [ ] 1 a : b a b b(a + b) b a 2 a 2 = b(a + b). b 2 ( a b ) 2 = a b a/b X 2 X 1 = 0 a/b > 0 2 a


Convolutional Neural Network A Graduation Thesis of College of Engineering, Chubu University Investigation of feature extraction by Convolution


6kg 1.1m 1.m.1m.1 l λ ϵ λ l + λ l l l dl dl + dλ ϵ dλ dl dl + dλ dl dl 3 1. JIS 1 6kg 1% 66kg 1 13 σ a1 σ m σ a1 σ m σ m σ a1 f f σ a1 σ a1 σ m f 4

F = 0 F α, β F = t 2 + at + b (t α)(t β) = t 2 (α + β)t + αβ G : α + β = a, αβ = b F = 0 F (t) = 0 t α, β G t F = 0 α, β G. α β a b α β α β a b (α β)

R R 16 ( 3 )

‚åŁÎ“·„´Šš‡ðŠp‡¢‡½‹âfi`fiI…A…‰…S…−…Y…•‡ÌMarkovŸA“½fiI›ð’Í

熊本県数学問題正解

(1.2) T D = 0 T = D = 30 kn 1.2 (1.4) 2F W = 0 F = W/2 = 300 kn/2 = 150 kn 1.3 (1.9) R = W 1 + W 2 = = 1100 N. (1.9) W 2 b W 1 a = 0


kut-paper-template.dvi

renshumondai-kaito.dvi

80 X 1, X 2,, X n ( λ ) λ P(X = x) = f (x; λ) = λx e λ, x = 0, 1, 2, x! l(λ) = n f (x i ; λ) = i=1 i=1 n λ x i e λ i=1 x i! = λ n i=1 x i e nλ n i=1 x

数学Ⅱ演習(足助・09夏)

kiyo5_1-masuzawa.indd

IPSJ SIG Technical Report Vol.2016-CE-137 No /12/ e β /α α β β / α A judgment method of difficulty of task for a learner using simple

it-ken_open.key

通信容量制約を考慮したフィードバック制御 - 電子情報通信学会 情報理論研究会(IT) 若手研究者のための講演会

201711grade1ouyou.pdf


II

newmain.dvi


Deep Learning Deep Learning GPU GPU FPGA %

( ) ( )

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

149 (Newell [5]) Newell [5], [1], [1], [11] Li,Ryu, and Song [2], [11] Li,Ryu, and Song [2], [1] 1) 2) ( ) ( ) 3) T : 2 a : 3 a 1 :

IPSJ SIG Technical Report Pitman-Yor 1 1 Pitman-Yor n-gram A proposal of the melody generation method using hierarchical pitman-yor language model Aki

1. (8) (1) (x + y) + (x + y) = 0 () (x + y ) 5xy = 0 (3) (x y + 3y 3 ) (x 3 + xy ) = 0 (4) x tan y x y + x = 0 (5) x = y + x + y (6) = x + y 1 x y 3 (

Fermat s Last Theorem Hajime Mashima November 19, 2018 Abstract About 380 years ago, Pierre de Fermat wrote the following idea to Diophantus s Arithme

() (, y) E(, y) () E(, y) (3) q ( ) () E(, y) = k q q (, y) () E(, y) = k r r (3).3 [.7 ] f y = f y () f(, y) = y () f(, y) = tan y y ( ) () f y = f y

Abstract This paper concerns with a method of dynamic image cognition. Our image cognition method has two distinguished features. One is that the imag

変 位 変位とは 物体中のある点が変形後に 別の点に異動したときの位置の変化で あり ベクトル量である 変位には 物体の変形の他に剛体運動 剛体変位 が含まれている 剛体変位 P(x, y, z) 平行移動と回転 P! (x + u, y + v, z + w) Q(x + d x, y + dy,

ad bc A A A = ad bc ( d ) b c a n A n A n A A det A A ( ) a b A = c d det A = ad bc σ {,,,, n} {,,, } {,,, } {,,, } ( ) σ = σ() = σ() = n sign σ sign(

I A A441 : April 15, 2013 Version : 1.1 I Kawahira, Tomoki TA (Shigehiro, Yoshida )

1. 2 P 2 (x, y) 2 x y (0, 0) R 2 = {(x, y) x, y R} x, y R P = (x, y) O = (0, 0) OP ( ) OP x x, y y ( ) x v = y ( ) x 2 1 v = P = (x, y) y ( x y ) 2 (x

29

1 Web [2] Web [3] [4] [5], [6] [7] [8] S.W. [9] 3. MeetingShelf Web MeetingShelf MeetingShelf (1) (2) (3) (4) (5) Web MeetingShelf

A_chapter3.dvi

(MIRU2008) HOG Histograms of Oriented Gradients (HOG)

4.6: 3 sin 5 sin θ θ t θ 2t θ 4t : sin ωt ω sin θ θ ωt sin ωt 1 ω ω [rad/sec] 1 [sec] ω[rad] [rad/sec] 5.3 ω [rad/sec] 5.7: 2t 4t sin 2t sin 4t

? (EM),, EM? (, 2004/ 2002) von Mises-Fisher ( 2004) HMM (MacKay 1997) LDA (Blei et al. 2001) PCFG ( 2004)... Variational Bayesian methods for Natural

, = = 7 6 = 42, =

dvi

25 Removal of the fricative sounds that occur in the electronic stethoscope

..,,,, , ( ) 3.,., 3.,., 500, 233.,, 3,,.,, i

2 A id A : A A A A id A def = {(a, a) A A a A} 1 { } 1 1 id 1 = α: A B β : B C α β αβ : A C αβ def = {(a, c) A C b B.((a, b) α (b, c) β)} 2.3 α

A Feasibility Study of Direct-Mapping-Type Parallel Processing Method to Solve Linear Equations in Load Flow Calculations Hiroaki Inayoshi, Non-member

JFE.dvi

x T = (x 1,, x M ) x T x M K C 1,, C K 22 x w y 1: 2 2

50 2 I SI MKSA r q r q F F = 1 qq 4πε 0 r r 2 r r r r (2.2 ε 0 = 1 c 2 µ 0 c = m/s q 2.1 r q' F r = 0 µ 0 = 4π 10 7 N/A 2 k = 1/(4πε 0 qq

JKR Point loading of an elastic half-space 2 3 Pressure applied to a circular region Boussinesq, n =

ばらつき抑制のための確率最適制御

Sport and the Media: The Close Relationship between Sport and Broadcasting SUDO, Haruo1) Abstract This report tries to demonstrate the relationship be

GPGPU


1 variation 1.1 imension unit L m M kg T s Q C QT 1 A = C s 1 MKSA F = ma N N = kg m s 1.1 J E = 1 mv W = F x J = kg m s 1 = N m 1.

meiji_resume_1.PDF

17 Proposal of an Algorithm of Image Extraction and Research on Improvement of a Man-machine Interface of Food Intake Measuring System

18 ( ) I II III A B C(100 ) 1, 2, 3, 5 I II A B (100 ) 1, 2, 3 I II A B (80 ) 6 8 I II III A B C(80 ) 1 n (1 + x) n (1) n C 1 + n C


., White-Box, White-Box. White-Box.,, White-Box., Maple [11], 2. 1, QE, QE, 1 Redlog [7], QEPCAD [9], SyNRAC [8] 3 QE., 2 Brown White-Box. 3 White-Box

情報理論 第5回 情報量とエントロピー

e a b a b b a a a 1 a a 1 = a 1 a = e G G G : x ( x =, 8, 1 ) x 1,, 60 θ, ϕ ψ θ G G H H G x. n n 1 n 1 n σ = (σ 1, σ,..., σ N ) i σ i i n S n n = 1,,

1 8, : 8.1 1, 2 z = ax + by + c ax by + z c = a b +1 x y z c = 0, (0, 0, c), n = ( a, b, 1). f = n i=1 a ii x 2 i + i<j 2a ij x i x j = ( x, A x), f =

x = a 1 f (a r, a + r) f(a) r a f f(a) 2 2. (a, b) 2 f (a, b) r f(a, b) r (a, b) f f(a, b)

Dirichlet process mixture Dirichlet process mixture 2 /40 MIRU2008 :

振動と波動

*2.5mm ”ŒŠá‡ÆfiÁ™¥‡Ì…Z†[…t…X…N…−†[…j…fi…O

29 jjencode JavaScript

IPSJ SIG Technical Report Vol.2012-MUS-96 No /8/10 MIDI Modeling Performance Indeterminacies for Polyphonic Midi Score Following and

漸化式のすべてのパターンを解説しましたー高校数学の達人・河見賢司のサイト

DVIOUT-HYOU

1: A/B/C/D Fig. 1 Modeling Based on Difference in Agitation Method artisoc[7] A D 2017 Information Processing

2.2 ( y = y(x ( (x 0, y 0 y (x 0 (y 0 = y(x 0 y = y(x ( y (x 0 = F (x 0, y(x 0 = F (x 0, y 0 (x 0, y 0 ( (x 0, y 0 F (x 0, y 0 xy (x, y (, F (x, y ( (

Transcription:

Linear-Chain CRF 23 2 8

Linear-Chain CRF Conditional Random Fields(CRF) CRF Linear-Chain CRF Ye (2009) Linear-Chain CRF i

Abstract An Efficient Algorithm for Variable-Order Linear-Chain CRFs ii Hiroshi MANABE Conditional Random Fields (CRFs) is one of the most popular models for sequence labeling. Applying the maximum entropy model to the whole sequence, it can describe relationship between the observations and labels with consistency. Linear-Chain CRFs is the most prevalent and practical application of CRFs which focuses on linear-chained graphs and usually makes first-order Markov assumption. It allows us to efficiently compute the model expectation of the feature functions by using forward-backward algorithm. Extending the model to higher-order Markov model, however, has been problematic due to the exponential computational complexity. Ye et al.(2009) presented an algorithm which computes the model expectation in polynomial time. However, because of its still high computational complexity, the algorithm is only practical when the higher-order features are sparse. This paper presents a novel algorithm that computes the model expectation of the feature functions more efficiently. When applied to English POS-tagging, this model yields a higher precision than the traditional first-order Markov model.

Linear-Chain CRF 1 1 2 2 2.1 Hidden Markov Model (HMM)....................... 2 2.2 Maximum Entropy Markov Model (MEMM).............. 3 2.3 Linear-Chain Conditional Random Fields(Linear-Chain CRF). 4 3 Linear-Chain CRF 5 3.1....................... 5 3.2............................... 7 4 Linear-Chain CRF 9 4.1............................ 9 4.2 :Ye(2009)................................ 11 4.3 :Qian(2009)............................... 11 5 Linear-Chain CRF 12 5.1.......................................... 12 5.2............................ 14 5.2.1.................................... 15 5.2.2.................................... 19 5.2.3................................ 20 5.2.4.............................. 21 5.2.5 -......................... 23 5.2.6........................ 24 5.3....................................... 27 5.3.1...................... 27 5.3.2.................... 30 6 33 7 36 37

38

1 Hidden Markov Model (HMM) Support Vector Machine (SVM) Maximum Entropy Markov Model (MEMM)[1] Conditional Random Fields(CRF)[2] Linear-Chain CRF Linear-Chain CRF MEMM Stanford Tagger[3] Linear-Chain CRF Ye [4] 1

2 x, y t x t, y t T L 0 T + 1 l BOS, l EOS P (x, y) P (y x) Linear-Chain Conditional Random Fields (Linear-Chain CRF) Hidden Markov Model (HMM) Maximum Entropy Markov Model (MEMM) 1 2.1 Hidden Markov Model (HMM) Hidden Markov Model (HMM) P (x, y) P (x, y) = t P (y t y t 1 )P (x t y t ) (1) HMM Linear-Chain CRF HMM Linear-Chain CRF (forward-backward algorithm) ( ) (Viterbi algorithm) HMM 2

2.2 Maximum Entropy Markov Model (MEMM) CRF Maximum Entropy Markov Model (MEMM) P (y x) y t x t P (y t y t 1, x, t) (2) ( ) 1 P (y t y t 1, x, t) = Z(x, t, y t 1 ) exp λ i f i (x, t, y t 1, y t ) (3) i Z(x, t, y t 1 ) 1 f i λ i < λ i < + P (y x) P (y x) = t P (y t x, t) (4) λ i L-BFGS [5] MEMM HMM Linear-Chain CRF 1 Linear-Chain CRF 3

2.3 Linear-Chain Conditional Random Fields(Linear-Chain CRF) Conditional Random Fields (CRF) MEMM CRF Linear-Chain Conditional Random Fields (Linear-Chain CRF) CRF Linear-Chain CRF P (y x) P (y x) = 1 ( ) Z(x) exp λ i f i (x, t, y t 1, y t ) t i Z(x) 1 f i f i y t 1 y t y t 1 unigram bigram λ i < λ i < + λ i L-BFGS [5] (5) 4

3 Linear-Chain CRF Linear-Chain CRF Linear-Chain CRF Linear-Chain CRF 3.1 Linear-Chain CRF T L T = log (x,ỹ) T P (ỹ x) i λ 2 i 2σ 2 reg (6) λ i σ reg λ i L T Ẽ[f i] f i E[f i ] f i L T λ i = Ẽ[f i] E[f i ] λ i σ 2 reg (7) L T λ i i (7) 0 (forward-backward algorithm) t y α, β α(y, t) def = ( t 1 exp λ i f i (x, t, y t 1, y t ) + y 0:t 1 t =1 i i β(y, t) def = T +1 exp λ i f i (x, t, y t 1, y t ) + y t+1:t +1 t =t+2 i i α(l BOS, 0) = 1, β(l EOS, T + 1) = 1 5 λ i f i (x, t, y t 1, y) ) (8) λ i f i (x, t, y, y t+1 ) (9)

α(y, t), β(y, t) α(y, t 1), β(y, t + 1) α(y, t) = ( ) α(y, t 1) exp λ i f i (x, t, y, y) y i β(y, t) = ( ) β(y, t + 1) exp λ i f i (x, t, y, y ) y i (10) (11) α, β O(L 2 T ) Algorithm 1 Algorithm 2 Algorithm 1 Forward-backward: forward part α(l BOS, 0) 1 for t = 1 to T + 1 do for all y t do α(y t, t) y (α(y, t 1) exp ( i λ i f i (x, t, y, y t ))) Algorithm 2 Forward-backward: backward part β(l EOS, T + 1) 1 for t = T for all y t do downto 0 do β(y t, t) y (β(y, t + 1) exp ( i λ i f i (x, t, y t, y ))) α, β t 1 t y y P (y t 1 = y, y t = y) = 1 ( ) Z(x) α(y, t 1) exp λ i f i (x, t, y, y) β(y, t) (12) i Z(x) 6

Z(x) = α(l EOS, T + 1) = β(l BOS, 0) (13) 3.2 Linear-Chain CRF x y (Viterbi algorithm) y y = arg max y = arg max exp y ( ) 1 Z(x) exp λ i f i (x, t, y t 1, y t ) t i ( ) λ i f i (x, t, y t 1, y t ) t i (14) p(y t, t), q(y t, t) p(y t, t) q(y t, t) ( ( )) def = max p(y t 1, t 1) exp λ i f i (x, t, y t 1, y t ) (15) y t 1 i def = arg max p(y t, t) (16) y t 1 p q Algorithm 3 Algorithm 4 q y L 2 T 7

Algorithm 3 Viterbi algorithm(1) p(l BOS, 0) 1 for t = 1 to T + 1 do for all y t do p(y t, t) max yt 1 (p(y t 1, t 1) exp ( i λ i f i (x, t, y t 1, y t ))) q(y t, t) arg max yt 1 (p(y t, t)) Algorithm 4 Viterbi algorithm(2) for t = T to 1 do y t q(y t+1, t + 1) 8

4 Linear-Chain CRF 1 Linear-Chain CRF n n n n 1 Linear-Chain CRF n L n 2 {a, b, c} 9 acb cba, cbb, cbc 3 (L n ) (L ) L n+1 L n+1 T n 4.1 Linear-Chain CRF ( ) 1 L L ( ) 9

{a, b, c} a > b > c aa, ab, ac, ba, bb, bc, cb ca, cc unigram a, b, c bigram ca, cc cc ca ca, cc ca ca, cc cb cc ca CRFSuite[6] 96.06% 96.00% Linear-Chain CRF Linear-Chain CRF(Variable-Order Linear-Chain CRF) Ye [4] CRF Qian [7] Sparse Higher Order CRFs(SHO-CRFs) 10

4.2 :Ye(2009) Ye [4] Linear-Chain CRF M K X O(M 3 K 4 X) O(MKX) 1 4.3 :Qian(2009) Qian sparse higher order CRFs (SHO-CRFs) [7] Z s:t = Z s Z s+1 Z t Qian 11

5 Linear-Chain CRF Linear-Chain CRF 5.1 x, y T T Y = {l 1,, l L } 0 T + 1 l BOS, l EOS t Y t {l BOS } if t = 0 Y t = {l EOS } if t = T + 1 Y otherwise (17) Y n,, Y m z z = m n + 1, (t : 1 t m n + 1)z n Y n+t 1 z Y n:m L = 3, T = 2 Y 0 = {l BOS }, Y 1 = {l 1, l 2, l 3 }, Y 2 = {l 1, l 2, l 3 }, Y 3 = {l EOS } {z z Y 0:3 } = {l BOS l 1 l 1 l EOS, l BOS l 1 l 2 l EOS, l BOS l 1 l 3 l EOS, l BOS l 2 l 1 l EOS, l BOS l 2 l 2 l EOS, l BOS l 2 l 3 l EOS, l BOS l 3 l 1 l EOS, l BOS l 3 l 2 l EOS, l BOS l 3 l 3 l EOS } z z i:j def = z i j j < i z i:j 12

(ϵ ) z 1, z 2, z 1 z 2 z 3 z 3 = z 1 + z 2 z 1 z 2 (z 2 z 1 ) z 1 s z 2 (z 1 = z 2 ) z ϵ s z z 1 z 2 z 1 s z 2 z 1 z 2 z 1 z 2 z 1 < s z 2 z 1 z 2 z 1 < s z 2, z 1 ϵ z 1 1: z 1 1 < s z 2 1: z 2 1 (18) z 1 z 2 z 1 ϵ z 1 1 z 2 ( ) z 1 < s z 3, z 2 < s z 3 z 1 < s z 2 or z 2 s z 1 (19) z 1, z 2 z 3 z 1 z 2 z 2 z 1 z 2 S z 1 s(z 1, S) = z 2 z 2 S z 1 (z 1 S ) s(z 1, S) = z 2 if and only if z 2 S and z 2 < s z 1 and (z S)z < s z 1 z s z 2 (20) z 2 S z 1 S(z 1, S) = z 2 z 2 S z 1 S(z 1, S) = z 2 if and only if z 2 S and z 2 s z 1 and (z S)z s z 1 z s z 2 (21) z S S(z, S) = z 13

S s(ϵ, S) = ( (z)z ) n n ( (z)z n, n = n ) f 1,, f m f i x, t b i (x, t)( ) L i (y, t)( ) f i (x, y, t) = b i (x, t)l i (y, t) (22) L i z i 1 if y t z L i (y, t) = i +1:t = z i 0 otherwise (23) z i = l 1 l 2 y t 1 = l 1, y t = l 2 L i (y, t) = 1 z i L i z i 1 f i z i 1 5.2 f i E[f i ] = x +1 P (y x) f i (x, y, t) (24) (x,ỹ) T y t= z i 1 T x x + 1 t = 1 1 t = 0 (24) 14

L-BFGS [5] 1 0 t T + 1( T = 3) 0, T + 1 / ( ) a1 s a2 a3 d a4 ( ) a0 : (b i ) (L i ) exp(exp(λ i ) ) ( exp ) a3:[yz]:2.80 a3 Y Z 1 2.80 b i (x, t) z i z i 1 f i (x, y, t) = b i (x, t)l(y, t) 1 l BOS, l EOS 1 Y = {X, Y, Z} l BOS = B, l EOS = E 5.2.1 t P t T P t = Y t {ϵ} {z k 1: z k (t t) b k(x, t) = 1, z k t t} (25) t =t 15

P t t t (22) f i b i L i b i 1 f i L i z i 1 : exp(w), exp(w ), σ( ) α, γ, β, δ( ) ( exp ) t = 3 a0, a3 1 a0:[x]:0.10, a0:[y]:0.20, a0:[z]:0.30, a0:[xy]0.70, a0:[xyz]:0.80, a0:[yx]:0.90, a0:[yz]:1.00, a0:[zy]:1.10, a3:[yyy]:2.50, a3:[y]:2.60, a3:[z]:2.70, a3:[yz]:2.80, 12 X, Y, Z, Y X, XY, Y Y Y, ZY, Y Z, XY Z 9 ϵ 10 t = 3 Y Z a0:[yz]:1.00 a3:[yz]:2.80 (25) 1 t = 2 Y Y t = 2 a0, a1 Y Y t = 3 Y Y Y t = 2 1 Y Y t = 2 16

z P t, z ϵ, t > 0 z 1: z 1 P t 1 (26) (25) t z T +1 z Y t:t +1 t =t f i (x, z + z, t) = T +1 f i (x, S(z, P t ) + z, t) (27) i z Y t:t +1 t =t i f i (x, z + z, t) f i (x, t +1 z + z + z, t) f i (x, S(z, P t ) + z, t) f i (x, t +1 S(z,P t) + S(z, P t ) + z, t) (27) t z = t t z S(z) < s z s z z b i (x, t ) = 1, L(z +z, t) = 1(L ) f i t z +z (26) t z S(z, P t ) < s z s z z P t (21) 17

a0:[x]:0.10 a0:[y]:0.20 a0:[z]:0.30 a0:[e]:0.40 a0:[bx]:0.50 a0:[bxy]:0.60 a0:[xy]:0.70 a0:[xyz]:0.80 a0:[yx]:0.90 a0:[yz]:1.00 a0:[zy]:1.10 a0:[xe]:1.20 a0:[ye]:1.30 a0:[ze]:1.40 a0:[zye]:1.50 a0:[yze]:1.60 a1:[x]:1.70 a1:[y]:1.80 a1:[z]:1.90 a1:[bx]:2.00 a1:[xy]:2.10 a2:[x]:2.20 a2:[y]:2.30 a2:[z]:2.40 a3:[yyy]:2.50 a3:[y]:2.60 a3:[z]:2.70 a3:[yz]:2.80 a4:[ze]:2.90 []:w 1.00 W 1.00 σ 9.24 α 0.00 γ 1.00 β 9.64 δ 9.64 (BOS)/ "This"/a0,a1,a2 "is"/a0,a1 "good"/a0,a3 (EOS)/a0,a4 []:w 1.00 W 1.00 σ 9.24 α 0.00 γ 2.57 β 3.53 δ 3.53 []:w 1.00 W 1.00 σ 9.24 α 0.00 γ 2.85 β 1.63 δ 1.63 []:w 1.00 W 1.00 σ 9.24 α 0.00 γ 5.65 β 0.40 δ 0.40 []:w 1.00 W 1.00 σ 9.24 α 0.00 γ 9.24 β 1.00 δ 1.00 [B]:w 1.00 W 1.00 σ 9.24 α 1.00 γ 1.00 β 9.24 δ-0.40 [X]:w 0.37 W 0.37 σ 1.08 α 0.00 γ 0.37 β 3.96 δ 0.43 [X]:w 0.17 W 0.17 σ 0.66 α 1.74 γ 0.42 β 1.55 δ-0.08 [X]:w 0.10 W 0.10 σ 0.13 α 1.89 γ 0.28 β 0.48 δ 0.08 [E]:w 0.40 W 0.40 σ 9.24 α 0.00 γ 9.24 β 1.00 δ 0.00 [BX]:w 1.00 W 0.37 σ 1.08 α 1.00 γ 0.37 β 2.89 δ-1.06 [YX]:w 0.90 W 0.15 σ 0.20 α 0.83 γ 0.13 β 1.55 δ 0.00 [YX]:w 0.90 W 0.09 σ 0.04 α 0.96 γ 0.09 β 0.48 δ 0.00 [XE]:w 1.20 W 0.48 σ 0.13 α 0.28 γ 0.13 β 1.00 δ 0.00 [Y]:w 0.83 W 0.83 σ 3.02 α 1.00 γ 0.83 β 3.65 δ 0.12 [Y]:w 0.36 W 0.36 σ 5.93 α 0.00 γ 0.96 β 6.21 δ 4.57 [Y]:w 0.52 W 0.52 σ 1.11 α 0.66 γ 1.72 β 0.52 δ 0.12 [YE]:w 1.30 W 0.52 σ 1.11 α 0.88 γ 1.11 β 1.00 δ 0.00 [Z]:w 1.37 W 1.37 σ 5.13 α 1.00 γ 1.37 β 3.75 δ 0.22 [XY]:w 1.47 W 0.53 σ 0.60 α 0.00 γ 0.12 β 5.03 δ-1.18 [XY]:w 0.70 W 0.36 σ 0.08 α 0.42 γ 0.15 β 0.52 δ 0.00 [ZYE]:w 1.50 W 0.78 σ 0.65 α 0.84 γ 0.65 β 1.00 δ 0.00 [BXY]:w 0.60 W 0.32 σ 0.60 α 0.37 γ 0.12 β 5.03 δ 0.00 [YYY]:w 2.50 W 1.30 σ 0.20 α 0.30 γ 0.39 β 0.52 δ 0.00 [ZE]:w 4.06 W 1.62 σ 7.99 α 1.53 γ 7.99 β 1.00 δ 0.00 [YY]:w 1.00 W 0.36 σ 1.97 α 0.83 γ 0.30 β 6.61 δ 0.41 [ZY]:w 1.10 W 0.57 σ 0.65 α 1.46 γ 0.84 β 0.78 δ 0.26 [YZE]:w 1.60 W 2.60 σ 5.51 α 2.12 γ 5.51 β 1.00 δ 0.00 [ZY]:w 1.10 W 0.40 σ 3.36 α 1.37 γ 0.54 β 6.21 δ 0.00 [Z]:w 0.81 W 0.81 σ 7.99 α 1.89 γ 3.65 β 1.62 δ 1.22 [Z]:w 0.57 W 0.57 σ 2.65 α 1.74 γ 1.46 β 1.81 δ 0.18 [YZ]:w 2.80 W 2.27 σ 5.51 α 0.84 γ 2.12 β 2.60 δ 0.97 [YZ]:w 1.00 W 0.57 σ 0.85 α 0.83 γ 0.47 β 1.81 δ 0.00 [XYZ]:w 0.80 W 1.81 σ 0.56 α 0.12 γ 0.22 β 2.60 δ 0.00 1: 18

5.2.2 z p P t t w(z p, t) w(z p, t) def = i:z i =z p,b i (x,t)=1 λ i (28) 1 exp t = 2 XY a0:[xy]:0.70 a1:[xy]:2.10 1.47 1 exp(w) exp w z p P t t W (z p, t) W (z p, t) def = i:z i sz p,b i (x,t)=1 λ i (29) ( ) 1 W (29) (28) w W (z p, t) = i:z i P t,z i s z p w(z i, t) (30) 3 Y X exp(w) 0.90 X 0.10 exp(w ) 0.09 1 exp(w ) 2 exp W W z p ϵ W (z p, t) = W (s(z, P t ), t) + w(z p, t) s(z p, P t ) (20) P t z p P t z p z p s(z, P t ) 19

5.2.3 z p ϵ P t t 1 α(z p, t) α(z p, t) def = z:z Y 0:t,z p sz z:z Y 0:t, (z P t,s(z,p t )=z p )z s z t 1 exp λ i t =1 i:f(x,z,t )=1 t 1 λ i t =1 i:f(x,z,t )=1 exp (31) z p 1 t 1 z p t 1 t 1 α(ϵ, t) = 0 1 t = 1 X α 0 BX α(x, 1) t = 1 X BX t = 1 X BX (t = 0 B ) 0 t = 2 X α 1.74 Y X t = 2 X Y X XX ZX z p P t t γ(z p, t) γ(z p, t) def = z:z Y 0:t,z p sz exp t λ i t =1 i:f(x,z,t )=1 (32) z p t γ(ϵ, 0) = γ(l BOS, 0) = 1 t = 2 X γ 0.42 X 20

α γ α(z p, t) = γ(z p 1: z p 1, t 1) γ(z p, t) = z P t,z p s z z:z P t,s(z,p t )=z p γ(z 1: z 1, t 1) (33) α(z, t) exp(w (z, t)) (34) (33) γ(z p 1: z p 1, t 1) zp P t z p α(z p, t) (34) α(z, t) exp(w (z, t)) α(z, t) P t P t z α exp(w ) γ(z, t) t = 2 Y X γ 0.13 Y X α 0.83 W 0.15 X γ 0.42 Y X γ 0.13 X α 1.74 W 0.17 0.29 2 X α X Y X 1 ( ) W 2 Y X X 5.2.4 z p P t t β(z p, t) β(z p, t) def = T +1 exp λ i z Y t+1:t +1 t =t+1 i:f i (x,z p +z,t ) (35) t + 1 T + 1 z p z β(z, T + 1) = 1 f i (x, z p + z, t ) f i (x, t+1 zp + z p + 21

z, t ) z p P t t T δ(z p, t) δ(z p, t) def = T +1 exp λ i z Y t+1:t +1 t =t+1 i:f i (x,z p +z,t )=1 T +1 exp λ i t =t+1 i:f i (x,s(z p,p t)+z,t )=1 (36) t + 1 T + 1 z p z p δ(ϵ, T + 1) = 1 δ(z p, t) = β(z p, t) = z P t+1,z 1: z 1 =z p (β(z, t + 1) exp(w (z, t + 1)) β(s(z, P t+1 ), t + 1) exp(w (s(z, P t+1 ), t + 1))) (37) δ(z, t) z P t,z sz p (38) (37) (36) T +1 exp λ i exp λ i z Y t+2:t +1 z Y t+1:t+1 t =t+2 i:f i (x,z p +z +z,t )=1 i:f i (x,z p +z,t+1) T +1 exp λ i exp λ i (39) t =t+2 i:f i (x,s(z p,p t )+z +z,t )=1 i:f i (x,s(z p,p t )+z +z,t) (21) (39) exp T +1 exp λ i t =t+2 i:f i (x,s(z p +z,p t+1 )+z,t )=1 T +1 λ i t =t+2 i:f i (x,s(s(z p,p t)+z,p t+1 )+z,t )=1 exp exp z Y t+2:t +1 z Y t+1:t+1 i:f i (x,s(z p +z,p t+1 ),t+1) i:f i (x,s(s(z p,p t)+z,p t+1 ),t) λ i λ i (40) s(z p + z, P t+1 ) s s(z p, P t ) + z (41) 22

(20) s(z p + z, P t+1 ) < s z p + z, s(z p, P t ) + z < s z p + z (19) s(z p, P t ) + z < s s(z p +z, P t+1 ) z = s(z p +z, P t+1 ) z P t+1 z ϵ (26) z 1: z 1 P t (18) s(z p, P t ) < s z 1: z 1 < s z p (20) (20) t + 1 s(z p + z, P t+1 ) < s z < s z p + z z S(s(z p, P t ) + z, P t+1 ) = s(z p + z, P t+1 ) (42) z p + z P t+1 (21) S(z p + z, P t+1 ) z p + z S(z p + z, P t+1 ) = s(z p + z, P t+1 ) (43) (40) z p + z P t+1 (40) T +1 exp λ i exp λ i z Y t+2:t +1 z P t+1,z z 1 =zp t =t+2 i:f i (x,z +z,t )=1 i:f i (x,z,t+1) T +1 exp λ i exp λ i t =t+2 i:f i (x,s(z,p t+1 )+z,t )=1 i:f i (x,s(z,p t+1 )+z,p t+1 ),t) (35) β (29) W (37) (44) 1 t = 3 X δ 0.08 XE β 0.48 E β(1.0) 0.40 5.2.5 - z p P t t θ(z p, t), σ(z p, t) θ(z p, t) σ(z p, t) def = α(z p, t) exp(w (z p, t))β(z p, t) (45) def = θ(z, t) (46) z:z P t,z s z p 23

θ(z p, t) t z p z p t t = 0 T + 1 1 t = 2 X θ 0.46 (θ ) t = 2 X Y X ( t = 1, 2 XX ZX ) σ(z p, t) t z p t = 0 T + 1 1 t = 2 X σ 0.66 t = 2 X σ(z p, t) = σ(s(z p, P t ), t) + θ(z p, t) Z(x) γ Z(x) = γ(ϵ, T + 1) (47) y t z p P t P (z p s y 1:t x) = σ(z p, t)/z(x) (48) 5.2.6 t = 1,, T + 1 P t t z F z,t Algorithm 5 t z ϵ P t t 1 z 1: z 1 P t 1 ( ) t z s(z, P t ) z 1 < s z 2 z 1 z 2 24

Algorithm 5 Make paths for t = 1 to T + 1 do F {f k b k (x, t) = 1} for all f i F do F z i,t F z i,t {f i } for j = 0 to z i do P t j P t j {z i 1: z j } (ascending order) (descending order) ( ) Algorithm 6 w(z, t) 0 (33) z i α(z i, t) = γ(z 1: z 1, t) (33) z i 2 (33) z P t α(z, t) 1 1 T +1 T +1 O F z p,t + P t (49) t=1 z p :z p P t t=1 γ, δ t 1 t + 1 exp 25

Algorithm 6 Sum-difference for t = 1 to T + 1 do for all z ϵ P t in ascending order do for all f i F z,t do w(z, t) w(z, t) + λ i W (z, t) W (s(z, P t ), t) + w(z, t) γ(ϵ, 0) 1, γ(l BOS, 0) 1 for t = 1 to T + 1 do for all z ϵ P t in descending order do α(s(z, P t ), t) α(s(z, P t ), t) γ(z 1: z 1, t 1) α(z, t) α(z, t) + γ(z 1: z 1, t 1) α(ϵ, t) 0 for all z ϵ P t in descending order do γ(z, t) γ(z, t) + α(z, t) exp(w (z, t)) γ(s(z, P t ), t) γ(s(z, P t ), t) + γ(z, t) Z(x) γ(ϵ, T + 1) δ(ϵ, T + 1) 1 for t = T + 1 downto 1 do β(ϵ, t) δ(ϵ, t) for all z ϵ P t in ascending order do β(z, t) β(s(z, P t ), t) + δ(z, t) for all z ϵ P t do δ(z 1: z 1, t 1) δ(z 1: z 1, t 1)+β(z, t) exp(w (z, t)) β(s(z, P t ), t) exp(w (s(z, P t ))) for t = 1 to T + 1 do for all z ϵ P t in descending order do θ(z p, t) α(z p, t) exp(w (z p, t))β(z p, t) σ(z, t) σ(z, t) + θ(z, t) σ(s(z, P t ), t) σ(s(z, P t ), t) + σ(z, t) for all f i F z,t do E[f i ] E[f i ] + σ(z, t)/z(x) 26

(Algorithm 6 α, β ) 5.3 Linear-Chain CRF y ( ) y 1 = arg max y Z(x) exp λ i f i (x, y, t) t i ( ) = arg max exp λ i f i (x, y, t) y t O ( T +1 i:b(x,t)=1 1 + L ( T +1 z p :z p P t 1 )) O ( T +1 t=1 t=1 i:b(x,t)=1 1 + log L ( T +1 t=1 i t=1 (50) z p :z p P t 1 )) W Algorithm 7 5.3.1 z p P t p(z p, t), q(z p, t) p(z p, t) q(z p, t) def = max z:z P t 1,z p 1: z p 1 sz, (z :z P t 1,z p sz )z sz (p(z, t 1)) exp(w (z p, t)) (51) def = arg max z:z P t 1,z p 1: z p 1 sz, (z :z P t 1,z p sz)z sz (p(z, t 1)) (52) 27

Algorithm 7 Calculate weights for t = 1 to T + 1 do for all z ϵ P t in ascending order do for all f i F z,t do w(z, t) w(z, t) + λ i W (z, t) W (s(z, P t ), t) + w(z, t) p(z p, t) z p z p t t 1 z p q(z p, t) p(z p, t) t 1 Algorithm 8 p, q r, u Algorithm 5 Algorithm 7 Algorithm 9 q y Algorithm 8 O(L) 28

Algorithm 8 Decode-forward(1) p(ϵ, 0) 1, p(l BOS, 0) 1 for t = 1 to T + 1 do for all z ϵ P t in descending order do if this is the first iteration or the last label in z has changed then z the first z P t 1 in descending order for all z P t 1 do r(z, t 1) p(z, t 1) u(z, t 1) z end if while z z 1: z 1 do if r(z, t 1) > r(s(z, t 1), t 1) then r(s(z, t 1), t 1) r(z, t 1) u(s(z, t 1), t 1) u(z, t 1) end if z next z P t 1 in descending order end while p(z, t) r(z, t) exp(w (z, t)) q(z, t) u(z, t) Algorithm 9 Decode-forward(2) z arg max z PT +1 p(z, T + 1) for t = T + 1 to 1 do y t z z z q(z, t) y 0 l BOS 29

5.3.2 z p P t p(z p, t), q(z p, t) p(z p, t) q(z p, t) def = max z:z P t+1,z 1: z 1 s z p, (z :z P t+1,z s z )z 1: z 1 sz p (p(z, t + 1)) exp(w (z p, t)) (53) def = arg max z:z P t+1,z 1: z 1 s z p, (z :z P t+1,z s z )z 1: z 1 sz p (p(z, t + 1)) (54) p(z p, t) z p z p t t + 1 z p q(z p, t) p(z p, t) t + 1 Algorithm 10 p, q r, u, v, l, H H Heap-insert, Heap-delete, Heap-max Heapinsert Heap-delete Heap-max O(log L) Heap-max O(1) t Q t Q t = {z 1: z 1 z P t } (55) z t z t z 1 z z 2 z 2 = z 1 + z Algorithm 5 Algorithm 7 Algorithm 11 q y 30

Algorithm 10 Decode-backward(1) v(ϵ, T + 1) 1 for t = T + 1 downto 1 do for all z ϵ P t in ascending order do if v(z, t) = 0 then v(z, t) = v(s(z, P t ), t) q(z, t) = q(s(z, P t ), t) end if p(z, t) = v(z, t) exp(w (z, t)) H empty heap for all z : z P t, z = 1 do Heap-insert(H, z 1, t)) u(z 1 ) p(z, t) z ϵ for all z ϵ Q t in ascending order do while z s(z, Q t ) do for all z : z P t, z 1: z 1 = z do Heap-delete(H, z z 1 ) Heap-insert(H, z z 1, r(z )) z s(z, Q t ) end while for all z : z P t, z 1: z 1 = z do r(z ) u(z z 1 ) u(z z 1 ) p(z, t) Heap-delete(H, z z 1 ) Heap-insert(H, z z 1, p(z, t)) l Heap-max(H) q(z p, t) s(z p + l, P t+1 ) v(z p, t) u(l) z z 31

Algorithm 11 Decode-backward(2) if v(l BOS, 0) = 0 then q(l BOS, 0) q(ϵ, 0) end if z l BOS for t = 0 to T do y t z z z q(z, t) y T +1 l EOS O(L) O(log L) Q t 32

6 CRFSuite[6] Penn Treebank 3.0 Wall Street Journal 0-18 22-24 38,219 912,344 5,462 129,654 (Stanford Tagger[3] ) 3 1 Linear-Chain CRF 2, 3, 4 1 2.9GHz CPU 128GiB 2 2 3 exp log CRFSuite[6] Stanford Tagger[3] MEMM[1] Stanford Tagger 3 1 33

y i y i, x i y i, x i 1 y i, x i+1 y i, x i 1, x i y i, x i, x i+1 (1 ) y i, x i 2, x i 1 y i, x i 2, x i 1, x i y i, x i 3, x i 2, x i 1 y i, x i+1 (10 ) y i, x i+1 (10 ) y i, x i+1 y i, x i+1 y i, x i+1 y i, y i 1 2 y i, y i 1, x i y i, y i 1, x i 1 y i, y i 1, y i 2, x i 1 3 y i, y i 1, y i 2, x i, x i 1 y i, y i 1, y i 2, x i 1, x i 2 y i, y i 1, y i 2, y i 3, x i 1, x i 2 4 y i, y i 1, y i 2, y i 3, x i, x i 1, x i 2 y i, y i 1, y i 2, y i 3, x i 1, x i 2, x i 3 y i, y i 1, y i 2, y i 3, y i 4, x i 1, x i 2, x i 3 1: 34

( ) baseline (1 CRF) 3113069 26 202 96.99% 1 CRF( ) 3113069 75 1408 97.12% 2 3618876 87 1277 97.18% 3 5163811 87 1313 97.19% 4 7329406 90 1154 97.19% Stanford Tagger[3] 460552 - - 97.24% 2: WSJ POS-tagging 35

7 Linear-Chain CRF Ye [4] Linear-Chain CRF Stanford Tagger[3] Linear-Chain CRF Linear-Chain CRF POS-tagging http://vocrf.net/ 36

Linear-Chain CRF ( ) 3 37

[1] McCallum, A., Freitag, D. and Pereira, F. C. N.: Maximum Entropy Markov Models for Information Extraction and Segmentation, Proceedings of the Seventeenth International Conference on Machine Learning, ICML 00, San Francisco, CA, USA, Morgan Kaufmann Publishers Inc., pp. 591 598 (2000). [2] LAFFERTY, J.: Conditional Random Fields : Probabilistic Models for Segmenting and Labeling Sequence Data, Proceedings of ICML, 2001 (2001). [3] Toutanova, K., Klein, D., Manning, C. D. and Singer, Y.: Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network, In Proceedings of HLT-NAACL 2003, pp. 252 259 (2003). [4] Ye, N., Lee, W. S., Chieu, H. L. and Wu, D.: Conditional Random Fields with High-Order Features for Sequence Labeling, Advances in Neural Information Processing Systems 22 (Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C. K. I. and Culotta, A.(eds.)), pp. 2196 2204 (2009). [5] Liu, D. and Nocedal, J.: On the Limited Memory Method for Large Scale Optimization, Mathematical Programming B, Vol. 45(3), pp. 503 528 (1989). [6] Okazaki, N.: CRFsuite: a fast implementation of Conditional Random Fields (CRFs) (2007). [7] Qian, X., Jiang, X., Zhang, Q., Huang, X. and Wu, L.: Sparse higher order conditional random fields for improved sequence labeling, ICML, p. 107 (2009). 38