Takio Kurita Neurosceince Research Institute, National Institute of Advanced Indastrial Science and Technology takio-kurita@aistgojp (Support Vector Machine, SVM) 1 (Support Vector Machine, SVM) ( ) 2 2 (Support Vector Machine, SVM) 1960 Vapnik Optimal Separating Hyperplane 1990 1
2 21 1 x T = (x 1,, x M ) x T x M K C 1,, C K 22 x w y 1: 2 2
1 y = sign(w T x h) (1) 2 w h sign(u) u > 0 1 u 0 1 1 1 2 2 C 1,C 2 1 1 N x 1,, x N t 1,, t N H2 H1 2: ( 1-1 ) t i (w T x i h) 1, i = 1,, N (2) H1: w T x h = 1 H2: w T x h = 1 2 2 1 w w h t i (w T x i h) 1, (i = 1,, N) (3) 3
L(w) = 1 2 w 2 (4) Lagrange α i ( 0), i = 1,, N L(w, h, α) = 1 2 w 2 α i {t i (w T x i h) 1} (5) w h w = 0 = α i t i x i (6) α i t i (7) α i t i = 0 (8) L D (α) = α i 1 2 0 α i, i = 1,, N (9) α i α j t i t j x T i x j (10) Lagrange α i ( 0), i = 1,, N αi 0 α i > 0 x i w T x h = 1 w T x h = 1 αi 0 x i αi (i 0) w w = αi t i x i (11) i S i, S h w T x h = 1 w T x h = 1 x s, s S h = w T x s t s (12) αi (i 0) y = sign(w T x h ) = sign( i S α i t i x T i x h ) (13) 4
αi = 0 α i > 0 23 H2 H1 3: ( 1-1 ) 1 w 3 H1 H2 ξ i ( 0) ξ i w ξ i w (14) ξ i 0, t i (w T x i h) 1 ξ i, (i = 1,, N) (15) L(w, ξ) = 1 2 w 2 + γ ξ i (16) 5
γ 2 Lagrange α i ν i L(w, h, α, ν) = 1 2 w 2 + γ ξ i α i {t i (w T x i h) (1 ξ i )} ν i ξ i (17) w h ξ i 0 w = 0 = α i t i x i (18) α i t i (19) α i = γ ν i (20) α i t i = 0 (21) L D (α) = α i 1 2 0 α i γ, i = 1,, N (22) α i α j t i t j x T i x j (23) αi H1 H2 ( ) H1 H2 αi α i = 0 H1 H2 0 < αi < γ H1 H2 αi = γ ξ i 0 H1 H2 i, 24 6
x ϕ(x) ϕ L D ϕ(x 1 ) ϕ(x 2 ) ϕ(x 1 ) T ϕ(x 2 ) = K(x 1, x 2 ) (24) x 1 x 2 ϕ(x 1 ) ϕ(x 2 ) K(x 1, x 2 ) K K K(x 1, x 2 ) = (1 + x T 1 x 2 ) p (25) Gauss ( x1 x 2 2 ) K(x 1, x 2 ) = exp 2σ 2 (26) K(x 1, x 2 ) = tanh ( ax T 1 x 2 b ) (27) (10) (23) L D L D (α) = = α i 1 α i α j t i t j ϕ(x i ) T ϕ(x j ) 2 i, α i 1 α i α j t i t j K(x i, x j ) (28) 2 i, (13) y = sign(w T ϕ(x) h ) = sign( i S α i t i ϕ(x i ) T ϕ(x) h ) 7
= sign( i S α i t i K(x i, x) h ) (29) Gauss Radial Basis Function (RBF) Kernel x w y 4: 25 CV Gauss σ 8
multinomial logit model 3 31 Rosenblatt x = (x 1,, x M ) T y y = f(η) η = w T x h = w T x (30) w i i h w = (h, w 1,, w M ) T x = ( 1, x 1,, x M ) T f Rosenblatt { 1 if η 0 f(η) = sign(η) = 0 otherwise f(η) = (31) f(η) = η (32) exp(η) 1 + exp(η) (33) 9
0 1 Threshold 08 06 04 02-4 -2 0 2 4 (a) 1 08 Linear 06 04 02 0-02 -04-06 -08-1 -4-2 0 2 4 (b) 1 09 Logistic 08 07 06 05 04 03 02 01 0-4 -2 0 2 4 (c) 5: 32 ( ) Rosenblatt 0 33 2 N {(x i, t i ) i = 1,, N} x i ( ) t i 10
ε 2 emp = (t i y i ) 2 = ε 2 emp(i) (34) ( ) w ε 2 emp ε 2 emp w j ε 2 emp w j = h 2(t i y i )x ij = 2δ i x ij (35) ε 2 emp h N = 2(t i y i )( 1) = 2δ i ( 1) (36) δ i = (t i y i ) w j w j + α( δ i x ij ) (37) h h + α( δ i ( 1)) (38) α (learning rate) Widrow-Hoff (Widrow-Hoff learning rule) t i y i δ i (delta rule) Widrow-Hoff N (M + 1) X = ( x 1,, x N ) T N t = (t 1,, t N ) T 2 ε 2 emp = w 0 (t i y i ) 2 = t X w 2 (39) ε 2 emp w = XT (t X w) = 0 (40) (X T X) w w = (X T X) 1 X T t (41) (multiple regression analysis) x (explanatory variable) t (criterion variable) 11
34 (multiple regression analysis) (shrinkage method) (regularization method) (ridge regression) (1) x Forward stepwise selection Backward stepwise selection Forward stepwise selection 1 1 Backward stepwise selection 1 2 resampling leave-one-out leave-one-out N N 1 12
1 N 1 1 N jackknife [9, 10] bootstrap [11, 12, 13] resampling resampling 2 F AIC(An Information Theoretical Criterion)[14, 15] Rissanen MDL(Minimum Description Length)[16, 17] ( ) AIC MDL AIC J AIC = 2( ) + 2J (42) MDL Rissanen (Minimal Discription Length) MDL = ( ) + N 2 log J (43) AIC MDL (2) Shrinkage (rigde regression) ε 2 emp = (t i y i ) 2 = (t i ( w j x ij h)) 2 (44) w j wj 2 (45) Q(w, h) = (t i ( w j x ij h)) 2 + λ wj 2 (46) 13
λ 2 λ = 0 Q(w, h) h 0 Q(w, h) h h = 2N( t h = t + w j x j + h) = 0 (47) w j x j (48) t = 1 N N t i x j = 1 N N x ij Q(w, h) Q(w) = + λ {(t i t) w j (x ij x j )} 2 w 2 j = ( t Xw) T ( t Xw) + λw T w (49) t X (t i t) (x ij x j ) w w Q(w) w = 2 X t + 2( X T Xw + λi)w = 0 (50) w = ( X T X + λi) 1 XT t (51) X T X λ X T X 35 {(x i, u i ) i = 1,, N} u i 0 1 2 x y x u 1 L = N y u i i (1 y i ) (1 ui) (52) 14
( ) l = = = {u i log y i + (1 u i ) log(1 y i )} {u i log{ exp(η i) 1 + exp(η i ) } 1 +(1 u i ) log{ 1 + exp(η i ) }} {u i η i log{1 + exp(η i )}} (53) w w j l w j = (u i y i )x ij = δ i x ij (54) δ i = (u i y i ) h l N h = (u i y i )( 1) = δ i ( 1) (55) w j w j + α( δ i x ij ) (56) h h + α( δ i ( 1)) (57) Widrow-Hoff (Widrow-Hoff learning rule) y i Fisher y θ 1,, θ M f(y, θ 1,, θ M ) ( ) 2 F ij = E log f(y, θ 1,, θ M ) θ i θ j Fisher F = [F ij ] Fisher Fisher Fisher (53) 2 2 2 l w k w j = (58) ω i x ik x ij (59) 15
025 omega 02 015 01 005 0 0 02 04 06 08 1 6: ω p ω i = y i (1 y i ) 1 2 l = δ i x i = X T δ, (60) 2 l = ω i x i x T i = X T W X X T = [ x 1,, x N ], W = diag(ω 1,, ω N ) δ = (δ 1,, δ N ) T w Fisher Hessian F = E( 2 l) = X T W X (61) { x i } ω i ω i 6 0 1 05 Fisher (53) Fisher [18] Hessian Fisher Fisher Hessian Fisher w δ w w = w + δ w (62) δ w F δ w = l (63) (62) F F w = F w + F δ w = F w + l (64) F w F w = X T W η (65) 16
η = (η 1,, η N ) T w w = F 1 (F w + l) = (X T W X) 1 (X T W η + X T δ) δ = (δ 1,, δ N ) T = (X T W X) 1 X T W (η + W 1 δ) (66) X η + W 1 δ 0 w = 0 W = 1 4 I, η = 0 δ = u 1 2 1 (66) w 0 w 0 = 4(X T X) 1 X T (u 1 1) (67) 2 t 1 2 1 36 resampling (1) Weight Decay 2 Q( w) = l + λ = w 2 j {log{1 + exp(η i )} u i η i } +λ wj 2 (68) Q( w) w j Q = l + 2λw j w j w j 17
= (u i y i )x ij + 2λw j (69) Q( w) h Q h = l h = (u i y i )( 1) (70) Weight Decay w j w j + α( (u i y i )x ij ) 2αλw j (71) h h + α( (u i y i )( 1)) (72) w j 2 w j 0 37 (16) L(w, ξ) = = ξ i + λ w 2 j [1 t i η i ] + + λ wj 2 (73) [x] + x 7 (a) [1 x] + 1 t i η i 1 ( H1 H2 ) 0 1 1 2 t i 0 1 Q = (1 t i η i ) 2 + λ wj 2 (74) 1 2 1 (1 x) 2 7 (b) t i η i 1 t i η i 1 t i η i 1 1 2 18
0 3 (1-x)*(x<1) 25 2 15 1 05-3 -2-1 0 1 2 3 (a) 3 (1-x)*(1-x) 25 2 15 1 05 0-3 -2-1 0 1 2 3 (b) ( ) 3 log(1+exp(-x)) 25 2 15 1 05 0-3 -2-1 0 1 2 3 (c) (Weight Decay) 7: 19
Weight Decay u i (0, 1) t i ( 1, 1) Q = log{1 + exp(t i η i )} + λ wj 2 (75) 1 2 1 log{1 + exp(t i η i )} 7 (c) 1 t i η i = 1 (2 ) t i η i 1 3 2 1 2 2 4 [1] VNVapnik, Statistical Learning Theory, John Wiley & Sons (1998) [2],,,, No444, pp52-58 (2000) [3], -,, Vol42, No7, pp676-683 (2001) [4] BScholkopf, CJCBurges, AJSmola, Advances in Kernel Methods - Support Vector Learning, The MIT Press, 1999 [5] NCristianini, JS-Taylor, An Introduction to Support Vector Machines and other kernel-based learning methods, Cambridge University Press, 2000 [6] THastie, RTibshirani, JFriedman, The Elements of Statistical Learning - Data Mining, Inference, and Prediction, Springer-Verlag, 2001 [7] RODuda, PEHart, DGStork, Pattern Classification (Second Edition), John Wiley & Sons, 2001 [8] KRMuller, SMika, GRatsch, KTsuda, BScholkopf, An introduction to kernel-based learning algorithms, IEEE Trans On Neural Networks, Vol12, No2, pp181-201, 2001 [9] Miller,RG,(1974): The jacknife -a review, Biometrika, Vol61, No1, pp1-15 [10] Stone,M,(1974): Cross-validatory choice and assessment of statistic al predictions, Journal of Royal Statistical Society, VolB36, pp111-147 20
[11] Efron,B,(1979): Bootstrap methods: anothoe look at the jackknife, The Annals of Statistics, Vol7, No1, pp1-26 [12] Efron,B,(1983): Estimating the error rate of a prediction rule: imp rovements in cross-validation, Journal of Amerian Statistical Association, Vol 78, pp316-331 [13] Efron,B,(1985): The bootstrap method for assessing statistical accu racy, Behaviormetrika, Vol17, pp1-35 [14] Akaike,H,(1974): A new look at the statistical model identification, IEEE Trans on Automatic Control, volac-19, No6, pp716-723 [15],,,(1983):, [16] Rissanen,J,(1983): A universal prior for integers and estimation by minimum description length, The Annals of Statistics, Vol11, NO2, pp416-431 [17] Rissanen,J,(1986): Stochastic complexity and modeling, The Annals of Statistics, Vol14, No3, pp1080-1100 [18] PMcCullagh, and JANelder FRS, Generalized Linear Models, Chapman and Hall, 1989 21