kubostat2018d p.1 I 2018 (d) model selection and kubo@ees.hokudai.ac.jp http://goo.gl/76c4i 2018 06 25 : 2018 06 21 17:45 1 2 3 4 :? AIC : deviance model selection misunderstanding kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 1 / 44 kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 2 / 44 number of parameters? 4 GLM 5 GLM : : 2012 05 18 http://goo.gl/ufq2 seed number 2 4 6 8 10 12 14 (A) k = 1 (B) k = 7 Too few parametes? 7 8 9 10 11 12 bod size x 2 4 6 8 10 12 14 Too man parameters? 7 8 9 10 11 12 bod size x How man parameters do ou need for the best prediction? kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 3 / 44 kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 4 / 44 :? :? k? 1. :? seed number 2 4 6 8 10 12 14 (A) k = 1 (B) k = 7 Too few parametes? 2 4 6 8 10 12 14 Too man parameters? 7 8 9 10 11 12 7 8 9 10 11 12 bod size x bod size x? number of parameters k? kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 5 / 44 kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 6 / 44
kubostat2018d p.2 :? bod size x and fertilization f change seed number? : a statistical model for this example? i response variable seed number : { i } explanator variable : bod size {x i } fertilization {f i } sample size f i C: T: control (f i = C): 50 sample (i {1, 2, 50}) fertilization (f i = T): 50 sample (i {51, 52, 100}) i x i probabilit distribution Poisson distribution : linear predictor : β 1 + β 2 x i + β 3 f i link function : log link function kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 7 / 44 kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 8 / 44 :? 4 candidate models 4 : (A) constant λ :? 4 candidate models 4 : (B) f model λ i = exp(β 1 ) λ i = exp(β 1 + β 3 f i ) (log likelihood) (log likelihood) > loglik(glm( ~ 1, data = d, famil = poisson)) log Lik. 237.64 (df=1) > loglik(glm( ~ f, data = d, famil = poisson)) log Lik. 237.63 (df=2) kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 9 / 44 kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 10 / 44 :? 4 candidate models 4 : (C) x model :? 4 candidate models 4 : (D) x + f model λ i = exp(β 1 + β 2 x i ) λ i = exp(β 1 + β 2 x i + β 3 f i ) (log likelihood) (log likelihood) > loglik(glm( ~ x, data = d, famil = poisson)) log Lik. 235.39 (df=2) > loglik(glm( ~ x + f, data = d, famil = poisson)) log Lik. 235.29 (df=3) kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 11 / 44 kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 12 / 44
kubostat2018d p.3 :? k increases log L increases AIC : deviance (A) constant λ k = 1 237.6 (C) x model k = 2 235.4 (B) f model k = 2 237.6 fertilization Control (D) x + f model k = 3 235.3 Control fertilization 2. AIC : deviance badness of prediction : AIC kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 13 / 44 kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 14 / 44 AIC output R glm() deviance : deviance AIC deviance D = 2 log L : deviance > glm( ~ x + f, data = d, famil = poisson) Call: glm(formula = ~ x + f, famil = poisson, data = d) Maximum log likelihood log L : goodness of fit Deviance D = 2 log L : Coefficients: (Intercept) x ft 1.2631 0.0801 0.0320 Degrees of Freedom: 99 Total (i.e. Null); Null Deviance: 89.5 Residual Deviance: 84.8 AIC: 477 97 Residual model k log L Deviance Residual 2 log L deviance constant λ 1 237.6 475.3 89.5 f 2 237.6 475.3 89.5 x 2 235.4 470.8 85.0 x + f 3 235.3 470.6 84.8 saturation 100 192.9 385.8 0.0 Residual Deviance? Null Deviance? AIC? kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 15 / 44 kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 16 / 44 AIC : deviance Null deviance, Residual deviance,... AIC : deviance badness of prediction : AIC = 2 log L + 2k Max deviance 475.3 470.8 constant λ x model Look for a model of the smallest AIC AIC Deviance 2 log L () 89.5 (Null Deviance) 85.0 (Residual Deviance) model k log L Deviance Residual 2 log L deviance AIC constant λ 1 237.6 475.3 89.5 477.3 f 2 237.6 475.3 89.5 479.3 x 2 235.4 470.8 85.0 474.8 x + f 3 235.3 470.6 84.8 476.6 saturation 100 192.9 385.8 0.0 585.8 Min deviance 385.8 saturation model AIC: A (or Akaike) information criterion kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 17 / 44 kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 18 / 44
AIC : deviance (estimation)? constant λ ˆβ 1 = 2.04 β 1 = 2.08 parameter estimation kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 19 / 44 AIC : deviance Is it OK? Goodness of fit is evaluated b using the SAME data set...? constant λ ˆβ 1 = 2.04 log L () biased goodness of fit! kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 20 / 44 AIC : deviance : constant λ ˆβ 1 = 2.04 E(log L) β 1 = 2.08 ( ) (200 ) kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 21 / 44 AIC : deviance 1.95 2.00 2.05 2.10 2.15 2.20 140 135 130 125 120 115 110 log likelihood 1.95 2.00 2.05 2.10 2.15 2.20 140 135 130 125 120 115 110 1.95 2.00 2.05 2.10 2.15 2.20 140 135 130 125 120 115 110 β 1 β 1 β 1 (200 ) ( ) (A) (B) (A) (C) log L = 120.6 E(log L) = 122.9 ˆβ 1 = 2.04 β 1 = 2.08 kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 22 / 44 AIC : deviance 1 2 2 kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 23 / 44 3. likelihood ratio test kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 24 / 44 kubostat2018d p.4
kubostat2018d p.5 Although their procedures are similar... the are total different! AIC ( ) () AIC model selection totall different in their objectives kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 25 / 44 kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 26 / 44 Objective? model selection : Look for a model of better prediction : rejection of null hpothesis kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 27 / 44 統計学的な検定 (NemanPearson framework) statistical test Null Alternative hpothesis hpothesis 帰無仮説対立仮説 glm( ~ 1) どうでもいい 興味ない VS glm( ~ x) 重要! これを主張したい! 非対称性 asmmetricit? kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 28 / 44 統計学的な検定 (NemanPearson framework) statistical test Null Alternative hpothesis hpothesis 帰無仮説対立仮説 (if...) glm( ~ 1) test! reject 棄却 VS glm( ~ x) support 支持 非対称性 asmmetricit? kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 29 / 44 統計学的な検定 (NemanPearson framework) statistical test (if...) Null hpothesis 帰無仮説 glm( ~ 1) test! NOT reject VS Alternative hpothesis 対立仮説 glm( ~ x) Sa Nothing!? 非対称性 asmmetricit? kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 30 / 44
kubostat2018d p.6 The same example, again test statistics D 1,2 i i seed number i D: deviance x i bod size x i neglect fertilization treatment (!) x model D 2 = 470.8 constant λ D 1 = 475.3 difference in deviance D 1,2 = D 1 D 2 = 4.51 4.5 likelihood ratio? log L 1 L = log L 1 log L 2 2 model k log L Deviance 2 log L constant λ 1 237.6 D 1 = 475.3 x 2 235.4 D 2 = 470.8 null hpothesis alternative hpothesis asmmetricit in test Null hpothesis is junk :... et we are focousing onl on null hpothesis kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 31 / 44 kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 32 / 44 How to make null model objective : null hpothesis rejection Null hpothesis is included in Alt hpothesis this is a nested model ( ) observerd D 1,2 = 4.5 () ( ) () () { i } λ i alternative hpothesis : log λ i = β 1 + β 2 x i null hpothesis : log λ i = β 1 ( ) significant not significant is... (Reject ) (Not reject ) TRUE Tpe I error (no problem) NOT true (no problem) Tpe II error asmmetricit in test evaluating onl TpeI error : kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 33 / 44 kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 34 / 44 generate D 1,2 distribution D 1,2 : Suppose null hpothesis is TRUE! bootstrap likelihood test How to generate D 1,2 under is TRUE?! constant λ x model D 1,2 ( ˆβ 1 = 2.06 ) D 1,2 D 1,2 D 1,2 > d$.rnd < rpois(100, lambda = mean(d$)) > fit1 < glm(.rnd ~ 1, data = d, famil = poisson) > fit2 < glm(.rnd ~ x, data = d, famil = poisson) > fit1$deviance fit2$deviance generation of random numbers virtual data rpois() ( ) fitting GLM to the virtual data glm() kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 35 / 44 kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 36 / 44
kubostat2018d p.7 You must define rejection region in advance sa, 5%? 5%? NOT significant significant (5%) A random D 1,2 generator in R get.dd < function(d) # { n.sample < nrow(d) #.mean < mean(d$) # d$.rnd < rpois(n.sample, lambda =.mean) fit1 < glm(.rnd ~ 1, data = d, famil = poisson) fit2 < glm(.rnd ~ x, data = d, famil = poisson) fit1$deviance fit2$deviance # } pb < function(d, n.bootstrap) { replicate(n.bootstrap, get.dd(d)) } kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 37 / 44 kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 38 / 44 Generated distribution of D 1,2 = D 1 D 2 Probabilit{ D 1,2 4.5} = 38 1000 = 0.038 observed D 1,2 D 1,2 = 4.5 constant λ x model D 1,2 (R code is in the next page) > source("pb.r") # reading "pb.r" text file > dd12 < pb(d, n.bootstrap = 1000) > hist(dd12, 100) # to plot histogram > abline(v = 4.5, lt = 2) > sum(dd12 >= 4.5) [1] 38 socalled P value is 0.038. kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 39 / 44 kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 40 / 44 In this case, null hpothesis is rejected In case that P > 0.05...? So we can state that alternative hpothesis x model is better than constant λ. i i seed number i can be accepted. D: deviance x i bod size x i x model D 2 = 470.8 constant λ D 1 = 475.3 You can conclude NOTHING! You can NOT state that constant λ (Null hpothesis) is better λ asmmetricit in stattest : Null hpothesis is never accepted kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 41 / 44 kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 42 / 44
kubostat2018d p.8 model selection misunderstanding model selection misunderstanding 4. model selection misunderstanding FAQ http://hosho.ees.hokudai.ac.jp/~kubo/ce/faqmodelselection.html kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 43 / 44 kubostat2018d (http://goo.gl/76c4i) 2018 (d) 2018 06 25 44 / 44