kubostat2017c p.1 2017 (c), a generalized linear model (GLM) : kubo@ees.hokudai.ac.jp http://goo.gl/76c4i 2017 11 14 : 2017 11 07 15:43 kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 1 / 47 agenda I 1 response variable y 2 :? 3 GLM 4 R GLM 5 GLM kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 2 / 47 agenda II Normal distribution and identity link function Poisson distribution and log link function log 3 (GLM) http://goo.gl/ufq2 y y : : 2012 05 18 kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 3 / 47 kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 4 / 47 response variable y? Generalized Linear Model (GLM) () (logistic regression) (linear regression) 1. response variable y kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 5 / 47 kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 6 / 47
kubostat2017c p.2 statistaical models appeared in the class response variable y suppose that you have a count data set... 0, 1, 2 response variable y The development of linear models Hierarchical Bayesian Model Be more fleible Generalized Linear Mied Model (GLMM) Incoporating random effects such as individuality parameter estimation MCMC MLE Generalized Linear Model (GLM) Always normal distribution? That's non-sense! MSE Linear model Kubo Doctrine: Learn the evolution of linear-model family, firstly! kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 7 / 47 (y {0, 1, 2, 3, } ) response variable y e.g. egg number () e.g. body size () y? kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 8 / 47 the normal distribution... is NOT this one!? response variable y the Poisson disribution approimates data?! response variable y response variable y? y 0? NO! kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 9 / 47 response variable y fair distribution non-negative mean YES! bye-bye, the normal distribution kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 10 / 47 :? :? body size and fertilization f change seed number y? 2. :? Modeling number of seeds of plants using GLM response variable seed number : {y i } : body size { i } fertilization {f i } f i C: T: i sample size control (f i = C): 50 sample (i {1, 2, 50}) treated (f i = T): 50 sample (i {51, 52, 100}) y i i kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 11 / 47 kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 12 / 47
kubostat2017c p.3 : Reading data file? :? data frame d : data: http://hosho.ees.hokudai.ac.jp/~kubo/ce/eeslecture2017.html#toc4 data3a.csv CSV (comma separated value) format file R : > d <- read.csv("data3a.csv") d data frame ( ) data frame d > d y f 1 6 8.31 C 2 6 9.44 C 3 6 9.50 C...... 99 7 10.86 T 100 9 9.97 T > d$ [1] 8.31 9.44 9.50 9.07 10.16 8.32 10.61 10.06 [9] 9.93 10.43 10.36 10.15 10.92 8.85 9.42 11.11...... [97] 8.52 10.24 10.86 9.97 > d$y [1] 6 6 6 12 10 4 9 9 9 11 6 10 6 10 11 8 [17] 3 8 5 5 4 11 5 10 6 6 7 9 3 10 2 9...... [97] 6 8 7 9 kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 13 / 47 kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 14 / 47 :? data frame d : :? data type and class R f > d$f [1] C C C C C C C C C C C C C C C C C C C C C C C C C [26] C C C C C C C C C C C C C C C C C C C C C C C C C [51] T T T T T T T T T T T T T T T T T T T T T T T T T [76] T T T T T T T T T T T T T T T T T T T T T T T T T Levels: C T data type: factor levels : levels C T 2 > class(d) # d data.frame [1] "data.frame" > class(d$y) # y integer [1] "integer" > class(d$) # numeric [1] "numeric" > class(d$f) # f factor [1] "factor" kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 15 / 47 kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 16 / 47 :? :? data frame summary()! Generate Data Plots! Always! > summary(d) y f Min. : 2.00 Min. : 7.190 C:50 1st Qu.: 6.00 1st Qu.: 9.428 T:50 Median : 8.00 Median :10.155 Mean : 7.83 Mean :10.089 3rd Qu.:10.00 3rd Qu.:10.685 Ma. :15.00 Ma. :12.400 > plot(d$, d$y, pch = c(21, 19)[d$f]) > legend("topleft", legend = c("c", "T"), pch = c(21, 19)) kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 17 / 47 kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 18 / 47
kubostat2017c p.4 :? GLM f (bo-whisker plot) > plot(d$f, d$y) # note that d$f is! 3. GLM log link kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 19 / 47 kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 20 / 47 GLM GLM how to specify linear regression model, a GLM GLM Generalized Linear Model (GLM) probability distribution?? link function? probability distribution : Gaussian distribution : e.g., β 1 + β 2 i link function : : ( ) + ( ) i identity link function kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 21 / 47 kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 22 / 47 GLM (?) GLM how to specify model, a GLM GLM : (response variable) : () (): ( ) = (, intercept) + ( 1) ( 1) + ( 2) ( 2) + ( 3) ( 3) probability distribution Poisson distribution : : e.g., β 1 + β 2 i link function log link function : + kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 23 / 47 kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 24 / 47
kubostat2017c p.5 GLM how to specify logistic regression model, a GLM GLM logistic GLM R (GLM) probability distribution binomial distribution : : e.g., β 1 + β 2 i link function : logit yi i probability distribution random number generation GLM fitting GLM ( ) rbinom() glm(family = binomial) rbinom() glm(family = binomial) rpois() glm(family = poisson) rnbinom() glm.nb() in library(mass) ( ) rgamma() glm(family = gamma) rnorm() glm(family = gaussian) glm() GLM kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 25 / 47 kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 26 / 47 GLM GLM eponential function? seed number y i follows the Poisson distribution y i λ i p(y i λ i ) = λyi i ep( λ i ) y i! mean i λ i? λ i = ep(β 1 + β 2 i ) parameter coefficient β 1 β 2 ( ) body size no f i, for simplicity i i f i i λi λ i = ep(β 1 + β 2 i ) {β 1, β 2} = { 2, 0.8} {β 1, β 2} = { 1, 0.4} i i kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 27 / 47 kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 28 / 47 GLM GLM ( ) GLM a statistical model for this eample mean i λ i λ i = ep(β 1 + β 2 i ) log link function log(λ i ) log link function log( ) = = β 1 + β 2 i probability distribution Poisson distribution : : β 1 + β 2 i link function log link function : log kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 29 / 47 kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 30 / 47
kubostat2017c p.6 R GLM R GLM glm() function 4. R GLM > d y f 1 6 8.31 C 2 6 9.44 C 3 6 9.50 C...... 99 7 10.86 T 100 9 9.97 T Is that all?! > fit <- glm(y ~, data = d, family = poisson) kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 31 / 47 kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 32 / 47 R GLM R GLM glm() output glm() > fit <- glm(y ~, data = d, family = poisson) all: glm(formula = y ~, family = poisson, data = d) Coefficients: (Intercept) 1.2917 0.0757 ( z):? link : z (y)? family:? Degrees of Freedom: 99 Total (i.e. Null); Null Deviance: 89.5 Residual Deviance: 85 AIC: 475 98 Residual kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 33 / 47 kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 34 / 47 R GLM R GLM glm() > summary(fit) Call: glm(formula = y ~, family = poisson, data = d) Deviance Residuals: Min 1Q Median 3Q Ma -2.368-0.735-0.177 0.699 2.376 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 1.2917 0.3637 3.55 0.00038 0.0757 0.0356 2.13 0.03358 ( ) kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 35 / 47 ( ) β 2 (Estimate 0.0757, SE 0.0356) (Estimate 1.29, SE 0.364) β 1 0.0 0.5 1.0 1.5 p p ˆβ p 0.5 ˆβ ( : ) kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 36 / 47
kubostat2017c p.7 R GLM (?) β 2 (Estimate 0.0757, SE 0.0356) (Estimate 1.29, SE 0.364) β 1 0.0 0.5 1.0 1.5 95%? kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 37 / 47 model prediction R GLM > fit <- glm(y ~, data = d, family = poisson)... Coefficients: (Intercept) 1.2917 0.0757 > plot(d$, d$y, pch = c(21, 19)[d$f]) # data > p <- seq(min(d$), ma(d$), length = 100) > lines(p, ep(1.2917 + 0.0757 * p)) the figure shows the relationship between model prediction and data kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 38 / 47 GLM incorporate the fertilization effects in GLM f i GLM 5. GLM + seed number y i follows the Poisson distribution y i λ i mean i λ i fertilization effects β 3 dummy variable f i p(y i λ i ) = λyi i ep( λ i ) y i! λ i = ep(β 1 + β 2 i + β 3 d i ) d i = coefficient { 0 (f i = C ) 1 (f i = T ) kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 39 / 47 kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 40 / 47 output glm(y + f,...) GLM + f model prediction GLM > summary(glm(y ~ + f, data = d, family = poisson))...( )... Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 1.2631 0.3696 3.42 0.00063 0.0801 0.0370 2.16 0.03062 ft -0.0320 0.0744-0.43 0.66703 ( ) > plot(d$, d$y, pch = c(21, 19)[d$f]) # data > p <- seq(min(d$), ma(d$), length = 100) > lines(p, ep(1.2631 + 0.0801 * p), col = "blue", lwd = 3) # C > lines(p, ep(1.2631 + 0.0801 * p - 0.032), col = "red", lwd = 3) # T kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 41 / 47 kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 42 / 47
kubostat2017c p.8 GLM multiple s GLM model interpretation depends on link function λi f i = C: λ i = ep(1.26 + 0.0801 i ) f i = T: λ i = ep(1.26 + 0.0801 i 0.032) control = ep(1.26 + 0.0801 i ) ep( 0.032) fertilization i ep( 0.032)! kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 43 / 47 λi (A) log link function i (B) identity link function λ = ep(β 1 + β 2 + ) λ = β 1 + β 2 + multiplicative additive i kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 44 / 47 probability distribution GLM: GLM link function GLM statistaical models appeared in the class y log y 階層ベイズモデル もっと自由な統計モデリングを! 線形モデルの発展 (HBM) 一般化線形混合モデル 個体差 場所差といった変量効果をあつかいたい 一般化線形モデル 正規分布以外の確率分布をあつかいたい (GLMM) 推定計算方法 MCMC 最尤推定法 (GLM) 最小二乗法 線形モデル kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 45 / 47 kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 46 / 47 GLM statistaical models appeared in the class The development of linear models Hierarchical Bayesian Model Be more fleible Generalized Linear Mied Model (GLMM) Incoporating random effects such as individuality parameter estimation MCMC MLE Generalized Linear Model (GLM) Always normal distribution? That's non-sense! MSE Linear model Kubo Doctrine: Learn the evolution of linear-model family, firstly! kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 47 / 47 y 2 4 6 8 10 12 14 Too simple? 7 8 9 10 11 12 GLM The net topic (A) k = 1 (B) k = 7 2 4 6 8 10 12 14 Too comple? 7 8 9 10 11 12 Model selection and statistical test kubostat2017c (http://goo.gl/76c4i) 2017 (c) 2017 11 14 48 / 47