バイオインフォマティクス特論3

藤博幸

1-2-1. 平均と標準偏差を推測する 1-2-2. 7 の科学者 1-2-3. 7 の科学者のコードの改造

ベイズ統計で実践モデリング 4.1 平均と標準偏差を推測する第 4 章ガウス分布を使った推論

データは n 個の独な観測値 x 1, x 2, x n 例 : ある細胞の遺伝の発現量を温度時間などの条件を全く同じ状態で独に n 回計測ガウス分布に従うと仮定このデータが従うガウス分布の平均と標準偏差を推測しよう注意 :Jags ではガウス分布は平均と精度 ( 精度は標準偏差の逆数 ) の 2 つにパラメータで表される

μ σ x 1 x 2 x 3... x n

プレート記法によるグラフィカルモデル μ σ x i i data

連続値離散値観測変数 ( 確率変数 ) 観測変数依存関係

μ σ 尤度仮定からデータ x i は同じ正規分布に従うので n 尤度 =Πdnorm(x i, µ, 1/σ 2 ) i=1 x i i data モデルとしては x i ~ Gaussian(µ, 1/σ 2 ) と表現

事前分布 μ σ µ は平均 0 で精度が小さい (= 分散が大きい ) 正規分布に従うとする µ ~ Gaussian(0, 0.001) x i i data σ は上限を 10 に設定して一様分布で表現 σ ~ Uniform(0, 10) 無情報事前分布あるいはそれに近い形で表現

x <- seq(-10, 10, 0.01) plot(x, dnorm(x, 0, 1000.0),ty='l', ylim=c(0,10^(-10000)))

Jags でのモデルの記述 Gaussian.txt μ x i σ i data # Inferring the Mean and Standard Deviation # of a Gaussian model{ # Data Come From A Gaussian for (i in 1:n){ x[i] ~ dnorm(mu,lambda) } # Priors mu ~ dnorm(0,.001) sigma ~ dunif(0,10) lambda <- 1/pow(sigma,2) }

Gaussian_jags.R 変数のクリア # clears workspace: rm(list=ls()) Rate_1.txt と Rate_1_jags.R のあるディレクトリへの移動 # sets working directories: setwd("/users/toh/desktop/code/parameterestimation/gaussian") library(r2jags) パッケージの読み込み x <- c(1.1, 1.9, 2.3, 1.8) n <- length(x) 観測変数の設定 data <- list("x", "n") # to be passed on to JAGS

μ と σ の初期値の設定 myinits <- list( list(mu = 0, sigma = 1)) # parameters to be monitored: parameters <- c("mu", "sigma") # The following command calls JAGS with specific options. # For a detailed description see the R2jags documentation. samples <- jags(data, inits=myinits, parameters, model.file="gaussian.txt", n.chains=1, n.iter=1000, n.burnin=1, n.thin=1, DIC=T) # Now the values for the monitored parameters are in the "samples" object, # ready for inspection. mu <- samples$bugsoutput$sims.list$mu sigma <- samples$bugsoutput$sims.list$sigma モニターするパラメータの設定 Jags を呼び出して μ と σ をサンプリング

traceplot(samples) par(cex.main = 1.5, mar = c(5, 6, 4, 5) + 0.1, mgp = c(3.5, 1, 0), cex.lab = 1.5, font.lab = 2, cex.axis = 1.3, bty = "n", las=1) Nbreaks <- 80 y <- hist(mu, Nbreaks, plot=f) plot(c(y$breaks, max(y$breaks)), c(0,y$density,0), type="s", lwd=2, lty=1, xlim=c(0,15), ylim=c(0,3), xlab="rate", ylab="posterior Density") par(cex.main = 1.5, mar = c(5, 6, 4, 5) + 0.1, mgp = c(3.5, 1, 0), cex.lab = 1.5, font.lab = 2, cex.axis = 1.3, bty = "n", las=1) Nbreaks <- 80 y <- hist(sigma, Nbreaks, plot=f) plot(c(y$breaks, max(y$breaks)), c(0,y$density,0), type="s", lwd=2, lty=1, xlim=c(0,8), ylim=c(0,5), xlab="rate", ylab="posterior Density")

μ のトレース σ のトレース

μ の事後分布 σ の事後分布

μ の要約 σ の要約 summary(mu) V1 Min. :-7.035 1st Qu.: 1.531 Median : 1.773 Mean : 1.784 3rd Qu.: 2.021 Max. : 8.734 summary(sigma) V1 Min. :0.2040 1st Qu.:0.5188 Median :0.7184 Mean :0.9790 3rd Qu.:1.0996 Max. :9.5059 density(mu)$x[which(density(mu)$y==max(density(mu)$y))] [1] 1.777701 density(sigma)$x[which(density(sigma)$y==max(density(sigma)$y))] [1] 0.5296268

ベイズ統計で実践モデリング 4.2 7 の科学者第 4 章ガウス分布を使った推論

実験スキルのきく異なる 7 の科学者が全員同じ量について測定をう直感的には最初の 2 は適性をいた研究者この量の真の値は 10 をわずかに下回る程度と考えられる 27.020 3.570 8.191 9.808 9.603 9.945 10.056 この測定量についての事後分布を求めて測定される量について推論する副次的には 7 の科学者の測定スキルをついて推論する

σ i 仮定 (1) 全ての科学者の測定はガウス分布に従う (2) 全員同じ量を測定しているのでそれぞれのガウス分布は同じ平均値を持つ λ i (3) 標準偏差の違いで実験スキルの違いを表現 μ x i i 番のデータプレート表記でのグラフィカルモデル

σ 1 σ 2 σ 3 σ 4 σ 5 σ 6 σ 7 λ 1 λ 2 λ 3 λ 4 λ 5 λ 6 λ 7 x 1 x 2 x 3 x 4 x 5 x 6 x 7 μ

# The Seven Scientists model{ # Data Come From Gaussians With Common Mean But Different Precisions for (i in 1:n){ x[i] ~ dnorm(mu,lambda[i]) 尤度の要素の設定 } # Priors mu ~ dnorm(0,.001) for (i in 1:n){ 事前分布の設定 lambda[i] ~ dgamma(.001,.001) sigma[i] <- 1/sqrt(lambda[i]) } }

事前分布の形 x <- seq(0, 1.0, 0.001) plot(x, dgamma(x, 0.001, 0.001), ty='l') 先の例では標準偏差の事前分布として様分布を使ったが今回は精度の事前分布としてガンマ分布を使

SevenScientists_jags.R # clears workspace: rm(list=ls()) 変数のクリア Rate_1.txt と Rate_1_jags.R のあるディレクトリへの移動 # sets working directories: setwd("/users/toh/desttop/code/parameterestimation/gaussian") library(r2jags) パッケージの読み込み x <- c(-27.020,3.570,8.191,9.898,9.603,9.945,10.056) n <- length(x) data <- list("x", "n") # to be passed on to JAGS 観測変数の設定

myinits <- list( list(mu = 0, lambda = rep(1,n))) # parameters to be monitored: parameters <- c("mu", "sigma") μ と λ の初期値の設定モニターするパラメータの設定 Jags を呼び出して μ と σ をサンプリング # The following command calls JAGS with specific options. # For a detailed description see the R2jags documentation. samples <- jags(data, inits=myinits, parameters, model.file ="SevenScientists.txt", n.chains=1, n.iter=1000, n.burnin=1, n.thin=1, DIC=T) # Now the values for the monitored parameters are in the "samples" object, # ready for inspection.

MCMC のサンプリングのトレース traceplot(samples) σ は 7 分出てくる

μ の事後分布のヒストグラム par(cex.main = 1.5, mar = c(5, 6, 4, 5) + 0.1, mgp = c(3.5, 1, 0), cex.lab = 1.5, font.lab = 2, cex.axis = 1.3, bty = "n", las=1) Nbreaks <- 80 y <- hist(mu, Nbreaks, plot=f) plot(c(y$breaks, max(y$breaks)), c(0,y$density,0), type="s", lwd=2, lty=1, xlim=c(0,15), ylim=c(0,3), xlab="rate", ylab="posterior Density")

μ の要約 summary(mu) V1 Min. : 3.446 1st Qu.: 9.851 Median : 9.922 Mean : 9.860 3rd Qu.: 9.979 Max. :10.658

σ の事後分布のヒストグラム Nbreaks <- 80 hist(sigma[,1], Nbreaks) hist(sigma[,7], Nbreaks)

σ の要約 summary(sigma) V1 V2 V3 V4 V5 V6 V7 Min. : 11.38 Min. : 0.091 Min. : 0.0893 Min. : 0.01756 Min. : 0.01935 Min. : 0.01721 Min. : 0.01861 1st Qu.: 32.33 1st Qu.: 5.347 1st Qu.: 1.4573 1st Qu.: 0.07211 1st Qu.: 0.22762 1st Qu.: 0.06734 1st Qu.: 0.11319 Median : 54.58 Median : 8.719 Median : 2.5830 Median : 0.14569 Median : 0.40984 Median : 0.13662 Median : 0.22778 Mean : 336.26 Mean : 64.568 Mean : 11.3078 Mean : 0.90485 Mean : 1.42081 Mean : 0.55492 Mean : 1.49933 3rd Qu.: 115.78 3rd Qu.: 16.879 3rd Qu.: 5.1573 3rd Qu.: 0.35877 3rd Qu.: 0.89008 3rd Qu.: 0.32470 3rd Qu.: 0.60704 Max. :169014.80 Max. :14080.411 Max. :1861.6139 Max. :156.96397 Max. :101.48750 Max. :27.16168 Max. :270.28688 とだけでなく 3 もスキルに問題がありそう

モード ( 最頻値 ) を出 for (i in 1:n) { print(density(sigma[,i])$x[which(density(sigma[,i])$y==max(density(sigma[,i])$y))]) } [1] -27.84334 [1] 6.507084 [1] 2.14607 [1] 0.05037881 [1] 2.226777 [1] 0.1280672 [1] -0.152803 結果が平均やメジアン ( 中央値 ) ほど安定していない MCMC のステップ数 ( サンプル数 ) が少ないせいか?

ステップ数 ( サンプル数 ) が少ないバーンインやシンニングが考慮されていない正確な不変分布 ( 事後分布 ) が得られていないのでは?

burn in を100 ß--- n.burnin thinningを50 ß--- n.thin MCMCの全ステップ数を100000 ß--- n.iter chainを2にして初期値を与える ß-- n.chains SevenScientistsv2_jags.R (=SevenScientists_jags.R をコピーしたもの ) を書き換える 2 回目の Rate_1_jags.R を参考にできるモデルは同じなので SevenScientists.txt は変更しなくて良い

(1) 実部分を書き換える samples <- jags(data, inits=myinits, parameters, model.file ="SevenScientists.txt", n.chains=1, n.iter=1000, n.burnin=1, n.thin=1, DIC=T) samples <- jags(data, inits=myinits, parameters, model.file ="SevenScientists.txt", n.chains=2, n.iter=100000, n.burnin=100, n.thin=50, DIC=T)

(2) 2 つの連鎖の初期値を与える myinits <- list( list(mu = 0, lambda = rep(1,n))) myinits <- list( list(mu = 0, lambda = rep(1,n)), list(mu = 10.0, lamda = rep(10, n)))

結果の要約 print(samples) Inference for Bugs model at "SevenScientists.txt", fit using jags, 2 chains, each with 1e+05 iterations (first 100 discarded), n.thin = 50 n.sims = 3996 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff mu 9.914 0.155 9.581 9.877 9.930 9.982 10.115 1.008 4000 sigma[1] 289.514 4335.256 16.386 31.753 55.113 115.877 1157.041 1.001 4000 sigma[2] 72.155 2657.668 2.765 5.471 9.256 19.637 169.932 1.001 4000 sigma[3] 13.493 203.082 0.747 1.465 2.488 5.293 39.622 1.002 4000 sigma[4] 2.879 140.652 0.027 0.067 0.131 0.310 3.534 1.001 2200 sigma[5] 2.354 22.821 0.079 0.256 0.454 0.964 11.013 1.001 4000 sigma[6] 0.748 5.373 0.026 0.063 0.130 0.311 4.453 1.001 4000 sigma[7] 1.386 15.690 0.037 0.106 0.210 0.476 5.245 1.001 4000 deviance 23.240 5.306 15.045 19.415 22.644 26.268 35.342 1.001 4000 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pd = var(deviance)/2) pd = 14.1 and DIC = 37.3 DIC is an estimate of expected predictive error (lower deviance is better).

mixing の確認 traceplot traceplot(samples)

MCMC の 2 つの連鎖をそれぞれ取り出す mu <- samples$bugsoutput$sims.array[,,2] max(mu) # [1] 13.00318 min(mu) # [1] 7.929883 plot(mu[,1],ylim=c(7,14),ty='l',col=2) par(new=t) plot(mu[,2],ylim=c(7,14),ty='l',col=4)

summary(mu[,1]) Min. 1st Qu. Median Mean 3rd Qu. Max. 7.930 9.876 9.929 9.912 9.983 11.701 summary(mu[,2]) Min. 1st Qu. Median Mean 3rd Qu. Max. 8.360 9.877 9.931 9.916 9.980 13.003 density(mu[,1])$x[which(density(mu[,1])$y==max(density(mu[,1])$y))] [1] 9.932694 density(mu[,2])$x[which(density(mu[,2])$y==max(density(mu[,2])$y))] [1] 9.935837

Nbreaks <- 80 x <- hist(mu[,1],nbreaks,plot=f) y <- hist(mu[,2],nbreaks,plot=f) plot(c(x$breaks, max(x$breaks)), c(0,x$density,0), type="s", lwd=2, lty=1, xlim=c(7,12), ylim=c(0,7), xlab="mu", ylab="posterior Density", col=2) par(new=t) plot(c(y$breaks, max(y$breaks)), c(0,y$density,0), type="s", lwd=2, lty=1, xlim=c(7,12), ylim=c(0,7), xlab="mu", ylab="posterior Density", col=4) 2 つの MCMC で得られた μ の事後分布がほぼ同じ形に収束していることがわかる

sigma <- samples$bugsoutput$sims.array[,,3:9] max(sigma[,,7]) # [1] 786.2002 min(sigma[,,7]) # [1] 0.01580951 plot(sigma[,1,7],ylim=c(0,790),ty='l',col=2) par(new=t) plot(sigma[,2,7],ylim=c(0,790),ty='l',col=4)

for (i in 1:7) { print(i) for (j in 1:2) { print(summary(sigma[,j,i])) print(density(sigma[,j,i])$x[which(density(sigma[,j,i])$y==max(density(sigma[,j,i])$y))]) } print("//") } [1] 1 Min. 1st Qu. Median Mean 3rd Qu. Max. 10.64 32.12 56.28 268.19 117.30 107734.56 [1] 184.0602 Min. 1st Qu. Median Mean 3rd Qu. Max. 9.31 31.60 54.29 310.83 115.19 240874.91 [1] -27.53077 [1] "//" [1] 2 Min. 1st Qu. Median Mean 3rd Qu. Max. 1.83 5.42 9.19 114.71 19.70 167799.81 [1] -4.466137 Min. 1st Qu. Median Mean 3rd Qu. Max. 1.547 5.477 9.333 29.599 19.480 1684.698 [1] 5.329852 [1] "//"

[1] 3 Min. 1st Qu. Median Mean 3rd Qu. Max. 0.025 1.452 2.473 18.294 5.141 11397.367 [1] -1.60123 Min. 1st Qu. Median Mean 3rd Qu. Max. 0.2213 1.4916 2.4904 8.6934 5.5065 2902.6972 [1] 4.138908 [1] "//" [1] 4 Min. 1st Qu. Median Mean 3rd Qu. Max. 0.01543 0.06639 0.12601 0.61530 0.29927 114.12277 [1] 0.1365077 Min. 1st Qu. Median Mean 3rd Qu. Max. 0.015 0.069 0.133 5.143 0.320 8888.641 [1] -0.09578296 [1] "//" [1] 5 Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0173 0.2503 0.4404 2.4581 0.9620 970.9479 [1] 1.604904 Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0180 0.2621 0.4644 2.2492 0.9635 495.8900 [1] 0.6804895 [1] "//"

[1] 6 Min. 1st Qu. Median Mean 3rd Qu. Max. 0.01388 0.06223 0.13098 0.76116 0.31199 231.79103 [1] 0.3578206 Min. 1st Qu. Median Mean 3rd Qu. Max. 0.01375 0.06430 0.12904 0.73421 0.30944 115.64845 [1] 0.132436 [1] "//" [1] 7 Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0158 0.1050 0.2094 1.5741 0.5039 786.2002 [1] -0.1599903 Min. 1st Qu. Median Mean 3rd Qu. Max. 0.01779 0.10748 0.21135 1.19853 0.44613 214.29994 [1] 0.2884735 [1] "//"