2017/11/2 Python-statistics4 Python で統計学を学ぶ (4) この内容は杉澤村井 (2008) R によるやさしい統計学 (

Size: px

Start display at page:

Download "2017/11/2 Python-statistics4 Python で統計学を学ぶ (4) この内容は杉澤村井 (2008) R によるやさしい統計学 ("

りえはかまや
6 years ago
Views:

1 Python で統計学を学ぶ (4) この内容は杉澤村井 (2008) R によるやさしい統計学 ( を参考にしていますこの講義では統計的仮説検定をとりあげますこれは統計的仮説検定の順の理解と語の習熟がねらいですまた代表的な統計的仮説検定つまり標準正規分布をいた検定 t 分布をいた検定無相関検定カイ乗検定について学びます学習項です : 統計的仮説検定の必要性統計的仮説検定の順と語標準正規分布をいた検定 (1 つの平均値の検定 : 分散が既知 ) t 分布をいた検定 (1 つの平均値の検定 : 分散が未知 ) 相関係数の検定 ( 無相関検定 ) 独性の検定 ( カイ乗検定 ) 関数のまとめ演習問題統計的仮説検定の必要性下の散布図をてください ( の円は点の分布の状態を表すために描いたものです ): ( ( これをみるとこの2つの変数 a と bの間には相関関係がないようにみえます実際 corrcoef(a, b) = 0.034なので相関なしといえますところが a, bそれぞれから30 点ずつ無作為抽出したデータ xa, xb ( 下にその散布図をす ) はときにという弱い相関をすことがあります参考 : 無相関の集団から相関するデータを作る次が無相関の集団から相関するデータを作った法です : 1/16

In [185]: import matplotlib.pyplot as plt fig = plt.figure() a = random.normal(50,10,500) b = random.normal(50,10,500) plt.scatter(a, b,c='w') plt.xlabel('a') plt.ylabel('b') ax = fig.

get_ylim()[0]) ax.set_aspect(aspect) # 縦横の縮尺を調整 plt.show() In [34]: %matplotlib inline import matplotlib.pyplot as plt import numpy as np import numpy.random as random a = random.

2 In [185]: import matplotlib.pyplot as plt fig = plt.figure() a = random.normal(50,10,500) b = random.normal(50,10,500) plt.scatter(a, b,c='w') plt.xlabel('a') plt.ylabel('b') ax = fig.add_subplot(111) # 中心 (50,50) で半径 25 の円を描画 circle = plt.circle((50,50),25, fill=false, color='b') ax.add_patch(circle) aspect = (ax.get_xlim()[1] - ax.get_xlim()[0]) / (ax.get_ylim()[1] - ax.get_ylim()[0]) ax.set_aspect(aspect) # 縦横の縮尺を調整 plt.show() In [34]: %matplotlib inline import matplotlib.pyplot as plt import numpy as np import numpy.random as random a = random.normal(50,10,500) b = random.normal(50,10,500) for _ in range(100): xa = random.choice(a,30) xb = random.choice(b,30) if (abs(np.corrcoef(xa,xb)[0,1]) > 0.30): break plt.scatter(xa, xb) plt.xlabel('xa') plt.ylabel('xb') x = np.linspace(10, 90, 10000) lm = np.polyfit(xa, xb, 1) plt.plot(x, lm[0]*x+lm[1],"g") print(np.corrcoef(xa,xb)[0,1]) これは実際には作為的に作られたデータですしかしあなたが発明した機器の有効性をす論やあなたが作った薬品の効果をす論において都合の良いデータだけを集めたのではないかもしくは作為がないとしてもこのデータは本当に偶然の結果であって多数のデータを取ればこのようなグラフにはならないのではないかという疑いがかけられることがありますそのような疑いや批判には ( 前の章で学んだように ) 標本抽出が無作為抽出であること ( 都合の良いデータを集めたわけではない ) そして ( 本章で学ぶように ) 集団に全く相関がないとしたら抽出した標本からこのような結果が得られる可能性が常にさいということ ( 多数のデータを集めても同じような結果が得られる確率がい ) をさなければなりませんそして統計的仮説検定は確率に基づき後者の主張をうための法です ( 前者の無作為抽出は統計による分析の前提です ) 統計的仮説検定の順と語統計的仮説検定の般的な順を次の表にします : 順やること 1 集団に関する帰無仮説と対仮説を設定 2 検定統計量を選択 3 有意準 αの値を決定 4 データを収集した後データから検定統計量の実現値を求める 5 結論 : 検定統計量の実現値が棄却域にれば帰無仮説を棄却し対仮説を採択するそうでなければ帰無仮説を採択する 1. 帰無仮説と対仮説帰無仮説 : 提案する法が従来の法と差がない提案する法は効果がないという仮説 --- 本来主張したいこととは逆の仮説この仮説が棄却されることを標として仮説検定をう具体的には平均 μ = 0 ( 平均は0である ), 相関係数 ρ = 0 ( 相関がない ), 平均の差 μ 1 μ 2 = 0 ( 差がない ) というような仮説対仮説 : 帰無仮説が棄却されたときに採択される仮説 --- 帰無仮説とは逆の仮説であり実験などでしたい主張したいことを表したもの具体的には平均 μ 0( 平均は0でない ), 相関係数 ρ 0 ( 相関がある ), 平均の差 μ 1 μ 2 0 ( 差がある ) というような仮説対仮説の設定により検定は次のどちらかでう ( 両側検定のがより厳しい条件であり普通は両側検定でう ): 側検定 : 対仮説が平均 μ > 0( もしくは μ < 0 ) 相関係数 ρ > 0( もしくは ρ < 0 ) 平均の差 μ 1 > μ 2 ( もしくは μ 1 < μ 2 ) の場合 2/16

3 両側検定 : 対仮説が平均 μ 0 相関係数 ρ 0 平均の差 μ 1 - μ2 0の場合要するに両側検定では例えば平均 μ 0を調べるには平均 μ > 0 と μ < 0 の両を調べなければならない帰無仮説が正しいものとして分析をう実際に得られたデータから計算された検定統計量の値によって採択を判断する帰無仮説が正しいとしたとき検定統計量がほぼ起こり得ない値 ( それほど極端な値 ) であれば帰無仮説を棄却する ( つまり本来の主張を表す対仮説が採択される ) そうでなければ ( 確率的に分起こりうるような値であれば帰無仮説を採択する ( この場合は本来主張したかった対仮説が棄却されてしまう ) 2. 検定統計量検定統計量 : 統計的仮説検定のためにいられる標本統計量のこと代表的な検定統計量の例 : t, χ 2 F 検定統計量の実現値 : 実際のデータ ( にった標本 ) を基に計算してえられる具体的な値のこと検定統計量の実現値は対仮説に合うほど 0から離れた値をす 3. 有意準と棄却域対仮説を採択するか決定するときに基準になるのが有意準 (α で表されます ) 有意準は 5% または 1%(α=0.05 または α=0.01) に設定することが多い ( つまり標本が 100 回に 5 回 (5% の場合 ) 以下にしか現れないデータであった --- こんなことは偶然では起こりえない --- だから帰無仮説が成りたないと考えて良いのではないかという判断基準 ) 帰無仮説が正しいものとして考えた時の標本分布を帰無分布という --- 帰無分布に基づいて確率計算される帰無仮説のもとで常にじにくい検定統計量の値の範囲を棄却域という --- 帰無仮説が棄却される領域 ( だからこの範囲にるのが望ましい ) 採択域 : 棄却域以外の部分 --- 帰無仮説が採択される領域臨界値 : 棄却域と採択域の境の値棄却域に検定統計量の実現値がったら帰無仮説を棄却する --- 本来主張したかったことが採択される! ( 正規分布を帰無分布とした時の棄却域 4 & 5. 統計的仮説検定の結果の報告検定統計量の実現値が棄却域にった場合差がないという帰無仮説を棄却し差があるという対仮説を採択する検定結果は 5% ( または 1%) 準で有意であるまたは p <.05 ( または p <.01 ) で有意差がられたと記述する帰無仮説が棄却できない場合は検定の結果差が有意でなかったまたは有意差が認められなかったと書く課題 4-1 あなたはランダムに配置された対象物 ( 例えば地雷や油や埋蔵など ) を衛星からのセンサーデータを元に限定された時間 ( 例えば 1 時間 ) 内に検出する機器を作成した 100 個のデータに対し検出率は 0.70 であったそしてその性能が従来の製品 ( 検出率は 0.60 と宣伝されている ) よりも優れていることを統計的仮説検定の法によりしたいどのような帰無仮説と対仮説をたてればよいかまた検定法は側か両側か有意準はどのくらいに設定したらよいか考えを述べよ In [ ]: p 値 p 値 : 帰無仮説が正しいという仮定のもとで標本から計算した検定統計量の実現値以上の値が得られる確率 p 値が有意準よりさい時に帰無仮説を棄却する [ 参考 : p 値がさいことの意味 ] p 値のきさが対仮説を採択する ( 帰無仮説を棄却する ) 決めとなります p 値がさいということは帰無仮説が正しいとすると確率的にほとんど起こりえないことが起きた ( 有意準が 5% なら 100 回中 5 回以下 1% なら 100 回中 1 回以下 ) ということを意味します逆に p 値がきいということは確率的にはよくあることが起きた ( だからこの結果では差があるとはいえない ) ということになります第 1 種の誤りと第 2 種の誤り第 1 種の誤り α: 帰無仮説が真のときこれを棄却してしまう誤りのことこの種の誤りを犯す確率が有意準または危険率第 2 種の誤り β: 帰無仮説が偽のときこれを採択する ( 棄却できない ) 誤りのこと本当は差があるのに差がないと判断してしまう誤り検定検定 : 帰無仮説が偽の場合全体の確率 1 から第 2 種の誤りの確率 (1 - β) を引いた確率第 2 種の誤りを犯さない確率ともつまり間違っている帰無仮説を正しく棄却できる確率のこと標準正規分布をいた検定 (1 つの平均値の検定 : 分散が既知 ) 3/16

4 正規集団 N(μ, σ 2 ) から無作為に標本を抽出する ( サンプルサイズを n とする ) と標本平均の分布も正規分布標本平均の平均は [ ア ] 分散は [ イ ] ( 問題 : アイに当てはまる記号を書け--- 課題 4-2) これを標準化したものを検定統計量とする ( X は標本データの平均 ): Z = X μ σ/ n 課題 4-2 正規集団 N(μ, σ 2 ) から無作為に標本を抽出したとき理論的に標本平均の平均と分散がそれぞれどのように表されるか書きなさい ( つまり上の [ ア ], [ イ ] の箇所を補うこと ) またこれを標準化して得られる検定統計量がZで表されている理由を答えなさい [ ヒント ] 標本分布を求める (Rstatistics-03.html#makingSample) の項を読みなおしてくださいまた標準化については標準化 (Rstatistics-01.html#RS01:normalization) の項をてください Type Markdown and LaTeX: α 2 In [36]: Python を使った実習例題 : 理学テストがN(12, 10) の正規分布に従うものとする次のデータ ( 指導法データと呼ぶ) はこの集団から無作為抽出した標本と考えてよいかどうかを判定せよ from future import division import numpy as np SampleData = np.array([13,14,7,12,10,6,8,15,4,14,9,6,10,12,5,12,8,8,12,15]) 次のステップでう : 1. 帰無仮説と対仮説をたてる : 帰無仮説は無作為抽出した標本と考えて良いつまり μ = 12 対仮説は無作為抽出した標本ではないつまりμ 検定統計量の選択 : 標本データを標準化した値 (Zで表す) 3. 有意準の決定 : 両側検定で有意準 5% つまりα = 検定統計量の実現値の計算 : In [38]: z = (np.mean(sampledata) - 12) / (10.0/len(SampleData))**0.5 z Out[38]: # 標準化帰無仮説の棄却か採択かの決定 : 帰無仮説によればこの標本は正規分布に従うそこでscipy.statsモジュールのppf 関数で棄却の臨界値を求めるもしくはcdf 関数でp 値を求める下側確率 : 標準正規分布に従う確率変数 Zを例にとると Zがある値 α以下となる確率 Prob(Z α) 上側確率 : 標準正規分布に従う確率変数 Zを例にとると Zがある値 αよりきくなる確率 Prob(Z > α) In [45]: import scipy.stats as st st.norm.ppf(0.025) # 下側確率 0.05/2 = 0.025となるzの値を求める # 下側確率であるからこの値よりもZ 値が小さければ棄却される Out[45]: In [46]: # 上側確率 /2 = となる z の値を求める st.norm.ppf(0.975) # 上側確率であるからこの値よりも Z 値が大きければ棄却される Out[46]: In [4]: help(st.norm.ppf) Help on method ppf in module scipy.stats._distn_infrastructure: ppf(self, q, *args, **kwds) method of scipy.stats._continuous_distns.norm_gen instance Percent point function (inverse of `cdf`) at q of the given RV. q : array_like lower tail probability arg1, arg2, arg3,... : array_like The shape parameter(s) for the distribution (see docstring of the instance object for more information) loc : array_like, optional location parameter (default=0) scale : array_like, optional scale parameter (default=1) x : array_like quantile corresponding to the lower tail probability q. この結果棄却域は Z < または Z > となるので Zの値は棄却域にるよって結論有意準 5% において指導法データは理学テスト ( という集団 ) から無作為抽出した標本とはいえないなお関数 cdf をいて直接 p 値を求めることもできる : In [47]: st.norm.cdf( ) # 下側確率 # 下側確率とすれば p 値は0.0023という小さな値 (< 0.05) Out[47]: In [48]: st.norm.cdf( ) Out[48]: # 上側確率 4/16

5 In [49]: # 両側検定なので 2 倍する 2*st.norm.cdf( ) # 両側検定であるから 2 倍した p 値はという小さな値 (< 0.05) Out[49]: In [3]: help(st.norm.cdf) Help on method cdf in module scipy.stats._distn_infrastructure: cdf(self, x, *args, **kwds) method of scipy.stats._continuous_distns.norm_gen instance Cumulative distribution function of the given RV. x : array_like quantiles arg1, arg2, arg3,... : array_like The shape parameter(s) for the distribution (see docstring of the instance object for more information) loc : array_like, optional location parameter (default=0) scale : array_like, optional scale parameter (default=1) cdf : ndarray Cumulative distribution function evaluated at `x` 課題 4-3 標準正規分布のグラフを書き有意準 5% の棄却域をで表し例題の Z 値がどこに位置するかを重ね書きした図を作成せよ [ ヒント ] 前節正規分布 ( の正規分布グラフに領域を表する関数で紹介した関数を拡張修正している Z 値以下の領域をオレンジで表すと次のような図が得られる : ( t 分布をいた検定 (1 つの平均値の検定 : 分散が未知 ) 2 正規集団からの無作為標本であっても集団の分散 σ がわからない場合先の法が使えません--- 先の検定でいた検定統計量が計算できないからですそこで分散の平根 σ の代わりに標本から求められる不偏分散の平根 σ を使いを検定統計量とするこれは由度 (df) n 1 のt 分布に従う t = X μ σ / n t 分布 : 統計学でよく利される正規分布の形に似た左右対称形の確率分布由度 (df):t 分布の形状を決める ( 5/16

6 Python を使った実習例題 : 理学テストが平均 12 の正規分布に従うものとする ( 分散は未知!) 前項にあげた指導法データ (SampleData) がこの集団から無作為抽出した標本と考えてよいかどうかを判定せよ次のステップでう : 1. 帰無仮説と対仮説をたてる : 帰無仮説は無作為抽出した標本と考えて良いつまり μ = 12 対仮説は無作為抽出した標本ではないつまりμ 検定統計量の選択 : 標本の不偏分散の平根 σ をい t = X μ σ / n を検定統計量とする 3. 有意準の決定 : 両側検定で有意準 5% つまりα = 検定統計量の実現値の計算 : t = (np.mean(sampledata) - 12) / (np.var(sampledata, ddof=1)/len(sampledata))**0.5 # 検定量 5. 帰無仮説の棄却か採択かの決定 : 帰無仮説によればこの検定統計量は由度 df = n 1 = 19のt 分布に従う st.t.ppf(0.025,19) # df=19 下側確率 0.05/2 = 0.025となるtの値を求める (scipy.stats.tモジュールのppf 関数 # 下側確率であるからこの値よりもt 値が小さければ棄却される ) st.t.ppf(0.975,19) # df=19 上側確率 /2 = 0.975となるtの値を求める (scipy.stats.tモジュールのppf 関数 # 上側確率であるからこの値よりもt 値が大きければ棄却されるこの結果棄却域は t < または t > となるので tの値は棄却域にる関数 cdf をいて直接 p 値を求めることもできる : st.t.cdf( ,19) # 下側確率 (scipy.stats.tモジュールのcdf 関数 ) # 下側確率とすれば p 値はという小さな値 (< 0.05) print(st.t.cdf( ,19)) # 上側確率 print(2*st.t.cdf( ,19)) # 両側検定なので2 倍する # 両側検定より2 倍したp 値は0.017という小さな値 (< 0.05) 6. よって結論有意準 5% において指導法データは理学テスト ( という集団 ) から無作為抽出した標本とはいえない In [186]: # 以上の実行 t = (np.mean(sampledata) - 12) / (np.var(sampledata, ddof=1)/len(sampledata))**0.5 print("t = %f" % t) t = # 検定量 In [187]: import scipy.stats as st st.t.ppf(0.025,19) # df=19 下側確率 0.05/2 = 0.025となるtの値を求める (scipy.stats.tモジュールのppf 関数 # 下側確率であるからこの値よりもt 値が小さければ棄却される ) Out[187]: In [188]: st.t.ppf(0.975,19) # df=19 上側確率 /2 = 0.975となるtの値を求める (scipy.stats.tモジュールのppf 関数 # 上側確率であるからこの値よりもt 値が大きければ棄却される Out[188]: In [189]: st.t.cdf( ,19) # 下側確率 (scipy.stats.t モジュールの cdf 関数 ) # 下側確率とすれば p 値はという小さな値 (< 0.05) Out[189]: /16

7 In [190]: print(st.t.cdf( ,19)) # 上側確率 print(2*st.t.cdf( ,19)) # 両側検定なので2 倍する # 両側検定より2 倍したp 値は0.017という小さな値 (< 0.05) In [5]: help(st.t.ppf) Help on method ppf in module scipy.stats._distn_infrastructure: ppf(self, q, *args, **kwds) method of scipy.stats._continuous_distns.t_gen instance Percent point function (inverse of `cdf`) at q of the given RV. q : array_like lower tail probability arg1, arg2, arg3,... : array_like The shape parameter(s) for the distribution (see docstring of the instance object for more information) loc : array_like, optional location parameter (default=0) scale : array_like, optional scale parameter (default=1) x : array_like quantile corresponding to the lower tail probability q. In [6]: help(st.t.cdf) Help on method cdf in module scipy.stats._distn_infrastructure: cdf(self, x, *args, **kwds) method of scipy.stats._continuous_distns.t_gen instance Cumulative distribution function of the given RV. x : array_like quantiles arg1, arg2, arg3,... : array_like The shape parameter(s) for the distribution (see docstring of the instance object for more information) loc : array_like, optional location parameter (default=0) scale : array_like, optional scale parameter (default=1) cdf : ndarray Cumulative distribution function evaluated at `x` Python で t 検定するための関数 : 以上のことをすべてやってくれる関数が scipy.stats モジュールの ttest_1samp 関数である 7/16

8 In [70]: import scipy.stats as st help(st.ttest_1samp) Help on function ttest_1samp in module scipy.stats.stats: ttest_1samp(a, popmean, axis=0, nan_policy='propagate') Calculates the T-test for the mean of ONE group of scores. This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations `a` is equal to the given population mean, `popmean`. a : array_like sample observation popmean : float or array_like expected value in null hypothesis, if array_like than it must have the same shape as `a` excluding the axis dimension axis : int or None, optional Axis along which to compute test. If None, compute over the whole array `a`. nan_policy : {'propagate', 'raise', 'omit'}, optional Defines how to handle when input contains nan. 'propagate' returns nan, 'raise' throws an error, 'omit' performs the calculations ignoring nan values. Default is 'propagate'. statistic : float or array t-statistic pvalue : float or array two-tailed p-value Examples - >>> from scipy import stats >>> np.random.seed( ) # fix seed to get the same result >>> rvs = stats.norm.rvs(loc=5, scale=10, size=(50,2)) Test if mean of random sample is equal to true mean, and different mean. We reject the null hypothesis in the second case and don't reject it in the first case. >>> stats.ttest_1samp(rvs,5.0) (array([ , ]), array([ , ])) >>> stats.ttest_1samp(rvs,0.0) (array([ , ]), array([ , ])) Examples using axis and non-scalar dimension for population mean. >>> stats.ttest_1samp(rvs,[5.0,0.0]) (array([ , ]), array([ e-01, e-04])) >>> stats.ttest_1samp(rvs.t,[5.0,0.0],axis=1) (array([ , ]), array([ e-01, e-04])) >>> stats.ttest_1samp(rvs,[[5.0],[0.0]]) (array([[ , ], [ , ]]), array([[ e-01, e-01], [ e-03, e-04]])) 指導法データ (SampleData) をいてその使いをす : st.ttest_1samp( データ, μ) In [72]: Out[72]: st.ttest_1samp(sampledata,12.0) Ttest_1sampResult(statistic= , pvalue= ) この表から t 値が p 値が ( 両側検定 ) であることが得られる相関係数の検定 ( 無相関検定 ) 無相関検定 : 集団において相関が 0 であると設定してう検定集団相関係数 ( 相関 ) に関する検定をうときは標本相関係数 rから次を求めて検定統計量とする : t = r n 2 1 r 2 Python を使った実習例題 : 以下で与えられる統計学テスト 1 (StatTest1) と統計学テスト 2 (StatTest2) の得点の相関係数の検定をえ有意準は 5% とする In [73]: import numpy as np StatTest1 = np.array([6,10,6,10,5,3,5,9,3,3,11,6,11,9,7,5,8,7,7,9]) StatTest2 = np.array([10,13,8,15,8,6,9,10,7,3,18,14,18,11,12,5,7,12,7,7]) 次のステップでう : 1. 帰無仮説と対仮説をたてる : 帰無仮説は ρ = 0 つまり相関 = 0 対仮説は ρ 0 つまり相関 0 2. 検定統計量の選択 : t = r n 2 1 r 2 3. 有意準の決定 : 両側検定で有意準 5% つまりα = 検定統計量の実現値の計算 : 8/16

9 SampleCorr = np.corrcoef(stattest1, StatTest2)[0,1] print("sample Correlation = %f" % SampleCorr) # 標本相関 Sample Correlation = SampleSize = len(stattest1) tdividend = SampleCorr * (SampleSize - 2.0)**0.5 tdivider = (1.0 - SampleCorr**2)**0.5 t = tdividend/tdivider t statistics = 帰無仮説の棄却か採択かの決定 : 帰無仮説によればこの検定統計量は由度 df = n 2 = 18のt 分布に従う import scipy.stats as st print(st.t.ppf(0.025,18)) # df=18 下側確率 0.05/2 = 0.025となるtの値を求める # 下側確率であるからこの値よりもt 値が小さければ棄却される print(st.t.ppf(0.975,18)) # df=18 上側確率 /2 = 0.975となるtの値を求める # 上側確率であるからこの値よりもt 値が大きければ棄却される 6. この結果棄却域は t < または t > 2.101となるので tの値は棄却域にるよって結論統計学テスト1(StatTest1) と統計学テスト2(StatTest2) は有意準 5% において強い相関 ( 相関係数 0.75) がある In [83]: from future import division import numpy as np SampleCorr = np.corrcoef(stattest1, StatTest2)[0,1] print("sample Correlation = %f" % SampleCorr) # 標本相関 SampleSize = len(stattest1) tdividend = SampleCorr * (SampleSize - 2.0)**0.5 tdivider = (1.0 - SampleCorr**2)**0.5 t = tdividend/tdivider print("t statistics = %f" % t) Sample Correlation = t statistics = In [86]: import scipy.stats as st print(st.t.ppf(0.025,18)) # df=18 下側確率 0.05/2 = 0.025となるtの値を求める # 下側確率であるからこの値よりもt 値が小さければ棄却される print(st.t.ppf(0.975,18)) # df=18 上側確率 /2 = 0.975となるtの値を求める # 上側確率であるからこの値よりもt 値が大きければ棄却されるなお scipy.stats.t モジュールの cdf 関数をいて直接 p 値を求めることもできる : In [93]: print(st.t.cdf(t,18)) # 上側確率 print( (1.0 - st.t.cdf(t,18))*2.0 ) # 両側検定なので 2 倍する # 両側検定により 2 倍した p 値はという小さな値 (< 0.05) Python で無相関検定するための関数 : scipy.stat.pearsonr In [92]: help(st.pearsonr) Help on function pearsonr in module scipy.stats.stats: pearsonr(x, y) Calculates a Pearson correlation coefficient and the p-value for testing non-correlation. The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking, Pearson's correlation requires that each dataset be normally distributed, and not necessarily zero-mean. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases. The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so. x : (N,) array_like Input y : (N,) array_like Input r : float Pearson's correlation coefficient p-value : float 2-tailed p-value References ( In [95]: import scipy.stats as st SampleCorr = st.pearsonr(stattest1, StatTest2) print(samplecorr) ( , ) pearsonr 関数の出の第要素は標本相関係数 (0.75) 第要素は両側検定による p 値である 9/16

10 独性の検定 ( カイ乗検定 ) 2 つの質的変数が独かどうかを確かめる --- 独とは 2 つの質的変数に連関がないこと独性の検定 :2つの質的変数間の連関の有意性を調べる検定期待度数 :2つの変数の間に連関がない( 独である ) という帰無仮説のもとで帰無仮説が正しければ ( 連関がなければ ) これくらいの度数をとるだろうと期待される度数クロス集計表におけるセルの期待度数 = ( セルが属するの周辺度数セルが属する列の周辺度数 ) 総度数 χ 2 ( カイ2 乗 ) という確率分布を利するためカイ乗 (2 乗 ) 検定ともいう独性の検定における検定統計量の式 χ 2 ( O = 1 E 1 ) 2 ( O + 2 E 2 ) 2 ( O + + k E k ) 2 E 1 E 2 E k O 1 O k は観測度数 E 1 E k は期待度数カイ乗分布 : ( Python を使った実習例題 :20 名の学に対し数学 (Math) と統計学 (Stat) の好き嫌いをアンケート調査した結果が以下このことから般に数学と統計学の好き嫌いの間に有意な連関があるといえるかどうか有意準 5% で検定せよ Math = np.array([" 嫌い "," 嫌い "," 好き "," 好き "," 嫌い "," 嫌い "," 嫌い "," 嫌い "," 嫌い "," 好き "," 好き ", " 嫌い "," 好き "," 嫌い "," 嫌い "," 好き "," 嫌い "," 嫌い "," 嫌い "," 嫌い "]) Stat = np.array([" 好き "," 好き "," 好き "," 好き "," 嫌い "," 嫌い "," 嫌い "," 嫌い "," 嫌い "," 嫌い "," 好き ", " 好き "," 好き "," 嫌い "," 好き "," 嫌い "," 嫌い "," 嫌い "," 嫌い "," 嫌い "]) このクロス集計表は以下 : 統計学好き統計学嫌い合計数学好き Expectation_11 Expectation_12 6 数学嫌い Expectation_21 Expectation_22 14 計マスのことをセルセルに書かれた数値を観測度数観測度数を各々列で合計したものを周辺度数周辺度数の合計を総度数と呼ぶ由度 df = ( の数 -1) ( 列の数 -1) In [144]: Math = np.array([" 嫌い "," 嫌い "," 好き "," 好き "," 嫌い "," 嫌い "," 嫌い "," 嫌い "," 嫌い "," 好き "," 好き ", " 嫌い "," 好き "," 嫌い "," 嫌い "," 好き "," 嫌い "," 嫌い "," 嫌い "," 嫌い "]) Stat = np.array([" 好き "," 好き "," 好き "," 好き "," 嫌い "," 嫌い "," 嫌い "," 嫌い "," 嫌い "," 嫌い "," 好き ", " 好き "," 好き "," 嫌い "," 好き "," 嫌い "," 嫌い "," 嫌い "," 嫌い "," 嫌い "]) import pandas as pd data = pd.dataframe({ 'Stat':Stat, 'Math':Math}) table = pd.crosstab(data.math,data.stat,margins=true) # クロス集計表を作る table Out[144]: Stat 好き嫌い All Math 好き嫌い All 次のステップでう : 1. 帰無仮説と対仮説をたてる : 帰無仮説 H0は数学と統計学の2つの変数は独 ( 連関なし ) 対仮説 H1は数学と統計学の2つの変数は独 ( 連関なし ) 2. 検定統計量の選択 : χ 2 ( O = 1 E 1 ) 2 ( O + 2 E 2 ) 2 ( O + + k E k ) 2 E 1 E 2 E k 3. 有意準の決定 : 5% とする ( 側検定 ---カイ乗検定は棄却域がつしかない) 4. 検定統計量の実現値の計算 : 10/16

11 In [145]: Expectaion_11 = 8*6/20.0 Expectaion_12 = 12*6/20.0 Expectaion_21 = 8*14/20.0 Expectaion_22 = 12*14/20.0 ExpectedFrequency = np.array([expectaion_11, Expectaion_21,Expectaion_12,Expectaion_22]) ObservedFrequency = np.array([4,4,2,10]) ChiSqElements = (ObservedFrequency - ExpectedFrequency)**2 / ExpectedFrequency ChiSq = np.sum(chisqelements) # 検定統計量 print(chisq) In [104]: 5. 帰無仮説の棄却か採択かの決定 : 帰無仮説によればこの検定統計量は由度 df = 1の χ 2 分布に従う import scipy.stats as st help(st.distributions.chi2.ppf) Help on method ppf in module scipy.stats._distn_infrastructure: ppf(self, q, *args, **kwds) method of scipy.stats._continuous_distns.chi2_gen instance Percent point function (inverse of `cdf`) at q of the given RV. q : array_like lower tail probability arg1, arg2, arg3,... : array_like The shape parameter(s) for the distribution (see docstring of the instance object for more information) loc : array_like, optional location parameter (default=0) scale : array_like, optional scale parameter (default=1) x : array_like quantile corresponding to the lower tail probability q. In [8]: st.distributions.chi2.ppf(0.95,1) # 自由度 1 のカイ二乗分布で確率 0.95 となる χ2 の値を求める # これが棄却域を定める --- この値よりも χ2 値が大きければ棄却される # カイ二乗分布は上側のみ Out[8]: この結果棄却域は χ 2 > 3.84 となりこの例題におけるχ 2 の値 (=2.54) は棄却域にっていないつまり帰無仮説は棄却されず採択されるよって結論有意準 5% において数学と統計学の2つの変数は独ではない ( 連関がない ) なお cdf 関数をいて直接 p 値を求めることもできる : In [107]: help(st.distributions.chi2.cdf) Help on method cdf in module scipy.stats._distn_infrastructure: cdf(self, x, *args, **kwds) method of scipy.stats._continuous_distns.chi2_gen instance Cumulative distribution function of the given RV. x : array_like quantiles arg1, arg2, arg3,... : array_like The shape parameter(s) for the distribution (see docstring of the instance object for more information) loc : array_like, optional location parameter (default=0) scale : array_like, optional scale parameter (default=1) cdf : ndarray Cumulative distribution function evaluated at `x` In [108]: st.distributions.chi2.cdf(chisq,1) # 上側確率 # p 値が有意水準 0.05 よりも大きいので帰無仮説は棄却されない Out[108]: Python でカイ乗検定するための関数 chisquare(scipy.stats モジュール ): 11/16

12 In [109]: help(st.chisquare) Help on function chisquare in module scipy.stats.stats: chisquare(f_obs, f_exp=none, ddof=0, axis=0) Calculates a one-way chi square test. The chi square test tests the null hypothesis that the categorical data has the given frequencies. f_obs : array_like Observed frequencies in each category. f_exp : array_like, optional Expected frequencies in each category. By default the categories are assumed to be equally likely. ddof : int, optional "Delta degrees of freedom": adjustment to the degrees of freedom for the p-value. The p-value is computed using a chi-squared distribution with ``k ddof`` degrees of freedom, where `k` is the number of observed frequencies. The default value of `ddof` is 0. axis : int or None, optional The axis of the broadcast result of `f_obs` and `f_exp` along which to apply the test. If axis is None, all values in `f_obs` are treated as a single data set. Default is 0. chisq : float or ndarray The chi-squared test statistic. The value is a float if `axis` is None or `f_obs` and `f_exp` are 1-D. p : float or ndarray The p-value of the test. The value is a float if `ddof` and the return value `chisq` are scalars. See Also - power_divergence mstats.chisquare Notes This test is invalid when the observed or expected frequencies in each category are too small. A typical rule is that all of the observed and expected frequencies should be at least 5. The default degrees of freedom, k-1, are for the case when no parameters of the distribution are estimated. If p parameters are estimated by efficient maximum likelihood then the correct degrees of freedom are k-1-p. If the parameters are estimated in a different way, then the dof can be between k-1-p and k-1. However, it is also possible that the asymptotic distribution is not a chisquare, in which case this test is not appropriate. References.. [1] Lowry, Richard. "Concepts and Applications of Inferential Statistics". Chapter 8. ( [2] "Chi-squared test", ( Examples - When just `f_obs` is given, it is assumed that the expected frequencies are uniform and given by the mean of the observed frequencies. >>> from scipy.stats import chisquare >>> chisquare([16, 18, 16, 14, 12, 12]) (2.0, ) With `f_exp` the expected frequencies can be given. >>> chisquare([16, 18, 16, 14, 12, 12], f_exp=[16, 16, 16, 16, 16, 8]) (3.5, ) When `f_obs` is 2-D, by default the test is applied to each column. >>> obs = np.array([[16, 18, 16, 14, 12, 12], [32, 24, 16, 28, 20, 24]]).T >>> obs.shape (6, 2) >>> chisquare(obs) (array([ 2., ]), array([ , ])) By setting ``axis=none``, the test is applied to all data in the array, which is equivalent to applying the test to the flattened array. >>> chisquare(obs, axis=none) ( , ) >>> chisquare(obs.ravel()) ( , ) `ddof` is the change to make to the default degrees of freedom. >>> chisquare([16, 18, 16, 14, 12, 12], ddof=1) (2.0, ) The calculation of the p-values is done by broadcasting the chi-squared statistic with `ddof`. >>> chisquare([16, 18, 16, 14, 12, 12], ddof=[0,1,2]) (2.0, array([ , , ])) `f_obs` and `f_exp` are also broadcast. In the following, `f_obs` has shape (6,) and `f_exp` has shape (2, 6), so the result of broadcasting `f_obs` and `f_exp` has shape (2, 6). To compute the desired chi-squared statistics, we use ``axis=1``: >>> chisquare([16, 18, 16, 14, 12, 12],... f_exp=[[16, 16, 16, 16, 16, 8], [8, 20, 20, 16, 12, 12]],... axis=1) (array([ 3.5, 9.25]), array([ , ])) 12/16

13 In [146]: ExpectedFrequency = np.array([expectaion_11, Expectaion_21,Expectaion_12,Expectaion_22]) ObservedFrequency = np.array([4,4,2,10]) st.chisquare(observedfrequency, f_exp =ExpectedFrequency, ddof=2) # 自由度の計算に使う ddof の値に注意 Out[146]: Power_divergenceResult(statistic= , pvalue= ) サンプルサイズの影響標本における連関のきさが全く同じであってもサンプルサイズが異なると検定の結果が変わることがあるサンプルサイズがきくなると有意になりやすい --- 統計的仮説検定般にいえる性質 In [160]: import pandas as pd data = { 'Mastered': [16,12], 'NotMastered':[4,8]} # ある科目の履修 vs 未履修 df = pd.dataframe(data) df.index=['humanities','technicals'] # 文系 vs 理系 df # クロス集計表 Out[160]: Mastered NotMastered Humanities 16 4 Technicals 12 8 In [165]: In [166]: Exp_11 = sum(df['mastered'])*sum(df.loc['humanities',:])/40.0 Exp_12 = sum(df['notmastered'])*sum(df.loc['humanities',:])/40.0 Exp_21 = sum(df['mastered'])*sum(df.loc['technicals',:])/40.0 Exp_22 = sum(df['notmastered'])*sum(df.loc['technicals',:])/40.0 ExpectedFrequency = np.array([exp_11, Exp_12,Exp_21,Exp_22]) ObservedFrequency = np.array([16,4,12,8]) st.chisquare(observedfrequency, f_exp =ExpectedFrequency, ddof=2) # 自由度の計算に使う ddof の値に注意 # p 値 =0.17 なので帰無仮説は棄却されない連関なし Out[166]: Power_divergenceResult(statistic= , pvalue= ) In [167]: data10 = { 'Mastered': [160,120], 'NotMastered':[40,80]} df = pd.dataframe(data10) df.index=['humanities','technicals'] # 文系 vs 理系 df # クロス集計表 # ある科目の履修 vs 未履修 --- 前の 10 倍 Out[167]: Mastered NotMastered Humanities Technicals In [168]: In [169]: Exp_11 = sum(df['mastered'])*sum(df.loc['humanities',:])/400.0 Exp_12 = sum(df['notmastered'])*sum(df.loc['humanities',:])/400.0 Exp_21 = sum(df['mastered'])*sum(df.loc['technicals',:])/400.0 Exp_22 = sum(df['notmastered'])*sum(df.loc['technicals',:])/400.0 ExpectedFrequency = np.array([exp_11, Exp_12,Exp_21,Exp_22]) ObservedFrequency = np.array([160,40,120,80]) st.chisquare(observedfrequency, f_exp =ExpectedFrequency, ddof=2) # 自由度の計算に使う ddof の値に注意 # p 値 = なので帰無仮説は棄却される連関あり Out[169]: Power_divergenceResult(statistic= , pvalue= e-05) 関数のまとめ注 : numpy を np, np.random を random matplotlib.pyplot を plt pandas を pd scipy.stats を st と略記する的関数名とモジュール使い指定された範囲からランダム抽出 random.choice( 配列, 個数 ) random.choice(range(10),5) 標準正規分布で下側確率に対応する確率分布関数の値 st.norm.ppf(p) st.norm.ppf(0.025) # Prob(Z < q) = 0.025となるqの値標準正規分布で下側確率 (p 値 ) を求める st.norm.cdf(z) st.norm.cdf(1.96) # Prob(Z < 1.96) の値 (p 値 ) t 分布で下側確率に対応する確率分布関数の値 st.t.ppf(p, 由度 ) st.t.ppf(0.025,19) # 由度 19のt 分布でProb(Z < q) = 0.025となるqの値 t 分布で下側確率 (p 値 ) を求める st.t.cdf(z,df) st.t.cdf(1.96,19) # 由度 19のt 分布でProb(Z < 1.96) の値 (p 値 ) t 検定をう ttest_1samp( データ, μ) 無相関検定をう st.pearsonr( データ 1, データ 2) ttest_1samp(np.array([13,14,7,12,10,6,8,15,4,14,9,6,8,8,12,15]),12.0) # の検定 st.pearsonr(stattest1, StatTest2) # 出の第要素は標本相関係数第要素は両側検定による p 値カイ乗分布の確率密度関数 st.distributions.chi2.pdf(x, 由度 ) plt.plot(x,st.distributions.chi2.pdf(x,3)) # 由度 3の χ 2 分布関数の描画カイ乗分布で上側確率に対応する値を求める st.distributions.chi2.ppf(p, 由度 ) カイ乗分布で上側確率を求める 1-st.distributions.chi2.cdf(z, 由度 ) カイ乗検定をう ( 独性の検定 ) st.chisquare( 観測度数リスト, f_exp = 期待度数リスト, ddof=n) #n= 観測個数 -1- 由度 st.distributions.chi2.ppf(0.95,2) # 由度 2でProb(Z < q) = 0.95となるq 値を求める 1-st.distributions.chi2.cdf(3.5,1) # 由度 1のカイ2 乗分布でProb(Z 3.5) となる確率 st.chisquare(observedfrequency, f_exp =ExpectedFrequency, ddof=2) μ = 12 演習問題 4 演習問題 4-1 次のデータ ( 単位は cm) は平均 170cm の正規分布に従う 20 歳男性の集団からの無作為抽出と考えてよいかどうかを検定せよ In [171]: import numpy as np Height = np.array([165,150,170,168,159,170,167,178,155,159,161,162,166,171,155,160,168,172,155,167]) 13/16

14 演習問題 4-2 以下にすデータにおいて勉強時間 (StudyHours) と定期試験の成績 (ExamResult) の相関係数の無相関検定をえ In [170]: import numpy as np StudyHours = np.array([1, 3, 10, 12, 6, 3, 8, 4, 1, 5]) ExamResult = np.array([20, 40, 100, 80, 50, 50, 70, 50, 10, 60]) 演習問題 4-3 先の演習問題 4-2 のデータに対しピアソンの相関係数とスピアマンの順位相関係数を求めさらに無相関検定もえ 14/16

15 In [174]: import scipy.stats as st help(st.spearmanr) Help on function spearmanr in module scipy.stats.stats: spearmanr(a, b=none, axis=0, nan_policy='propagate') Calculates a Spearman rank-order correlation coefficient and the p-value to test for non-correlation. The Spearman correlation is a nonparametric measure of the monotonicity of the relationship between two datasets. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact monotonic relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases. The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Spearman correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so. a, b : 1D or 2D array_like, b is optional One or two 1-D or 2-D arrays containing multiple variables and observations. When these are 1-D, each represents a vector of observations of a single variable. For the behavior in the 2-D case, see under ``axis``, below. Both arrays need to have the same length in the ``axis`` dimension. axis : int or None, optional If axis=0 (default), then each column represents a variable, with observations in the rows. If axis=1, the relationship is transposed: each row represents a variable, while the columns contain observations. If axis=none, then both arrays will be raveled. nan_policy : {'propagate', 'raise', 'omit'}, optional Defines how to handle when input contains nan. 'propagate' returns nan, 'raise' throws an error, 'omit' performs the calculations ignoring nan values. Default is 'propagate'. correlation : float or ndarray (2-D square) Spearman correlation matrix or correlation coefficient (if only 2 variables are given as parameters. Correlation matrix is square with length equal to total number of variables (columns or rows) in a and b combined. pvalue : float The two-sided p-value for a hypothesis test whose null hypothesis is that two sets of data are uncorrelated, has same dimension as rho. Notes Changes in scipy 0.8.0: rewrite to add tie-handling, and axis. References.. [1] Zwillinger, D. and Kokoska, S. (2000). CRC Standard Probability and Statistics Tables and Formulae. Chapman & Hall: New York Section 14.7 Examples - >>> from scipy import stats >>> stats.spearmanr([1,2,3,4,5], [5,6,7,8,7]) ( , ) >>> np.random.seed( ) >>> x2n = np.random.randn(100, 2) >>> y2n = np.random.randn(100, 2) >>> stats.spearmanr(x2n) ( , ) >>> stats.spearmanr(x2n[:,0], x2n[:,1]) ( , ) >>> rho, pval = stats.spearmanr(x2n, y2n) >>> rho array([[ 1., , , ], [ , 1., , ], [ , , 1., ], [ , , , 1. ]]) >>> pval array([[ 0., , , ], [ , 0., , ], [ , , 0., ], [ , , , 0. ]]) >>> rho, pval = stats.spearmanr(x2n.t, y2n.t, axis=1) >>> rho array([[ 1., , , ], [ , 1., , ], [ , , 1., ], [ , , , 1. ]]) >>> stats.spearmanr(x2n, y2n, axis=none) ( , ) >>> stats.spearmanr(x2n.ravel(), y2n.ravel()) ( , ) >>> xint = np.random.randint(10, size=(100, 2)) >>> stats.spearmanr(xint) ( , ) In [ ]: 演習問題 4-4 以下にす演習問題 2-2 のデータに対しカイ乗検定をえ 15/16

16 In [179]: import numpy as np FoodTendency = np.array([" 洋食 "," 和食 "," 和食 "," 洋食 "," 和食 "," 洋食 "," 洋食 "," 和食 "," 洋食 "," 洋食 "," 和食 ", " 洋食 "," 和食 "," 洋食 "," 和食 "," 和食 "," 洋食 "," 洋食 "," 和食 "," 和食 "]) TasteTendency = np.array([" 甘党 "," 辛党 "," 甘党 "," 甘党 "," 辛党 "," 辛党 "," 辛党 "," 辛党 "," 甘党 "," 甘党 "," 甘党 ", " 甘党 "," 辛党 "," 辛党 "," 甘党 "," 辛党 "," 辛党 "," 甘党 "," 辛党 "," 辛党 "]) 演習問題 4-5 次のそれぞれのデータについて無相関検定をえ In [ ]: #5-1 import numpy as np Kokugo = np.array([60,40,30,70,55]) Shakai = np.array([80,25,35,70,50]) In [ ]: # 単純に (5-1) のデータを 2 回繰り返したもの Kokugo = np.array([60,40,30,70,55,60,40,30,70,55]) Shakai = np.array([80,25,35,70,50,80,25,35,70,50]) 演習問題 4-6 badmington.csv (SampleData/badmington.csv) は区切り記号がコンマの CSV のファイルでありバドミントンのラケットの重量 x と硬度 y の表 ( 出典 : 内 (2010) すぐに使える R による統計解析とグラフの応東京図書 ) が収められているこのデータをデータフレームとして読み込み硬度 (y) と重量 (x) の相関係数を算出し無相関の検定をえ [ 参考 ] 区切り記号がコンマの csv ファイルを読み込みその内容をデータフレームとして取り込むには pandas モジュールの read_csv 関数をいる In [ ]: 16/16

Python-statistics5 Python で統計学を学ぶ (5) この内容は山田杉澤村井 (2008) R によるやさしい統計学 (

Python-statistics5 Python で統計学を学ぶ (5) この内容は山田杉澤村井 (2008) R によるやさしい統計学 ( http://localhost:8888/notebooks/... Python で統計学を学ぶ (5) この内容は山田杉澤村井 (2008) R によるやさしい統計学 (http://shop.ohmsha.co.jp/shop /shopdetail.html?brandcode=000000001781&search=978-4-274-06710-5&sort=) を参考にしています

2017/11/2 Python-statistics4 Python で統計学を学ぶ (4) この内容は 杉澤 村井 (2008) R によるやさしい統計学 (

2017/11/2 Python-statistics4 Python で統計学を学ぶ (4) この内容は杉澤村井 (2008) R によるやさしい統計学 (