ex1: TopHat キイロショウジョウバエ Drosophila melanogaster の RNA-seq を行ったライブラリは 2 種類それぞれ single end( イン

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:ex1 ex1: TopHat キイロショウジョウバエ Drosophila melanogaster の RNA-seq を行ったライブラリは 2 種類それぞれ single end( インサートの片側だけ読む ) で 75bp シークエンスした 10 万リード得られたこれらのリードを D. melanogaster のゲノムにマッピングしたい TopHat を用いて splice-aware なマッピングを行う Data Input reads (~/data/ex/ 以下にある ) C1_10k_Read1.fq C2_10k_Read1.fq Reference D. mel genome and annotation (Ensembl BDGP5.25) をiGenomes (http://tophat.cbcb.umd.edu/igenomes.html [http://tophat.cbcb.umd.edu/igenomes.html]) からダウンロードする ftp://igenome:g3nom3s4u@ussd-ftp.illumina.com/drosophila_melanogaster /Ensembl/BDGP5.25/Drosophila_melanogaster_Ensembl_BDGP5.25.tar.gz [ftp://igenome:g3nom3s4u@ussd-ftp.illumina.com/drosophila_melanogaster/ensembl /BDGP5.25/Drosophila_melanogaster_Ensembl_BDGP5.25.tar.gz] ただファイルサイズが比較的大きくダウンロードに時間がかかると思われる演習用のMacに同じファイルが置いてある (~/data/ex/ 以下にある ) Drosophila_melanogaster_Ensembl_BDGP5.25.tar.gz Setup Setup environment ex1 ディレクトリをつくり以下の解析はその下で作業しよう $ mkdir ex1 $ cd ex1 Sequence reads less などのコマンドで C1_10k_Read1.fq の内容を確認する注 ) 本番の解析ではリード数の確認フォーマットの確認クオリティの確認などを行う必要であればアダプター配列の除去低クオリティ部位のトリムも行う Reference sequence and annotation files Drosophila_melanogaster_Ensembl_BDGP5.25.tar.gz を解凍する 1

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:ex1 $tar xzvf Drosophila_melanogaster_Ensembl_BDGP5.25.tar.gz Run tophat TopHat を実行 C1_10k_Read1.fq $ tophat -p 4 -G genes.gtf -o C1_tophat_out genome C1_10k_Read1.fq gene.gtf は known transcript が記録された gtf ファイルファイルパスを指定することダウンロードした Drosophila_melanogaster_Ensembl_BDGP5.25 の中のどこかにある genome には bowtie2 用のゲノムのインデックスファイルの base name を指定するダウンロードした Drosophila_melanogaster_Ensembl_BDGP5.25 の中のどこかにある -p は使う CPU core を指定するオプション使用するコンピュータのスペックに合わせてオススメ : transcriptome-index オプションは指定した方が良い初回に作製した bowtie2 index が 2 回目以降使い回せる複数ライブラリを解析する際は大幅に時間の節約になる同様に C2_10k_Read1.fq をマッピング $ tophat -p 4 -G genes.gtf -o C2_tophat_out genome C2_10k_Read1.fq Inspect Results 計算が終わったらどのようなファイルが生成されたか確認する $ ls -l C1_tophat_out/ total 1048 -rw-r--r-- 1 shige staff 1028268 Mar 6 23:10 accepted_hits.bam -rw-r--r-- 1 shige staff 52 Mar 6 23:10 deletions.bed -rw-r--r-- 1 shige staff 54 Mar 6 23:10 insertions.bed -rw-r--r-- 1 shige staff 25506 Mar 6 23:10 junctions.bed drwxr-xr-x 17 shige staff 578 Mar 6 23:04 logs -rw-r--r-- 1 shige staff 66 Mar 6 23:04 prep_reads.inf prep_reads.info の中身を less で確認しよう accepted_hits.bam がアライメント結果だ中身を samtools で確認しよう $ samtools view C1_tophat_out/accepted_hits.bam less IGV IGV で可視化しよう IGV で bam ファイルを読むためにはインデクシングをしなければいけない sort indexing の段階をふむ $ samtools sort accepted_hits.bam accepted_hits.sorted # => accepted_hits.sorted.bam ができる $ samtools index accepted_hits.sorted.bam # => accepted_hits.sorted.bam.bai ができる 1. IGV を立上げる 2

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:ex1 2. 左上のプルダウンメニューから Drosophila melanogaster のゲノムを選ぶ ( 今回使っているリファレンスと同一のバージョンの Dmel ゲノムがないが今回の練習では r5.33 を選んで問題はない ) 3. メニュー File > Load from File accepted_hits.sorted.bam を選択 4. 適当な染色体の適当な場所を指定し適当にズームアップする X:830,000 近辺 git2014sopen/ex1.txt Last modified: 2014/03/04 20:03 by shige Except where otherwise noted, content on this wiki is licensed under the following license:cc Attribution-Share Alike 3.0 Unported [http://creativecommons.org/licenses/by-sa/3.0/] 3

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:ex2 ex2: Transcript-based Mapping with Bowtie2 マウス Mus musculus の RNA-seq を行ったライブラリは 1 種類のみで single end ( 片側 Read1 のみ ) 75bp シークエンスを行ったこれらのリードをマウス mrna リファレンスにマッピングさせたい戦略 :bowtie2 で mrna リファレンスにマッピング Data データファイルは ~/data/ss 以下に保存してある Input reads IlluminaReads1.fq Reference minimouse_mrna.fa Setup Setup environment ex2 ディレクトリをつくり以下の解析はその下で作業しよう Sequence reads less などのコマンドでシーケンスファイル (IlluminaReads1.fq) の内容を確認する注 ) 本番の解析ではリード数の確認フォーマットの確認クオリティの確認などを行う必要であればアダプター配列の除去低クオリティ部位のトリムも行う Reference sequence and annotation files minimouse_mrna.fa の内容を less などで確認する Create index of reference $ bowtie2-build reference.fasta output_basename たとえば reference.fasta : referenceのfastaファイル今回の場合はRefSeq.MM9.cds.nr.fasta( のパス ) output_basename : 生成されるインデックスファイル群のbase name bowtie2-build Data/RefSeq.MM9.cds.nr.fasta myref を実行すると 4

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:ex2 myref.1.bt2 myref.4.bt2 myref.2.bt2 myref.rev.1.bt2 myref.3.bt2 myref.rev.2.bt2 の 6 つのファイルができる Run Bowtie2 bowtie2 でマッピングしよう Usage: bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> -U <r>} [-S <sam>] bowtie2には様々なオプションがあるが今回は最低限のオプションだけを設定して実行するどのようなオプションが利用可能かは bowtie2 -h で確認できるまた開発者ホームページに詳細な解説がある本番の解析では適切なオプションを適切なパラメータで実行しなければいけない実際はいくつかパラメータを振って試行錯誤することになる $ bowtie2 -p 4 -x RefSeq.MM9.cds.nr -U mouse_200k.left.fq -S out.sam out.sam がマッピング結果 SAM format -p は使うCPUコア数使用するコンピュータにあわせて設定するコマンドを実行するとしばらくして 200000 reads; of these: 200000 (100.00%) were unpaired; of these: 114740 (57.37%) aligned 0 times 68238 (34.12%) aligned exactly 1 time 17022 (8.51%) aligned >1 times 42.63% overall alignment rate のようなレポートが表示されて終了するマッピング率など有用な情報なのでテキストファイルにコピー & ペーストして保存しておくと良い Inspect Results 計算が終わったらどのようなファイルが生成されたか確認する (ls -l など ) out.sam の内容を確認しよう (less, head, tail など ). 最初の約 2 万行はヘッダでアライメントはそのあとに続く SAM to BAM mapping 結果を可視化したりカウントしたり様々な下流解析を行うために SAM ファイルを sort 済の BAM に変換するそしてインデクシングする SAM BAM の変換は NGS 研究ではよく行う作業なので必ず身に付けること $ samtools view -bs out.sam > out.bam $ samtools sort out.bam out.sorted # => out.sorted.bam が生成される $ samtools index out.sorted.bam 5

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:ex2 # => out.sorted.bam.bai が生成される Count by transcript samtools を使って transcript ごとにカウントする簡易な方法を紹介する amtools のサブコマンド idxstats は reference sequence のエントリー毎にマップされたリード数を集計する今回は各シークエンスエントリーが各トランスクリプトに相当するのでこれを利用すると transcript ごとのカウント情報が得られる $ samtools idxstats out.sorted.bam ( やや難 ) 最もカウント数が多いのはどの遺伝子だろうか sort コマンドで手っ取り早く確認することが出来る git2014sopen/ex2.txt Last modified: 2014/03/04 20:10 by shige Except where otherwise noted, content on this wiki is licensed under the following license:cc Attribution-Share Alike 3.0 Unported [http://creativecommons.org/licenses/by-sa/3.0/] 6

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:ex3 ex3: Data Import and Visualization arab2.txt は 6 libraries (2 groups x 3 biological replicates) のシロイヌナズナ RNA-seq のデータであるすでにマッピング済みで遺伝子毎のリードカウントがタブ区切りテキストとして提供されているこの exercise ではテーブルの中身を確認する基本テクニックを練習する Data (~/data/ss/ 以下にある ) arab2.txt : count table Inspect table with MS Excel 1) 表計算ソフト MS Excel を使って arab2.txt の中身を確認しよう 2)MS Excel で m1 と m2 の scatter plot( 散布図 ) を書いてみようこのふたつは同一コンディションの biological replicate なので発現パターンは両者で良く似ているはずである次に m1 と h1 を比較しようこのふたつはコントロールと実験群の比較なので有る程度の発現パターンの違い ( すくなくとも m1 vs m2 よりも大きい違い ) が期待されるヒント :xy 軸ともに対数をとることコメント : ノーマライズなどを施していない生データでもこれだけ豊富な情報が得られることを認識して欲しい ex-1b: Inspect table with R こんどは R でテーブルの確認と scatter plot を書いてみよう Data import > dat <- read.delim("arab2.txt", row.names=1, head=t) # read tab-delimited text > head(dat) m1 m2 m3 h1 h2 h3 AT1G01010 35 77 40 46 64 60 AT1G01020 43 45 32 43 39 49 AT1G01030 16 24 26 27 35 20 AT1G01040 72 43 64 66 25 90 AT1G01050 49 78 90 67 45 60 AT1G01060 0 15 2 0 21 8 Inspect table > dim(dat) [1] 26221 6 Q1:dim コマンドは何をするものですか? Q2: またその結果得られた 26221 と 6 は何を意味しますか? 7

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:ex3 Inspect data by column それぞれのライブラリのリードカウント合計は重要な基礎情報である計算してみよう Q3: それぞれのライブラリのリードカウント合計を求めなさい例としてm1カラムの合計を計算する > sum(dat$m1) [1] 1902032 他のカラムも合計を計算しようまたこれらは基礎情報として重要なので記録しておこう Q4 ( やや難 ): 約 2 万 5 千遺伝子にはカウント 0 のものから非常にたくさんのカウントをもつものがあるカウントの i) 最大値最小値平均値中央値はいくつか調べよう ii) ヒストグラムを書きなさい m1 as an example: > dim(dat) [1] 26221 6 > sum(dat$m1) [1] 1902032 > sum(dat$m3) [1] 3259705 > max(dat$m1) [1] 61791 > min(dat$m1) [1] 0 > mean(dat$m1) [1] 72.5385 > median(dat$m1) [1] 9 コメント : ただ計算するだけでなくこれらの基礎統計量を見てカウントデータの分布にどのような特徴があるか考えることが大切です > hist(dat$m1) しかしあんまり特徴がつかめないので > hist(log10(dat$m1 + 1), breaks="scott") Scatter plot Q5: m1 vs m2 を scatter plot で比較しよう 8

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:ex3 > plot(dat$m1 + 1, dat$m2 + 1, log="xy") それぞれ +1 しているのは log0 は計算できないため +1 して下駄を履かせている Q6: ほかのライブラリどうしも scatter plot を描いて比較しようコメント :plot などのグラフィックス関数には様々な引数を与えることによって非常に多くの描画パラメータを変更できグラフの見栄えを変更することが出来る余裕があれば plot の色形などを変更する練習をしてみると良いだろう git2014sopen/ex3.txt Last modified: 2014/03/04 20:13 by shige Except where otherwise noted, content on this wiki is licensed under the following license:cc Attribution-Share Alike 3.0 Unported [http://creativecommons.org/licenses/by-sa/3.0/] 9

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:ex4 ex4: MA plot MA plot 2 RNAseq M = log(intensityb / intensitya) # log of the ratio = A = log(sqrt(intensitya * intensityb)) # intensity average of log intensity = arab2.txt MA plot (ex3 dat ) Q1) m1 vs m2 MA-plot M <- log2(dat$m2 / dat$m1) A <- log2(sqrt(dat$m1 * dat$m2)) plot(a, M) log0 M <- log2(dat$m2 + 1) - log2(dat$m1 + 1) A <- 1/2 * (log2(dat$m2 + 1) + log2(dat$m1 + 1)) plot(a, M, col="gray", pch=16, cex=0.4, ylim=c(-8,8)) edger MAplot RNAseq MA git2014sopen/ex4.txt Last modified: 2014/03/04 20:15 by shige Except where otherwise noted, content on this wiki is licensed under the following license:cc Attribution-Share Alike 3.0 Unported [http://creativecommons.org/licenses/by-sa/3.0/] 10

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:ex5&#de_testing ex5: edger arab2 のデータを使い 2 群間比較を edger で行う edger は複雑なパッケージである開発者が詳細のユーザーガイドやマニュアルを用意しているのでこれらを活用して欲しい ( リンクは下記参照 ) Import library library(edger) Import data > dat <- read.delim("arab2.txt", row.names=1) #... dat 中身の確認作業... 2 グループ各 3 繰り返し実験という実験デザインを定義する > grp <- c("m", "M", "M", "H", "H", "H") > grp [1] "M" "M" "M" "H" "H" "H" edger の DGEList 関数でカウントデータを読み込む > D <- DGEList(dat, group=grp) > head(d)... Normalization TMM 法でノーマライズする calcnormfactors を使う > D <-calcnormfactors(d, nethod="tmm") 計算結果の確認 > D$samples group lib.size norm.factors m1 M 1902032 1.0399197 m2 M 1934029 1.0611305 m3 M 3259705 0.8841923 h1 H 2129854 1.0266944 h2 H 1295304 1.1412144 h3 H 3526579 0.8747345 DE testing estimate dispersion > D <- estimatecommondisp(d) > D$common.dispersion 11

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:ex5&#de_testing [1] 0.342609 > D <- estimatetagwisedisp(d) > summary(d$tagwise.dispersion) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1173 0.1834 0.4728 1.0540 1.7400 3.7390 DE test > de.tagwise <- exacttest(d, pair=c("m", "H")) Multiple comparison correction and View results > toptags(de.tagwise) Comparison of groups: H-M logfc logcpm PValue FDR AT5G48430 6.233066 6.706315 3.281461e-21 8.604319e-17 AT3G46280 5.078716 8.120404 1.110955e-19 1.456517e-15 AT2G19190 4.620707 7.381817 1.710816e-19 1.495310e-15 AT4G12500 4.334870 10.435847 4.689616e-19 3.074161e-15 AT2G44370 5.514376 5.178263 9.902189e-18 5.192906e-14 AT2G39380 5.012163 5.765848 2.010501e-17 8.786223e-14 AT3G55150 5.809677 4.871425 3.065826e-17 1.148414e-13 AT4G12490 3.901996 10.198755 8.068822e-17 2.455369e-13 AT1G51820 4.476647 6.369685 8.490613e-17 2.455369e-13 AT2G39530 4.366709 6.710299 9.364131e-17 2.455369e-13 Dump the table into a text file > write.table(de.tagwise$table, "de.tagwise.txt", sep="\t", quote=f) or > tmp <- toptags(de.tagwise, n=nrow(de.tagwise$table)) > write.table(tmp$table, "de.tagwise2.txt", sep="\t", quote=f) MA plot edger 提供の plotsmear 関数を使うと便利 plotsmear(d) 有意に発現差のある遺伝子を赤で表示することもできる > de.names <- row.names(dat[decidetestsdge(de.tagwise, p.value=0.05)!=0, ]) > plotsmear(d, de.tags=de.names) 12

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:ex5&#de_testing Inspect DE result example: fold-change > 10 を抽出カウント detab <- tmp$table ## get fold-change > 10 detab[detab$logfc > log2(10),] nrow(detab[detab$logfc > log2(10),]) example: FC > 5 AND FDR < 0.01 > detab[(detab$logfc > log2(2) & detab$fdr < 0.05), ] edger に組み込まれている decidetestdge 関数も便利 > summary(decidetestsdge(de.tagwise, p.value=0.05)) [,1] -1 49 0 25903 1 269 dump したタブ区切りテキストを MS Excel で読み込んでフィルタ機能やソート機能を駆使してデータを探索するのも良いだろう Links http://www.bioconductor.org/packages/release/bioc/html/edger.html [http://www.bioconductor.org/packages/release/bioc/html/edger.html] edger edger User's Guide [http://www.bioconductor.org/packages/release/bioc/vignettes/edger /inst/doc/edgerusersguide.pdf] git2014sopen/ex5.txt Last modified: 2014/03/05 20:55 by shige Except where otherwise noted, content on this wiki is licensed under the following license:cc Attribution-Share Alike 3.0 Unported [http://creativecommons.org/licenses/by-sa/3.0/] 13

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:ex6 Clustering 問 1 データセット 080106_mutant_group00-04_Pst-aR2_6h_q0.05_ExpRatio_22g.txt を使ってユークリッド距離を使った場合とコサイン係数を使った距離でクラスタリングした時の違いを調べなさいなおコサイン係数での距離クラスタリングは下記のカスタム関数を用いてよいまた heatmap および heatmap.2(gregmisc ライブラリー ) では引数に Rowv=as.dendrogram( 行のクラスタリング結果 ), Colv=as.dendrogram( 列のクラスタリング結果 ) と指定することで任意のクラスタリング結果でヒートマップを描くことができる # コサイン係数 ( ベクトルの角度 ) でクラスタリング # コサイン係数は関数が無いので自作する cosine.coef <- function(x,y) { a <- sum(na.omit(x * y)) / sqrt( sum(na.omit(x)^2) * sum(na.omit(y)^2) ) return(a) } # making a distance table between columns using uncentered Pearson correlation cosine.table <- function(x) { numberofpoints <- ncol(x) columnnames <- colnames(x) distancetable <- matrix(data = NA, nrow = numberofpoints, ncol = numberofpoints, dimnames = list( columnnames, columnnames ) ) for ( i in 1:(numberOfPoints-1) ) { for ( j in (i+1):numberofpoints ) { v1 <- x[, i] v2 <- x[, j] d <- 1 - cosine.coef(v1, v2) distancetable[i, j] <- d distancetable[j, i] <- d } } } for ( i in 1:numberOfPoints ) { distancetable[i, i] <- 1 } # fill the diagonal return(distancetable) # コサイン係数の距離行列 # cosinedistancetable <- as.dist(cosine.table(as.matrix((dat)))) # 行のクラスタリング rowclusters <- hclust(as.dist(cosine.table(as.matrix((t(dat)))))) # 列のクラスタリング colclusters <- hclust(as.dist(cosine.table(as.matrix((dat))))) # 問 1 解法例 dat <- read.delim(file= /Users/nibb2/Desktop/080106_mutant_group00-04_Pst-aR2_6h_q0.05_ExpRatio_22g.txt, header=true, row.name=1) library(gregmisc) # ユークリッド距離 ( ベクトル間の距離 ) でクラスタリング png(filename=paste("080106_mutant_group00-04_pst-ar2_6h_q0.05_expratio_22g", "_Euclidian.png", sep="")) heatmap.2(as.matrix(dat), trace="none") dev.off() png(filename=paste("080106_mutant_group00-04_pst-ar2_6h_q0.05_expratio_22g", "_cosine.png", sep="")) heatmap.2(as.matrix(dat), Rowv=as.dendrogram(rowClusters), Colv=as.dendrogram(colClusters), trace="none") dev.off() 問 2 データセット 080106_mutant_group00-04_Pst-aR2_6h_q0.05_ExpRatio_22g.txt を使ってユークリッド距離を使って得られたクラスタリング結果がどれだけ頑健な結果であるか leave-one out 法を使って調べなさい今回は for 関数などを使って計 21 個の leave-one-out 検証した結果のヒートマップを書きその結果を比べなさい # 問 2 解法例 14

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:ex6 for (i in 1:ncol(dat)){ LOOdata <- dat[,-i] png(filename=paste("080106_mutant_group00-04_pst-ar2_6h_q0.05_expratio_22g", "_Euclid_", colnames(dat)[i], "-LOO.png", sep=""), width=480*2, height=480*2) heatmap.2(as.matrix(loodata),, trace="none") dev.off() } 注 : ファイルのパスは環境にあわせて変更すること git2014sopen/ex6.txt Last modified: 2014/03/04 20:18 by shige Except where otherwise noted, content on this wiki is licensed under the following license:cc Attribution-Share Alike 3.0 Unported [http://creativecommons.org/licenses/by-sa/3.0/] 15

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:prac_genome_based Genome-based RNA-Seq pipeline (Arabidopsis thaliana) RNA-seq 2D sample (2days dark condition ) 2D2L sample ( 2days light condition sampling duplicate paired-end 101bp pre-processing Arabidopsis thaliana TopHat splice-aware Data Input reads ( ~/data/ky/tophat/ ) condition Dark, rep#1: 2D_1_R1.fastq, 2D_1_R2.fastq condition Dark, rep#2: 2D_2_R1.fastq, 2D_2_R2.fastq condition Dark, rep#3: 2D_3_R1.fastq, 2D_3_R2.fastq condition Light, rep#1: 2D2L_1_R1.fastq, 2D2L_1_R2.fastq condition Light, rep#2: 2D2L_2_R1.fastq, 2D2L_2_R2.fastq condition Light, rep#3: 2D2L_3_R1.fastq, 2D2L_3_R2.fastq Reference Arabidopsis thaliana genome and annotation (Ensembl) igenomes (http://tophat.cbcb.umd.edu /igenomes.html [http://tophat.cbcb.umd.edu/igenomes.html]) ftp://ussd-ftp.illumina.com/arabidopsis_thaliana/ensembl/tair10 /Arabidopsis_thaliana_Ensembl_TAIR10.tar.gz [ftp://ussd-ftp.illumina.com/arabidopsis_thaliana /Ensembl/TAIR10/Arabidopsis_thaliana_Ensembl_TAIR10.tar.gz] Chr4 15.6% genome (genome.fa) (genes.gtf) bowtie2 index (genome.fa.*.bt2) A C G T N a c g t n other Sum_ACGT Sum_acgt Count_All cg_ratio Chr1 9709674 5435374 5421151 9697113 163958 0 0 0 0 0 401 30263312 0 30427671 35.8 Chr2 6315641 3542973 3520766 6316348 2506 0 0 0 0 0 55 19695728 0 19698289 35.8 Chr3 7484757 4258333 4262704 7448059 5966 0 0 0 0 0 11 23453853 0 23459830 36.3 Chr4 5940546 3371349 3356091 5914038 3030 0 0 0 0 0 2 18582024 0 18585056 36.2 Chr5 8621974 4832253 4858759 8652238 10278 0 0 0 0 0 0 26965224 0 26975502 35.9 Mt 102464 82661 81609 100190 0 0 0 0 0 0 0 366924 0 366924 44.7 Pt 48546 28496 27570 49866 0 0 0 0 0 0 0 154478 0 154478 36.2 total 38223602 21551439 21528650 38177852 185738 0 0 0 0 0 469 119481543 0 119667750 36 Software tophat (installed) cufflinks (installed) bowtie2 (installed) samtools (installed) Setup 16

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:prac_genome_based Setup environment top_cuff $ mkdir top_cuff $ cd top_cuff Sequence reads less 2D_1_R1.fastq Pre-processing Run tophat TopHat 2D_1_R1.fastq 2D_1_R2.fastq $ tophat -p 4 -G genes.gtf -o 2D_1 genome.fa 2D_1_R1.fastq 2D_1_R2.fastq -p CPU core transcriptome-index bowtie2 index 2 paired-end single read Inspect Results $ ls -l C1_tophat_out/ $ $ ls -la 2D_1 total 92072 -rw-r--r-- 1 kyamaguc staff 3761633 3 2 17:14 accepted_hits.bam -rw-r--r-- 1 kyamaguc staff 557 3 2 17:14 align_summary.txt -rw-r--r-- 1 kyamaguc staff 5372 3 2 17:14 deletions.bed -rw-r--r-- 1 kyamaguc staff 2850 3 2 17:14 insertions.bed -rw-r--r-- 1 kyamaguc staff 407733 3 2 17:14 junctions.bed drwxr-xr-x 30 kyamaguc staff 1020 3 2 17:14 logs -rw-r--r-- 1 kyamaguc staff 176 3 2 17:12 prep_reads.info -rw-r--r-- 1 kyamaguc staff 33862783 3 2 17:14 unmapped.bam prep_reads.info align_summary.txt less accepted_hits.bam samtools $ samtools view 2D_1/accepted_hits.bam less IGV IGV IGV bam sort indexing 17

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:prac_genome_based $ samtools sort accepted_hits.bam accepted_hits.sorted # => accepted_hits.sorted.bam $ samtools index accepted_hits.sorted.bam # => accepted_hits.sorted.bam.bai 1. IGV 2. A.thaliana(TAIR10) 3. File > Load from File accepted_hits.sorted.bam 4. X:1.14kb Statistics ( ) samtools Run cufflinks $ cufflinks -o 2D_1 -G genes.gtf accepted_hits.bam * tophat less -rw-r--r-- 1 kyamaguc staff 3761633 3 2 17:14 accepted_hits.bam -rw-r--r-- 1 kyamaguc staff 557 3 2 17:14 align_summary.txt -rw-r--r-- 1 kyamaguc staff 5372 3 2 17:14 deletions.bed -rw-r--r-- 1 kyamaguc staff 422318 3 2 17:24 genes.fpkm_tracking -rw-r--r-- 1 kyamaguc staff 2850 3 2 17:14 insertions.bed -rw-r--r-- 1 kyamaguc staff 577476 3 2 17:24 isoforms.fpkm_tracking -rw-r--r-- 1 kyamaguc staff 407733 3 2 17:14 junctions.bed drwxr-xr-x 30 kyamaguc staff 1020 3 2 17:14 logs -rw-r--r-- 1 kyamaguc staff 176 3 2 17:12 prep_reads.info -rw-r--r-- 1 kyamaguc staff 0 3 2 17:24 skipped.gtf -rw-r--r-- 1 kyamaguc staff 8075446 3 2 17:24 transcripts.gtf -rw-r--r-- 1 kyamaguc staff 33862783 3 2 17:14 unmapped.bam genges.fpkm_tracking, isoforms.fpkm_tracking 2D_2, 2D_3, 2D2L_1, 2D2L_2, 2D2L_3 6sample Run cuffmerge 18

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:prac_genome_based merge gtf assemble.txt ~/top_cuff/2d_1/transcripts.gtf ~/top_cuff/2d_2/transcripts.gtf ~/top_cuff/2d_3/transcripts.gtf ~/top_cuff/2d2l_1/transcripts.gtf ~/top_cuff/2d2l_2/transcripts.gtf ~/top_cuff/2d2l_3/transcripts.gtf cuffmerge $ cuffmerge -p 4 -s genome.fa -g genes.gtf assemblies.txt *genome.fa, genes.gtf merged_asm merged.gtf less Run cuffdiff GTF Run cufflinks, Run cuffmerge cuffdiff -p 4 merged.gtf -o 2D_vs_2D2L \ ~/top_cuff/2d_1/accepted_hits.bam,~/top_cuff/2d_2/accepted_hits.bam,~/top_cuff/2d_3/accepted_hits.bam \ ~/top_cuff/2d2l_1/accepted_hits.bam,~/top_cuff/2d2l_2/accepted_hits.bam,~/top_cuff/2d2l_3/accepted_hits.bam 2D_vs_2D2L Explore the results gene level gene_exp.diff genes.fpkm_tracking tab Excel Excel Q: How many genes are differentially expressed? Q: Scatter plot MA plot Links http://tophat.cbcb.umd.edu/ [http://tophat.cbcb.umd.edu/] TopHat http://cufflinks.cbcb.umd.edu/ [http://cufflinks.cbcb.umd.edu/] CuffLinks 19

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:prac_genome_based Notes Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 7, 562 578 (2012). git2014sopen/prac_genome_based.txt Last modified: 2014/03/05 14:49 by kyamaguc Except where otherwise noted, content on this wiki is licensed under the following license:cc Attribution-Share Alike 3.0 Unported [http://creativecommons.org/licenses/by-sa/3.0/] 20

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:prac_denovo&#compile_results de novo RNA-seq pipeline genome sequence に依存しない transcript base の RNA-seq 解析のパイプラインを学ぶシロイヌナズナを明条件 (L) と暗条件 (D) で育てそれぞれ 3 個体から RNA を抽出して標準的な方法で RNAseq ライブラリを作製しイルミナシーケンサーでシーケンスした ( 片側 76base シーケンス ) 明暗の条件の差で発現の異なる遺伝子を同定したいモデル植物シロイヌナズナはゲノムシーケンスは既知であるがここではあえてゲノム情報は使わず de novo RNA-seq のアプローチで解析する Data Input reads ( ファイルは ~/data/data/ky/tophat/ にある ) condition Light, rep#1: 2D2L_1_R1.fastq condition Light, rep#2; 2D2L_2_R1.fastq condition Light, rep#3; 2D2L_3_R1.fastq condition Dark, rep#1; 2D_1_R1.fastq condition Dark, rep#2; 2D_2_R1.fastq condition Dark, rep#3; 2D_3_R1.fastq 本来は read quality check and cleaning を行なうべきであるが上記シーケンスデータは処理済み Overview 戦略 1. de novo assembly to build transcriptome reference tool: Trinity 2. mapping reads to transcriptome reference tool: bowtie2 3. abundance estimation tool: express 4. differential expression analysis tool: edger 5. annotation of reference sequences tool: blastx Setup practice_denovo ディレクトリを作製し以降の解析はそのディレクトリの下で作業しよう $ mkdir practice_denovo $ cd practice_denovo Sequence reads 解析に使うシーケンスデータをカレントディレクトリにコピーするオリジナルは ~/data/ky 21

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:prac_denovo&#compile_results /tophat/ にある $ cp ~/data/ky/tophat/*_r1.fastq./ Tops: 6 つのファイルをひとつひとつコピーしても良いがワイルドカードを使うと便利 Read QC and pre-precessing 本来は read quality check and cleaning を行なうべきであるが上記シーケンスデータは処理済み ( 参考 ) QC のためのソフトウェア : FastQC, Low quality base trimming, adaptor trimming を行なうためのソフトウェア : cutadapt, de novo assembly Trinity で de novo assembly を行なうただこの実習では行なわない (Trinity は Mac をサポートしていないことハイスペックのマシンを要すること時間がかかることやや高度なインフォマティクススキルを要することが理由 ) ( 参考 ) Trinity を実行する例 input read の準備 $cat 2D2L_1_R1.fastq \ 2D2L_2_R1.fastq \ 2D2L_3_R1.fastq \ 2D_1_R1.fastq \ 2D_2_R1.fastq \ 2D_3_R1.fastq \ > seq_all.fq 実行ファイル run_trinity.sh の作製 run_trinity.sh!/bin/sh #=== configu === SEQ=seq_all.fq OUT=trinity_out MIN_KMER_COV=1 NCPU=24 JM=256G #=== TRINITY_BASE=~/bio/Applications/trinityrnaseq_r2013-02-25 ulimit -s unlimited 22

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:prac_denovo&#compile_results $TRINITY_BASE/Trinity.pl \ --JM $JM \ --seqtype fq \ --min_kmer_cov $MIN_KMER_COV \ --single $SEQ --CPU $NCPU \ --output $OUT \ コマンドラインから実行 $ sh run_trinity.sh 今回は transcripts.fa を de novo assembly の結果得られたコンティグシーケンスセットとして解析を進める transcripts.fa は以下からダウンロードしてください http://133.48.62.157/download/git2014s/transcripts.fa.gz [http://133.48.62.157/download /git2014s/transcripts.fa.gz] Mapping bowtie2 を使ってリードを transcripts.fa にマッピングするまずは reference を indexing. $ bowtie2-index transcripts.fa transcripts マッピング ( 例 ) $ bowtie2 -x transcripts -U 2D_3_R1.fastq -p 8 -a -S 2D_3_R1.sam point: -a オプション express で mutihit を考慮した abundance estimation を行なうために必須もしくは -k オプションを使うここでは 2D_3_R1.fastq を mapping してその結果を D3.sam に保存している 6 つのファイルすべてで同様の処理を行なう結果ファイルは自分なりにルールを決めて規則正しく命名するとよい今回は 2D_3_R1.fastq => 2D_3_R1.sam 2D2L_1_R1.fastq => 2D2L_1_R1.sam... のルールで sam ファイルを生成しよう 2D2L_1_R1.sam 2D2L_2_R1.sam 2D2L_3_R1.sam 2D_1_R1.sam 2D_2_R1.sam 2D_3_R1.sam の 6 つのファイルができるはず Abundance estimation express を使って abundance estimation をおこなう ( 例 ) 23

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:prac_denovo&#compile_results $express transcripts.fa -o 2D2L_2_R1 2D2L_2_R1.sam 上の場合 2D2L_2_R1.sam の解析結果が 2D2L_2_R1 ディレクトリ以下に保存されるその中の results.xprs にカウントデータが格納されている results.xprs の中身をみてみよう results.xprs の est_counts カラムを以下のedgeR 解析に使う各 results.xprsのest_countsのカラムだけ抽出してひとつのテーブルにまとめたいそのためには簡単なプログラミングが必要である今回は以下のRuby scriptで処理してください download merge_express_result.rb [http://133.48.62.157/download/git2014 /merge_express_result.rb] (show code) ( 使い方 ) $ruby merge_express_result.rb directory1 directory2 directory3... ( 例 ) $ruby merge_express_result.rb 2D2L_1_R1 2D2L_2_R1 2D2L_3_R1 2D_1_R1 2D_2_R1 2D_3_R1 > expr_est_count_ 上の例ではマージしたテーブルをシェルのリダイレクトの機能を使って expr_est_count_merged.txt に保存している中身は以下のとおり #=== express Summary Table === # # source: # 2D_1_R1/express_out/results.xprs # 2D_2_R1/express_out/results.xprs # 2D_3_R1/express_out/results.xprs # 2D2L_1_R1/express_out/results.xprs # 2D2L_2_R1/express_out/results.xprs # 2D2L_3_R1/express_out/results.xprs # target column: 6 # script: merge_express_result.rb # date: Tue Mar 04 14:06:44 +0900 2014 # # id 2D_1_R1 2D_2_R1 2D_3_R1 2D2L_1_R1 2D2L_2_R1 2D2L_3_R1 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1 3.070183 0.003086 2.122748 1.575432 3.311783 4.431309 10 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 100 0.000169 0.000000 0.000000 0.500000 0.500000 0.500000... Differential expression analysis 複雑な統計計算で発現変動解析をあれこれ行なう前に得られたカウントデータの概要を把握しておくことが大事エクセルにインポートしてデータをながめてみよう R にインポートしてデータをながめてみよう data import > dat <- read.delim("expr_est_count_merged.txt", comment="#", head=f, row.name=1) > libs <- c("d1", "D2", "D3", "L1", "L2", "L3") > colnames(x) <- libs > dim(d) [1] 6384 6 > colsums(d) D1 D2 D3 L1 L2 L3 24

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:prac_denovo&#compile_results 37511 34462 38288 53185 46775 49838 scatter plot > plot(x$d1+1, x$d2+1, log="xy") > pairs(x+1, log="xy") MA plot 同一 condition 内のサンプルのばらつきに比べて Dark vs Light 間のデータのばらつきほうが大きいことが確認できるだろう edger による differential expression analysis > library(edger) >group <- c("d", "D", "D", "L", "L", "L") > D <- DGEList(dat, group=group) # import table into edger > D <- calcnormfactors(d, method="tmm") # TMM normalization > D$samples group lib.size norm.factors D1 D 37511 1.0101512 D2 D 34462 1.0537642 D3 D 38288 1.0160306 L1 L 53185 0.9552525 L2 L 46775 0.9780374 L3 L 49838 0.9896684 >D <- estimatecommondisp(d) # estimate common dispersion > D$common.dispersion [1] 0.05574236 >D <- estimatetagwisedisp(d) # estimate tagwise dispersion > summary(d$tagwise.dispersion) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000871 0.000871 0.026900 0.059340 0.067680 1.029000 >de <- exacttest(d, pair=c("d", "L")) >toptags(de) # significance test to find differentially expressed genes # view the most significant genes edger 解析結果のファイルへの書き出し # dump normalized count data > D.cpm.tmm <- cpm(d, normalized.lib.size=t) > write.table(d.cpm.tmm, file="cpm.tmm.txt", sep="\t", quote=f) # dump DE analysis result > tmp <- toptags(de, n=nrow(de$table)) > write.table(tmp$table, "de.txt", sep="\t", quote=f) このコードで normalized countデータ cpm.tmm.txt DE significance de.txt に保存される ( 補足 :edgerに含まれる便利なコマンド) plotsmear: edgerに含まれるma 描画ツール 25

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:prac_denovo&#compile_results de.names <- row.names(d[decidetestsdge(de, p.value=0.01)!=0, ]) plotsmear(d, de.tags=de.names) plotmds: MDS plot を行なうとライブラリ間の発現パターンの類似性をおおざっぱにとらえることができる plotmds(d) DE 解析結果の吟味 de.txt を MS Excel や R を使って吟味しよう Q1: differentially expressed genes は何個あるか? q-valueを0.01, 0.05の閾値に設定したときで数値はどうかわるか? Q2: Q1 のうちDarkで有意に上がっている遺伝子 Lightで有意に上がっている遺伝子にわけるとそれぞれ何個ずつか? Q3: 上位の遺伝子はどのような遺伝子か現時点ではIDと配列しかわからないので BLASTサーチをしてみよう ( 参考 ) Q2に答えるためには edger command decidetestsdgeも便利 > summary(decidetestsdge(de.tagwise, p.value=0.05)) [,1] -1 49 0 25903 1 269 Quick annotation using BLAST transcripts.fa は ID とシーケンス情報しかないのでそれぞれがどのような遺伝子をコードしているかは不明でありアノテーションを行なう必要がある BLAST による相同性検索はおおまかなアノテーションを行なうのに便利な手法であるここではシロイヌナズナのタンパク質データベースを検索することにより各コンティグがシロイヌナズナのどのタンパク質に対応するかを調べる今回はシロイヌナズナのシーケンスが query となるのでシロイヌナズナのタンパク質データベースに対して検索をかけるとほぼ 100% ヒットする非モデル生物の de novo RNAseq では de novo アセンブリで得られたコンティグを query にモデル生物のタンパク質データベースや nr データベースに対して検索をかけることになるシロイヌナズナのタンパク質アミノ酸配列セットを取得する ftp://ftp.arabidopsis.org/home/tair/sequences/blast_datasets/tair10_blastsets /TAIR10_pep_20101214_updated [ftp://ftp.arabidopsis.org/home/tair/sequences/blast_datasets /TAIR10_blastsets/TAIR10_pep_20101214_updated] $ makeblastdb -in TAIR10_pep_20101214_updated -dbtype prot -parse_seqids # build blastdb ( 検索例 ) blastx -query transcripts.fa -db TAIR10_pep_20101214_updated -num_threads 8 -max_target_seqs 1 \ -evalue 1.0e-8 -outfmt 6 > transcripts.blastx.tair10.txt 上の例ではトップヒットだけをテーブル形式で出力しているこれで transcripts.fa の ID と coding gene の対応がとれた 26

http://133.48.62.157/wiki5/doku.php?id=git2014sopen:prac_denovo&#compile_results Compile results model 生物の場合大半の遺伝子に詳細なアノテーションがついているのでそれらの情報と紐づけするとさらに利便性は上がるシロイヌナズナの場合各遺伝子の functional annotation は以下のファイルにまとめられており TAIR のウェブサイトからダウンロードすることができる ftp://ftp.arabidopsis.org/home/tair/genes/tair10_genome_release /TAIR10_functional_descriptions [ftp://ftp.arabidopsis.org/home/tair/genes /TAIR10_genome_release/TAIR10_functional_descriptions] これらの DE analysis, annotation data, をひとつのテーブルにまとめると見やすくまたさらなる下流解析を行なうのにも便利であるその処理には簡単なプログラミングが必要である今回は compile_results.rb を使って下さい使い方 download compile_results.rb [http://133.48.62.157/download/git2014/compile_results.rb] (view source) $ruby compile_results.rb > result_merged.txt 結果を MS Excel で吟味しよう Dark vs Light でどのような遺伝子が変動するか biological interpretation が可能になったはず git2014sopen/prac_denovo.txt Last modified: 2014/03/05 21:06 by shige Except where otherwise noted, content on this wiki is licensed under the following license:cc Attribution-Share Alike 3.0 Unported [http://creativecommons.org/licenses/by-sa/3.0/] 27

ex1: TopHat キイロショウジョウバエ Drosophila melanogaster の RNA-seq を行った ライブラリは 2 種類 それぞれ single end( イン

ex1: TopHat キイロショウジョウバエ Drosophila melanogaster の RNA-seq を行ったライブラリは 2 種類それぞれ single end( イン