NGS_tools_top

Similar documents
<4D F736F F F696E74202D D D E C815B836A F B83582E >

Maeda140303

Introduction Purpose This training course describes the configuration and session features of the High-performance Embedded Workshop (HEW), a key tool

PowerPoint プレゼンテーション

fx-9860G Manager PLUS_J


Introduction Purpose This training course demonstrates the use of the High-performance Embedded Workshop (HEW), a key tool for developing software for

バクテリアゲノム解析

Introduction Purpose This course explains how to use Mapview, a utility program for the Highperformance Embedded Workshop (HEW) development environmen

AJACS18_ ppt

評論・社会科学 84号(よこ)(P)/3.金子

25 II :30 16:00 (1),. Do not open this problem booklet until the start of the examination is announced. (2) 3.. Answer the following 3 proble

28 Docker Design and Implementation of Program Evaluation System Using Docker Virtualized Environment

_念3)医療2009_夏.indd

はじめに

GPGPU

Microsoft Word - PCM TL-Ed.4.4(特定電気用品適合性検査申込のご案内)

soturon.dvi

Page 1 of 6 B (The World of Mathematics) November 20, 2006 Final Exam 2006 Division: ID#: Name: 1. p, q, r (Let p, q, r are propositions. ) (10pts) (a

Studies of Foot Form for Footwear Design (Part 9) : Characteristics of the Foot Form of Young and Elder Women Based on their Sizes of Ball Joint Girth


,,,,., C Java,,.,,.,., ,,.,, i

AtCoder Regular Contest 073 Editorial Kohei Morita(yosupo) A: Shiritori if python3 a, b, c = input().split() if a[len(a)-1] == b[0] and b[len(

<95DB8C9288E397C389C88A E696E6462>

1 I EViews View Proc Freeze

Kyushu Communication Studies 第2号

How to read the marks and remarks used in this parts book. Section 1 : Explanation of Code Use In MRK Column OO : Interchangeable between the new part

How to read the marks and remarks used in this parts book. Section 1 : Explanation of Code Use In MRK Column OO : Interchangeable between the new part

How to read the marks and remarks used in this parts book. Section 1 : Explanation of Code Use In MRK Column OO : Interchangeable between the new part

浜松医科大学紀要

How to read the marks and remarks used in this parts book. Section 1 : Explanation of Code Use In MRK Column OO : Interchangeable between the new part


(Keiichiro Ono) UC, San Diego Trey Ideker Lab Research Associate/ Software Engineer 2

LC304_manual.ai

fiš„v8.dvi

29 jjencode JavaScript


L3 Japanese (90570) 2008

open / window / I / shall / the? something / want / drink / I / to the way / you / tell / the library / would / to / me

elemmay09.pub

生研ニュースNo.132



1 # include < stdio.h> 2 # include < string.h> 3 4 int main (){ 5 char str [222]; 6 scanf ("%s", str ); 7 int n= strlen ( str ); 8 for ( int i=n -2; i

206“ƒŁ\”ƒ-fl_“H„¤‰ZŁñ

„h‹¤.05.07

A Nutritional Study of Anemia in Pregnancy Hematologic Characteristics in Pregnancy (Part 1) Keizo Shiraki, Fumiko Hisaoka Department of Nutrition, Sc

00_1512_SLIMLINE_BOOK.indb

Vol. 48 No. 4 Apr LAN TCP/IP LAN TCP/IP 1 PC TCP/IP 1 PC User-mode Linux 12 Development of a System to Visualize Computer Network Behavior for L

SPSS

Microsoft Word - Meta70_Preferences.doc

1..FEM FEM 3. 4.

_Y05…X…`…‘…“†[…h…•

1 1 tf-idf tf-idf i


Motivation and Purpose There is no definition about whether seatbelt anchorage should be fixed or not. We tested the same test conditions except for t

According to Nihon no doyo, kodomo no uta no genjo to bunseki, a joint research conducted by the National Association of College Music Education, many

L1 What Can You Blood Type Tell Us? Part 1 Can you guess/ my blood type? Well,/ you re very serious person/ so/ I think/ your blood type is A. Wow!/ G

ñ{ï 01-65

Introduction Purpose The course describes library configuration and usage in the High Performance Embedded Workshop (HEW), which speeds development of

JJ-90

NKK NEWS 2012

Journal of Geography 116 (6) Configuration of Rapid Digital Mapping System Using Tablet PC and its Application to Obtaining Ground Truth

自分の天職をつかめ

Sequencher 4.9 Confidence score Clustal Clustal ClustalW Sequencher ClustalW Windows Macintosh motif confidence Sequencher V4.9 Trim Ends Without Prev

The copyright of this material is retained by the Information Processing Society of Japan (IPSJ). The material has been made available on the website

Fig. 1 Schematic construction of a PWS vehicle Fig. 2 Main power circuit of an inverter system for two motors drive

How to read the marks and remarks used in this parts book. Section 1 : Explanation of Code Use In MRK Column OO : Interchangeable between the new part

NSR-500 Create USB Installer Procedures


25 D Effects of viewpoints of head mounted wearable 3D display on human task performance

2 The Bulletin of Meiji University of Integrative Medicine 3, Yamashita 10 11

1 Web [2] Web [3] [4] [5], [6] [7] [8] S.W. [9] 3. MeetingShelf Web MeetingShelf MeetingShelf (1) (2) (3) (4) (5) Web MeetingShelf

国際恋愛で避けるべき7つの失敗と解決策

Microsoft Word - MetaFluor70取扱説明.doc

日本版 General Social Surveys 研究論文集[2]

Visual Evaluation of Polka-dot Patterns Yoojin LEE and Nobuko NARUSE * Granduate School of Bunka Women's University, and * Faculty of Fashion Science,

千葉県における温泉地の地域的展開

Introduction to Information and Communication Technology (a)

Microsoft Word - Win-Outlook.docx

nagasaki_GMT2015_key09

JOURNAL OF THE JAPANESE ASSOCIATION FOR PETROLEUM TECHNOLOGY VOL. 66, NO. 6 (Nov., 2001) (Received August 10, 2001; accepted November 9, 2001) Alterna

RX600 & RX200シリーズ アプリケーションノート RX用仮想EEPROM


tikeya[at]shoin.ac.jp The Function of Quotation Form -tte as Sentence-final Particle Tomoko IKEYA Kobe Shoin Women s University Institute of Linguisti

Microsoft Word - j201drills27.doc

高等学校 英語科


10-渡部芳栄.indd

BS・110度CSデジタルハイビジョンチューナー P-TU1000JS取扱説明書

IPSJ SIG Technical Report Vol.2009-BIO-17 No /5/26 DNA 1 1 DNA DNA DNA DNA Correcting read errors on DNA sequences determined by Pyrosequencing

NSR-500 Create DVD Installer Procedures

「プログラミング言語」 SICP 第4章 ~超言語的抽象~ その6

untitled



Virtual Window System Virtual Window System Virtual Window System Virtual Window System Virtual Window System Virtual Window System Social Networking

先端社会研究 ★5★号/4.山崎

卒業論文2.dvi

Transcription:

"#$"%&"'()*+,(-.// 0122,(3456547 / 89:7(87;"<=&9 (>(?5$@9@87(A5%5;9B87 (>(149=(CB87D5%5 1

"#$% % &'()**+,$% % -./)**+,$% % % 2

次 世 代 DNAシーケンシングデータの 可 視 化 法 お 手 軽 ツールIntegrative Genomics Viewer(IGV) 2011.09.09 基 礎 生 物 学 研 究 所 生 物 機 能 解 析 センター 山 口 勝 司 NIBB CORE RESEARCH FACILITIES FUNCTIONAL GENOMICS FACILITY NIBB CORE RESEARCH FACILITIES FUNCTIONAL GENOMICS FACILITY NIBB CORE RESEARCH FACILITIES FUNCTIONAL GENOMICS FACILITY 可 視 化 ツールに 求 められるもの 膨 大 なデータを 如 何 に 直 感 的 に 理 解 できるようにするか 遺 伝 子 発 現 の 数 値 情 報 位 置 情 報 SNPの 位 置 情 報 頻 度 情 報 様 々なデータの 精 度 情 報 gene model / gene annotationと 並 べて 比 較 複 数 のデータセットを 並 べて 比 較 色 々なデータを 比 較 統 合 的 に 解 釈 できるようにしたい ゲノムviewerに 自 分 のデータを 乗 せ 統 合 的 直 感 的 に 解 釈 できること 3

可 視 化 ツールの 取 捨 選 択 基 準 を 考 える 1. お 金 を 出 す 気 があるか 無 料 / 有 料 / 基 本 無 料 2. 誰 が 使 うか 個 人 レベル/ コミュニティーレベル 3. 利 用 深 度 データを 見 るだけ/ 自 分 から 色 々 工 夫 4. 利 用 しやすさ 導 入 に 必 要 なコンピュータスペック マニュアルの 分 かりやすさ 利 用 の 簡 便 さ 利 用 者 が 多 いか/ 情 報 の 多 さ お 手 軽 ツール Integrative Genomics Viewer(IGV) アカデミックウェアで 無 料 コミュニティーでの 利 用 者 が 多 いから 情 報 も 多 い javaのプログラムなので オールプラットフォーム 対 応 マニュアルは 親 切 サンプルデータのある WEBサーバーではなく ではなく PCレベルでできる 共 同 利 用 に 供 するには 誰 もが 簡 便 に 使 える 必 要 がある 4

IGV(Integrative genome viewer)によるデータ 可 視 化 次 世 代 シーケンサーのデータを 見 るためには BAMファイルをロードするだけ http://www.broadinstitute.org/igv/ Produced by Broad Institute 5

6

簡 易 説 明 7

ユーザーガイド 8

9

ゲノムViewerなので 次 世 代 DNAシーケンサーのデータに 限 定 されない マイクロアレイの 結 果 や ゲノムアノテーションの 情 報 も 随 時 表 示 できる 対 応 するファイル 形 式 に 応 じて 表 示 方 法 が 決 まる WEBサイト 説 明 より 10

.gctファイル サンプル 間 遺 伝 子 間 の 発 現 プロファイル WEBサイト 説 明 より Gene List View Loading/Defining Gene Lists My Lists WEBサイト 説 明 より 11

Nature Biotech. 29:24 26 (2011) Supplement figureからの 抜 粋 Nature Biotech. 29:24 26 (2011) Supplement figureからの 抜 粋 12

公 開 情 報 のviewerとして Human body map project 13

さあ 実 際 使 ってみましょう 14

登 録 されていない 生 物 種 配 列 でも 自 分 でimportすればOK 今 回 のデータはChr2:1-2,000,000のみです2 000のみです スケール 変 更 15

BAMファイルを 読 み 込 む File->Load from file で 用 意 されているファイルの 中 からLer.bamを 読 み 込 む 次 世 代 シーケンサーデータの 神 髄 (リード 配 列 の 情 報 とゲノム 上 のマッピングに 関 する 情 報 ) mutファイルを 読 み 込 む (File-> Load from fileからler.mutをロード) Chr1 711 711 Ler test Chr1 892 892 Ler test Chr1 956 956 Ler test Chr1 10904 10904 Ler test Chr1 32210 32210 Ler test Chr1 37388 37388 Ler test Chr1 49438 49438 Ler test Chr1 71326 71326 Ler test Chr1 71348 71348 Ler test Chr1 88300 88300 Ler test 16

mut bam トラックの 移 動 による 移 動 マウスのドラッグも 移 動 popupで 情 報 を 見 ることが 可 能 gene modelのtrackを 指 定 して Ctr+F gene model 単 位 で 右 に 移 動 Ctr+B gene model 単 位 で 左 に 移 動 17

データを 全 部 消 し 別 のデータセットへ 変 異 体 のSNPを 確 認 してみましょう 1.VCFファイルを 読 み 込 ませてみる File-> Load from file からSummary.vcfをロード 2.BEDファイルを 読 み 込 ませてみる 3.WIGファイルを 読 み 込 ませてみる 4.Chr2: 388,854に 移 動 5.BAMファイルを 複 数 並 べてみる VCF file (Variant call file) 個 々のサンプルのSNP 位 置 と 頻 度 確 からさが 記 載 されている (ここではcontに 対 し mutant3 株 のデータが 含 まれている) 青 色 がheteroSNP 水 色 がhomoSNPを 意 味 する Samtools, Vcftoolsで 作 成 可 能 SNPを 確 認 してみましょう 1.VCFファイルを 読 み 込 ませてみる 2.BEDファイルを 読 み 込 ませてみる File-> Load from file からtranspozon.bedをロード 3.WIGファイルを 読 み 込 ませてみる 4.Chr1:28211145に 移 動 4.BAMファイルを 複 数 並 べてみる 18

BEDファイルの 描 画 ここではトランスポゾン 領 域 が 分 かるようにします SNPを 確 認 してみましょう 1.VCFファイルを 読 み 込 ませてみる 2.BEDファイルを 読 み 込 ませてみる 3.WIGファイルを 読 み 込 ませてみる File-> Load from file からcg_ratio.wig.tdfをロード 4.Chr2:28211145に 移 動 5.BAMファイルを 複 数 並 べてみる マウス 右 クリックでtrack 表 示 の 設 定 変 更 が 可 能 Type of Graph -> Line plot Windowing functuib -> Set datqa range -> min 0 max 100 Change track height 19

WIGファイル 単 純 なベースごとの 数 値 データを 扱 う 時 などに 用 いる 一 定 幅 での 塩 基 CG ratioのデータを デ 表 示 させてみましょう テキスト 形 式 の 大 きなファイルは 描 画 に 時 間 がかかる バイナリー 化 すべき WIG hdf (IGV toolで 可 能 今 回 は 処 理 済 み 説 明 は 省 略 File-> Run igvtoolsで 起 動 可 能 tileを 選 択 ) variablestep chrom=chr1 52 36 53 36 54 34 55 34 56 34 57 32 58 32 59 32 60 34 61 36 62 36 63 34 64 32 65 32 SNPを 確 認 してみましょう 1.VCFファイルを 読 み 込 ませてみる 2.BEDファイルを 読 み 込 ませてみる 3.WIGファイルを 読 み 込 ませてみる 4.Chr2:388854に 移 動 5.BAMファイルを 複 数 並 べてみて 目 的 領 域 のリード 状 況 をみてみましょう File-> Load from file から cont.bam mutant1.bam mutant2.bam mutant3.bam をロード BAMファイルやVCFファイルの 描 画 にはindexファイルも 必 要 igv toolsで 作 成 可 能 今 回 はすでに 作 成 してあります 20

リード 配 列 が 離 れているものにも 対 応 Split Screen View メイトペアデータの 場 合 のみ Paired view 21

その 他 の 便 利 機 能 セッションの 保 存 表 示 しているデータの 読 み 込 み 状 況 を それごと 保 存 セッションをロードすることで 意 図 した 画 面 を 表 示 できる データセットが 揃 っていること フォルダー 構 造 が 同 一 である 必 要 がある バッチ 処 理 重 要 領 域 の 画 面 スナップショットを 自 動 で 取 ったりできる new load myfile.bam snapshotdirectory mysnapshotdirectory genome hg18 goto chr1:65,289,335-65,309,335 sort position collapse snapshot goto chr1:113,144,120-113,164,120 sort base collapse snapshot IGV 紹 介 のまとめ 可 視 化 ツールとして 十 分 な 機 能 を 持 つ 無 料 比 較 的 簡 単 お 手 軽 次 世 代 DNAシーケンサー 解 析 の 標 準 的 ツールになりつつある 自 分 で 見 るためにも 良 し 人 に 見 せるためにも 良 し 利 用 範 囲 は 次 世 代 DNAシーケンサーに に 限 定 しない 広 くゲノミクスの 解 析 に 有 用 ウェブサイトを 見 ながら 復 習 して 頂 けたら もっと 良 く 分 かるはず 22

samtools "#$%%&' "#$%&'(&)*&$"+,-).&)/0#123$456.1/$,7$1$-&)&6,*$456.1/$ 456$7/56,)-$+16-&$)(*+&589&$7&'(&)*&$1+,-).&)/7:$$ "#$;55+7$265<,9&$<16,5(7$(8+,8&7$456$.1),2(+18)-$ 1+,-).&)/7$,)$/=&$"#$456.1/>$,)*+(9,)-$7568)->$.&6-,)->$,)9&?,)-$1)9$-&)&618)-$1+,-).&)/7$,)$1$2&6@257,85)$456.1/:$ References Li et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics (2009) vol. 25 (16) pp. 2078-9 http://samtools.sourceforge.net/ 23

"#$%%&'( "#0A"# $.122,)- A"# $ $ "# A"# $ A"# 756/$B$,)9&?,)-$%&?C$ DEF3$ $.122,)-$ $ GH$*1++ $ SAMtools NGS "#$%%&' samtools$,7$1$*5..1)9@+,)&$/55+:$ Usage: samtools <command> [options] IGDJ$K1L$ M6,N&)$,)$O:$O5.2,+&$1)9$,)7/1++$PL$make command line 6&19$95*(.&)/7$*16&4(++L$ K&P7,/&$95*(.&)/7$ =&+2$95*(.&)/$ QR"S#R$ 24

=N2C0071./55+7:75(6*&456-&:)&/0 =N2C0071./55+7:75(6*&456-&:)&/071./55+7:7=/.+ 25

$ samtools Program: samtools (Tools for alignments in the SAM format) Version: 0.1.18 (r982:295) Usage: samtools <command> [options] Command: view SAM<->BAM conversion sort sort alignment file mpileup multi-way pileup depth compute the depth faidx index/extract FASTA tview text alignment viewer index index alignment idxstats BAM index stats (r595 or later) fixmate fix mate information flagstat simple stats calmd recalculate MD/NM tags and '=' bases merge merge sorted alignments rmdup remove PCR duplicates reheader replace BAM header cat concatenate BAMs targetcut cut fosmid regions (for fosmid pool only) phase phase heterozygotes )'*('+,-.%//012 $ samtools view Usage: samtools view [options] <in.bam> <in.sam> [region1 [...]] Options: -b output BAM -h print header for the SAM output -H print header only (no alignments) -S input is SAM -u uncompressed BAM output (force -b) -x output FLAG in HEX (samtools-c specific) -X output FLAG in string (samtools-c specific) -c print only the count of matching records -t FILE list of reference names and lengths (force -S) [null] -T FILE reference sequence file (force -S) [null] -o FILE output file name [stdout] -R FILE list of read groups to be outputted [null] -f INT required flag, 0 for unset [0] -F INT filtering flag, 0 for unset [0] -q INT minimum mapping quality [0] -l STR only output reads in library STR [null] -r STR only output reads in read group STR [null] -? longer help 26

31'$0&&0$4%1 D)7/1++185) $ $ # download samtools-0.1.18.tar.bz2 $ tar xjvf samtools-0.1.18.tar.bz2 # $ cd samtools-0.1.18 $ less INSTALL # Type `make' to compile samtools. $ make # path samtools H61*8*&$ $ $ $ http://133.48.62.157/download/git2011a/samtools_practice.tgz $ $ $ 27

5%16*7$("#($%(8"# $ samtools view -Sb ex1.sam -o ex1.bam Try - Q1. less ex1.sam - Q2. less Run1.bam - Q3. samtools view ex1.sam bam - Q3. ls command sam => bam view9(:4*;(8"#(<4&* NA12878.chr16p.bam 1000 Genomes Project CEU ( ) chromosome 16 : 48,000,000 50,000,000 bam file $ samtools view NA12878.chr16p.bam $ samtools view NA12878.chr16p.bam less Try - Q1. samtools view NA12878.chr16p.bam 28

view9(31'=*.$(>*02*7(=07$( T&19&6$216/$54$A"#$U+&$,)*+(9&7$(7&4(+$,)456.185) $ samtools view H NA12878.chr16p.bam @HD VN:1.0 GO:none SO:coordinate @SQ SN:1 LN:249250621 AS:NCBI37 UR:file:/home/shige/hg19.fasta \ M5:1b22b98cdeb4a9304cb5d48026a85128 @SQ SN:2 LN:243199373 AS:NCBI37 UR:file:/home/shige/hg19.fasta \ M5:a0d9851da00400dec1098a9255ac712e... @RG ID:ERR001268 PL:ILLUMINA LB:NA12878.1 PI:200 DS:SRP000032 \ SM:NA12878 CN:MPIMG @RG ID:ERR001269 PL:ILLUMINA LB:NA12878.1 PI:200 DS:SRP000032 \ SM:NA12878 CN:MPIMG... @PG ID:bwa VN:0.5.5 - Each header line begins with character @ followed by a two-letter record type code. - Each line is TAB-delimited. - each data field follows a format TAG:VALUE where TAG is a two-letter string view9(5%16*7$(8"#($%("# $ samtools view -h NA12878.chr16p.bam > NA12878.chr16p.sam Try - Q1. NA12878.chr16p.bam sam - Q2. h 29

view9(:4*;(0('=*.4<4.(7*?4%1( 16:49,000,000 -- 49,000,100 $ samtools view NA12878.chr16p.bam 16:49,000,000-49,000,100 [bam_index_load] fail to load BAM index. # <== Error message [main_samview] random alignment retrieval only works for indexed BAM files. ## build index $ samtools index NA12878.chr16p.bam # => NA12878.chr16p.bam.bai ## try again $samtools view NA12878.chr16p.bam 16:49,000,000-49,000,100... - Q1: - Q2: - Q3: 16:49,000,001 ( ) - Q4: 16:49,900,000-50,000,100 bam file merge9(/*7?*(8"#'( Run1.bam, Run2.bam bam file ## inspect Run1.bam & Run2.bam # Run1.bam Run2.bam? ## merge $ samtools merge out.bam Run1.bam Run2.bam ## check # Run1+Run2 [ ] Run1, Run2 samtools merge h samtools reheader 30

42@'$0$'9(/0==*2(7*02' Retrieve and print stats in the index file. $ samtools idxstats NA12878.chr16p.bam... 15 102531392 0 0 16 90354753 2175422 78412 17 81195210 0 0... VW$;"A$9&+,.,/&9$/&?/$ XC$6&4&6&)*&$7&'(&)*&$)1.&$ YC$6&4&6&)*&$7&'(&)*&$+&)-/=$ ZC$[$.122&9$6&197$ \C$[$().122&9$6&197:$ <&0?'$0$A(2*=$> flagstat: Collect some statistics about alignment $ samtools flagstat NA12878.chr16p.bam 2253834 + 0 in total (QC-passed reads + QC-failed reads) 131828 + 0 duplicates 2175422 + 0 mapped (96.52%:nan%) 1907026 + 0 paired in sequencing 953675 + 0 read1 953351 + 0 read2 1589213 + 0 properly paired (83.33%:nan%) 1750199 + 0 with itself and mate mapped 78415 + 0 singletons (4.11%:nan%) 47076 + 0 with mate mapped to a different chr 27432 + 0 with mate mapped to a different chr (mapq>=5) depth: compute the depth 1 coverage (depth) $ samtools depth NA12878.chr16p.bam head 16 47999937 1 16 47999938 1 31

tview9(:4'+0&4b*(0&4?1/*1$ IGV alignment visualize ## you need reference sequence $ samtools tview NA12878.chr16p.bam human_chr16_partial.fasta # type g for go to the region of your interest (ex, 16:49000000) # type? for display help ]56$19<1)*&9$(7&6 view9(c4&$*741?(d02601.*2e samtools view f BAM_FILE - Q1: sample = SRR010937 Ans : -r SRR010937 - Q2: PE Ans : -f 3 32

]56$19<1)*&9$(7&6 F%7G(%1(0('$7*0/ 1./55+7$,7$9&7,-)&9$/5$K56^$5)$1$7/6&1.:$D/$6&-1697$1)$,)2(/$U+&$_@`$17$/=&$7/1)9169$,)2(/$%7/9,)3$1)9$1)$ 5(/2(/$U+&$_@`$17$/=&$7/1)9169$5(/2(/$%7/95(/3:$&<&61+$*5..1)97$*1)$/=(7$P&$*5.P,)&9$K,/=$I),?$2,2&7:$ 1./55+7$1+K1L7$5(/2(/$K16),)-$1)9$&6656$.&771-&7$/5$/=&$7/1)9169$&6656$5(/2(/$%7/9&663:$$ a$71./55+7$<,&k$@($1+):p1.$jcx>bbb>bbb@x>xbb>bbb$c$71./55+7$2,+&(2$@$ $$ a71./55+7$<,&k$@=$,):p1.$c$-6&2$@<$defqecgcrqqbbbbxewd$c$71./55+7$<,&k$@p$@$w$5(/:p1.$ $$$ ]56$19<1)*&9$(7&6 H*/%$*(0..*''(%6*7($>*(41$*71*$ 1./55+7$,7$1P+&$/5$52&)$1$A"#$%)5/$"#3$U+&$5)$1$6&.5/&$];H$56$T;;H$7&6<&6$,4$/=&$A"#$U+&$)1.&$7/16/7$ K,/=$_h2C00`$56$_=N2C00`:$1./55+7$*=&*^7$/=&$*(66&)/$K56^,)-$9,6&*/56L$456$/=&$,)9&?$U+&$1)9$K,++$95K)+519$ /=&$,)9&?$(25)$1P7&)*&:$1./55+7$95&7$)5/$6&/6,&<&$/=&$&)86&$1+,-).&)/$U+&$()+&77$,/$,7$17^&9$/5$95$75:$ $ samtools view http://133.48.62.157/download/git2010/ NA12878.chr16p.bam $ samtools view http://s3.amazonaws.com/1000genomes/pilots_bam/na06984/ NA06984.454.MOSAIK.SRP000033.2009_11.chr22_1_49691432.bam O+5(9$*5.2(8)-$54$GE$,7$&17L$K,/=$71./55+7:$ 33

"#/55+7 GE $ $ 34

"#$%%&' "#$%%&' "#$%%&'()'(*('%+,*-.('/)$.(0%-($1.(2%34*-)'%56(3*5)4/&*7%5( *58(*55%$*7%5()5("#(*58(9::(0%-3*$;(( "#$%%&'(*&'%('/44%-$'($1.(2%34*-)'%5(%0('.</.52.( *&)=53.5$'()5(>?(0%-3*$($%(@%$1("#(*58(9::(0.*$/-.';( References Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841 842 (2010). http://code.google.com/p/bedtools 35

Aaron R. Quinlan and Ira M. Hall features can be custom annotations that an individual lab or researcher defines (e.g., my novel gene or 1 variant). Department of Biochemistry and Molecular Genetics, University of Virginia S Public Health Genomics, University of Virginia, Charlottesville, VA 22908, USA The basic characteristics of a genome feature are the chromosome or scaffold on which the feature resides, the base pair on which the feature starts (i.e. the start ), the base pair on which feature Associate Editor: Martin Bishop ends (i.e. the end ), the strand on which the feature exists (i.e. + or - ), and the name of the feature if one is applicable. ABSTRACT The two most widely used formats for representing genome features are the BED (Browser Extensible analyses often requi Data) and GFF (General Feature Format) formats. BEDTools was originally written to work Motivation: Testing for correlations between different sets of faster and more flex exclusively with genome features described using the BED format, but it has been recently genomic extended to features seamlessly work is with a fundamental BED, GFF and VCF taskfiles. in genomics research. and more diverse s However, searching for overlaps between features with existing webbased format from methods the UCSC Genome is complicated Browser s Table Browser by the(http://genome.ucsc.edu/cgi-bin/hgtables? massive datasets that are technologies. In an acute by the data Existing annotations for the genomes of many species can be easily downloaded in BED and GFF command=start) or from the Bulk Downloads page (http://hgdownload.cse.ucsc.edu/downloads.html). routinely produced with current sequencing technologies. Fast and BEDTools, a fast an In addition, the Ensemble Genome Browser contains annotations in GFF/GTF format for many species flexible (http://www.ensembl.org/info/data/ftp/index.html) tools are therefore required to ask complex questions of these on genomic feature data Section in4 an of this efficient manual describes manner. BED and GFF formats in detail and (Quinlan illustrates how et to define al your 2010) Results: own annotations. This article introduces a new software suite for the comparison, manipulation and annotation of genomic features 2 FEATURES in1.3.2 Browser Overlapping Extensible / intersecting Data features. (BED) and General Feature Format 2.1 Common sc (GFF) Two genome format. features BEDTools (henceforth referred alsoto supports as features ) are thesaid comparison to overlap or intersect of sequence if they share at least one base in common. In the figure below, Feature A intersects/overlaps Feature B, but it does alignments in BAM format to both BED and GFF features. The tools Genomic analyses o not intersect/overlap Feature C. are extremely efficient and allow the user to compare large datasets in an experiment to "#$%&'#( "#$%&'#* (e.g. next-generation sequencing data) with both public and custom genomic features fr genome annotation tracks. BEDTools can be combined with one in common, they ar "#$%&'#) another as well as with standard UNIX commands, thus facilitating example, a typical routine genomics tasks as well as pipelines that can (BEDtools quickly answer Manual) variants overlap w intricate questions of large genomic datasets. identify overlappin 9 Availability and implementation: BEDTools was written in C++. set A and repeatedl Source code and a comprehensive user manual are freely available set B. While effect at http://code.google.com/p/bedtools screening for over Contact: aaronquinlan@gmail.com; imh4y@virginia.edu sequence alignment "#$%%&'( Supplementary information: Supplementary data are available at track for the human Bioinformatics online. asking more compli genomic features. B such questions wit Received on November 24, 2009; revised on January 11, 2010; accepted ( on January 21, 2010 ( 1 INTRODUCTION Determining ( whether distinct sets of genomic features (e.g. aligned sequence A1BCD'.< &%2/' reads, gene annotations, ESTs, genetic polymorphisms, mobile ( elements, etc.) overlap or are associated with one another is a fundamental task in genomics research. Such comparisons serve ( to characterize experimental results, infer causality or coincidence (or lack thereof) and assess the biological impact of genomic discoveries. Genomic features are commonly represented by the Browser Extensible Data (BED) or General Feature Format (GFF) BEDtools NGS formats and are typically compared using either the UCSC Genome Browser s (Kent et al., 2002) Table Browser or using the Galaxy (Giardine et al., 2005) interface. While these tools offer a convenient and reliable method for such analyses, they are not amenable to large and/or ad hoc datasets owing to the inherent need to interact with a remote or local web site installation. 36 Moreover, complicated Galaxy browsers. T environment and wo grep, awk, sort, etc conducted with a si 2.2 Language a BEDTools incorpor UCSC Genome Bro uses a hierarchical to discrete bins chromosome. This since one must o share the same (or Figure 1, calculat millions of sequenc the tools available is written in C++

1E4FGG'*3$%%&';'%/-2.0%-=.;5.$G 1E4FGG2%8.;=%%=&.;2%3G4G @.8$%%&'G "#$%%&' bedtools )'(*(2%33*58D&)5.($%%&;( Usage: <command> [options] HIBJ(,*K( L-)E.5()5(AMM;(A%34)&.(*58()5'$*&&(@K(make command line stream / pipe -.*8(8%2/3.5$'(2*-.0/&&K(,.@')$.(8%2/3.5$'( 1.&4(8%2/3.5$( N">#?"( 37

BEDTools is intended to run in a command line environment on UNIX, LINUX and Apple OS X operating systems. Installing BEDTools involves downloading the latest source code archive followed by compiling the source code into binaries on your local system. The following commands will install BEDTools in a local directory on a *NIX or OS X machine. Note that the <version> refers to the latest posted version number on http://bedtools.googlecode.com/. )*'$+&& Note: The BEDTools makefiles use the GCC compiler. One should edit the Makefiles accordingly if one wants to use a different compiler. curl http://bedtools.googlecode.com/files/bedtools.<version>.tar.gz > BEDTools.tar.gz tar -zxvf BEDTools.tar.gz cd BEDTools-<version> make clean make all ls bin At this point, one should copy the binaries in BEDTools/bin/ to either usr/local/bin/ or some other repository for commonly used UNIX tools in your environment. You will typically require administrator (e.g. root or sudo ) privileges to copy to usr/local/bin/. If in doubt, contact you system administrator for help. 3. Quick start guide 3.1 Install BEDTools curl http://bedtools.googlecode.com/files/bedtools.<version>.tar.gz > BEDTools.tar.gz tar -zxvf BEDTools.tar.gz cd BEDTools make clean make all sudo cp bin/* /usr/local/bin/ 3.2 Use BEDTools Below are examples of typical BEDTools usage. Additional usage examples are described in section 6 of this manual. Using the -h option with any BEDTools will report a list of all command line options. A. Report the base-pair overlap between the features in two BED files. $ intersectbed -a reads.bed -b genes.bed B. Report those entries in A that overlap NO entries in B. Like "grep -v" $ intersectbed -a reads.bed -b genes.bed v C-*272.( ( 17 ( ( http://133.48.62.157/download/git2011a/bedtools_practice.tgz ( ( ( 38

1.2 Summary of available tools BEDTools support a wide range of operations for interrogating and manipulating genomic features. The table below summarizes the tools available in the suite (tools that support BAM file are indicated).,-+.&+/&0($%%&' Utility Description intersectbed Returns overlapping features between two BED/GFF/VCF files. Also supports BAM format as input and output. windowbed Returns overlapping features between two BED/GFF/VCF files within a window. Also supports BAM format as input and output. closestbed Returns the closest feature to each entry in a BED/GFF/VCF file. coveragebed Summarizes the depth and breadth of coverage of features in one BED/GFF file (e.g., aligned reads) relative to another (e.g., user-defined windows). Also supports BAM format as input and output. genomecoveragebed Histogram or a per base report of genome coverage. Also supports BAM format as input and output. pairtobed Returns overlaps between a BEDPE file and a regular BED/GFF/VCF file. Also supports BAM format as input and output. pairtopair Returns overlaps between two BEDPE files. bamtobed Converts BAM alignments to BED and BEDPE formats. Also supports BAM format as input and output. bedtobam Converts BED/GFF/VCF features (both blocked and unblocked) to BAM format. bedtoigv Creates a batch script to create IGV images at each interval defined in a BED/GFF/ VCF file. bed12tobed6 Splits BED12 features into discrete BED6 features. subtractbed Removes the portion of an interval that is overlapped by another feature. mergebed Merges overlapping features into a single feature. fastafrombed Creates FASTA sequences from BED/GFF intervals. maskfastafrombed Masks a FASTA file based upon BED/GFF coordinates. shufflebed Permutes the locations of features within a genome. slopbed Adjusts features by a requested number of base pairs. sortbed Sorts BED/GFF files in useful ways. linksbed Creates an HTML links from a BED/GFF file. complementbed Returns intervals not spanned by features in a BED/GFF file. overlap Computes the amount of overlap (positive values) or distance (negative values) between genome features and reports the result at the end of the same line. groupby Summarizes a dataset column based upon common column groupings. Akin to the SQL "group by" command. unionbedgraphs Combines multiple BedGraph files into a single file, allowing coverage/other comparisons between them. annotatebed Annotates one BED/VCF/GFF file with overlaps from many others. /48*$. ( 8 ( ( ( ( 39

"#(1%23+$ 3*58*$%-K %47%5*& 21-%3 '$*-$.58 5*3. '2%-. '$-*58,-.+,-.),-./,-.0,-.1,-.2 chr22 1000 5000 genea 960 + chr22 2000 6000 geneb 900 - chrx 0 1000 genec 80 - $*@(OIPQ('4*2.R 5. The BEDTools suite This section covers the functionality and default / optional usage for each of the available BEDTools. intersectbed4('5200*(%-02&+6(10+$720' Example figures are provided some cases an effort to convey the purpose of the tool. The behavior of each available parameter is discussed for each tool in abstract terms. More concrete usage examples are provided in Section 6. 0-%3(?*5/*& 5.1 intersectbed By far, the most common question asked of two sets of genomic features is whether or not any of the features in the two sets overlap with one another. This is known as feature intersection. intersectbed allows one to screen for overlaps between two sets of genomic features. Moreover, it allows one to have fine control as to how the intersections are reported. intersectbed works with both BED/GFF/VCF and BAM files as input. 5.1.1 Usage and option summary (ex) Usage: SNP (bt1.bed) $ intersectbed [OPTIONS] [-a <BED/GFF/VCF> -abam <BAM>] -b <BED/GFF/VCF> GeneA (bt2.bed ) Option Description [file: -a bt1.bed] BED/GFF/VCF file A. Each feature in A is compared to B in search of overlaps. Use stdin if chr1 100 passing A with 101 a UNIX pipe. A/G chr1 -b 200 BED/GFF/VCF 201 file B. Use C/G stdin if passing B with a UNIX pipe. chr1 -abam 205 BAM file A. 206 Each BAM C/G alignment in A is compared to B in search of overlaps. Use stdin if passing chrx 300 A with a UNIX 301 pipe: For C/T example: samtools view b <BAM> intersectbed abam stdin b genes.bed [file: -ubam bt2.bed] Write uncompressed BAM output. The default is write compressed BAM output. chr1 -bed 150 When using 251 BAM input GeneA (-abam), write 1 output as BED. + The default is to write output in BAM when using -abam. For example: intersectbed abam reads.bam b genes.bed bed $ -wa intersectbed Write the original -a entry bt1.bed in A for each -b overlap. bt2.bed chr1 -wb Write 200 the original 201 entry in B for C/G each overlap. Useful for knowing what A overlaps. Restricted by -f chr1 and 205 -r. 206 C/G -wo Write the original A and B entries plus the number of base pairs of overlap between the two features. Only A features with overlap are reported. Restricted by -f and -r. -wao Write the original A and B entries plus the number of base pairs of overlap between the two features. However, A features w/o overlap are also reported with a NULL B feature and overlap = 0. Restricted by -f and -r. -u Write original A entry once if any overlaps found in B. In other words, just report the fact at least one overlap was found in B. Restricted by -f and -r. 40 -c For each entry in A, report the number of hits in B while restricting to -f. Reports 0 for A entries

intersectbed4('5200*(%-02&+6(10+$720' (ex) illumina NA12878.chr16p.bam ABCC11.exons.mod.gtf [file: NA12878.chr16p.bam] # mapping bam file [file: ABCC11.exons.gtf] # exon annotation in GTF format 16 hg19_refgene exon 48200822 48201277 0.0 -. gene_id "NM_032583"; transcript_id "NM_032583";... $ intersectbed abam NA12878.chr16p.bam -b ABCC11.exons.gtf > result.bam $ samtools view result.bam # visualize on IGV [genome: Human (b37)] 41

5&%'0'$084( 5.6 closestbed 0-%3(?*5/*& Similar to intersectbed, closestbed searches for overlapping features in A and B. In the event that no feature in B overlaps the current feature in A, closestbed will report the closest (that is, least genomic distance from the start or end of A) feature in B. For example, one might want to find which is the closest gene to a significant GWAS polymorphism. Note that closestbed will report an overlapping feature as the closest---that is, it does not restrict to closest non-overlapping feature. 5.6.1 Usage and option summary (ex) SNP (ex-bedtools-1.bed) SNP Usage: $ closestbed [OPTIONS] -a <BED/GFF/VCF> -b <BED/GFF/VCF> ex-bedtools-3.bed Option Description -s Force strandedness. That is, find the closest feature in B overlaps A on the same strand. By default, this is disabled. [file: -d bt1.bed] In addition to the closest feature in B, report [file: its distance bt3.bed] to A as an extra column. The reported chr1 100 distance for 101 overlapping features A/G will be 0. chr1 50 80 GeneA chr1 200 201 C/G chr1 190 250 GeneB -t How ties for closest feature should be handled. This occurs when two features in B have exactly the chr1 205 206 C/G chr1 280 350 GeneC same overlap with a feature in A. By default, all such features in B are reported. chrx 300 301 C/T Here are the other choices controlling how ties are handled: all Report all ties (default). first Report the first tie that occurred in the B file. last Report the last tie that occurred in the B file. $ $ closestbed -a bt1.bed b bt3.bed -d chr1 100 101 A/G chr1 50 80 GeneA 20 chr1 200 201 C/G chr1 190 250 GeneB 0 chr1 205 206 C/G chr1 190 250 GeneB 0 chrx 5.6.2 Default 300 behavior 301 C/T. -1-1. -1 closestbed first searches for features in B that overlap a feature in A. If overlaps are found, the feature in B that overlaps the highest fraction of A is reported. If no overlaps are found, closestbed looks for the feature in B that is closest (that is, least genomic distance to the start or end of A) to A. For example, in the figure below, feature B1 would be reported as the closest feature to A1. Chromosome ================================================================ BED File A ============= 5&%'0'$084( BED File B ======== ====== Result ====== (ex) ChIP-seq (bt4.bed) hg19.exons.gtf 50 $ closestbed a bt4.bed -b hg19.exons.gtf -d 42

"#$%&'( 43

)*+,-.'/0(12-2%2-*3 )4516*'675' 874.&'9):;9<<)*=7>6(71=-&8'-4&8&-'=4<*7(='6?12)<)*+,< =121@18'<.'/0(12-2%2-*3.'/0(12/7.5129 ):;9<<*'675'-4&8&-'=4<0AB<0AB/7.512-)25(C/7.512, "#$%&'( &75516=(D6' ('88 "#$%&'( EFG8;.'1=8)''287H>1.'I'-*-"#$%&'(J FKGL&75516=(D6' 8&.D;2IM4@NO?'.(O?N2)76'2&J MI821P8P&1(161(N8D8J 44

" M4@N ):;9<<*77-*(<#Q0>@ GER #A"277(8 S$T277(8 21@ FKGL &75;D('16=D6821(( )'(;O51641( 45