ビッグデータアナリティクス - 第3回: 分散処理とApache Spark

Similar documents
WebOS aplat WebOS WebOS 3 XML Yahoo!Pipes Popfry UNIX grep awk XML GUI WebOS GUI GUI 4 CUI

HP ProLiant Gen8とRed Hatで始めるHadoop™ ~Hadoop™スタートアップ支援サービス~

P P P P P P P OS... P P P P P P

Joint Content Development Proposal Tech Docs and Curriculum

DEIM Forum 2012 C2-6 Hadoop Web Hadoop Distributed File System Hadoop I/O I/O Hadoo

yamamoto_hadoop.pptx

Microsoft Word - D JP.docx

Abstract Journal of Agricultural Science 2

unix.dvi

UNIX

459


( ) Shift JIS ( ) ASCII ASCII ( ) 8bit = 1 Byte JIS(Japan Industrial Standard) X 0201 (X ) 2 Byte JIS ISO-2022-JP, Shift JIS, EUC 1 Byte 2 By

Plan of Talk CAS CAS 2 CAS Single Sign On CAS CAS 2 CAS Aug. 19, 2005 NII p. 2/32

¥¤¥ó¥¿¡¼¥Í¥Ã¥È·×¬¤È¥Ç¡¼¥¿²òÀÏ Âè11²ó

HPE Moonshot System ~ビッグデータ分析&モバイルワークプレイスを新たなステージへ~

プレゼンテーション

n n n ( ) n Oracle 16 PostgreSQL 3 MySQL

PRIMERGY 性能情報 SPECint2006 / SPECfp2006 測定結果一覧

3. XML, DB, DB (AP). DB, DB, AP. RDB., XMLDB, XML,.,,.,, (XML / ), XML,,., AP. AP AP AP 検索キー //A=1 //A=2 //A=3 返却 XML 全体 XML 全体 XML 全体 XMLDB <root> <A

csj-report.pdf

strtok-count.eps

untitled

untitled

Microsoft Word - Live Meeting Help.docx

ORCA (Online Research Control system Architecture)

IPSJ-HPC

dicutil1_5_2.book

,,.,,., II,,,.,,.,.,,,.,,,.,, II i

DocuWide 2051/2051MF 補足説明書

Agenda Hadoop Sahara Kilo Q&A Copyright 2015 Mirantis, Inc. All rights reserved Page 2

new_emc_panf_Hyoushi_0818

fx-9860G Manager PLUS_J

100123SLES11HA.pptx

MENU 키를 누르면 아래의 화면이 나타납니다

tutorial_lc.dvi

ビッグデータ / IoT 時代にデジタルトランスフォーメーションを実現する Dell Blueprint Dell Cloudera Apache Hadoop / Dell Validated Systems for SAP HANA ソリューションガイド デルの Hadoop / SAP HAN



3_23.dvi

CAS Yale Open Source software Authentication Authorization (nu-cas) Backend Database Authentication Authorization Powered by A

intra-mart Web for SellSide ver /03/31 Oracle MS-SQL Server IBM DB2 MS-SQL Server IBM DB2 Client Side JavaScript Server Side JavaScript URL -

AJACS18_ ppt

スライド 1

2016_Sum_H4_0405.ai


untitled

Docker Haruka Iwao Storage Solution Architect, Red Hat K.K. February 12, 2015

FY14Q4 SMB Magalog December - APJ Version

cat A

~~~~~~~~~~~~~~~~~~ wait Call CPU time 1, latch: library cache 7, latch: library cache lock 4, job scheduler co

untitled


PowerPoint Presentation


PRIMERGY 性能情報 SPECint2006 / SPECfp2006 測定結果一覧

RouteMagic Controller( RMC ) 3.6 RMC RouteMagic RouteMagic Controller RouteMagic Controller MP1200 / MP200 Version 3.6 RouteMagic Controller Version 3

HIGIS 3/プレゼンテーション資料/J_GrayA.ppt

…l…b…g…‘†[…N…v…“…O…›…~…fi…OfiÁŸ_


Who is ETGEAR? Milestone ETWORK STORAGE PRODUCTS Who is ETGEAR IDEX 3 ETGEAR Ready DATA ReadyAS ,70,8 0G iscsi X- in $ mil

付加情報をもったファイル共有システム

2009 Web B012-1

EMC-greenplum-SG s-1p

Microsoft PowerPoint - SUGI2011_EPS_Konno.ppt

RT300/140/105シリーズ 取扱説明書

Big Data ウェビナー シリーズ CiscoのHadoopリセールについて

Transcription:

3 : Apache Spark 2017 10 20 2017 10 20 1 / 32

2011 1.8ZB 2020 35ZB 1ZB = 10 21 = 1,000,000,000,000 GB Word Excel XML CSV JSON text... 2017 10 20 2 / 32

CPU SPECfp Pentium G3420 77.6 8,946 Xeon Gold 6128 1,470 22 Xeon Platinum 8180 1,770 130 PRIMERGY RX1330 64GB 19 PRIMERGY RX2530 768GB 45 PRIMERGY RX4770 1536GB 180 1TB SATA 200MB/s 2500 SAS 300MB/s 5.1 SSD 600MB/s 10 2017 10 20 3 / 32

CPU 2017 10 20 4 / 32

R Excel Excel 105 2650 1 2017 10 20 5 / 32

Mapreduce MapReduce Hadoop/Spark Map: Reduce: Map Hadoop/Spark1.x Spark 2.x 2017 10 20 6 / 32

Apache Spark 2017 10 20 7 / 32

twitter (JSON) /project/bigdata-lab/bda/tweet_20171004.json 150GB (10/15 ) 2017.10.4 2017.10.13 twitter 1% spark % time cat /project/bigdata-lab/bda/tweet_20171004.json wc -l 26511626 cat tweet_20171004.json 0.13s user 46.46s system 4% cpu 18:29.32 total wc -l 19.21s user 29.22s system 4% cpu 18:29.32 total 18 20 I/O 2017 10 20 8 / 32

JSON CSV XML JavaScript {"id":915179801954025472, "text":" \n ","place":null,"lang":"ja"} { :, :, :, :,...} { :, :{ :{ :, :,...}, :,...}} 2017 10 20 9 / 32

{...,"lang":"ja",...} % lang : ja % cat /project/bigdata-lab/bda/tweet_20171004.json grep \"lang\":\"ja\" w ) 2017 10 20 10 / 32

Spark MLlib GraphX Spark Scala, Java, Python, R Hadoop HDFS Hive Text CSV JSON 2017 10 20 11 / 32

Hadoop Hadoop Spark bda1node0x (x 01 18 ) bda1node01 04 ssh bda1node05 $ ssh bda1node05 Last login: Mon Oct 16 16:00:40 2017 from mm-dhcp-128-012.naist.jp 2017 10 19 12:12:14 JST [ysuzuki@bda1node05 ~]$ mandara $ cat /project/bigdata-lab/bda/tweet_20171004.json head [Tue Oct 03 20:40:40 JST 2017]Establishing connection. [Tue Oct 03 20:40:42 JST 2017]Connection established. [Tue Oct 03 20:40:42 JST 2017]Receiving status stream. {"in_reply_to_status_id_str":null,"in_reply_to_status_id":null, "created_at":"tue Oct 03 11:40:42 +0000 2017","in_reply_to _user_id_str":null,"source":"<a href=\"http://twitter.com/download/andro 2017 10 20 12 / 32

HDFS Hadoop Spark Linux HDFS HDFS % hadoop fs -put /project/bigdata-lab/bda/tweet_20171004.json HDFS % hadoop fs -ls -h Found 4 items drwx------ - ysuzuki is-staff 0 2017-10-18 09:00.Trash drwxr-xr-x - ysuzuki is-staff 0 2017-10-16 18:59.sparkStaging drwx------ - ysuzuki is-staff 0 2017-01-02 19:32.staging -rw-r--r-- 3 ysuzuki is-staff 152.1 G 2017-10-16 13:42 tweet_20171004.json 2017 10 20 13 / 32

Spark Spark Scala Java Python R Python spark % pyspark2 --master yarn % pyspark2 % spark2-shell --master yarn scala % pyspark2 --driver-memory 16g --executor-memory 16g --master yarn Welcome to / / / / \ \/ _ \/ _ / / _/ / /. /\_,_/_/ /_/\_\ version 2.0.0.cloudera2 /_/ Using Python version 2.6.6 (r266:84292, Aug 18 2016 08:36:59) SparkSession available as spark. >>> http://bda1node03.naist.jp:8088 2017 10 20 14 / 32

DataFrame Spark (DataFrame) (select,filter,groupby) (join) (read, show) https://spark.apache.org/docs/2.0.0/api/python/index.html 2017 10 20 15 / 32

>>> df = spark.read.json("/user/ysuzuki/tweet_20171004.json") df tweet_20171004.json >>> df.count() 33573798 2017 10 20 16 / 32

>>> df.filter("lang= ja ").count() 6600510 ja en es groupby 2017 10 20 17 / 32

>>> df.groupby("lang").count().sort("count", ascending=false).show() +----+-------+ lang count +----+-------+ en 7057160 ja 4458359 es 2002657 ar 1731097 und 1447410 ko 1307743 pt 1280456 th 759078... +----+-------+ only showing top 20 rows 2017 10 20 18 / 32

1) 2) Spark 2017 10 20 19 / 32

count.py count.py from pyspark.sql import SparkSession spark = SparkSession \.builder \.appname("app example") \.config("master", "yarn") \.getorcreate() df = spark.read.json("tweet_20171004.json") print df.count() count.py % spark2-submit count.py tee output.txt % cat output.txt 33573798 2017 10 20 20 / 32

>>> from pyspark.sql.functions import explode,split >>> df.filter("lang = en ").select("text").distinct().select(explode(split( text, ))).groupby("col").count().sort("count",ascending=false).show() +----+-------+ col count +----+-------+ RT 1735023 the 1218371 to 1136336 a 883388 I 737669... be 248032 me 239061 +----+-------+ only showing top 20 rows 2017 10 20 21 / 32

>>> df.groupby("text").count().show() +--------------------+-----+ text count +--------------------+-----+ Big economic call... 1 @TomiLahren reall... 1 RT @PossumPastor:... 2 RT @dsmesk:... 13 count >>> df.groupby("text").count().sort("count",ascending=false).show() +--------------------+-----+ text count +--------------------+-----+ RT @akiko_lawson:... 4284 RT @Kaepernick7:... 3747 RT @RodriguezDaGo... 3419 RT @TheRealNyha:... 3305 2017 10 20 22 / 32

>>> df.groupby("text").count().sort("count",ascending=false).show(20,false) +--------------------------------------------------------------------------- text +--------------------------------------------------------------------------- RT @akiko_lawson: #L 10/15 1 #L (^^) 3 10/13 10:59 # 4284 RT @Kaepernick7: I appreciate you @Eminem https://t.co/nwavbwsokq Eminem 2017 10 20 23 / 32

>>> df.groupby("user.name").count().sort("count",ascending=false).show(20,false) +----+-----+ name count +----+-----+. 53716 17986 15129-14615 ; 11380 10926 9044 ID 2017 10 20 24 / 32

ID >>> a = df.groupby("user.id").count().sort("count",ascending=false) >>> a.show() +------------------+-----+ id count +------------------+-----+ 115639376 9038 4823945834 3736 1662830792 3445 856385582401966080 3242 2669983818 3015 796251890908434432 2663 104120518 1592 2017 10 20 25 / 32

ID >>> b = df.select("user.id","user.name").distinct() >>> b.show() +------------------+--------------------+ id name +------------------+--------------------+ 767396295665299456 859974072834310144 [6 ] 535819067 Vibes 517553112 Liseth Valencia R. 2017 10 20 26 / 32

a b >>> c = a.join(b, a.id == b.id, inner ).select("name","count") >>> c.show() +------------------+-----+ name count +------------------+-----+ 9038 McDonalds Japan 3736 3445 3242 Test Account1 3015 (NESCAF ) 2663 1592 1496 2017 10 20 27 / 32

text >>> df.where(col( text ).like("% %")).select( text ).show() +--------------------+ text +--------------------+... RT @keigomi29:... RT @shunchoukatsu... RT @tkq12:... RT @pentabutabu:... 2017 10 20 28 / 32

CSV >>> c.write.csv("output_csv") >>> exit $ hadoop fs -get output_csv output_csv Excel utf-8 Excel 2017 10 20 29 / 32

2017 10 20 30 / 32

3 20 1 2017 10 20 31 / 32

Apache Spark 100GB 2017 10 20 32 / 32