メモリ階層構造を考慮した大規模グラフ処理の高速化

Size: px

Start display at page:

Download "メモリ階層構造を考慮した大規模グラフ処理の高速化"

めぐのふしはら
4 years ago
Views:

1 , CREST ERATO 0.. (, CREST) ERATO / 8

2 Outline NETAL (NETwork Analysis Library) NUMA BFS raph500, reenraph500 Kronecker raph Level Synchronized parallel BFS Hybrid Algorithm for Parallel BFS NUMA Hybrid Parallel BFS 5th raph500 / (st) reenraph500 (, CREST) ERATO / 8

3 ,,,,,,,,,... raphct : BC (Betweenness, ) USA-road-d.LKS.gr (n =.76M, m = 6.89M) :0.6 cit-patents (n =.77M, m = 6.5M) :.6 (CRAY XMT) NETAL {, } CC, C, SC, BC USA-road-d.LKS.gr (n =.76M, m = 6.89M) :9.4 cit-patents (n =.77M, m = 6.5M) :.5 NUMA Intel/AMD (, CREST) ERATO / 8

4 Brandes algorithm* closeness (CC) C C (v)= t V d (v,t) graph (C) C (v)= max t V d (v,t) stress (SC) C S (v)= s v t V σ st (v) betweenness (BC) σ st (v) C B (v)= s v t V σ st multipathbfs multipathsssp 6 : ShortestPath phase 8-0 : UPdate phase *U.Brandes,AFasterAlgorithmforBetweennessCentrality,(00) : =(V,E) : C C [v], C [v], C S [v], C B [v], v V (0 ) : for s V parallel do : /* σ[v] v */ : /* S */ 4: /* P[v] v */ 5: /* d [v] v */ 6: σ,s,p,d multipathbfs(,s) 7: 8: C C [s] t V d (s,t) 9: C [s] max t V d (s,t) 0: while S /0 do : pop w S : for v P[w] do : δ S [v] ( + δ S [w]) 4: δ B [v] δ B [v]+ σ[v] σ[w] ( + δ B[w]) 5: end for 6: if w s then 7: C S [w] C S [w]+σ[w] δ S [w] 8: C B [w] C B [w]+δ B [w] 9: end if 0: end while : end for (, CREST) ERATO 4 / 8

5 =(V,E), n = V, m = E, l : E R + n SSSP ( : BFS) β n MSSP (SSSP β) n n APSP (,, ) ( : SSSP) distancesssp distance singlepathsssp distance multipathsssp distance singlepathsssp (, CREST) ERATO 5 / 8

6 -HEAP: Dĳkstra s algorithm,.., -HEAP. 9th DIMACS MLB,. singlepathsssp, n =.95M m = 58.M -way Xeon X5460.6Hz (4 cores ) SSSP [ms] CPU time (speedup) [B] -HEAP* (sequential) (.00) HEAP* ( threads) 74.4 (.94).7 -HEAP* ( 4 threads) 45. (.65).00 -HEAP* ( 8 threads) 0.7 ( 5.4).46 MLB (sequential) *,,, : (, CREST) ERATO 6 / 8

7 , bottleneck diff-procs diff-l same-l processor bandwidth down down down processor inside bandwidth - down down L cache sharing - - down Arithmetic performance different processors same processor, different L caches same processor, same L cache diff-procs : diff-l : L same-l : L -way Xeon X5460 SSSP (USA-road-d.USA.gr) sequential diff-procs diff-l same-l -HEAP* 5.4 s (± 0.00%) 5.44 s (-.9%) 5.6 s (- 5.05%) 6.6 s (-8.94%) d-heap 7. s (± 0.00%) 7.6 s (- 0.4%) 7.59 s (- 4.74%) 8.79 s (-7.75%) Fib-heap 5.95 s (± 0.00%) 6.09 s (- 0.87%) 6.56 s (-.68%) 8.7 s (-.%) Dial s 4.8 s (± 0.00%) 4.54 s (-.5%) 5.0 s (-.58%) 6.8 s (-.5%) double buckets 4.65 s (± 0.00%) 4.88 s (- 4.7%) 5.5 s (-.4%) 6.64 s (-9.97%) MLB 5.69 s (± 0.00%) 5.85 s (-.74%) 6.7 s (- 7.78%) 7.7 s (-6.9%) -stepping.74 s (± 0.00%).06 s (-.66%).55 s (- 6.4%) 6.49 s (-8.76%) *,,, : (, CREST) ERATO 7 / 8

8 NETAL (NETwork Analysis Library) APSP (CC,C,SC,BC) NUMA (CPU ) APSP (APSP) n-bfs BFS multipathbfs n n-dĳkstra Dĳkstra s with binary-heap multipathsssp n n/β-mlsc MLSC with binary-heap distancemssp n/β -HEAP n-bfs BFS CC,C,SC,BC n-dĳkstra Dĳkstra s with binary-heap CC,C,SC,BC 4 Centrality CC, C, SC, BC multipath (unweighted) weighted CC, C, SC, BC multipath in parallel NETAL (NETwork Analysis Library) APSP distance n-bfs n-dijkstra n/b-mslc Y.Yasui et al.: NETAL:High-Performance Implementation of Network Analysis Library Considering Computer Memory Hierarchy, 0. (, CREST) ERATO 8 / 8

9 NUMA 4-way opteron 674 ( cores 4sockets) APSP CPU ( ).. affinity raph Data CPU/Memory affinity worst: 48 -affinity raph Data CPU/Memory affinity best: 6 8-affinity USA-road-d.NY.gr (n = 64K, m = 74K) :TEPS (speedup) affinity n-bfs n-dĳkstra n/β-mslc(β = ) sequential 0.5 M (.0) 0.8 M (.0).4 M (.0) threads worst 9.4 M (.7) 0. M (.) 6.6 M ( 9.) best 99. M( 4.6) 4. M(.4) 40.9 M(.7) 4 threads worst 87.8 M ( 8.9).0 M ( 9.7) M ( 6.8) best M( 6.8) 49.8 M(.) 64.0 M(.5) 48 threads worst M ( 7.5) 5.0 M (.6) M (.) best M( 46.) 47.7 M( 4.6) 47.5 M( 5.7) (, CREST) ERATO 9 / 8

10 4-way opteron 674 ( cores 4sockets) afffnity affinity CPU/Memory affinity raph Data worst: 48 -affinity raph Data CPU/Memory affinity best: 6 8-affinity CPU time [seconds] trials n-bfs (best) n-sssp (best) n/β-mssp (best) n-bfs n-sssp n/β-mssp 0 affinity (best) n-bfs (worst) n-sssp (worst) n/β-mssp (worst) affinity (worst) (, CREST) ERATO 0 / 8

11 (APSP) USA-road-d.USA.gr n =.95M, m = 58.M.5 9. distanceapsp, n/β-mslc 7.75 (MLB 9, -stepping 4 ) n-dĳkstra (multipathsssp) MLB, -stepping 8 4 LiveJournal soc-livejournal n = 4.85M, m = 68.99M distanceapsp, n/β-mslc.78 (MLB 4, -stepping 04 ) n-dĳkstra (multipathsssp) MLB, -stepping 0 USA-road-d.USA.gr soc-livejournal n-bfs 70 days 7.5 days n-dĳkstra 99 days 9.6 days n/β-mslc 7.75 days (β = 6).78 days (β = ) LS-BFS 557 days (=.5 years) 79.5 days MLB 774 days (= 4.9 years) 0.55 days -stepping 5 days (= 9. years) 88. days (, CREST) ERATO / 8

12 中心性指標 (USA-road-d.LKS.gr (n =.76M, m = 6.89M)) NETAL は 4 種類の中心性指標 CC,C,CS,CB を計算する重みなし中心性 (上段, n-bfs で 9.4 時間), 重み付中心性 (下段, n-dĳkstra で.8 時間) raphct は枝長を考慮しない BC のみに 0.6 日間要する (4-way Opteron 674) closeness CC (v) = graph C (v) = maxt V d (v,t) t V d (v,t) stress CS (v) = 安井 (中央大学, CREST) s!v!t V σst (v) betweenness CB (v) = メモリ階層構造を考慮した大規模グラフ処理の高速化 s!v!t V σst (v) σst ERATO / 8

13 n-bfs n-dĳkstra,, / C C,C,C S,C B raphct raphct, C B NETAL(n-BFS) 4, raphct 6 NETAL(n-SSSP), NETAL(n-BFS).. instance n m n-bfs n-dĳkstra raphct (C C,C,C S,C B ) (weighted C C,C,C S,C B ) (C B ) USA-road-d.LKS.gr.8M 6.9M 9.5 h (SP 55 %, UP 45 %).84 h (SP 69 %, UP %) 49.8 h cit-patents.8m 6.5M.87 h (SP 7 %, UP 7 %).5 h (SP 40 %, UP 60 %).6 h (SP) (UP) SSCA# SSCA# C B ( ) SSCA#, n-bfs.8, n-dĳkstra.4 raphct instance n m n-bfs n-dĳkstra raphct SSCA# (C C,C,C S,C B ) (weighted C C,C,C S,C B ) (C B ) (C B ) R-MAT n = 6.78M m= 4.M 6.0 seconds 60.5 seconds 60.0 seconds error 0.8 MTEPS.9 MTEPS 48.5 MTEPS (, CREST) ERATO / 8

14 raph500, reenraph500 raph500 BFS TEPS ratio Traversed edges per second Kronecker raph 64 BFS. SCALE, edgefactor(= 6), n = SCALE, m = edgefactor n. BFS, BFS, TEPS., 64 Medial TEPS,. reenraph500 BFS TEPS/kW, BFS energy loop. RemotePDU: Omron RC008 raph& enera)on raph& Construc)on BFS Valida)on (, CREST) ERATO 4 / 8

Kronecker raph Kronecker raph SCALE kronecker. SCALE =... } {{ } SCALE raph500, = ( ) 0.57 0.9,. 0.9 0.

15 Kronecker raph Kronecker raph SCALE kronecker. SCALE =... } {{ } SCALE raph500, = ( ) , Kronecker raph SCALE 6, edgefactor 6 n = 6 = 67.M m = = 47.5M number of nodes node degree (, CREST) ERATO 5 / 8

16 Level Synchronized parallel BFS Level Synchronized parallel BFS BFS., atomic.. (, CREST) ERATO 6 / 8

17 Hybrid Algorithm for Parallel BFS [Beamer,0] Direction Optimizing Breadth-First Search (frontier),. forward-search (top-down step) backward-search (bottom-up step) (, CREST) ERATO 7 / 8

18 Hybrid Algorithm Top-down Bottom-up, Top-down Bottom-up,, Bottom-up Top-down,, m f m u n f n (, CREST) ERATO 8 / 8

19 NUMA Hybrid Algorithm Intel(R) Xeon(R) CPU (HT ) (α = 0,β = 4) TEPS (SCALE=6) SCALE TEPS (#threads=) Traversed Edges Per Second (TEPS).0e+0.5e+0.0e+0.5e+0.0e+0 5.0e+09 edgefactor= 8 edgefactor=6 edgefactor= edgefactor=64 Traversed Edges Per Second (TEPS).0e+0.5e+0.0e+0.5e+0.0e+0 5.0e+09 edgefactor= 8 edgefactor=6 edgefactor= edgefactor=64 0.0e #threads 0.0e scale (, CREST) ERATO 4 / 8

20 raph500(/(reenraph500 SCALE n m / ( BFS( ( ( (TEPS) ( ) (W) TEPS/kW WestmreEX 80 ( Xeon(E7@(4870(@(.40Hz((0(cores)(x(4 SandyBridgeEP ( Xeon(E5@690(@(.90Hz((6(cores)(x( MagnyCours((48)( Opteron(674(@(.0Hz(((cores)(x(4 SandyBridgeEP ( Xeon(E5@690(@(.00Hz((6(cores)(x( Xeon(E5@60(@(.0Hz(((threads)(x(( WestmereEP((4)( Xeon(X5670(@(.9Hz(((cores)(x(( Core(i7@80QM(@(.70Hz(((cores)( Core(i7@80QM(@(.70Hz(((cores)( ( ( ( ( ( ( ( ( ( ( ( ( ( ( (, CREST) ERATO 5 / 8

21 raph500(/(reenraph500 CPU( (node( SCALE n m / ( BFS( ( ( (TEPS) ( ) (W) TEPS/kW WestmreEX 80 ( Xeon(E7@(4870(@(.40Hz((0(cores)(x(4 SandyBridgeEP ( Xeon(E5@690(@(.90Hz((6(cores)(x( MagnyCours((48)( Opteron(674(@(.0Hz(((cores)(x(4 SandyBridgeEP ( Xeon(E5@690(@(.00Hz((6(cores)(x( Xeon(E5@60(@(.0Hz(((threads)(x(( WestmereEP((4)( Xeon(X5670(@(.9Hz(((cores)(x(( Core(i7@80QM(@(.70Hz(((cores)( Core(i7@80QM(@(.70Hz(((cores)( ( ( ( ( ( ( ( ( ( ( ( ( ( ( (, CREST) ERATO 6 / 8

st raphreen500-list raphcrest- - -ISC- - raph500-4- :-raphcrest8tegra-/-scale0(/(0.

22 st raphreen500-list raphcrest- - -ISC- - raph :-raphcrest8tegra-/-scale0(/(0.99(teps( ASUS-Pad-TF700T-(NVIDIA-Tegra--.7Hz,-4-cores)-/--node- (, CREST) ERATO 7 / 8

23 NETAL (NETwork Analysis Library) NUMA Intel/AMD.5 9., 7.75 CC, C, SC, BC : 9.4 ( : 0.6 ) CC, C, SC, BC :.5 ( :.6 ) raph500 Kronecker raph BFS. (small-world scale-free ) CPU node HT 80, 0.90 TEPS reenraph500, 5 (, CREST) ERATO 8 / 8

untitled

untitled c NUMA 1. 18 (Moore s law) 1Hz CPU 2. 1 (Register) (RAM) Level 1 (L1) L2 L3 L4 TLB (translation look-aside buffer) (OS) TLB TLB 3. NUMA NUMA (Non-uniform memory access) 819 0395 744 1 2014 10 Copyright