Vol. 47 No. SIG 7(ACS 14) May 2006,, IEEE 802.3x PAUSE PC 8Kbps 930 Mbps 100 IP 200 ms TCP/IP Precise Software Pacing Method Using Gap Packets Ryousei Takano,, Tomohiro Kudoh, Yuetsu Kodama, Motohiko Matsuda, Yutaka Ishikawa, and Fumihiro Okazaki In this paper, we propose a precise software pacing method, which achieves accurate network bandwidth control and smoothing bursty traffic without requiring special purpose hardware. The proposed method controls an inter-packet gap through the transmission of additional packet (gap packet) between adjacent packets. In order to realize a gap packet, the IEEE 802.3x PAUSE packet is employed. With the method, it is possible to provide bandwidth control and smoothing for each of 100 flows by use of a commodity PC. In the case of gigabit Ethernet, the transmission bandwidth can be set for a range from 8 Kbps to 930 Mbps for each of IP flows. Furthermore, it is shown that TCP/IP communication performance over high bandwidth-delay product networks is almost fully utilized by the method. 1. Grid Technology Research Center, National Institute of Advanced Industrial Science and Technology (AIST) AXE, Inc. Graduate School of Information Science and Technology, The University of Tokyo TCP/IP 1) 10) 12) SC2003 Bandwidth Challenge 194
Vol. 47 No. SIG 7(ACS 14) 195 GtrcNET-1 3) 4) IEEE802.3x PAUSE 200 ms TCP/IP 2 3 Linux 4 5 TCP/IP 6 7 8 2. GbE 1,500 12 µs 500 Mbps 1 (a) 24 µs ON OFF ON-OFF 1 Fig. 1 Pacing. Linux 1 10 ms 1 /HZ HZ 500 Mbps 1ms 1(b) 42 62.5 KB CPU 13) 3. 3.1 2 (1) PC (2) IP 3.2
196 May 2006 1 2 2 2 PC NIC Network Interface Card PC IEEE 802.3x PAUSE PAUSE PC NIC PAUSE PAUSE PAUSE 0 PC PC PAUSE IEEE802.3x 3.3 1 ipg NIC XON/XOFF 0 XON 2 Fig. 2 Inter packet gap control using gap packets. max rate pkt size pkt size + ipg = (1) target rate max rate (1) ( ) max rate ipg = target rate 1 pkt size (2) gappkt size hw gap IFG Inter Frame Gap #pkts gappkt size = ipg (hw gap #pkts) (3) 64 GbE 935.2 Mbps MTU 8 Kbps MTU 1,500 190 MB 3.4 IP (2) 2
Vol. 47 No. SIG 7(ACS 14) 197 NIC 1 NIC (1) (1a) (1b) (1c) (2) (1a) (1b) (1a) (1c) (3) (2) (2a) (1) (3) (3a) (1a) 3 global P1 P2 P1 P2 500 Mbps 250 Mbps Fig. 3 Packet scheduling using gap packets: global shows aglobalclock. P1 and P2 showclassclocks. Target bandwidths of P1 and P2 are 500 Mbps and 250 Mbps, respectively. + (2) (3b) 3 500 Mbps 250 Mbps P1 P2 1,500 (2) P1 P2 1,500 4,500 P1 (1) 1,500 P1 3,000 P1 P2 (2) 3,000 P2 6,000 P1 (3) 4,500 P1 6,000 P1 P2 1,500 (4) 6,000 P1 P2 2 1 1 3.5 Linux iproute2
198 May 2006 PSPacer 1),2) iproute2 QoS Quality of Service Qdisc Queuing Discipline Qdisc enqueue dequeue Qdisc PSPacer 4 Qdisc PSPacer enqueue Qdisc dequeue 3.4 dequeue PAUSE PSPacer MTU dequeue sk buff 3.6 iproute2 Qdisc tc PSPacer Qdisc 500 Mbps 250 Mbps Qdisc ID Qdisc # tc qdisc add dev eth0 root handle 1: \ psp default 3 skb clone sk buff skb trim Fig. 4 4 PSPacer Implementation of PSPacer. Qdisc # tc class add dev eth0 parent 1: \ classid 1:1 psp rate 500mbit # tc class add dev eth0 parent 1: \ classid 1:2 psp rate 250mbit # tc class add dev eth0 parent 1: \ classid 1:3 psp mode normal Qdisc FIFO PFIFO # tc qdisc add dev eth0 parent 1:1 \ handle 10: pfifo # tc qdisc add dev eth0 parent 1:2 \ handle 20: pfifo # tc qdisc add dev eth0 parent 1:3 \ handle 30: pfifo IP # tc filter add dev eth0 parent 1: \ protocol ip pref 1 u32 match ip \ dst 192.168.2.0/24 classid 1:1 # tc filter add dev eth0 parent 1: \ protocol ip pref 1 u32 match ip \ dst 192.168.3.0/24 classid 1:2 4. PSPacer 4.1 ATM 9) 1 BW avg
Vol. 47 No. SIG 7(ACS 14) 199 1 Table 1 Host PC specifications. 16 16 CPU Intel Xeon 2.8 GHz dual Intel Xeon 2.8 GHz dual Chipset Intel E7501 ServerWorks GC-LE Memory 1GB 1GB NIC Intel 82546EB Intel 82546EB I/O Bus PCI-X 133MHz/64bit PCI-X 133MHz/64bit OS FedoraCore 3 kernel 2.6.11.12 NIC Driver e1000 5.6.10.1-k2-NAPI TCP BIC TCP 5 Fig. 5 Burstiness. BW avg Q Q 1MTU 1 1MTU 1 2MTU 5 2 5 5 ON-OFF ON OFF 1 ON-OFF t ON t (Avg/Max) t (Avg/Max) (Max Avg) 2 MTU 4.2 PSPacer 2 16 PC 1 Catalyst 2970 3 Catalyst 6506 C2970 C6506 GtrcNET-1 3) GtrcNET-1 1 Linux 2.6.11.12 TCP BIC TCP 16) BIC TCP Linux Reno Linux SACK Selective ACK Scalable TCP SACK-tag 17) Linux TCP 5) 10,000 PSPacer Linux TBF Token Bucket Filter UDP TCP Iperf 4.3 TCP UDP GtrcNET-1
200 May 2006 2 1 1 1 bps Table 2 Single target rate, 1-to-1 communication: Bandwidth per 1 packet and max burstiness (The unit of bandwidth is bps). 8K 7.96 K 7.96 K 7.96 K 1MTU 10 M 9.95 M 9.95 M 9.95 M 1MTU 500 M 495 M 500 M 498 M 2MTU 930 M 918 M 931 M 926 M 2MTU 6 Fig. 6 Effective bandwidth while varying target bandwidth. 7 500 300 100 50 20 Mbps Fig. 7 Bandwidth control in the case of multiple target rates (Target Bandwidth: 500, 300, 100, 50, 20 Mbps). 1 6 Iperf PSPacer 8 Kbps 930 Mbps 1 Gbps 984 Mbps 7 5 500 300 100 50 20 Mbps Iperf 10 ms 4.4 t i pkt size i n ( i+n pkt size k=i i)/(t i+n t i ) n NIC n +1 n 4.1 25 GtrcNET-1 2 24 Hz =59.6ns n 930 Mbps 1 2 Mbps 4.4.1 1 1 1 1 2 99.5% 930 Mbps 931 Mbps 2MTU 8 Kbps 10 Mbps 1MTU 500 Mbps 930 Mbps 2 MTU
Vol. 47 No. SIG 7(ACS 14) 201 3 1 1 bps Table 3 Single target rate, 1-to-many communication: Bandwidth per 1 packet and max burstiness (The unit of bandwidth is bps). 10 M 9.89 M 9.92 M 9.91 M 1MTU 4 n bps Table 4 Multiple target rate: Bandwidth per n packets and burstiness (The unit of bandwidth is bps). n 1 20 M 14.4 M 30.3 M 20.5 M 2MTU 50 M 30.0 M 98.4 M 33.1 M 2MTU 100 M 61.9 M 246 M 103 M 2MTU 300 M 168 M 494 M 332 M 2MTU 500 M 345 M 990 M 506 M 2MTU 2 20 M 16.6 M 24.1 M 19.9 M 2MTU 50 M 39.8 M 74.2 M 49.9 M 2MTU 100 M 68.0 M 143 M 100 M 2MTU 300 M 205 M 493 M 304 M 2MTU 500 M 405 M 658 M 500 M 2MTU 4.4.2 1 1 16 6 96 IP 10 Mbps 3 1 1 40 Kbps 100 4.4.3 7 25 45 5 n 4 n 500 Mbps n =1 990 Mbps 2 2MTU 5. TCP/IP PSPacer 5 FIFO Iperf Mbps Table 5 Iperf throughput and the number of packet losses on a router (The unit of bandwidth is Mbps). TBF PSPacer FIFO 16 KB 29.4 219 26.9 131 474 0 64 KB 210 257 191 402 474 0 256 KB 223 394 379 261 473 0 1024 KB 256 1196 419 12 474 0 4096 KB 459 1283 471 0 474 0 TCP/IP 5.1 TCP/IP TCP/IP 18) GtrcNET-1 1 1 2 2 2 500 Mbps RTT Round Trip Time 200 ms Drop Tail FIFO 25 MB 5.2 1 1 TBF Token Bucket Filtering PSPacer FIFO 16 KB 4MB 5 1 1 5 TBF PSPacer 500 Mbps Iperf TCP/IP GtrcNET-1 Drop Tail PSPacer FIFO TBF FIFO FIFO TBF FIFO 4MB
202 May 2006 PSPacer FIFO 5.3 2 2 2 1 FIFO 32 KB A 5 B 120 110 A B PSPacer TBF PSPacer+PSPacer TBF+TBF PSPacer+TBF TBF +PSPacer 200 Mbps 8 1 1 500 Mbps, RTT 200 ms, FIFO 1MB Fig. 8 Behavior of slow start phase on 1-to-1 communication (Bottleneck bandwidth 500 Mbps, RTT 200 ms, FIFO size 1 MB). 1MB 1MB 256 KB 64 KB 2 62.5 KB 2 ACK 1),2) 500 µs FIFO 1MB TBF PSPacer 8 500 µs 200 ms TBF 8(a) RTT 200 ms ON-OFF 2.5 FIFO 1MB TBF PSPacer FIFO 16 KB ACK ON- OFF 9 1 PSPacer + PSPacer 9(a) 200 Mbps 400 Mbps TBF 5.2 500 Mbps 200 Mbps TBF + TBF 9(b) 200 Mbps PSPacer + TBF 9(c) TBF + PSPacer 9(d) 2 TBF + TBF TBF PSPacer 200 Mbps 6. 6.1 PSPacer NIC GbE NIC 33 MHz/32 bit PCI PCI 1 Gbps
Vol. 47 No. SIG 7(ACS 14) 203 9 2 2 500 Mbps RTT 200 ms FIFO 32 KB Fig. 9 Bandwidth of 2-to-2 communication (Bottleneck bandwidth 500 Mbps, RTT 200 ms, FIFO size 32 KB). PSPacer PCI 6.2 CPU PSPacer PSPacer CPU PSPacer CPU CPU GbE NIC NIC 1 1 Gbps UDP CPU 40% TBF PSPacer 500 Mbps CPU 10% 15% NIC DMA 1) TCP 6.3 5.3 4 PC 7) 16 2 MPI NAS IS 1.6 2 (1) (2) (1) 1 1) TCP RTT PSPacer
204 May 2006 QoS (2) FTTH 40 Mbps RTT 6 mspspacer TBF UDP 40 Mbps PSPacer 0.35% TBF 1.6% PSPacer 8) 7. TCP/IP WEB 10) 12) 6) MPI 1ms 10 ms Mbps 10 ms 1 Gbps IA32 APIC HPET High Precision Event Timer 13) µs 14) GtrcNET-1 3) MAC IPG NIC IPG IPG 15) Chelsio T110 TOE TCP Offloading Engine TOE NIC 2 WFQ Weighted Fair Queuing 19) 1 WFQ 8. IEEE 802.3x PAUSE PSPacer PC 8 Kbps 930 Mbps 100 IP PAUSE PSPacer 200 ms TCP/IP
Vol. 47 No. SIG 7(ACS 14) 205 TCP PSPacer GNU GPL http://www.gridmpi.org/ NAREGI National Research Grid Initiative 1) Takano, R., Kudoh, T., Kodama, Y., Matsuda, M., Tezuka, H. and Ishikawa, Y.: Design and Evaluation of Precise Software Pacing Mechanisms for Fast Long-Distance Networks, PFLDnet2005 (Feb. 2005). 2) 2004 (Oct. 2004). 3) Kodama, Y., Kudoh, T., Takano, R., Sato, H., Tatebe, O. and Sekiguchi, S.: GNET-1: Gigabit Ethernet Network Testbed, IEEE Cluster 2004 (Sep. 2004). 4) Tatebe, O., Ogawa, H., Kodama, Y., Kudoh, T., Sekiguchi, S., Matsuoka, S., Aida, K., Boku, T., Sato, M., Morita, Y., Kitatsuji, Y., Williams, J. and Hicks, J.: The 2nd Trans- Pacific Grid Datafarm Testbed and Experiments for SC2003, IEEE/IPSJ SAINT 2004 Workshops, pp.26 30 (Jan. 2004). 5) TCP/IP 2003 (Oct. 2003). 6) MPI TCP Vol.46, No.SIG12 (ACS11) (2005). 7) GridMPI Version 1.0 SWoPP 2005, (Aug. 2005). 8) Takano, R., Kodama, Y., Kudoh, T., Matsuda, M., Okazaki, F. and Ishikawa, Y.: Realtime Burstiness Measurement, PFLDnet2006 (Feb. 2006). 9) Michiel, H. and Leavens, K.: Teletraffic engineering in a broad-band era, Proc. IEEE, Vol.85, No.12, pp.2007 2033 (Dec. 1997). 10) Visweswaraiah, V. and Heidemann, J.: Improving Restart of Idle TCP Connections, USC TR 97-661 (Nov. 1997). 11) Aggarwal, A., Savage, S. and Anderson, T.: Understanding the performance of TCP pacing, IEEE INFOCOM, pp.1157 1165 (Mar. 2000). 12) Aron, M. and Durschel, P.: TCP: Improving Startup Dynamics by Adaptive Timers and Congestion Control, Technical Report TR98-318, Rice Univ. (1998). 13) Antony, A., Blom, J., de Laat, C., Lee, J. and Sjouw, W.: Microscopic Examination of TCP flows over transatlantic Links, igrid2002 special issue, Future Generation Computer Systems, Vol.19, Issue 6 (2003). 14) Kamezawa, H., Nakamura, M., Tamatsukuri, J., Aoshima, N., Inaba, M., Hiraki, K., Shitami, J., Jinzaki, A., Kurusu, R., Sakamoto, M. and Ikuta, Y.: Inter-layer coordination for parallel TCP streams on Long Fat pipe Networks, SC2004 (Nov. 2004). 15) Nakamura, M., Kurusu, R., Marti, F., Sakamoto, M., Ikuta, Y., Tamatsukuri, J., Sugawara, Y., Aoshima, N., Inaba, M. and Hiraki, K.: Experimental Results of interlayer cooperative hardware for FRC-TCP on 10 Gbps Ethernet WANPHY 18,500 km Network, PFLDnet2005 (Feb. 2005). 16) Xu, L., Harfoush, K. and Rhee, I.: Binary Increase Congestion Control for Fast Long- Distance Networks, IEEE INFOCOM 2004 (Mar. 2004). 17) T. Kelly s SACK-tag patch. http://www-lce.eng.cam.ac.uk/ ctk21/code/ 18) Floyd, S.: HighSpeed TCP for Large Congestion Windows, RFC 3649 (Dec. 2003). 19) Parekh, A.K. and Gallager, R.G.: A Generalized Processor Sharing Approach to Flow Control in Integrated Services Networks: The Single-Node Case, IEEE/ACM Trans.Networking, Vol.1, No.3, pp.344 357 (June 1993).
206 May 2006 ( 17 10 4 ) ( 18 1 20 ) 1997 1999 2005 2003 4 GridMPI 1991 1997 2002 IEEE CS 1962 1986 1988 2001 FPGA 1990 1995 IEEE CS 1988 1995 1999 2003 1987 1993 2002 1987 2 1998 4 2003 GridMPI