Performance Evaluation of RDMA over IP: A Case Study with Ammasso Gigabit Ethernet NIC
This research study delves into the performance evaluation of Remote Direct Memory Access (RDMA) over IP, focusing on the Ammasso Gigabit Ethernet NIC. It covers topics such as WAN emulation, Sockets over TCP/IP, RDMA over LAN, and more, providing insights into the evaluation of RDMA over WAN environments and the emulation methods used. The study addresses challenges, characteristics of WAN environments, and the comparison between Sockets and Cluster Core Interface Language. The experiments involve WAN setups, IP networks, WAN emulator characteristics, and the evaluation of CCIL Applications with RDMA.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
NETWORK BASED COMPUTING LABORATORY Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC H.-W. Jin, S. Narravula, G. Brown, K. Vaidyanathan, P. Balaji, and D.K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University { jinhy, narravul, browngre, vaidyana, balaji, panda}@cse.ohio-state.edu
NETWORK BASED COMPUTING LABORATORY Contents Introduction WAN Emulator for Cluster-of-Clusters Performance Evaluation of RDMA over IP Conclusions and Future Work
NETWORK BASED COMPUTING LABORATORY Introduction Sockets over TCP/IP RDMA over LAN InfiniBand, Myrinet, Quadrics HPC middleware (MPI) and file systems (PVFS) RDMA over WAN iWARP, RDDP Grid and Internet applications RDMA-enabled Gigabit Ethernet NIC Ammasso
NETWORK BASED COMPUTING LABORATORY Ammasso Gigabit Ethernet NIC Applications CCIL Sockets Interface (Cluster Core Interface Lang.) Operating System Ammasso Gigabit Ethernet NIC Sockets RDMA TCP TOE (TCP/IP Offload Engine) IP Device Driver Gigabit Ethernet
NETWORK BASED COMPUTING LABORATORY Problem Statement There have been no comprehensive quantitative evaluations of RDMA over WAN environment How to Emulate the WAN Environment? What Kind of Performance Metrics? Sockets vs. CCIL
NETWORK BASED COMPUTING LABORATORY Contents Introduction WAN Emulator for Cluster-of-Clusters Performance Evaluation of RDMA over IP Conclusions and Future Work
NETWORK BASED COMPUTING LABORATORY Experimental WAN Setup IP Network A IP Network B WAN Emulation IP Device Driver GigE Switch GigE Switch eth0 eth1 Linux Workstation-based Router
NETWORK BASED COMPUTING LABORATORY WAN Emulator for Cluster-of-Clusters Characteristics of WAN Environments High network delay Packet loss Etc. User-Level or Kernel-Level Emulator? Blocking or Queueing based Delay Adding?
NETWORK BASED COMPUTING LABORATORY Degen: Delay generator WAN Emulator for Cluster-of-Clusters Dgen Daemon IP Degen Kernel Module Routing Decision Degen Netfilter Timestamp delay queue reinjection Device Driver Device Driver eth0 eth1
NETWORK BASED COMPUTING LABORATORY Kernel Patch for CCIL WAN Communication Ammasso Setup Ammasso 1100 Ammasso software version amso1100-1.2-ga2 Packet Drops for CCIL WAN Communication Timeout Retransmission Kernel Patch on Router
NETWORK BASED COMPUTING LABORATORY Contents Introduction WAN Emulator for Cluster-of-Clusters Performance Evaluation of RDMA over IP Basic communication latency Computation and communication overlap Communication progress CPU resource requirements Unification of communication interface Bandwidth (throughput) Conclusions and Future Work
NETWORK BASED COMPUTING LABORATORY Basic Communication Latency 1KB Message Size 9000 450 8000 400 Sockets CCIL Sockets CCIL 350 7000 300 6000 Latency (us) Latency (us) 250 5000 200 4000 150 3000 100 2000 50 1000 0 0 128 256 512 1024 2048 4096 8192 16384 4 8 16 32 64 0 1 Network Delay (ms) 2 4 8 Message Size (Byte) No impact of zero-copy on the basic communication latency Basic communication is not an important metric
NETWORK BASED COMPUTING LABORATORY Computation and Communication Overlap n1 n0 Switch Switch Router Send Total Time (t2) Computation (t1) Receive Overlap Ratio = t1/ t2
NETWORK BASED COMPUTING LABORATORY Computation and Communication Overlap 1098% 1KB Message Size 242ms Computation 1 1 0.9 0.9 Sockets CCIL Sockets CCIL 0.8 0.8 0.7 0.7 Overlap Ratio Overlap Ratio 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 114% 0.1 0 0 0 61 122 182 242 302 362 422 0 1 2 4 8 Computation (ms) Network Delay (ms) RDMA can achieve a better computation and communication overlap Its benefit reduces as the network delay increases
NETWORK BASED COMPUTING LABORATORY Communication Progress n1 n0 Switch Switch Router Request Data Response Delay By Load Fetching Latency Response
NETWORK BASED COMPUTING LABORATORY Communication Progress 1KB Message Size 16ms Response Delay 100000 100000 65% 98% Sockets CCIL 10000 10000 Latency (us) Latency (us) 1000 1000 100 100 10 10 1 1 0 1 4 16 64 0 1 2 4 8 Response Delay by Load (ms) Network Delay (ms) RDMA can achieve a better communication progress Its benefit reduces as the network delay increases
NETWORK BASED COMPUTING LABORATORY CPU Resource Requirements n1 n0 Switch Switch Router Application 40 Streams Application Execution Time?
NETWORK BASED COMPUTING LABORATORY CPU Resource Requirements 16KB Message Size 50 50 Sockets CCIL 45 45 40 40 Execution Time (Sec) Execution Time (Sec) 35 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 1K 2K 4K 8K 16K 0 1 2 4 8 Message Size (Byte) Network Delay (ms) RDMA-based communication does not affect to the application execution time RDMA has a strong potential of saving the CPU resource
NETWORK BASED COMPUTING LABORATORY Unification of Communication Interface 250 Sockets CCIL Inter-Cluster 200 switch Latency (us) 150 Intra-Cluster 100 38% 50 switch 0 16384 128 256 512 1024 2048 4096 8192 4 8 16 32 64 Message Size (Byte) RDMA over IP can provide a unified communication interface RDMA can achieve lower latency for intra-cluster communication
NETWORK BASED COMPUTING LABORATORY Bandwidth 16KB Message Size 500 600 450 Sockets CCIL Sockets CCIL 500 400 Bandwidth (Mbps) Bandwidth (Mbps) 350 400 300 300 250 200 200 150 100 100 50 0 0 16384 128 256 512 1024 2048 4096 8192 4 8 16 32 64 0 1 2 4 8 Network Delay (ms) Message Size (Byte) Where is the bottleneck? Ethernet devices on the router TCP window size
NETWORK BASED COMPUTING LABORATORY Contents Introduction WAN Emulator for Cluster-of-Clusters Performance Evaluation of RDMA over IP Conclusions and Future Work
NETWORK BASED COMPUTING LABORATORY Conclusions The first quantitative study of RDMA over IP on a WAN setup WAN Emulator for Custer-of-Clusters Degen RDMA over IP Can Save CPU resource on the server side even on a high delay WAN environment Achieve better computation and communication overlap communication progress peak bandwidth Provide unified interface
NETWORK BASED COMPUTING LABORATORY Future Work Performance Evaluations Other performance factors impact of address exchange bandwidth Application-level performance WAN Emulator for Cluster-of-Clusters Delay model Other components RDMA-aware Middleware for Widely Distributed Systems over WAN
NETWORK BASED COMPUTING LABORATORY Acknowledgements Our research is supported by the following organizations: Current Funding support by Current Equipment donations by FOUNDRY NETWORKS
NETWORK BASED COMPUTING LABORATORY Thank You { jinhy, narravul, browngre, vaidyana, balaji, panda}@ cse.ohio-state.edu NetworkBased Computing Laboratory Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/