
High Performance Sockets and InfiniBand Architecture Overview
Explore the benefits of Sockets Direct Protocol over InfiniBand in cluster environments as researched by P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu, and D. K. Panda at the Network Based Computing Laboratory, The Ohio State University. The presentation outlines the introduction, background, SDP, multi-tier data centers, PVFS, experimental evaluations, conclusions, and future work in the field of network-based computing. Various optimizations for achieving high performance in application networking are discussed, including network-specific and generic strategies. Additionally, a detailed look at traditional Berkeley Sockets and high-performance sockets in user space and kernel space is provided. The InfiniBand architecture is examined, highlighting its industry-standard interconnect capabilities for high performance and low latency in compute and I/O node connectivity.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
NETWORK BASED COMPUTING LABORATORY Sockets Direct Protocol Over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda Network Based Computing Laboratory The Ohio State University
NETWORK BASED COMPUTING LABORATORY Presentation Layout Introduction and Background Sockets Direct Protocol (SDP) Multi-Tier Data-Centers Parallel Virtual File System (PVFS) Experimental Evaluation Conclusions and Future Work
NETWORK BASED COMPUTING LABORATORY Introduction Advent of High Performance Networks Ex: InfiniBand, Myrinet, 10-Gigabit Ethernet High Performance Protocols: VAPI / IBAL, GM, EMP Good to build new applications Not so beneficial for existing applications Built around Portability: Should run on all platforms TCP/IP based Sockets: A popular choice Performance of Application depends on the Performance of Sockets Several GENERICoptimizations for sockets to provide high performance Jacobson Optimization: Integrated Checksum-Copy [Jacob89] Header Prediction for Single Stream data transfer [Jacob89]: An analysis of TCP Processing Overhead , D. Clark, V. Jacobson, J. Romkey and H. Salwen. IEEE Communications
NETWORK BASED COMPUTING LABORATORY Network Specific Optimizations Generic Optimizations Insufficient Unable to saturate high performance networks Sockets can utilize some network features Interrupt Coalescing (can be considered generic) Checksum Offload (TCP stack has to modified) Insufficient! Can we do better? High Performance Sockets TCP Offload Engines (TOE)
NETWORK BASED COMPUTING LABORATORY High Performance Sockets Traditional Berkeley Sockets High Performance Sockets User Space Application or Library Application or Library User Space Pseudo sockets layer Sockets Kernel TCP Network Native Protocol Sockets OS Agent Kernel IP Hardware Hardware NIC High Performance Network
NETWORK BASED COMPUTING LABORATORY InfiniBand Architecture Overview Industry Standard Interconnect for connecting compute and I/O nodes Provides High Performance Low latency of lesser than 5us Over 840MBps uni-directional bandwidth Provides one-sided communication (RDMA, Remote Atomics) Becoming increasingly popular
NETWORK BASED COMPUTING LABORATORY Sockets Direct Protocol (SDP*) IBA Specific Protocol for Data-Streaming Defined to serve two purposes: Maintain compatibility for existing applications Deliver the high performance of IBA to the applications Two approaches for data transfer: Copy-based and Z-Copy Z-Copy specifies Source-Avail and Sink-Avail messages Source-Avail allows destination to RDMA Read from source Sink-Avail allows source to RDMA Write to the destination Current implementation limitations: Only supports the Copy-based implementation Does not support Source-Avail and Sink-Avail *SDP implementation from the Voltaire Software Stack
NETWORK BASED COMPUTING LABORATORY Presentation Layout Introduction and Background Sockets Direct Protocol (SDP) Multi-Tier Data-Centers Parallel Virtual File System (PVFS) Experimental Evaluation Conclusions and Future Work
NETWORK BASED COMPUTING LABORATORY Multi-Tier Data-Centers (Courtesy Mellanox Corporation) Client Requests come over the WAN (TCP based + Ethernet Connectivity) Traditional TCP based requests are forwarded to the inner tiers Performance is limited due to TCP Can we use SDP to improve the data-center performance? SDP is not compatible with traditional sockets: Requires TCP termination!
NETWORK BASED COMPUTING LABORATORY 3-Tier Data-Center Test-bed at OSU Apache Web Servers Caching Tier 2 Tier 0 Database Servers Clients Proxy Nodes MySQL or DB2 WAN Tier 1 Application Servers File System evaluation Caching Schemes Generate requests for both web servers and database servers TCP Termination Load Balancing Caching Apache PHP Dynamic Content Caching Persistent Connections
NETWORK BASED COMPUTING LABORATORY Presentation Layout Introduction and Background Sockets Direct Protocol (SDP) Multi-Tier Data-Centers Parallel Virtual File System (PVFS) Experimental Evaluation Conclusions and Future Work
NETWORK BASED COMPUTING LABORATORY Parallel Virtual File System (PVFS) Compute Node Meta-Data Manager Meta Data Compute Node I/O Server Node Data Network Compute Node I/O Server Node Data Compute Node I/O Server Node Data Relies on Striping of data across different nodes Tries to aggregate I/O bandwidth from multiple nodes Utilizes the local file system on the I/O Server nodes
NETWORK BASED COMPUTING LABORATORY Parallel I/O in Clusters via PVFS Applications Applications Posix MPI-IO libpvfs Posix MPI-IO libpvfs Control Data Network iod iod mgr Local file systems Local file systems PVFS: Parallel Virtual File System Parallel: stripe/access data across multiple nodes Virtual: exists only as a set of user-space daemons File system: common file access methods (open, read/write) Designed by ANL and Clemson PVFS over InfiniBand: Design and Performance Evaluation , Jiesheng Wu, Pete Wyckoff and D. K. Panda. International Conference on Parallel Processing (ICPP), 2003.
NETWORK BASED COMPUTING LABORATORY Presentation Layout Introduction and Background Sockets Direct Protocol (SDP) Multi-Tier Data-Centers Parallel Virtual File System (PVFS) Experimental Evaluation Micro-Benchmark Evaluation Data-Center Performance PVFS Performance Conclusions and Future Work
NETWORK BASED COMPUTING LABORATORY Experimental Test-bed Eight Dual 2.4GHz Xeon processor nodes 64-bit 133MHz PCI-X interfaces 512KB L2-Cache and 400MHz Front Side Bus Mellanox InfiniHost MT23108 Dual Port 4x HCAs MT43132 eight 4x port Switch SDK version 0.2.0 Firmware version 1.17
NETWORK BASED COMPUTING LABORATORY Latency and Bandwidth Comparison Latency and CPU utilization on SDP vs IPoIB Bandwidth and CPU utilization on SDP vs IPoIB 70 60 900 200 60 800 50 Bandwidth (Mbytes/s) % CPU utilization 160 50 700 % CPU utilization 40 Time (us) 600 40 120 30 500 30 20 400 20 80 300 10 10 200 40 0 0 100 1K 2K 4K 2 4 8 16 32 64 128 256 512 0 0 Message Size 4 16 64 256 Message Size SDP CPU VAPI send/recv 1K 4K 16K 64K IPoIB CPU SDP SDP CPU VAPI send/recv IPoIB VAPI RDMA write IPoIB CPU SDP IPoIB VAPI RDMA write SDP achieves 500MBps bandwidth compared to 180MBps of IPoIB Latency of 27us compared to 31us of IPoIB Improved CPU Utilization
NETWORK BASED COMPUTING LABORATORY Hotspot Latency Hotspot Latency on SDP vs IPoIB 700 200 180 600 160 500 140 % CPU Utilization Time (us) 120 400 100 300 80 60 200 40 100 20 0 0 1 2 3 4 5 6 7 Number of Nodes 16K SDP CPU 4K SDP 16K IPoIBCPU 4K IPoIB 1K IPoIB 16K IPoIB 1K SDP 16K SDP SDP is more scalable in hot-spot scenarios
NETWORK BASED COMPUTING LABORATORY Data-Center Response Time Web Server Delay Client Response Time 25 250 20 200 Response Time (ms) Time Spent (ms) 15 150 10 100 5 50 0 0 64K 128k 256k 512k 1024k 2048k 1M 2M 32K 128K 256K 512K 32K 64K Message Size (bytes) Message Size (bytes) IPoIB SDP IPoIB SDP SDP shows very little improvement: Client network (Fast Ethernet) becomes the bottleneck Client network bottleneck reflected in the web server delay: up to 3 times improvement with SDP
NETWORK BASED COMPUTING LABORATORY Data-Center Response Time (Fast Clients) 30 25 Response Time (ms) 20 IPoIB SDP 15 10 5 0 32K 64K 128K 256K 512K 1M 2M Message Size (bytes) SDP performs well for large files; not very well for small files
NETWORK BASED COMPUTING LABORATORY Data-Center Response Time Split-up IPoIB SDP Proxy End 3% Proxy End 3% Init + Qtime 8% Request Read Init + Qtime 9%Request Read 3% 3% Response Write 25% Core Processing 10% Core Processing 12% URL Manipulation 1% Response Write 38% URL Manipulation 1% Cache Update 2% Back-end Connect 14% Reply Read 14% Back-end Connect 32% Request Write 2% Cache Update 3% Request Write 2% Reply Read 15%
NETWORK BASED COMPUTING LABORATORY Data-Center Response Time without Connection Time Overhead 30 25 Response Time (ms) 20 IPoIB SDP 15 10 5 0 32K 64K 128K 256K 512K 1M 2M Message Size (bytes) Without the connection time, SDP would perform well for all file sizes
NETWORK BASED COMPUTING LABORATORY PVFS Performance using ramfs Read Bandwidth (3 IODs) Write Bandwidth (3IODs) 1500 1200 Bandwidth (MBps) Bandwidth (MBps) 1000 1000 800 600 500 400 200 IPoIB SDP VAPI 0 0 1 2 3 4 5 1 2 3 4 5 No. of Clients No. of Clients Write Bandwidth (4IODs) Read Bandwidth (4IODs) 1500 2000 Bandwidth (MBps) Bandwidth (MBps) 1500 1000 1000 500 500 0 0 1 2 No. of Clients 3 4 1 2 No. of Clients 3 4
NETWORK BASED COMPUTING LABORATORY PVFS Performance with sync (ext3fs) 75 Aggragate Bandwidth (Mbytes/s) 70 65 60 IPoIB SDP VAPI Clients can push data faster to IODs using SDP; de-stage bandwidth remains the same
NETWORK BASED COMPUTING LABORATORY Conclusions User-Level Sockets designed with two motives: Compatibility for existing applications High Performance for modern networks SDP was proposed recently along similar lines Sockets Direct Protocol: Is it Beneficial? Evaluated it using micro-benchmarks and real applications Multi-Tier Data-Centers and PVFS Benefits in environments it s good for Communication intensive environments such as PVFS Demonstrate environments it s yet to mature for Connection overhead involving environments such as Data-Centers
NETWORK BASED COMPUTING LABORATORY Future Work Connection Time bottleneck in SDP Using dynamic registered buffer pools, FMR techniques, etc Using QP pools Power-Law Networks Other applications: Streaming and Transaction Comparison with other high performance sockets
NETWORK BASED COMPUTING LABORATORY Thank You! For more information, please visit the Home Page NBC http://nowlab.cis.ohio-state.edu Network Based Computing Laboratory, The Ohio State University
NETWORK BASED COMPUTING LABORATORY Backup Slides
NETWORK BASED COMPUTING LABORATORY TCP Termination in SDP Personal Notebook/Computer Blade Servers Network Service Tier HTTP HTTP Browser Web Server HTML HTML Proxy Sockets Sockets SDP TCP SDP TCP IP OS by pass IP Ethernet Infiniband HW Ethernet Infiniband HW Ethernet Communication InfiniBand Communication