Interactions of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments

an analysis of 10 gigabit ethernet protocol n.w

1 / 25

Embed Share

Explore the dynamics of high-end computing trends, multicore architectures, and protocol interactions in the context of 10-Gigabit Ethernet. Learn how different protocols interact with multicore systems, impacting application performance and network processing efficiency.

mkate Follow

Uploaded on Mar 18, 2025 | 2 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Mathematics and Comp. Science Virginia Tech Argonne National Laboratory

High-end Computing Trends High-end Computing (HEC) Systems Continue to increase in scale and capability Multicore architectures A significant driving force for this trend Quad-core processors from Intel/AMD IBM cell, SUN Niagara, Intel Terascale processor High-speed Network Interconnects 10-Gigabit Ethernet (10GE), InfiniBand, Myrinet, Quadrics Different stacks use different amounts of hardware support How do these two components interact with each other?

Multicore Architectures Multi-processor vs. Multicore systems Not all of the processor hardware is replicated for multicore systems Hardware units such as cache might be shared between the different cores Multiple processing units embedded on the same processor die inter-core communication faster than inter-processor communication On most architectures (Intel, AMD, SUN), all cores are equally powerful makes scheduling easier

Interactions of Protocols with Multicores Depending on how the stack works, different protocols have different interactions with multicore systems Study based on host-based TCP/IP and iWARP TCP/IP has significant interaction with multicore systems Large impacts on application performance iWARP stack itself does not interact directly with multicore systems Software libraries built on top of iWARP DO interact (buffering of data, copies) Interaction similar to other high performance protocols (InfiniBand, Myrinet MX, Qlogic PSM)

TCP/IP Interaction vs. iWARP Interaction App App App App App App Library Library Library Host-processing independent of application process (statically tied to a single core) Host-processing closely tied to application process TCP/IP stack Packet Processing iWARP offloaded Network Packet Processing Network Packet Arrival Packet Arrival TCP/IP is some ways more asynchronous or centralized with respect to host- processing as compared to iWARP (or other high performance software stacks)

Presentation Layout Introduction and Motivation Treachery of Multicore Architectures Application Process to Core Mapping Techniques Conclusions and Future Work

MPI Bandwidth over TCP/IP Intel Platform AMD Platform 3500 3000 Core 0 Core 0 3000 Core 1 2500 Core 1 Core 2 Core 2 2500 2000 Bandwidth (Mbps) Core 3 Bandwidth (Mbps) Core 3 2000 1500 1500 1000 1000 500 500 0 0 1M 4M 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M 1 4 16 64 256 1K 4K 16K 64K 256K Message Size (bytes) Message Size (bytes)

MPI Bandwidth over iWARP Intel Platform AMD Platform 7000 8000 Core 0 Core 0 7000 6000 Core 1 Core 1 6000 Core 2 5000 Core 2 Bandwidth (Mbps) Bandwidth (Mbps) 5000 Core 3 Core 3 4000 4000 3000 3000 2000 2000 1000 1000 0 0 1M 4M 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M 1 4 16 64 256 1K 4K 16K 64K 256K Message Size (bytes) Message Size (bytes)

TCP/IP Interrupts and Cache Misses Hardware Interrupts L2 Cache Misses 100000 250 Core 0 Core 0 10000 Core 1 200 Core 1 Core 2 Core 2 1000 Interrupts per Message Percentage Difference 150 Core 3 Core 3 100 100 10 50 1 1M 4M 1 4 16 64 256 1K 4K 16K 64K 256K 0 0.1 1M 4M 1 4 16 64 256 1K 4K 16K 64K 256K 0.01 -50 Message Size (bytes) Message Size (bytes)

MPI Latency over TCP/IP (Intel Platform) Small Message Latency Large Message Latency 50 20000 Core 0 Core 1 Core 0 45 18000 Core 2 Core 3 Core 1 40 16000 Core 2 35 14000 Bandwidth (Mbps) Bandwidth (Mbps) Core 3 30 12000 25 10000 20 8000 15 6000 10 4000 5 2000 0 0 1K 4K 1 4 16 Message Size (bytes) 64 256 128K 256K 512K Message Size (bytes) 1M 2M 4M

Presentation Layout Introduction and Motivation Treachery of Multicore Architectures Application Process to Core Mapping Techniques Conclusions and Future Work

Application Behavior Pre-analysis A four-core system is effectively a 3.5 core system A part of a core has to be dedicated to communication Interrupts, Cache misses How do we schedule 4 application processes on 3.5 cores? If the application is exactly synchronized, there is not much we can do Otherwise, we have an opportunity! Study with GROMACS and LAMMPS

GROMACS Overview Developed by Groningen University Simulates the molecular dynamics of biochemical particles The root distributes a topology file corresponding to the molecular structure Simulation time broken down into a number of steps Processes synchronize at each step Performance reported as number of nanoseconds of molecular interactions that can be simulated each day Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Combination A 0 4 2 6 7 3 5 1 Combination B 0 2 4 6 5 1 3 7

GROMACS: Random Scheduling Gromacs LZM Application Computation MPI_Wait Other MPI calls 30 100% Combination A 90% Combination B 25 80% 70% 20 60% ns/day 15 50% 40% 10 30% 20% 5 10% 0% 0 0 1 2 3 0 1 2 3 TCP/IP iWARP Machine 1 cores Machine 2 cores

GROMACS: Selective Scheduling Gromacs LZM Application Computation MPI_Wait Other MPI calls 30 100% Combination A Combination B Combination A' Combination B' 90% 25 80% 70% 20 60% ns/day 15 50% 40% 10 30% 20% 5 10% 0% 0 0 1 2 3 0 1 2 3 TCP/IP iWARP Machine 1 cores Machine 2 cores

LAMMPS Overview Molecular dynamics simulator developed at Sandia Uses spatial decomposition techniques to partition the simulation domain into smaller 3-D subdomains Each subdomain allotted to a different process Interaction required only between neighboring subdomains improves scalability Used the Lennard-Jones liquid simulation within LAMMPS Core 0 Core 1 Core 2 Core 3 Network Core 0 Core 1 Core 2 Core 3

LAMMPS: Random Scheduling LAMMPS Application MPI_Wait MPI_Send Other MPI calls 12 100% Combination A 90% Combination B 10 80% Communication Time (seconds) 70% 8 60% 6 50% 40% 4 30% 20% 2 10% 0% 0 0 1 2 3 0 1 2 3 TCP/IP iWARP Machine 1 cores Machine 2 cores

LAMMPS: Intended Communication Pattern MPI_Irecv() MPI_Irecv() MPI_Send() MPI_Send() MPI_Wait() MPI_Wait() Computation MPI_Irecv() MPI_Irecv() MPI_Send() MPI_Send()

LAMMPS: Actual Communication Pattern Slower Core Slower Core Faster Core Faster Core MPI_Send() MPI_Send() MPI_Send() MPI buffer MPI buffer Socket Send Buffer Socket Send Buffer Socket Recv Buffer Socket Recv Buffer Application Recv Buffer Application Recv Buffer MPI_Wait() Computation MPI_Wait() Computation MPI_Send() Out-of-Sync Communication between processes Application Recv Buffer

LAMMPS: Selective Scheduling LAMMPS Application MPI_Wait MPI_Send Other MPI calls 12 100% Combination A Combination B Combination A' Combination B' 90% 10 80% Communication Time (seconds) 70% 8 60% 6 50% 40% 4 30% 20% 2 10% 0% 0 0 1 2 3 0 1 2 3 TCP/IP iWARP Machine 1 cores Machine 2 cores

Presentation Layout Introduction and Motivation Treachery of Multicore Architectures Application Process to Core Mapping Techniques Conclusions and Future Work

Concluding Remarks and Future Work Multicore architectures and high-speed networks are becoming prominent in high-end computing systems Interaction of these components is important and interesting! For TCP/IP scheduling order drastically impacts performance For iWARP scheduling order has no overhead Scheduling processes in a more intelligent manner allows significantly improved application performance Does not impact iWARP and other high-performance stack making the approach portable while efficient Dynamic process to core scheduling!

Thank You Contacts: Ganesh Narayanaswamy: cnganesh@cs.vt.edu Pavan Balaji: balaji@mcs.anl.gov Wu-chun Feng: feng@cs.vt.edu For More Information: http://synergy.cs.vt.edu http://www.mcs.anl.gov/~balaji

Backup Slides

MPI Latency over TCP/IP (AMD Platform) Small Message Latency Large Message Latency 50 25000 Core 0 Core 1 Core 0 45 Core 2 Core 3 Core 1 40 20000 Core 2 35 Bandwidth (Mbps) Bandwidth (Mbps) Core 3 30 15000 25 20 10000 15 10 5000 5 0 0 1K 4K 1 4 16 Message Size (bytes) 64 256 128K 256K 512K Message Size (bytes) 1M 2M 4M

Interactions of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments

Download Presentation

Presentation Transcript

Related

More Related Content