Mellanox is renowned for furthering development toward a more effective and efficient interconnect, and today we’re weaving intelligence into the interconnect fabric, improving our acceleration engines, and adding improved capabilities that further remove communication tasks from the CPU and dramatically increase system efficiency.
Historically, increases in performance have been achieved around a CPU-centric mindset, with development of the individual hardware devices, drivers, middleware, and software applications in order to improve scalability and maximize throughput. This archaic model is becoming short-lived, and as a new era of Co-Design is moving the industry toward Exascale-class computing, the creation of synergies between all system elements is the only approach that can lead to significant performance improvements.
As Mellanox is advancing the Co-Design approach, we strive for more CPU offload capabilities and acceleration techniques while maintaining forward and backward compatibility of new and existing infrastructures; the result is nothing less than the world’s most advanced interconnect, which continues to yield the most powerful and efficient supercomputers ever deployed.
ONLOAD technology, such as that attempted be used many years ago by companies such as Pathscale’s InfiniPath and QLOGIC TrueScale, have long since abandoned the method, actually selling off the failing IP technique twice over! ONLOAD, or “dumb NICs” was actually developed to take advantage of additional CPU cores that were not effectively leveraged by the older ecosystem, which had not yet matured to take advantage of the emerging multi-core processors. While very short-lived, it was a window of opportunity to tax the unused CPU cores and see some benefits from the very simple ONLOAD network host channel adapter design.
Today, a system with a network that leverages ONLOAD architecture cannot support as many applications and simultaneous end users because CPU compute time is being spent on the network stack and managing the application communication overhead. Intel continues moving in a direction that re-uses the acquired ONLOAD technology requires the network operations to be done by CPU. This approach actually requires even more CPUs to offset the loss of processor cycles available to the actual application’s computational needs.
By offloading the CPU using the Mellanox interconnect with OFFLOAD architecture, you free up CPU cycles for computation tasks that the application demands, which increases overall system efficiency and performance. This greatly increases the overall ROI of the system investment by being able to execute more parallel tasks with greater efficiency, reducing job run times and increasing end-user productivity.
In the recently published paper from Sandia and the University of New Mexico, the effect of ONLOAD versus OFFLOAD capabilities was studied. Look closely at the left graph below; it shows the dependency of the CPU as it relates to the network bandwidth you are able to achieve.
Typically, most HPC installations use mainstream CPUs and do not afford budget to the top-bin CPUs. It is actually quite easy to see the value that CPU offload capability brings to the system, with no dependency on CPU clock frequencies. Also, note that the left graph shows how ONLOAD reveals that the CPU architecture plays a role in effective bandwidth at various message sizes. A CPU-centric approach such as ONLOAD architecture does not lend itself well to the obstacles of application scalability because everything must be executed on the CPU.
Not only does Mellanox provide the right network hardware architecture for high performance computing, but additional application acceleration is achieved with optimized software, such as CORE-Direct™, PeerDirect™ and GPUDirect RDMA. Mellanox delivers ultra-low latency, high bandwidth, resilient solutions that enable higher cluster efficiency and provide scalability to tens-of-thousands of nodes.
The network is now a powerful co-processor extending OFFLOAD technology beyond the end-point with “Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)™”
Mellanox EDR 100Gb/s InfiniBand technology builds on over 15 years of experience building high-performance networks. Until now, the optimization efforts have mostly been focused on changes at the endpoints, e.g., HCA enhancements and host-based software changes. Today, such efforts involve switch hardware and software changes, including network management software, user interface, and enhanced communication libraries.
Mellanox “SHARP” technology improves the performance of collective operations by processing the data as it traverses the network, eliminating the need to send data multiple times between endpoints. This innovative approach will decrease the amount of data traversing the network as aggregation nodes are reached. Implementing the collective communication algorithms in the network also has additional benefits, including freeing up valuable CPU resources for computation rather than using them to process communication.
Mellanox Switch-IB 2 includes features to improve fabric scalability and resilience, above and beyond its improvements to the base InfiniBand performance. These improvements include support for adaptive routing and adaptive routing notification, InfiniBand routing capabilities, and several other enhanced reliability features.
Maximizing system level performance with Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)™
The performance of collective operations, such as global synchronization and reduction, has a large effect on the performance and scalability of many large-scale MPI- and PGAS-based applications. “SHARP” extends the CORE-Direct capabilities with switch-level support for collective operations. This in-network aggregation minimizes the data’s network path, resulting in extremely low latency and reduced network operations.
MiniFE is a Finite Element mini-application that implements kernels that represent implicit finite-element applications. MiniFE is available from the NERSC (National Energy Research Scientific Computing Center).
The figure below shows the performance improvements to the MiniFE application that can be expected and achieved by leveraging the SHARP capabilities of Switch-IB 2.
The performance of collective operations for applications that use such functions is often critical to the overall performance. Collective operations are a key limiting factor for application performance and scalability. This occurs because all communication endpoints implicitly interact with each other, with serialized data exchange taking place between endpoints.
Additionally, the explicit coupling between communication endpoints tends to magnify the effects of system noise on the parallel applications that use them, by delaying one or more data exchanges, resulting in further application scalability challenges.
Because of the large effect of collective operations on overall application performance and scalability, Mellanox Technologies has invested considerable effort in optimizing the performance of such operations. This includes incremental hardware enhancements to the Host Channel Adapter (HCA) with the CORE-Direct technology, as well as algorithmic and implementation work performed in the context of the Fabric Collective Accelerator (FCA) software library. “SHARP” resides within the switch and optimizes collective operations beyond the endpoint, improving performance of HPC applications upwards of 10X.
Switch-IB 2 is the first InfiniBand switch to employ this technology, which fundamentally enables the switch to operate as a co-processor and introduces more intelligence and offloading capabilities to the network fabric. Only Mellanox’s offload architecture will be able to implement such capabilities as “SHARP” to influence the performance of operations such as collectives. Another critical step in paving the road to Exascale – delivered by Mellanox.
If you’re thinking about using, moving, storing or retrieving data, Mellanox already has it optimized and accelerated! Check out http://www.mellanox.com to learn more!