These are exciting times for Mellanox, especially with Spectrum Ethernet switching. We are experiencing extreme momentum across many different verticals and use-cases, whether it’s a cloud company or a bank or any enterprise deploying open, scale-out infrastructure.
Ethernet Storage Fabrics (ESF) provide the fastest and most efficient networking solution for storage. ESF leverages the speed, flexibility, and cost efficiencies of Ethernet with the best switching hardware and software packaged in ideal form factors to provide performance, scalability, intelligence, high availability, and simplified management for storage.
An extreme use-case of ESF is High Performance Computing (HPC) HPC and Artificial Intelligence (AI)/Deep Learning (DL). We recently have gained significant wins with customers and partners with HPC and AI. This is a testament to the fact that our customers realize the value our deep experience and history in HPC brings to Ethernet.
In this blog, I’ll highlight our recent deployment with DownUnder Geosolutions. Also, I’ll discuss our recently announced AI Cloud reference architecture with Nutanix and NVIDIA. In future blogs, I’ll highlight some of our other HPC/AI customers and partners who are enjoying the benefits of Mellanox Spectrum Ethernet switching.
DownUnder GeoSolutions McCloud Service
DownUnder GeoSolutions (DUG) recently announced their selection of Mellanox end-to-end Ethernet for their massive exascale-focused HPC facility focused on seismic processing. This facility will scale to over 40,000 compute nodes, leveraging Mellanox high-throughput, low-latency 100G Ethernet switches and adapters.
You can find details about the DUG McCloud deployment here –
One of the most unique aspects of the DUG network is the Mellanox Multi-Host ConnectX adapters and the Spectrum switches.
Mellanox’s multi-host solution
The Multi-Host solution provides the following advantages –
- Up to 256 servers connected to a single 1U Spectrum SN2700 switch
- 4 servers share a single 50GbE connection by sharing a ConnectX-5 100G Ethernet Multi-host adapter
- Each server can burst up to 30 Gbps bandwidth
- Each server is guaranteed 12.5 Gbps bandwidth
Efficiency was one of the main attractions DUG had to the solution – 50% less switches and 75% less cables. This provides DUG cost savings on the network. But, more importantly, it allows DUG to pack the most network and servers possible into their data center footprint.
In addition, network performance is critical for the McCloud service at DUG. The Spectrum SN2700 provides performance that is not available in any other Ethernet switch. DUG is able to leverage the performance advantages by creating an HPC workload local to the 256 node pods connected to a single Spectrum switch. The advantages include –
- Fair Traffic Distribution – all flows get fair bandwidth across the network
- Superior Microburst Absorption – especially critical for incast traffic to the storage nodes
- Lowest Latency – consistent 300ns latency, no matter the packet size
- Zero Packet Loss – full line-rate forwarding at all packet sizes
The bottom line – DUG chose Mellanox End-to-End Ethernet for their McCloud HPC deployment because we are HPC experts. We provide huge efficiencies with the data center deployment as well an ensure the best network performance for the McCloud cluster – allowing DUG to provide a superior service while maximizing efficiencies and minimizing risk.
Artificial Intelligence Enterprise Cloud with Nutanix and NVIDIA
Moving onto another exciting development, Mellanox recently partnered with Nutanix and NVIDIA to provide an Enterprise Cloud for AI Reference Architecture. Due to the mass amount of distributed processing needed for AI/DL, it’s clear that Mellanox’s HPC-ready Spectrum Ethernet switches bring value in this environment.
Any enterprise that wants to stay relevant in the 21st century is investing into AI/DL capabilities. The AI Cloud solution from Nutanix, NVIDIA, and Mellanox makes it easy for enterprises to quickly deploy and operate shared infrastructure for AI/DL. Advantages of our joint solution include –
- Simplified Operations and Troubleshooting – making it easy to deploy and operate
- Enterprise-grade Uptime, Backup/Restore, and Disaster Recovery
- Distributed Architecture with Linear Scaling – meet the needs of today and tomorrow
- Built-in Security – protect your data while allowing many users of the infrastructure
- Less Rack Space – consolidated platform for business-critical applications and AI
- Simplified Networking – automated provisioning and full network visibility
Mellanox provides a Simplified Networking environment for the AI Cloud. Even prior to the joint AI Cloud solution, Mellanox has won Elevate Partner Awards from Nutanix for our Nutanix Ready and Calm Blueprint solutions due to our integration with AHV, purpose-built HCI switches, and ability to support any workload.
Simplified Networking is critical for the AI Cloud. AI/DL infrastructure is expensive. It must be simple for enterprises to provision services on the infrastructure, and requires a network solution where all network provisioning is done automatically. Furthermore, troubleshooting and identifying sub-optimal network operation is required in order to meet service-level agreements (SLAs) and maximize the utilization of the HCI and GPU investment. Mellanox provides a complete automation and visibility solution with its NEO plug-in for Nutanix Prism, as well as very unique telemetry features in its Spectrum switching hardware.
Mellanox Ethernet switching on Nutanix
Beyond simplicity, the Mellanox Spectrum switches provide additional unique advantages to the AI Cloud, including–
- Best-in-class Performance – required for AI/DL workloads (see advantages in previous section)
- Accelerated Flash Storage – leveraging end-to-end RoCE – Remote Direct Memory Access (RDMA) over Converged Ethernet
- Easy Scaling – unique switch form factors to support any cluster size today or tomorrow
AI/DL environments require accelerated hardware to maximize their investments – whether it’s the GPUs in the NVIDIA DGXs or the end-to-end Ethernet solution from Mellanox. The accelerated hardware minimizes the time needed by Data Scientists and Deep Learning Engineers to train their models, significantly increasing the productivity of the infrastructure and the Data Science and Deep Learning teams. Furthermore, an enterprise-ready solution is required for simplified operation and always-on availability. The Nutanix AI Cloud solution with Mellanox and NVIDIA is the meets these needs by leveraging technology from market and technology leaders.
Mellanox’s Ethernet solutions are a perfect fit for HPC and AI Cloud solutions. We provide the performance, automation, and efficiencies required – making it easy to deploy and operation, while ensuring you get the most out of your high-end infrastructure.
We expect the exciting times to continue for a long time. Please reach out if you want to learn more about our Spectrum Ethernet switching solutions.
Also, this article touched on only two of our examples of HPC and AI/DL Cloud. Stay tuned for future blogs highlighting more of our customers and partners!
Further reading –
- DUG supercharges massive HPC cloud service with Mellanox multi-host adapter
- Mellanox Powers Massive HPC Cloud Service for DownUnder Geosolutions
- Nutanix Enterprise Cloud for AI (with Mellanox and NVIDIA)
- Mellanox-Nutanix Partner Site
- Mellanox Ethernet Storage Fabrics
- Mellanox Spectrum Ethernet Switches
- Mellanox Multi-host Solutions
- Mellanox NEO Cloud Networking Orchestration and Management Software
- Tolly Performance Report – Mellanox
- Mellanox RDMA and RoCE for Ethernet Network Efficiency Performance
- Recommended Network Configuration Examples for RoCE Deployment