Enabling HPC and AI Cloud with Ethernet Switching

 
Ethernet, Uncategorized

Why DownUnder GeoSolutions NVIDIA and Nutanix are choosing Mellanox Ethernet switches for their HPC and AI/DL Cloud Needs.

These are exciting times for Mellanox, especially with Spectrum Ethernet switching.  We are experiencing extreme momentum across many different verticals and use-cases, whether it’s a cloud company or a bank or any enterprise deploying open, scale-out infrastructure.

Ethernet Storage Fabrics (ESF) provide the fastest and most efficient networking solution for storage.  ESF leverages the speed, flexibility, and cost efficiencies of Ethernet with the best switching hardware and software packaged in ideal form factors to provide performance, scalability, intelligence, high availability, and simplified management for storage.

An extreme use-case of ESF is High Performance Computing (HPC) HPC and Artificial Intelligence (AI)/Deep Learning (DL).  We recently have gained significant wins with customers and partners with HPC and AI.  This is a testament to the fact that our customers realize the value our deep experience and history in HPC brings to Ethernet.

In this blog, I’ll highlight our recent deployment with DownUnder Geosolutions.  Also, I’ll discuss our recently announced AI Cloud reference architecture with Nutanix and NVIDIA.  In future blogs, I’ll highlight some of our other HPC/AI customers and partners who are enjoying the benefits of Mellanox Spectrum Ethernet switching.

DownUnder GeoSolutions McCloud Service

DownUnder GeoSolutions (DUG) recently announced their selection of Mellanox end-to-end Ethernet for their massive exascale-focused HPC facility focused on seismic processing.  This facility will scale to over 40,000 compute nodes, leveraging Mellanox high-throughput, low-latency 100G Ethernet switches and adapters.

You can find details about the DUG McCloud deployment here –

DUG supercharges massive HPC cloud service with Mellanox multi-host adapter

Mellanox Powers Massive HPC Cloud Service for DownUnder Geosolutions

One of the most unique aspects of the DUG network is the Mellanox Multi-Host ConnectX adapters and the Spectrum switches.

Mellanox’s multi-host solution

Mellanox’s multi-host solution

The Multi-Host solution provides the following advantages –

Efficiency was one of the main attractions DUG had to the solution – 50% less switches and 75% less cables.  This provides DUG cost savings on the network.  But, more importantly, it allows DUG to pack the most network and servers possible into their data center footprint.

In addition, network performance is critical for the McCloud service at DUG.  The Spectrum SN2700 provides performance that is not available in any other Ethernet switch.  DUG is able to leverage the performance advantages by creating an HPC workload local to the 256 node pods connected to a single Spectrum switch.  The advantages include –

  • Fair Traffic Distribution – all flows get fair bandwidth across the network
  • Superior Microburst Absorption – especially critical for incast traffic to the storage nodes
  • Lowest Latency – consistent 300ns latency, no matter the packet size
  • Zero Packet Loss – full line-rate forwarding at all packet sizes

The bottom line – DUG chose Mellanox End-to-End Ethernet for their McCloud HPC deployment because we are HPC experts.  We provide huge efficiencies with the data center deployment as well an ensure the best network performance for the McCloud cluster – allowing DUG to provide a superior service while maximizing efficiencies and minimizing risk.

Artificial Intelligence Enterprise Cloud with Nutanix and NVIDIA

Moving onto another exciting development, Mellanox recently partnered with Nutanix and NVIDIA to provide an Enterprise Cloud for AI Reference Architecture.  Due to the mass amount of distributed processing needed for AI/DL, it’s clear that Mellanox’s HPC-ready Spectrum Ethernet switches bring value in this environment.

Any enterprise that wants to stay relevant in the 21st century is investing into AI/DL capabilities.  The AI Cloud solution from Nutanix, NVIDIA, and Mellanox makes it easy for enterprises to quickly deploy and operate shared infrastructure for AI/DL.  Advantages of our joint solution include –

  • Simplified Operations and Troubleshooting – making it easy to deploy and operate
  • Enterprise-grade Uptime, Backup/Restore, and Disaster Recovery
  • Distributed Architecture with Linear Scaling – meet the needs of today and tomorrow
  • Built-in Security – protect your data while allowing many users of the infrastructure
  • Less Rack Space – consolidated platform for business-critical applications and AI
  • Simplified Networking – automated provisioning and full network visibility

Mellanox provides a Simplified Networking environment for the AI Cloud.  Even prior to the joint AI Cloud solution, Mellanox has won Elevate Partner Awards from Nutanix for our Nutanix Ready and Calm Blueprint solutions due to our integration with AHV, purpose-built HCI switches, and ability to support any workload.

Simplified Networking is critical for the AI Cloud.  AI/DL infrastructure is expensive.  It must be simple for enterprises to provision services on the infrastructure, and requires a network solution where all network provisioning is done automatically.  Furthermore, troubleshooting and identifying sub-optimal network operation is required in order to meet service-level agreements (SLAs) and maximize the utilization of the HCI and GPU investment.  Mellanox provides a complete automation and visibility solution with its NEO plug-in for Nutanix Prism, as well as very unique telemetry features in its Spectrum switching hardware.

Mellanox Ethernet switching on Nutanix

Mellanox Ethernet switching on Nutanix

Beyond simplicity, the Mellanox Spectrum switches provide additional unique advantages to the AI Cloud, including–

  • Best-in-class Performance – required for AI/DL workloads (see advantages in previous section)
  • Accelerated Flash Storage – leveraging end-to-end RoCE – Remote Direct Memory Access (RDMA) over Converged Ethernet
  • Easy Scaling – unique switch form factors to support any cluster size today or tomorrow

AI/DL environments require accelerated hardware to maximize their investments – whether it’s the GPUs in the NVIDIA DGXs or the end-to-end Ethernet solution from Mellanox.  The accelerated hardware minimizes the time needed by Data Scientists and Deep Learning Engineers to train their models, significantly increasing the productivity of the infrastructure and the Data Science and Deep Learning teams.  Furthermore, an enterprise-ready solution is required for simplified operation and always-on availability.  The Nutanix AI Cloud solution with Mellanox and NVIDIA is the meets these needs by leveraging technology from market and technology leaders.

Conclusion

Mellanox’s Ethernet solutions are a perfect fit for HPC and AI Cloud solutions.  We provide the performance, automation, and efficiencies required – making it easy to deploy and operation, while ensuring you get the most out of your high-end infrastructure.

We expect the exciting times to continue for a long time.  Please reach out if you want to learn more about our Spectrum Ethernet switching solutions.

Also, this article touched on only two of our examples of HPC and AI/DL Cloud.  Stay tuned for future blogs highlighting more of our customers and partners!

Further reading –

 

 

About Bill Webb

Bill Webb is Director of Ethernet Switching – Americas at NVIDIA Networking. In this role, he evangelizes the benefits of NVIDIA’s Mellanox Ethernet switch portfolio in scale-out data center, storage, cloud computing, and AI/accelerated deep learning environments. Bill has spent over 20 years in the networking industry in a variety of sales, engineering, and management roles. Prior to NVIDIA, Bill worked at Concurrent (now Vecima), where he introduced a Ceph-based scale-out storage product for media streaming and at Ciena, where he led a team developing first generation Software Defined Networking applications. Bill started his career at Nortel Networks and then later worked at a several start-ups building fiber-to-the-premise technology.

Comments are closed.