Deep Learning Infrastructure built right

Deep learning

Why Mellanox Spectrum Ethernet switches are the ideal interconnect of choice for Deep Learning infrastructure.

In any good system design, it is important to maximize the performance of the most critical (and often the most expensive) component. In the case of Deep Learning infrastructures, the performance of the specialized compute elements such as GPUs must be maximized.

GPUs have improved compute performance 300-fold! Even so, deep learning training workloads are so resource-intensive that they need to be distributed and scaled out across multiple GPUs.  In such a distributed environment The Network is the critical part of the infrastructure that determines the overall system performance. Using legacy networks for Deep Learning workloads is like trying to drive a race car through a traffic jam. A race car requires a highly tuned, banked race track to run at top speed!

What attributes make an interconnect ideal for Deep Learning Infrastructure?

Consistent Performance

Distributed Deep Learning workloads are characterized by a huge thirst for data and the need to communicate intermediate results between all nodes on a regular basis to keep the applications from stalling. As such, significant performance gains are possible using high bandwidth, hardware accelerated RDMA over Converged Ethernet(RoCE) based GPU-GPU communications that support broadcast, scatter, gather, reduce, and all-to-all patterns. GPUs also need to read and process enormous volumes of training data from storage endpoints. The interconnect fabric that glues the distributed system together should reliably and quickly transport the communication packets between the GPUs and between GPUs and storage.

Mellanox Spectrum Ethernet switches deliver high bandwidth and consistent performance with:

  • Fully-shared Monolithic Packet Buffer that provides better traffic burst absorption capabilities, without dropping packets
  • Consistent cut-through performance and low latency
  • Intelligent Explicit Congestion Notification (ECN) based congestion management control loop to regulate traffic at a granular flow level to mitigate congestion.

Commodity off-shelf merchant silicon-based switches use fragmented packet buffers that are made of small packet buffer slices that are unable to absorb high bandwidth traffic bursts. The congestion management mechanisms are broken in switches that have these fragmented buffers. Additionally, without a tight ECN congestion management mechanism, such switches aggravate congestion by sending pause frames prematurely and blocking the network. With an unregulated flow of traffic and packet drops, commodity switches are unable to deliver the consistent low latencies required to maximize the Deep Learning cluster performance.

Intelligent Load Balancing

Distributed Deep Learning systems should be well balanced to bring forth best in class scale-out performance. Leaf-Spine networks leverage Layer-3 Equal Cost Multi-Path (ECMP) to balance and deliver high cross-sectional bandwidth necessary for scaling out. Mellanox Spectrum Ethernet switches enable high cross-sectional bandwidth:

  • Spectrum-based switches utilize their high-performance Packet buffer architecture to share available switch bandwidth fairly across ports
  • Spectrum implements flexible packet header hashing algorithms that enable it to evenly distribute traffic flows across Layer-3 Equal Cost Multi Paths

Commodity off-shelf merchant silicon-based switches have fairness issues that can result in traffic imbalance. For example, in a simple 3:1 oversubscription test with three senders sending traffic to the same destination, one of the senders often hogs 50% of the bandwidth, leaving each of the other nodes with only ~17%. These performance variations caused by traffic imbalance, in turn, can deteriorate overall distributed system performance.

Comprehensive Visibility for Deep Learning Infrastructure

It is critical to keep Deep Learning Infrastructure up and running to get the most out of it. Having native and built-in telemetry in the interconnect will also help with capacity planning and improve resource utilization.

With Mellanox What Just Happened™ (WJH), network operators can dramatically improve mean time to issue resolution and increase uptime. Mellanox Spectrum Ethernet switches provide rich contextual and event-based telemetry data that can help quickly drill down into application performance issues. With Mellanox WJH, operators can monitor infrastructure utilization, remove performance bottlenecks and plan resource capacity.

Commodity off-shelf merchant silicon-based switches are not designed to provide granular network visibility. As a result, networks operators are forced to collect data centrally and apply predictive methods to only guess the root-cause of issues. This creates a centralized choke point and such solutions cannot efficiently scale to support 25/100GbE speeds.

The Bottom Line

The network is the critical element that unleashes the power of specialized Deep Learning infrastructure. Mellanox Spectrum Ethernet switches with consistent performance, intelligent load balancing, and comprehensive visibility is the ideal interconnect of choice for Deep Learning applications. Use Mellanox Spectrum Ethernet switches to build your Deep Learning Infrastructure.

Additional Resources:

About Karthik Mandakolathur

Karthik is a Senior Director of Product Marketing at Mellanox. Karthik has been in the networking industry for over 15 years. Before joining Mellanox, he held product management and engineering positions at Cisco, Broadcom and Brocade. He holds multiple U.S. patents in the area of high performance switching architectures. He earned an MBA from The Wharton School, MSEE from Stanford and BSEE from Indian Institute of Technology.

Comments are closed.