In any good system design, it is important to maximize the performance of the most critical (and often the most expensive) component. In the case of Deep Learning infrastructures, the performance of the specialized compute elements such as GPUs must be maximized.
GPUs have improved compute performance 300-fold! Even so, deep learning training workloads are so resource-intensive that they need to be distributed and scaled out across multiple GPUs. In such a distributed environment The Network is the critical part of the infrastructure that determines the overall system performance. Using legacy networks for Deep Learning workloads is like trying to drive a race car through a traffic jam. A race car requires a highly tuned, banked race track to run at top speed!
Distributed Deep Learning workloads are characterized by a huge thirst for data and the need to communicate intermediate results between all nodes on a regular basis to keep the applications from stalling. As such, significant performance gains are possible using high bandwidth, hardware accelerated RDMA over Converged Ethernet(RoCE) based GPU-GPU communications that support broadcast, scatter, gather, reduce, and all-to-all patterns. GPUs also need to read and process enormous volumes of training data from storage endpoints. The interconnect fabric that glues the distributed system together should reliably and quickly transport the communication packets between the GPUs and between GPUs and storage.
Mellanox Spectrum Ethernet switches deliver high bandwidth and consistent performance with:
Commodity off-shelf merchant silicon-based switches use fragmented packet buffers that are made of small packet buffer slices that are unable to absorb high bandwidth traffic bursts. The congestion management mechanisms are broken in switches that have these fragmented buffers. Additionally, without a tight ECN congestion management mechanism, such switches aggravate congestion by sending pause frames prematurely and blocking the network. With an unregulated flow of traffic and packet drops, commodity switches are unable to deliver the consistent low latencies required to maximize the Deep Learning cluster performance.
Distributed Deep Learning systems should be well balanced to bring forth best in class scale-out performance. Leaf-Spine networks leverage Layer-3 Equal Cost Multi-Path (ECMP) to balance and deliver high cross-sectional bandwidth necessary for scaling out. Mellanox Spectrum Ethernet switches enable high cross-sectional bandwidth:
Commodity off-shelf merchant silicon-based switches have fairness issues that can result in traffic imbalance. For example, in a simple 3:1 oversubscription test with three senders sending traffic to the same destination, one of the senders often hogs 50% of the bandwidth, leaving each of the other nodes with only ~17%. These performance variations caused by traffic imbalance, in turn, can deteriorate overall distributed system performance.
It is critical to keep Deep Learning Infrastructure up and running to get the most out of it. Having native and built-in telemetry in the interconnect will also help with capacity planning and improve resource utilization.
With Mellanox What Just Happened™ (WJH), network operators can dramatically improve mean time to issue resolution and increase uptime. Mellanox Spectrum Ethernet switches provide rich contextual and event-based telemetry data that can help quickly drill down into application performance issues. With Mellanox WJH, operators can monitor infrastructure utilization, remove performance bottlenecks and plan resource capacity.
Commodity off-shelf merchant silicon-based switches are not designed to provide granular network visibility. As a result, networks operators are forced to collect data centrally and apply predictive methods to only guess the root-cause of issues. This creates a centralized choke point and such solutions cannot efficiently scale to support 25/100GbE speeds.
The network is the critical element that unleashes the power of specialized Deep Learning infrastructure. Mellanox Spectrum Ethernet switches with consistent performance, intelligent load balancing, and comprehensive visibility is the ideal interconnect of choice for Deep Learning applications. Use Mellanox Spectrum Ethernet switches to build your Deep Learning Infrastructure.