Scaling Out the Deep Learning Cloud Efficiently

The Duchess of Windsor famously said that you can never be too rich or too thin. A similar observation is true when trying to match deep learning applications and compute resources: You can never have too much horsepower.

Intractable problems in fields as diverse as finance, security, medical research, resource exploration, self-driving vehicles, and defense are being solved today by training a complex neural network how to behave rather than programming a more traditional computer to take explicit steps. Even though the discipline is still relatively young, the results have been astonishing.

A typical deep learning process must start a large data repository for training and models to begin training exercises which are used to develop a trained model. The next stage, Inference is then conducted on the test data to determine the results. — *Figure 1. Typical deep learning process.*

The training process required to take a deep learning model and turn it into a computerized savant is extremely resource intensive. The basic building blocks for the necessary operations are GPUs. Though they are already powerful and getting more so all the time, the kinds of applications that I previously identified take whatever you can throw at them and ask for more.

To achieve the necessary horsepower, the GPUs must be used in parallel, and therein lies the rub. The easiest way to bring more GPUs to bear is to simply add them to the system. However, this scale-up approach has some real-world limitations. You can only get a certain number of these powerful beasts into a single system within reasonable physical, electrical, and power dissipation constraints. If you want more than that—and you do—you must scale out.

This means providing multiple nodes in a cluster. For this to be useful, the GPUs must be shared among the nodes. Problem solved, right? It can be, but this approach brings with it a new set of challenges. The basic challenge is that just combining a whole bunch of compute nodes into a large cluster—and making them work together seamlessly—is not simple. In fact, if it is not done properly, the performance could become worse as you increase the number of GPUs, and the cost could become unattractive.

NVIDIA has partnered with One Convergence to solve the problems associated with efficiently scaling on-premises or bare metal cloud deep learning systems. NVIDIA end-to-end Ethernet solutions exceed the most demanding criteria and leave the competition in the dust. For example, you can easily see the performance advantages with TensorFlow over an NVIDIA 100 GbE–network versus a 10 GbE–network, both taking advantage of RDMA (Figure 2).

Chart shows 6.5x faster training with 100 GbE on VGG16 Training Synthetic data, 32 GPUs, over 10 GbE and 25 GbE. — *Figure 2. NVIDIA ConnectX-5 acceleration of TensorFlow training workloads.*

RDMA over Converged Ethernet (RoCE) is a standard protocol that enables RDMA’s efficient data transfer over Ethernet networks. This allows transport offload with hardware RDMA engine implementation and superior performance. While distributed TensorFlow takes full advantage of RDMA to eliminate processing bottlenecks, even with large-scale images, the NVIDIA 100GbE network delivers the expected performance and exceptional scalability from the 32 NVIDIA P100 GPUs. For both 25 GbE and 100 GbE, it’s evident that people still using 10 GbE are falling short of any return on investment they might have thought they were achieving.

Diagram shows architecture of initiator server and target server over RDMA, with the RoCE NIC and driver. — *Figure 3. RDMA over Converged Ethernet lowers latency and improves CPU utilization, removing bottlenecks to improve training and inferencing times.*

Beyond the performance advantages, the economic benefits of running AI workloads over NVIDIA 25, 50, or 100 GbE are substantial. NVIDIA Spectrum switches and NVIDIA ConnectX SmartNICs deliver an unbeatable performance at an even more unbeatable price point, yielding an outstanding ROI. With flexible port counts and cable options allowing up to 64 fully redundant 10/25/50/100 GbE ports in a 1U rack space, NVIDIA end-to-end Ethernet solutions are a game changer for state-of-the-art data centers that wish to maximize the value of their data.

The performance and cost of the system are the ticket in the door, but an attractive on-premises deep learning platform must go further, much further:

Provide a mechanism to share the GPUs without users or applications monopolizing the resources or getting starved.
Ensure that the GPUs are fully used as much as possible. These beasts are expensive!
Automate the process of identifying the resources on each node and making them immediately useful without a lot of manual intervention or tuning.
Make the system easy to use overall, as experts are always the minority in any population.
Base the system on best-in-class open platforms to enhance ease of use and to enable rapid integration of better frameworks and algorithms as they become available.

Many of these challenges have been overcome for organizations that could offload compute needs to the public cloud. For those companies that cannot take this path for regulatory, competitive, security, bandwidth, or cost reasons, there has not been a satisfactory solution.

To achieve the goals highlighted earlier, One Convergence has created a full-stack, deep-learning-as-a-service application called DKube that addresses these challenges for on-premises and bare metal cloud users.

The One Convergence DKube system dashboard displays statistics of the current state of GPU utilization. — *Figure 4. One Convergence DKube system dashboard.*

DKube provides a variety of valuable and integrated capabilities:

Network hardware abstraction: It abstracts the underlying networked hardware consisting of compute notes, GPUs, and storage, and allows them to be accessed and managed in a fully distributed, composable manner. The complexity of the topology, device details, and utilization are all handled in a way that makes on-premises cloud operation as simple as—and in some ways simpler than—the public cloud. The system can be scaled out by adding nodes to the cluster, and the resources on the nodes are automatically recognized and made useful immediately.
Efficient sharing: It allows the resources, especially the expensive GPU resources, to be efficiently shared among the users of the application. The GPUs are allocated on-demand and are available when not actively being used by a job.
Open platforms: It is based on best-in-class open platforms, which make the components familiar to data scientists. DKube is based on the containerized standard Kubernetes, and is compatible with Kubeflow, supporting Jupyter, TensorFlow, and PyTorch—with more to come. It integrates the components into a unified system, guiding the workflow and ensuring that the pieces work together in a robust and predictable way.
Ease of use: It provides an out-of-the-box, UI-driven deep learning package for model experimentation, training, and deployment. The deep learning workflow coordinates with the hardware platform to remove the operational headaches normally associated with deploying on-premises. You can focus on the problems at hand—which are difficult enough—rather than the plumbing. It is simple to get the application installed on an on-premises cluster and be working with models within four hours.

Conclusion

A One Convergence DKube system using NVIDIA high speed network adapters can enable deep-learning-as-a-service for on-premises platforms or the bare-metal deep learning cloud. For more information, see the following resources: