This post was co-written by Mellanox’s Ramnath Sagar and Nisha Tagala of Parallel Machines.
Tesla’s semi-autonomous Autopilot system has drawn a lot of attention in the automotive industry. The ability of Tesla to push smarter Autopilot service with each Over-The-Air (OTA) update enables them to maintain a competitive edge in the autonomous vehicle era. But using AI, especially High Performance Deep Learning (DL), to have a competitive edge is not just relevant for Tesla but for any enterprise looking to build an intelligent software.
In today’s world where rapid innovation is no longer optional but mandatory, DevOps becomes critical – bringing software developers and operations staff to work closely. However, in an DL-powered software world, effectively deploying a DevOps-styled application to production remains a humongous challenge. This is due to the complexities of configuration, the need for efficient hardware to scale training and inference performance, and the complexities of continuously managing and supporting deep learning in production.
Mellanox and ParallelM have teamed up to solve this challenge using MLOps (DevOps for Machine Learning) and defined a reference architecture for Production-scale High Performance Deep Learning solution. We demonstrate how our technologies (Mellanox for high performance deep learning and ParallelM for Production DL Management), coupled with the state-of-the-art technologies from Open Source community, can enable AI-first enterprises to maintain their competitive edge.
For our reference design, we chose Tensorflow, one of the most popular ML/DL frameworks, but the solution can be easily extended to other frameworks such as SparkML, Caffe, Torch and others.
This reference design accomplishes two key objectives:
For more details, refer to our reference design: https://community.mellanox.com/docs/DOC-3001
Nisha Talagala is CTO and vice president of engineering at Parallel Machines, where she focuses on production machine learning and deep learning solutions from the edge to the cloud. Nisha has more than 15 years of expertise in software development, distributed systems, I/O solutions, persistent memory, and flash. Previously, Nisha was a fellow at SanDisk; a fellow and lead architect at Fusion-io, where she drove innovation in nonvolatile memory, including the industry’s first persistent memory solution; technology lead for server flash at Intel, where she led server platform nonvolatile memory technology development, storage-memory convergence, and technical partner engagements; and CTO of Gear6, where she designed and built clustered computing caches for high-performance I/O environments. Nisha holds 48 patents in distributed systems, networking, storage, performance, and nonvolatile memory. She has authored many technical ad research publications and serves on multiple academic and industry conference program committees. Nisha holds a PhD from UC Berkeley, where her research focused on software clustering and distributed storage.