Mellanox Accelerates Apache Spark Performance with RDMA and RoCE Technologies

 
RDMA

Looking back over the last decade, Apache Spark has really disrupted big data processing and analytics in many ways. With its vibrant ecosystem, Spark, a high-performance analytics engine for big data processing, is the most active Apache open- source project. Key factors driving Spark enterprise adoption are unmatched performance, simple programming and general-purpose analytics over massive amounts of data.

Spark performance benchmarks indicate that Spark runs Big Data workloads 100x faster than Hadoop MapReduce for both batch and streaming data. This performance gain is primarily attributed to Spark’s in-memory computation approach to data processing and analysis – an approach that is very fast and efficient, enabling large-scale machine learning and data analytics. Spark also utilizes a new data model called resilient distributed datasets (RDDs). This is basically a data structure that is stored in-memory while being computed, thus eliminating expensive intermediate disk writes.

Figure 1: Apache Spark

 

To handle data processing and analysis at scale, Spark operates a continuous event known as the shuffle—a mechanism for re-distributing data so that it’s grouped differently across partitions. Typically, copying data across executors and machines makes the shuffle a complex and costly operation since it involves disk I/O, data serialization, and network I/O. Therefore data scientists and software professionals use various techniques to avoid data shuffling as much as possible in the application design and constructs. Still, shuffle operations are a necessity for most workloads, thus compromising performance.

An Introduction to Remote Direct Memory Access (RDMA)

Remote Direct Memory Access (RDMA) is a network technology that allows for direct memory access of one computer into that of another, without involving either one’s operating-system and CPU. RDMA is especially useful in scenarios involving massively parallel computer clusters as it permits high-throughput, low-latency networking. Once an application performs an RDMA Read or Write request, the system delivers the application data directly to the network (zero-copy, fully offloaded by the network adapter), reducing latency and enabling fast message transfer. RDMA over Converged Ethernet (RDMA) is a network protocol that allows RDMA to run over an Ethernet network.

Mellanox recently announced a release of the open-source Spark-RDMA plugin, geared towards accelerating Spark’s shuffle operations.

Figure 2: Illustration of RDMA zero copy network transport

 

Mellanox Technologies has been a leading pioneer of the popular RDMA and RDMA over converged Ethernet (RoCE) networking technologies, starting in the high-performance computing (HPC) industry. In fact, Mellanox has just released its 8th generation of RDMA/ RoCE capable products including the intelligent ConnectX adapter cards and BlueField SmartNICs, which both have built-in RDMA and RoCE capabilities and deliver best-in class performance and usability.

How does RDMA accelerate Spark workloads?

RDMA today is integrated into the mainstream code of popular machine learning (ML) and artificial intelligence (AI) frameworks, namely TensorFlow, MXNet and Caffe2. Recently, Mellanox announced the v3 release of its Spark-compliant open-source SparkRDMA software plugin, which leverages RDMA communication technology to accelerate Spark’s shuffle operations. The plugin neither changes the mainstream Spark code nor impacts its functionality, making it a perfect fit for existing deployments.

Figure 3: Mellanox ConnectX and BlueField adapter cards

 

Figure 4 below illustrates how SparkRDMA reuses the Unsafe and Sort Shuffle Writer implementations of the mainstream Spark (appears in light green). While Shuffle data is written and stored identically to the original implementation, the all-new ShuffleReader and ShuffleBlockResolver provide an optimized RDMA transport when blocks are read over the network (appears in light blue).

Figure 4: Illustration of SparkRDMA software architecture

 

The following diagrams describe the shuffle read protocol in the original implementation, and when using RDMA (lower diagram). As indicated, using RDMA for Spark’s shuffle operations both greatly shortens and speeds up the process.

Figure 5: Illustrations of the Shuffle Read protocol

 

Spark over RDMA has shown substantial improvements in block transfer times (both latency and total transfer time), memory consumption and CPU utilization, compared to standard Spark’s implementation which uses over TCP. Moreover, the Spark RDMA plugin is designed with ease-of-use in mind, and supports per-job operation, allowing for incremental deployments and limited use for shuffle-intensive jobs.

Finally, the performance benefits of running Spark over RDMA are tremendous! Here are a few data points showing SparkRDMA in-action:

  • 2.6x performance improvement with Terasort compared to non-accelerated Spark

Figure 6: Spark performance appearances using Spark over RDMA

 

  • 4.4x faster shuffles compared to non-accelerated Spark (9.3 min compared to 2.1 min with RDMA accelerated Spark)

Figure 7: Faster Shuffles

 

  • 1,000x faster transfers compared to non-accelerated Spark (2 seconds compared to
    2 milliseconds with RDMA accelerated Spark)
Spark performance improves with RDMA

Figure 8: Faster transfers

 

  • Zero shuffle read time in RDMA accelerated Spark
Spark performance shuffle read times improve with RDMA

Figure 9: Shuffle Read times

Mellanox RDMA/RoCE NICs – the way to go for Spark

Apache Spark is today’s fastest growing Big Data analysis platform.  The Mellanox team is excited to partner with large-scale enterprises, Cloud and AI solution providers to unlock scalable, faster and highly efficient big-data analytics and machine learning for a wide range of commercial and research use-cases.

Learn more about RDMA and RoCE.

To learn more about Mellanox’s fully-featured end-to-end InfiniBand and Ethernet product lines visit our website.

About Itay Ozery

Itay Ozery is Senior Product Manager at Mellanox Technologies, driving strategic product management and product marketing initiatives for Mellanox’s cloud networking solutions. Before joining Mellanox, Itay was Sr. Sales Engineer at NICE Systems Ltd., a Nasdaq listed corporation, where he led large-scale business and project in the fields of cyber security and intelligence. Prior to that, Itay held various positions for more than a decade in IT systems and networking with data centers and telecom service providers, where he acquired extensive experience in IT system and network engineering. Itay holds B.A. in Marketing and Information Systems from the College of Management Academic Studies, Israel.

Comments are closed.