Network File System (NFS) is a ubiquitous component of most modern clusters. It was originally designed as a work-group filesystem, making a central file store available to and shared among a number of client servers. Later, as NFS became more popular, mission-critical applications were running over NFS and high-speed access to storage became paramount, higher performance networking started to be used for the client-to-NFS communication. In addition to higher networking speeds (today 100GbE and soon 200GbE), the industry has been looking for technologies that offload stateless networking functions that run on the CPU to the IO subsystems. This leaves more CPU cycles free to run business applications and will maximize the data center efficiency.
One of the more popular networking offload technologies is RDMA (Remote Direct Memory Access). RDMA makes data transfers more efficient and enables fast data movement between servers and storage without involving the server’s CPU. Throughput is increased, latency reduced, and CPU power is freed up for the applications. RDMA technology is already widely used for efficient data transfer in render farms and in large cloud deployments such as Microsoft Azure, HPC solutions (including Machine/Deep learning), iSER and NVMe-oF based storage, mission critical SQL databases solutions such as Oracle’s RAC (Exadata), IBM DB2 pureScale, Microsoft SQL solutions and Teradata, as well as many others.
The figure above illustrates why IT managers have been deploying RoCE (RDMA over Converged Ethernet). RoCE utilizes advances in Ethernet to enable more efficient implementations of RDMA over Ethernet and enables widespread deployment of RDMA technologies in mainstream data center applications.
The growing deployment of RDMA-enabled networking solutions in public and private clouds, like RoCE that enables ruining RDMA over Ethernet, plus the recent NFS protocol extensions, enable NFS communication over RoCE. (For more details, please watch the Open Source NFS/RDMA Roadmap presentation given at the OpenFabrics Workshop on March 2017 by Chuck Lever, upstream Linux contributor and a Linux Kernel Architect at Oracle.) For a detailed description on how to run NFS over RoCE, please read, How to Configure NFS over RDMA (RoCE) at the Mellanox community site.
In order to evaluate the boost that RoCE enables (vs. TCP), we ran a set of iozone tests and measured the read/write IOPS and throughput of multi-thread read or write tests at Mellanox. The tests were performed on a single client against a Linux NFS server using a tmpfs, so storage latencies are removed from the picture and transport behavior is clearly exposed.
The Client server included Intel(R) Core(TM) i5-3450S CPU @ 2.80GHz one socket, four cores, HT disabled 16GB RAM, 1333MHz DDR3, non-ECC HCA together with Mellanox’s ConnectX-5 100GbE NIC (SW version 16.20.1010) plugged into in a PCIe 3.0 x16 slot.
The NFS Server included Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz one socket, four cores, HT disabled 64GB RAM, 2400MHz DDR4 HCA, together with Mellanox’s ConnectX-5 100GbE NIC (16.20.1010) plugged into in a PCIe 3.0 x16 slot.
The Client and the NFS Server were connected over a single 100GbE Mellanox LinkX® copper cable to Mellanox’s Spectrum™ switch using the SN2700 model with its 32 x 100GbE ports, which is the lowest latency Ethernet switch available in the market today, making it ideal for running latency sensitive applications over Ethernet .
Below are the bandwidth and the IOPS that were measured the performance over RoCE vs. TCP, running iozone tes
Running NFS over RDMA-enabled networks, such as RoCE, which offload the CPU from performing the data communication job, generate a significant performance boost. As a result, Mellanox expects that NFS over RoCE will eventually replace NFS over TCP and become the leading transport technology in data centers.
Thanks to Chuck Lever for sharing his performance results and for his guidance.