Everybody likes a good analogy, even better if it’s a football analogy. First to clarify, by saying Football I mean Soccer. I know that in some places that’s an important clarification. In my neighborhood, we played football and we used our legs 😊.
Now that we settled that I would like to present my view on NVMe-oF.
Recently, Pure Storage announced their NVME solution – FlashArrayX. By doing so, Pure joined a group of select vendors offering this fast storage solution – the fastest production level storage solution today.
Mellanox Technologies partnered with Pure Storage to provide NVMe-oF. It’s interesting that Mellanox is almost always the network partner for vendors providing an NVMe-oF solution. Why is that?
Simply, NVMe-oF stands for NVME over Fabrics. NVME over Fabrics is the way to extend the NVMe storage over the network.
NVMe over Fabrics is essentially NVME over RDMA, and RDMA is basically Mellanox. So, to summarize: NVMe-oF means NVME over a Mellanox network.
To clarify, RDMA is not a proprietary Mellanox protocol – RDMA is a standard. NVMe-oF solutions can be implemented over any network, but the fact is – Mellanox does RDMA best, simply because Mellanox has been doing RDMA for 20 years.
So, what is RDMA in a nutshell?
It is also called RoCE (RDMA over Converged Ethernet
NVMe cuts out legacy storage communication, such as SCSI, on the local storage node to enable fast storage. Looking at Diagram 1 below, similarly, RDMA cuts out the legacy network stack on the server to enable the fastest way to copy memory between nodes over the network.
RDMA, known as RoCE (RDMA over Converged Ethernet) in Ethernet environments, provides the application a direct access to the Network Interface Card (NIC). By bypassing the Kernel and TCP/IP stack on the client and storage nodes, RDMA improves both Speed and CPU offload. Speed is achieved since the data transfer is done by hardware offload in the NIC and CPU offload is achieved since the CPU is no longer needed for a simple memory copy over the network.
What does it mean? The equation is simple, less CPU for network is more CPU for Storage and compute and that means more IOPS, more processing – faster applications.
What are the RDMA prerequisites?
Why do we need a switch that is aware of RDMA?
As I mentioned above RDMA cuts the TCP/IP stack. But wait a minute – TCP has a very simple, yet important, role in the Ethernet fabric called TCP retransmission. This capability is required to recover when packets are lost for some reason.
Let’s look at this Diagram 1 again and see that red arrow between the NICs. This red arrow is a network, at least three switches in the path between compute and storage nodes:
Top of Rack <-> Aggregation <-> Top of Rack
What will happen if packets are dropped in this network? This can easily happen due to congestion or other reasons, and so what will force the retransmission of packets since TCP not being used?
Here comes the football analogy – think of a player as a switch or a NIC and the ball is a packet playing in an NVMe-oF match.
We need to make sure that in the NVMe-oF match, any pass between two players is completed successfully. For this, we need the best players in the world on the pitch, just like we would want in a football match. How can we make sure of that?
This is exactly the reason we need the DCB features. We need the network to be responsible for the traffic to make sure it will flow without any drops.
We need football players that can make the pass fast and 100% of the time – so that we have a perfect match.
Wow … We need 11 Lionel Messis on our pitch.
It’s true that many network vendors implement the basic required features in their switches. But only Mellanox, as THE RDMA company, has a switch that is aware of RDMA needs. Mellanox Spectrum switches have the best-in-class buffer architecture with the lowest latency for NVMe-oF and with the buffer settings for the RDMA profile configured automatically. Furthermore, the Mellanox Spectrum switches offer an easy end to end provisioning and monitoring for RDMA, simplifying the NVMe-oF provisioning and monitoring in the network.
So basically, you can think of the Mellanox Spectrum switch as the Messi of switches 😊
Here is an example of a lossless network profile configuration, where the buffers are configured according to the profile and PFC is configured across the switch –
And here is an example of a show command output that presents the buffer allocation –
Our Spectrum switches have many advantages in addition to extreme performance which is critical for NVMe-oF environments. Spectrum switches are Open Ethernet switches, as they offer a choice of Network Operating Systems (NOS). It can be
Onyx, it can be Cumulus Linux – the performance will be the same. When it comes to performance it’s all about the ASIC in the switch. The Spectrum line of products are powered by the Mellanox Spectrum ASIC – with low latency, zero packet loss and best in class buffers.
How great is that? Same top performance and our Spectrum Switches can be Onyx Messi or it can be Cumulus Ronaldo. We can choose our favorite – I will leave it to you do decide. I have another favorite player, one legendary number 10, but this is for another blog…
Further reading –