Uncompromised COTS Performance for Network Function Virtualization


When I explain Network Function Virtualization (NFV) and why it is a great technology that can revolutionize Communication Service Provider (CSP) operational and business models, I often use the smartphone analogy. In the not-so-distant past, we used to carry a lot of gadgets and accessories such as GPS, cameras, cell phones, Walkman, Gameboy, and this list goes on.

030415 chloe ma 800_then_now_apps-01

But now, people carry only smartphones, and all the above have become apps running on a generic piece of hardware and an operating system sitting on top of that hardware to provide necessary platform services to the software applications.  The number of apps in both Apple and Android app stores is well over a million, and Apple said it paid $13 billion to developers at the 2014 World Wide Developer Conference.



NFV is aiming at doing the same for CSPs, moving their services from running on purpose-built hardware platforms to Commercial Off-the-Shelf (COTS) compute, storage and networking infrastructure. The benefits are obvious:  agility in service creation, automation in operation, and dynamic scalability. But anybody who has designed and operated a Telco system wonders, how about performance?



Indeed, you are ok with running your GPS, video streaming, and other apps on generic smartphone software because they provide acceptable performance comparable to their purpose-built hardware predecessors. However, if it takes you 10 seconds to load up an app, or you see constant jitter and buffering when you watch the World Cup on your phone, you won’t be a happy camper.


Similarly, with NFV, a large variety of telco and cable applications are going to run over a common infrastructure with generic hardware and virtualization. Some of the applications demand high throughput. A typical Evolved Packet Core (EPC) blade can provide around 30Gb/s of throughput, and the CSP customers expect no less from its virtual counterpart. So for many applications raw bandwidth is vitally important.


However in many other NFV use cases, it is really the packet performance in terms of millions of packets per second (pps) that matters more than the raw throughput. Still other applications are latency sensitive, such as Session Border Controllers (SBCs) for Voice over LTE type of use cases, which must adhere to the specified maximum 150ms desired one-way end-to-end latency to achieve high-quality voice communications.


Latency is directly associated with processor or CPU load, so ideally we want latency improvement without significant impact on CPU load. Interestingly storage performance has largely been neglected in the NFV discussion, but in order to have Virtualized Network Services (VNFs) really be cloud-native and scale out, storage is an important piece of the puzzle to figure out.


In this post, I will mostly cover throughput and latency, and will address storage in a subsequent post.


In a typical virtualized compute environment, there is a virtual switch or router which resides in the host operating system or hypervisor kernel to handle packets in and out of the server and in between Virtual Machines (VMs) in the same server. The virtual switches along with an SDN controller plays a big role in traffic steering and service chaining. Open Virtual Switch (OVS) is a good example that is widely adopted in NFV trials and Proof of Concepts (PoCs), but it is known to be inefficient and provide poor packet performance and high latency. You pay the virtualization penalty.


So what does it take to improve throughput and latency performance between VMs? Here are the key factors:

How fast can your system transmit and receive bit streams?

First, it is the raw speed the servers and the switches connecting the servers can push packets in and out. There is really no magic there. In a server, packet I/O is typically handled by a Network Interface Card (NIC) or adapter. 10Gb/s NICs are very common but for a high-performance NFV cloud, 40Gb/s or even 100Gb/s will be needed.


Mellanox is a leader in high-performance NICs with dominant market share in 40Gb/s Ethernet NIC shipment, and the first to market with 100Gb/s end-to-end (NIC + cables + switches) for InfiniBand. The Mellanox ConnectX-4 family of NICs not only support InfiniBand but also 10Gb/s, 40Gb/s and 100Gb/s Ethernet, as well as the new 25G and 50G standard, giving maximum flexibility to the customers.


How fast can your system handle packets?

As I mentioned, a critically important metric here is really the packet performance, especially when your VNFs are handling a lot of small packets. As it turns out, it is a lot harder to achieve lossless small packet handling. Let’s take a 10G link as an example, the theoretical maximum rate of packet performance with 46-Byte packets and 1500-Byte packets (with 38 bytes of MAC and IP overhead per packet).


Packet Data Unit
(PDU) in Bytes
Packet Handling Rate (Millions of packets/sec)
10e10 b/s / (84 B * 8 b/B)]
10e10 bit/s / (1538 B * 8 b/B)



Small packets create much more stress for your system and represent a good test of how fast it can really handle packets. With Mellanox’s ConnectX-4 100Gb/s Ethernet NIC, a whopping 75 million pps raw performance over a single NIC has been achieved, which represents a huge step forward in terms of boosting VNF performance.


And it is not just about the NIC, but also the switches connecting the servers. A recent  benchmarking report highlights Mellanox Ethernet switches capabilities to do:


  • Zero-loss, wire-speed throughput at all frame sizes tested from 64 through 9212-byte jumbo frames compared to up to 20% loss and latency up to 97,980 ns for equivalent Broadcom-based switches
  • True cut-through switching, while the Broadcom-based switches fall back to store & forward for 10G-10G traffic with up to 96% longer latencies within the same rack for typical ToR topologies


How fast can your system handle packets in a virtualized environment?

Even with the good foundation of raw throughput and packet performance, you still can suffer from the virtualization penalty – the performance degradation in virtualized environment. The degradation of performance can be caused by the components that sit between the VM interface and the server’s NIC, your OS, your hypervisor, and device emulation.


In the following figure, you can see that with the virtual switch residing in the kernel space, receiving packets relies on kernel interrupts and store-and-forward mechanisms. Multiple context switches and memory copies can happen before a received packet is delivered to the destined application, resulting in sub-optimal packet performance. In some environment, fewer than 1 million pps can be achieved on 10Gb/s links.


Chloe Ma 030315 Fig 1


There are multiple remedies to the above issue. Here we pick two of the ones that are representative of a pure software acceleration scheme, and another solution with hardware assist.


Chloe Ma 030315 Fig 2


The left side of the above figure depicts a solution that utilizes Data Plane Development Kit (DPDK) to change the way packets are received from push to poll. The applications that link with the DPDK library and call DPDK APIs will keep polling for packets, eliminating the interrupt on packet arrival. This solution can significantly enhance packet performance while still preserving hardware independence. But doing all packet processing in user space still poses overhead to the CPU, which adversely impacts packet performance, latency, and performance of other applications on the same host.


The right side of the figure depicts a solution that leverages hardware assist to bypass the kernel and use Single Root Input Output Virtualization (SR-IOV) to virtualize a single Ethernet port into multiple lightweight Virtual Functions (VFs) that can be associated with VMs. The communication between VMs and their corresponding VFs are through Direct Memory Access to eliminate lengthy copy operations. Mellanox ConnectX supports this solution by implementing an embedded Open Virtual Switch called “eSwitch” inside the NIC. The eSwitch can perform accelerated switching of both inter and intra-host packets.


This approach accelerates packet processing without any CPU overhead. The valuable CPU resources can be used to perform other tasks instead of packet I/O. The eSwitch supports various overlay tunnels such as VXLAN, NVGRE, and GENEVE, and can be extended to more protocols. The eSwitch also shares the forwarding table with the virtual switch in the user space, so the forwarding table size is only limited by host memory. In a subsequent post, I will further explore the benefits of using eSwitch.


NFV demands packet performance in virtualized environment, and Mellanox end-to-end interconnect solution delivers unprecedented high packet throughput and low latency to make NFV over COTS an obvious choice.




Comments are closed.