Low latency in the electronic trading world has long moved from being just a competitive advantage for the exotic few, to being a base line, minimal requirement for participation. A common misconception is that from a server I/O technology perspective, the industry has reached a saturation point, and further competitive advantages will mostly come from the higher software layers. More specifically, a programmer writing the communication layers for an electronic trading application today might think his methodologies are just as valid today as they were 10 years ago, since he is still using the same (similar in terms of clock frequency) Xeon processors with the same 10Gb/s Ethernet connectivity.
A closer examination however reveals many advances in server I/O hardware, which are yet to be leveraged by many application developers. When properly accessed via application software, these innovations can drive the next wave of competitive advantage for electronic trading firms.
Here is a very brief re-cap of some of these areas of innovation:
While CPU clock speeds have not dramatically increased, the overall CPU performance has continued to significantly increase, mainly via increasing the number of cores per system. Typical core counts on servers used by electronic traders have increased 10x over the past 10 years, and yet many of these cores remain either idle or not used for communication, due to inability of software to properly take advantage of them.
In order to maximize the utilization of their servers, and get the full return on the hardware investment, traders need to be able to bring data into to all cores, with the exact same latency, jitter, accuracy and reliability. In the past, bottlenecks in network adapter cards and the PCI bus did not allow this. Modern day network adapter cards, such as the Mellanox ConnectX®-4, include parallel pipelines for message handling, PCI connections that are 10x faster than the 10GbE network link (PCIe3x16 > 100Gb/s), and hardware based steering tables that allow direct steering of different flows to different CPU cores.
Leveraging these mechanisms will allow better utilization of existing server estates, and better readiness to deal with tomorrow’s growth potential.
Bursts and Message Rates
Feeds are still coming in on 10Gb/s Ethernet links. Furthermore, if you look at the actual bandwidth consumed on most of these links, it is typically well below 10Gb/s. The actual challenge however, is not measured in Gb/sec, but rather in messages per second. It’s not measured on average but rather on peaks, and not just 1 sec peak, but rather 10 millisecond peaks. Looking at market data peaks over time on all popular feeds, we can actually see consistent, ongoing growth. Moreover, when looking at the amount of messages per second needed to be handled in the 10 millisecond peak bursts, we can see these are very challenging numbers to deal with, much more for the CPU and software stack than for the network.
The table below shows the latest capacity projections of OPRA, the popular consolidated options feed. From these numbers, it’s easy to see that the peaks continue growing, and systems that have been designed for average rates of 1 or 2 million packets per second will not be able to sustain them.
Dealing with these microbursts requires a tight combination of hardware and software functionality. On the hardware side, network adapter cards need to have much more buffering than standard 10GbE would require, and be able to steer traffic directly to CPU cores with minimal overhead.
On the software side, applications need to be ready to receive these microbursts, which were previously being dropped in the hardware, and be very carefully planned to empty receive buffers, and act upon the data, in the most efficient way.
Round trip times, measured from application user space to application user space, have always been a bit higher than what the hardware is actually capable of due to the multiple layers of software involved. This applies to both user and kernel space. The closest most firms have come to hardware performance is by using socket-API based kernel bypass libraries (e.g., VMA, Open Onload), which while fairly transparent and easy to use, leave much to be desired when trying to keep up with hardware speed ups.
Today, multiple vendors are exposing low level APIs, which allow developers to program directly to the hardware. This includes control of buffer sizes and allocations, steering and filtering rules, and more. Many of these APIs are also fully open source leaving much flexibility to the developer, and alleviating concerns of vendor lock-in.
Modern network adapter cards have capabilities that did not exist in the past, and can help dealing with large bursts, parallel message streams, and inefficient access to hardware resources.
Join us on July 28th at 11AM EST for a webcast titled: “Latest Trends in Low Latency Programming”. In this webinar, Eitan Rabin of Mellanox and Jeremy Eder of Red Hat (sharing many years of hands-on low latency programming and tuning) will go over various new software tools and APIs that can be leveraged to reach the next level of latency and jitter. Register Now