We generally assume that faster network interconnects maximize endpoint performance. In this blog, I will examine key factors and considerations when choosing the right speed for your leaf-spine data center networks.
To establish a common ground and terminology, I’ve listed below the 5 building blocks that comprise a standard leaf-spine networking infrastructure.
Let’s start by reviewing the trends in 2020 for datacenter leaf-spine networks deployments and describing the main eco-system that lies behind it all.The following illustration is divided into two main connectivity parts, each of which takes different factors into consideration when picking the deployment rate:
1) Switch-to-Switch (applies also when using a level of super-spine)
The following is an overview of Leaf-Spine network connectivity:
Together these parts comprise an eco-system, which I will now analyze in depth.
New leaf-spine data center deployments in 2020 evolve around four IEEE approved speeds: 40,100,200 and 400GbE.
There are different combinations of supported switches per each speed. Thus for example, constructing a network of 400GbE leaf-spine connectivity will require the network owner to pick switches and cables that can support those rates.
Like every other product in the world, each speed generation demonstrates a unique product life cycle (PLC, figure below) while each PLC stage comes with its own attributes.
As a rough categorization, during the Introduction stage, product adoption is concentrated within a small group of innovators who are neither afraid to take risks nor suffer from birth pangs; in networking, these are usually the networking giants (A.K.A hyperscalers). Growth occurs as leaders and decision makers start adopting a new generation. The Maturity stage is characterized by the adoption of products by the more conservative customers, while Decline occurs when a speed generation is used to connect legacy equipment.
The main questions that pop-up in my mind, are “Why do generations change?” and “What drives the ambition for faster Switch-to-Switch connectivity?” The answer is surprisingly simple: $MONEY. When you constantly optimize your production process and, at the same time, allow bigger scale (bigger ASIC switching capacity), the result is lower connectivity costs.
This price reduction does not happen at once—it takes time to reach maturity. Hyperscalers can benefit from cost reduction even when a generation is in its Introduction stage, because being big allows them to get better prices (Economy of Scale offers better buying power), often much lower than the Manufacturer Suggested Retail Price. In some sense, you could say that hyperscalers are paving the way for the rest of the market to use new generations.
Armed with this new knowledge, we are ready to do some analysis.
Before focusing on the present, let’s rewind a decade, back to 2010-11, when 10GbE was approaching maturity and the industry was hyped about transitioning from a 10 to 100GbE switch-switch speed. At the time, the 100GbE leaf-spine eco-system had many caveats. To name a few: 100GbE NRZ technology spine switches did not have the right radix for scale, providing only 12 ports of 100GbE in a spine switch, meaning only 12 racks could have been connected in a leaf-spine cluster.
Also, at the same time, 40GbE switch-switch connectivity started to gain traction, even though it was slower, due to mature SerDes technology, a reliable ASIC process, better scale and lower overall cost for most of the market.
Put yourself in the shoes of a decision maker who needs to deploy a new cluster in 2011—what switch-switch speed would you pick? Hard dilemma, right?
Fortunately, as it was a decade ago, we have since accumulated lots of data about what happened. Take a moment to analyze the graph below (Pay attention that the 10/40GbE generation is a perfect example for a PLC curve).
Beginning in 2011 until 2015, most of the industry picked 40GbE as its leaf-spine network speed. When asked in retrospect about the benefits of 40GbE, businesses will typically mention improved application performance and better ROI. Only at the end of 2015, roughly 4 years after the advent of 40GbE, did the 100GbE leaf-spine eco-system begin its rise, and be seen as reliable and cost-effective. Some deployments did benefit from 100GbE, since picking “the latest and the greatest” would fit some use cases, even at higher prices.
Fast forward to 2020. New datacenter deployments enjoy a set of wonderful new options of switch-switch rates to pick from, starting from 40GbE to 400GbE. Most of the current deployments are using 100GbE connections, which is mature at this point. With the continuous drive to lower costs, the demand for faster network speeds isn’t easing up, as newer technologies of 200GbE and 400GbE are deployed.
The following table presents attributes currently associated with each switch-to-switch speed generation:
We can conclude that each generation has its own pros and cons and picking one should be based on personal preferences. Now we will continue our journey in understanding the dynamics taking place in the data center speed eco-system, and try to answer which switch-to-switch speed generation will fit you best— 100,200 or 400GbE?
As mentioned before, new switch-to-switch datacenter deployments in 2020 evolve around four IEEE approved speeds: 40,100,200 and 400GbE. Each one is at a different PLC stage, summarized in the table below:
Let me share with you the reasons I view the market in this way:
To begin with, 400GbE is the current latest and greatest, and no doubt it will take a major part of deployments in the future, by offering the fastest connectivity, with a projected lowest cost per GbE. However, at the present, it still has not reached the required maturity to gain the associated benefits of commoditization.
A small number of hyperscalers, known for innovation, compute-intense applications, engineering capabilities and most importantly, those which enjoy economy of scale, are deploying clusters at that speed. To mitigate technical issues with 400GbE native connections, some have shifted to 2x200GbE or pure 200GbE deployments. The reason is that with 200GbE leaf-spine connections, hyperscalers can rely on a more resilient infrastructure, leveraging both cheaper optics and switch radix that allows for scaling a fabric.
At present, non-hyperscalers trying to move to 400GbE switch-switch connectivity, will come to realize that the cables and transceivers are still very expensive and produced in low volumes. Moreover, the 7nm ASIC process for creating high capacity switches is not optimized.
At the opposite side of the curve lies the 40GbE, which is a generation in decline. You should consider 40GbE if you are deploying a legacy cluster, with legacy equipment that cannot work at faster speeds.
Most of the market is not being caught up in the hype and doesn’t waste money on unnecessary bandwidth. It is focused on the 100GbE matured eco-system. Exhibiting textbook characteristics when it comes to cost reduction, market availability and reliability means that the 100GbE is not going away, it is here to stay.
This is a great opportunity to mention the other part of our story. Which is the NIC-switch speed. At this point,it might seem that they co-exist orthogonally, but in fact they are entwined and affect one another.
Whether your application is in the field of intense compute, storage or AI, the NIC is the heart of it. In practice, the NIC speed determines the optimal choice of the surrounding network infrastructure, as it connects your compute and storage to the network. I’ll explain: while deciding which switch-to-switch speed to pick, consider also what kind of traffic, generated from the compute nodes, is going to run between the switches. Different applications have different traffic patterns. Nowadays, most of the traffic in a datacenter is east-west traffic, from one NIC to another.
To get the best application performance, opt for a leaf switch that has the appropriate blocking factor (optimally non-blocking at all) to avoid congestion, by deploying enough uplinks and downlinks ports.
Datacenter deployments frequently use NICs at one of the following speeds: 10GbE (NRZ), 25GbE (NRZ), 50GbE (PAM-4) or 100GbE (PAM-4). There are also 50GbE and 100GbE NRZ NICs, but such are less common.
This is where the complete eco-system builds up – the point where switch-to-switch and NIC-to-switch complements each other. After reviewing dozens of different datacenter deployments, I noticed that there is a clear pattern when it comes to overall costs, apropos choosing a switch-to-switch speed when considering also the NIC-switch speed-of-choice. The math just works that way. There is an optimal point where a specific switch-switch speed generation allows the NIC-switch speed to maximize application performance, both in terms of bandwidth utilization and ROI.
Taking into consideration the application, wanted blocking factor and price per GbE— if your choice is based on the NIC speed, you would probably want to use the switch-switch speed as shown in this table:
Of course, other combinations might be better, depending on the prices you get from your vendor, but on average, this is how I view the market.
If you’ve made it this far, then you must have realized that 200GbE leaf-spine speed is also an option to consider.
In December 2017, the IEEE approved a standard, which contains the specifications for 200 and 400GbE. As discussed in previous chapters, a small number of hyperscalers are upgrading their deployment from 100GbE to 400GbE directly. Practically speaking, the industry acknowledged that the 200GbE can serve as an intermediate step, like the transition between 10 to 100GbE, in which 40GbE served as an intermediate step.
So, what’s in it for you?
200GbE switch-to-switch deployments enjoy a comprehensive set of benefits:
In preparation for the 200/400GbE era, Mellanox has optimized its 200GbE switch portfolio. It allows the fabric to scale the radix with better ROI than 400GbE, by using a 64×200(12.8Tbps) spine and 12×200+48×50(6.4Tbps) as a non-blocking leaf switch.
When you consider Mellanox’s competition, Mellanox offers an optimized non-blocking leaf switch (Top-of-Rack) for 50G PAM-4 NICs.
Mellanox Spectrum®-2 based platforms provide a capacity of 6.4Tb, 50G PAM-4 SerDes and a feature set that complies with the virtualized datacenter environment.
Using a competitor’s 12.8TbE switch as a leaf switch is just overkill for today’s deployments because the majority of top-of-rack switches have 48 downlink ports of 50GbE; and by doing the math to get to a non-blocking ratio, the switches need 6 ports of 400 or 12 ports of 200GbE, resulting in a total of 4.8TbE. There is no added value to paying for unutilized switching capacity.
By the way, Mellanox offers a 200GbE development kit for people who want to take our SN3700 Ethernet switch for a test drive.
Deploying or upgrading a datacenter in 2020? Make sure to take into consideration the following:
Disagree with the need for 200GbE, or anything else in this blog? Feel free to reach out to me. I would love to have a discussion.