Building a Simple, Scalable MPLS / Segment Routing Network Just Got Easier


In recent years, IP networking has experienced a massive transformation, driven by the requirement of hyperscale data centers to achieve improved performance, density and scale, and to reduce capex and OPEX spending. By contrast, MPLS architectures and solutions are lagging far behind – leaving the operators with antiquated, inadequate and expensive options to develop their networks, and thus preventing the ability to take the full advantage out of newer architectures, such as Segment Routing.

But… why??

Essentially, there are four critical reasons why commercial switching ASICs have under delivered, causing MPLS networks to languish and underperform, they are:

  1. Failure to deliver scalable forwarding capabilities
  2. Inflexibility of forwarding table resources limiting scalability
  3. Inability to provide adequate entropy for multi-pathing
  4. Failure to consider the most basic tag switching primitives

Failure of MPLS Scalability – Consider the Transistors

A major reason for the availability of scalable IP networking elements is the use of modern algorithms in order to implement Longest Prefix Match (“LPM”) forwarding, which is a major element of IP routing. Legacy implementations rely on TCAMs (Ternary Content Addressable Memories) to perform LPM lookups, which results in lower density and performance, and higher power and cost. Modern switching and routing ASICs available today implement algorithmic LPM lookups that utilize simpler SRAM memories – resulting in better performance and scalability and lower power solutions.

While this transformation has benefitted IP networks, most of the modern ASICs lack similar algorithmic capabilities to perform MPLS lookups. This is ironic given that the MPLS architecture was built in order to allow the use of SRAMs to start with… however, implementations have focused first on IP networks and thus MPLS capabilities have lagged behind. As a result, MPLS networks built from commercial ASICs are not competitive with IP networks in scalability, leaving MPLS operators lagging in innovation and without the right building blocks to scale their networks

Flexibility of Resource Allocation

Every network architecture and use case has different requirements, however unfortunately, most switching ASICs today do not allow operators to choose and optimize the available resources in order to suit their specific needs. For example, in some switches the lookup tables are dedicated to specific types of operations (ex: L2, L3, ACL operations). This can leave the network architect with a shortage of one type of lookup resource and with an unused abundance of another. This resource inflexibility is certainly problematic for standard IP networks, but MPLS architectures suffer even more from hard-coded and dedicated tables. Most ASICs have fixed resources and thus are unable to trade-off IP tables vs. MPLS tables. But even beyond this, they are unable to flexibly allocate ILM (Incoming Label Map) entries to use vs NHLFE (Next Hop Label Forwarding Entry) entries. These design constraints lock operators to very restrictive models of using MPLS in their networks. However, these constraints are not dictated by the networking protocols themselves, but rather a consequence of the switch ASIC architecture.

Entropy can Never Decrease…?

One major enabler of scalable IP networks is the ability to distribute a massive amount of traffic and flows across multiple hierarchies of switches. Ideally, data flows are completely uncorrelated and therefore are distributed well, flowing smoothly across the multiple available network links.

This method relies on modern hash and multi-path algorithms; enabling operators to benefit from high radix switches in hyperscale data centers, cloud, big data, machine learning, and artificial intelligence clusters, content distribution networks, and more.

However, in the real world, traffic patterns can be bursty and if the amount of entropy is limited, often multiple flows collide. This is particularly the case when only limited header information is hashed to determine the route through the network. In this case, multiple flows can align to travel on a single link, thereby creating an oversubscribed ‘hot spot’ or microburst, resulting in congestion, increased latency, and ultimately, packet loss and retransmission. Essentially, there is not enough entropy in the limited header information being hashed. What is therefore needed is to increase the entropy available to distribute the flows more randomly.

In order to achieve that, ASICs need to look deeper in the packet, beyond the IP header, and gain entropy from L4 flow information. While forwarding MPLS packets, this becomes even more important since multiple flows may get encapsulated with few MPLS labels. Unfortunately, the most ASICs lack the ability to read IP and L4 fields and hash them once encapsulated within MPLS headers. Some work has been done in order to use edge devices to “feed” entropy into the MPLS labels[1], ending up with cumbersome architectures.

In addition to entropy, ASICs that are capable to read and use data over the MPLS stack enables better security, filtering, and policy-based forwarding or routing. Eventually, enables operator’s visibility into their own networks.


Keep it simple, and scalable! Looking at the original MPLS actions: swap / push / pop, it turns out that the simplest action is actually missing: forward. In modern MPLS architectures, the simple ‘forward’ MPLS primitive can be the norm, rather than swap labels and forward. Unfortunately, to accomplish this rather fundamental operation, it turns out that most of current switches actually implement swap-to-the-same-label in order to forward. Interestingly enough, this is even what the specification of Segment Routing / SPRING suggests…[2] This technique results in massive scalability issues for MPLS networks which rely on multi-path techniques for load balancing and traffic management. The swap-to-same-label approach for implementing simple forward, requires a table entry for each combination of ingress labels and egress interfaces. This consumes forwarding tables as the product of labels and interfaces which scale poorly, particularly as switch port radix continues to grow. A simpler, better solution implements the forward primitive directly, which allows table entries to scale only as the sum of ingress labels and egress interfaces.[3] In even modestly sized data centers the scalability advantages become significant.

To Summarize

Mellanox Spectrum switches are designed for scale. Modern scale-out networks can benefits from better architectures that enable flexibility, visibility and scalability for the operator:

  • Relying on fast and efficient SRAM memories, scale your MPLS / Segment Routing tables as you scale your IP tables!
  • BYOP – Build your own profile, flexible and dynamic allocation of resources for different use cases
  • Exploits visibility and entropy for MPLS multi-pathing, just as used for IP networks, looking through 10s of labels
  • Simple and scalable opcodes – enable efficient use of MPLS tables for multi-layer ECMP-based MPLS networks
  • Pushes 2-3x more MPLS labels to the packet
  • Enables disaggregated model for MPLS networks with the highest density in terms of Gbps/Watt and Gbps/RU

Supporting Resources:


[2] “CONTINUE: the active segment is not completed and hence remains active.  The CONTINUE instruction is implemented as the SWAP instruction in the MPLS dataplane…”

[3] Some suggestions to change that has been already made in the past, see for example:

About Barak Gafni

Barak Gafni is a Staff Architect at Mellanox Technologies, focusing on enabling the most scalable, agile and simple networks of tomorrow. He joined Mellanox at 2009, and has 12 years of experience in the networking industry. Barak holds a B.Sc. in EE from the University of Tel Aviv (Cum Laude), has co-authored multiple IETF RFCs and holds several patents in the space of networking.

Comments are closed.