All posts by Barak Gafni

About Barak Gafni

Barak Gafni is a Staff Architect at Mellanox Technologies, focusing on enabling the most scalable, agile and simple networks of tomorrow. He joined Mellanox at 2009, and has 12 years of experience in the networking industry. Barak holds a B.Sc. in EE from the University of Tel Aviv (Cum Laude), has co-authored multiple IETF RFCs and holds several patents in the space of networking.

On Segment(ed) Routing

Traffic Engineering On Segment(ed) Routing Using MPLS and IPv6

Segment Routing is one of the hottest technologies being considered to transform the way packets are handled in critical networking infrastructures. Whether it is in the core of the internet, within data centers, or between data centers, Segment Routing is seen as a simpler way to control networks, program packet routes and processing, and implement traffic engineering policies.

A closer look shows that essentially Segment Routing implementations are segmented depending on the underlying technology. Segment Routing architecture has been specified by the IETF (RFC 8402), but the actual implementation in the data plane, which carries the traffic eventually, has been done over two fundamentally distinct technologies – MPLS and IPv6. Thus, one architecture, but a choice between two data planes. The interesting part is that not only do these two data planes use different definitions of the segments, but they keep them in opposite orders (!!!).

Segment Routing Differences: To Put Things Visually

The MPLS data plane architecture allows a packet to carry multiple “labels”, acting as segments identifiers, or SIDs if you’d like. So does IPv6, through a Segment Routing Header (SRH), which is an extension to the IPv6 header; the SRH carries multiple segments as well.

And this is where the fun stuff begins:

The MPLS data plane places the next segment (represented as a label) in the stack closest to the MAC header (i.e. beginning of the packet), and “pops” (removes) the segment after it is used to determine the next hop.  Thus the “current” segment and the next segment are always in the same “place” in the packet relative to the MAC header:

MPLS segment routing keeps the segment labels in order with the current/next label at the front of the packet, closest to the MAC header

On the other hand, IPv6 Segment Routing (also known as SRv6) carries the “current” segment as an IPv6 destination address, with the rest of the segments in the SRH, such that the last segment is the closest to the MAC header, and the next segment in the stack furthest from the MAC header, at least to begin with. Its location in the SRH varies, and can be found using a pointer (called “Segments Left”) that indicates the next segment location, and as the packet traverse the next-segment pointer is updated without removing any of the segments from the stack:

Segment Routing with IPv6 or SRv6 keeps the segments in the reverse order as it’s done with MPLS, and IPv6 segment routing doesn’t drop segments as they are used.

How Would MPLS and IPv6 Affect My Network?

These two different underlying technologies make a big difference when considering high-speed data plane silicon to support segment routing. For MPLS data plane, most switch ASICs were not built with segment routing in mind, while others, such as the Mellanox Spectrum switch, have much better capabilities. This includes additional set of Segment Routing oriented actions set, deep parsing that enables the use of entropy from above the MPLS stack, unique data structures to scale segment routing for data centers, and enormous label tables scalability. For more details see my previous blog: “Building a Simple, Scalable MPLS Segment Routing Network”.

For Segment Routing over IPv6 (SRv6), most of the switch ASIC implementations simply lack the ability to look deep enough into the packet, see the multiple segments, and use a segment that is located at an arbitrary offset within the SRH. The good news is that new data plane architectures are evolving and making their way into the market, enabling efficient and high-throughput implementations that support these requirements, and will drive market adoption further for MPLS-SR, but also SRv6. The ability to deploy these products in the data centers will enable network architects and administrators to apply traffic engineering and program the network inside IPv6 supported networks. Segment Routing, with all of its benefits, will no longer be a secret tool accessible only for the ones who run MPLS.

I‘d be glad to get your inputs, and talk to me if you’re interested to hear more!


Building a Simple, Scalable MPLS / Segment Routing Network Just Got Easier

In recent years, IP networking has experienced a massive transformation, driven by the requirement of hyperscale data centers to achieve improved performance, density and scale, and to reduce capex and OPEX spending. By contrast, MPLS architectures and solutions are lagging far behind – leaving the operators with antiquated, inadequate and expensive options to develop their networks, and thus preventing the ability to take the full advantage out of newer architectures, such as Segment Routing.

But… why??

Essentially, there are four critical reasons why commercial switching ASICs have under delivered, causing MPLS networks to languish and underperform, they are:

  1. Failure to deliver scalable forwarding capabilities
  2. Inflexibility of forwarding table resources limiting scalability
  3. Inability to provide adequate entropy for multi-pathing
  4. Failure to consider the most basic tag switching primitives

Failure of MPLS Scalability – Consider the Transistors

A major reason for the availability of scalable IP networking elements is the use of modern algorithms in order to implement Longest Prefix Match (“LPM”) forwarding, which is a major element of IP routing. Legacy implementations rely on TCAMs (Ternary Content Addressable Memories) to perform LPM lookups, which results in lower density and performance, and higher power and cost. Modern switching and routing ASICs available today implement algorithmic LPM lookups that utilize simpler SRAM memories – resulting in better performance and scalability and lower power solutions.

While this transformation has benefitted IP networks, most of the modern ASICs lack similar algorithmic capabilities to perform MPLS lookups. This is ironic given that the MPLS architecture was built in order to allow the use of SRAMs to start with… however, implementations have focused first on IP networks and thus MPLS capabilities have lagged behind. As a result, MPLS networks built from commercial ASICs are not competitive with IP networks in scalability, leaving MPLS operators lagging in innovation and without the right building blocks to scale their networks

Flexibility of Resource Allocation

Every network architecture and use case has different requirements, however unfortunately, most switching ASICs today do not allow operators to choose and optimize the available resources in order to suit their specific needs. For example, in some switches the lookup tables are dedicated to specific types of operations (ex: L2, L3, ACL operations). This can leave the network architect with a shortage of one type of lookup resource and with an unused abundance of another. This resource inflexibility is certainly problematic for standard IP networks, but MPLS architectures suffer even more from hard-coded and dedicated tables. Most ASICs have fixed resources and thus are unable to trade-off IP tables vs. MPLS tables. But even beyond this, they are unable to flexibly allocate ILM (Incoming Label Map) entries to use vs NHLFE (Next Hop Label Forwarding Entry) entries. These design constraints lock operators to very restrictive models of using MPLS in their networks. However, these constraints are not dictated by the networking protocols themselves, but rather a consequence of the switch ASIC architecture.

Entropy can Never Decrease…?

One major enabler of scalable IP networks is the ability to distribute a massive amount of traffic and flows across multiple hierarchies of switches. Ideally, data flows are completely uncorrelated and therefore are distributed well, flowing smoothly across the multiple available network links.

This method relies on modern hash and multi-path algorithms; enabling operators to benefit from high radix switches in hyperscale data centers, cloud, big data, machine learning, and artificial intelligence clusters, content distribution networks, and more.

However, in the real world, traffic patterns can be bursty and if the amount of entropy is limited, often multiple flows collide. This is particularly the case when only limited header information is hashed to determine the route through the network. In this case, multiple flows can align to travel on a single link, thereby creating an oversubscribed ‘hot spot’ or microburst, resulting in congestion, increased latency, and ultimately, packet loss and retransmission. Essentially, there is not enough entropy in the limited header information being hashed. What is therefore needed is to increase the entropy available to distribute the flows more randomly.

In order to achieve that, ASICs need to look deeper in the packet, beyond the IP header, and gain entropy from L4 flow information. While forwarding MPLS packets, this becomes even more important since multiple flows may get encapsulated with few MPLS labels. Unfortunately, the most ASICs lack the ability to read IP and L4 fields and hash them once encapsulated within MPLS headers. Some work has been done in order to use edge devices to “feed” entropy into the MPLS labels[1], ending up with cumbersome architectures.

In addition to entropy, ASICs that are capable to read and use data over the MPLS stack enables better security, filtering, and policy-based forwarding or routing. Eventually, enables operator’s visibility into their own networks.


Keep it simple, and scalable! Looking at the original MPLS actions: swap / push / pop, it turns out that the simplest action is actually missing: forward. In modern MPLS architectures, the simple ‘forward’ MPLS primitive can be the norm, rather than swap labels and forward. Unfortunately, to accomplish this rather fundamental operation, it turns out that most of current switches actually implement swap-to-the-same-label in order to forward. Interestingly enough, this is even what the specification of Segment Routing / SPRING suggests…[2] This technique results in massive scalability issues for MPLS networks which rely on multi-path techniques for load balancing and traffic management. The swap-to-same-label approach for implementing simple forward, requires a table entry for each combination of ingress labels and egress interfaces. This consumes forwarding tables as the product of labels and interfaces which scale poorly, particularly as switch port radix continues to grow. A simpler, better solution implements the forward primitive directly, which allows table entries to scale only as the sum of ingress labels and egress interfaces.[3] In even modestly sized data centers the scalability advantages become significant.

To Summarize

Mellanox Spectrum switches are designed for scale. Modern scale-out networks can benefits from better architectures that enable flexibility, visibility and scalability for the operator:

  • Relying on fast and efficient SRAM memories, scale your MPLS / Segment Routing tables as you scale your IP tables!
  • BYOP – Build your own profile, flexible and dynamic allocation of resources for different use cases
  • Exploits visibility and entropy for MPLS multi-pathing, just as used for IP networks, looking through 10s of labels
  • Simple and scalable opcodes – enable efficient use of MPLS tables for multi-layer ECMP-based MPLS networks
  • Pushes 2-3x more MPLS labels to the packet
  • Enables disaggregated model for MPLS networks with the highest density in terms of Gbps/Watt and Gbps/RU

Supporting Resources:


[2] “CONTINUE: the active segment is not completed and hence remains active.  The CONTINUE instruction is implemented as the SWAP instruction in the MPLS dataplane…”

[3] Some suggestions to change that has been already made in the past, see for example: