All posts by David Iles

What Just Happened? Mellanox’s New Advanced Streaming Telemetry with Real Time Network Visibility

Let’s talk about network streaming telemetry and why you need it.  If you ever had problems trying to recreate issues, or had a tough time figuring out why you were seeing packet loss, or if you are a network admin who has ever been blamed for an application outage or a server or storage performance issue, you need good network telemetry. Because the network enables applications to be accessed, share data, and connect to storage, good network streaming telemetry is also good application telemetry.

Some of you may be asking, what exactly is telemetry?

What is Telemetry? 

If you were driving a car, telemetry is the speedometer, tachometer, gas gauge, oil pressure gauge, engine temperature, and dashboard warning lights – the data you need to safely get you where you want to go and to know how the car is doing along the way.  Whether you are driving a car or flying a plane, you need good telemetry, and the faster you travel the more critical it is.  So too, if you are running a data center, deploying VMs and containers, or managing a storage deployment, you need visibility into what’s going on inside the network fabric. And the faster your network runs, or the more critical network performance is to your business, the more important that becomes. Switch streaming telemetry can give you that crucial visibility.

Shift from Protocols to Streaming Telemetry

The legacy position on network management has been that more is better: more protocols, more packets captured, and—in case of a problem—more deep digging through captured packets to find the cause and then the fix. But over the last several years there’s been a trend in data center networks towards simplification.   The larger, or more advanced the data center, the fewer the protocols they like to run.   Back in my Tech Support days, we used to have a saying that “the smarter the customer, the shorter the config file.”  This saying came out the fact that the guys that always had trouble were those who enabled every possible feature & protocol.  You could estimate the amount of problems by the length of the config files.  We’ve seen this trend to move to more L3 and away from L2 and all the versions of spanning tree, and band-aids like root guard, loop guard, BPDU guard, and the rest.

The main exception to the simplification trend is the need for more visibility as smart folks want to see what’s going on inside their network. As the networks gets larger and faster, savvy administrators are using fewer protocols but aiming for more network telemetry to achieve better visibility.
Some Network Admins want better streaming telemetry to improve their “mean time to innocence” – to speed up the time to find the root cause of issues so they can rule out whatever is NOT causing the problem and prove if it’s really the server team’s fault (or maybe the storage teams’ fault).   Others are trying to get more out of their network.  Most network teams don’t really know if their networks are under-utilized or over-utilized because they have poor visibility into what’s really going on. Without that understanding, it’s impossible to run the network efficiently or to grow it properly.

WJH is a switch-level monitoring solution where the switch ASIC monitors flows at line rate and will alert you if you had performance problems due to packet drops, congestion events, routing loops, etc.:

For example, if you’re dropping packets, because of a bad cable or bad optic, WJH will let you see those dropped packets and tell you why they were dropped.    WJH will let you know if you’ve got congestion or buffer problems, or even security issues.  For example, if you’re hitting a bunch of ACLs, and they’re dropping packets, you’d like to know why, because you might have a corrupted server or VM.  Or you might have a poorly configured ACL that’s causing problems.

In lossless environments, like NVMe over Fabrics (NVMe-oF) running on RoCE, you might have performance problems even though you are not dropping packets.  The performance issues could be due to congestion issues or excessive pause frames or latency issues.  It’s very common to find out the root cause is uneven load balancing across a LAG or ECMP group.   Whether your problem is packet drops or poor performance without packet drops, WJH was built to get to the bottom of those things and give you the best streaming telemetry for superior network visibility.

Pretty much every network in the world is going to have some packet drops.  Sometimes, it’s for bad reasons and sometimes it’s for good reasons.  Many of the other switch telemetry solutions don’t deliver enough data to diagnose and solve problems. When a non-Mellanox switch drops a packet, that packet is sent to Bit Heaven, never to be seen again.  The packet, and all that useful diagnostic information, will just disappear and the most those switches will do is increment a vague counter.  When you check that counter, the switch will say, “Oh, you’ve now dropped 504 packets, due to a bad VLAN”.  But that’s it, those switches don’t tell you anything about the packet that was dropped, when it was dropped, or why it was dropped–just that it was dropped.  So you don’t know if the packet was dropped because the switch was misconfigured or because the server was misconfigured, or something else completely.

Other switch or network management solutions perform statistical sampling of packets from every port on every switch. This produces a staggeringly large boatload of packets but not all the problem packets, so doesn’t record when, why or how a packet was lost. It also doesn’t properly explain how congestion started, what caused unacceptably high latency, or why traffic might have become unbalanced or been misrouted. When a problem is suspected, you need to sort through huge piles of saved packets and try to extrapolate (or guess) what really happened and why.

In these cases, you simultaneously have too much data (too many sampled packets) and yet not enough information (not enough details about the problem packets).  Everything on the network becomes suspect and determining what really happened can take many hours.  But, there is a better way!

How does WJH Work?

How Mellanox’s accelerated streaming telemetry technology finds dropped packets in real time

Mellanox’s What Just Happened (WJH) is a hardware-accelerated telemetry technology where the switch ASIC holds onto important parts of dropped packets.  The switch won’t keep the whole packet or all the normal packets as that would consume a lot of space, for little benefit.  Instead, the switch keeps the important parts of the problem packet like the source and destination IP and MAC, port numbers, etc., along with that some very detailed descriptions of why, when, and where it was dropped. Because the switch is involved, it knows which packets to save and why those packets were dropped, or too slow, or misrouted. And with hardware acceleration, the switch can record all the relevant packets along with important details, even while driving many ports of 25, 40, 50 or 100 (soon 200) Gigabit Ethernet.

For small deployments, you can log into the switch and quickly see what’s going wrong in your network.  But for larger deployments, WJH can stream these packets out to a centralized database using gRPC.  This works with turnkey solutions, like Mellanox NEO, and because it’s in a standard database, it works with open source tools like Kabana and Grafana.  If you are a network expert, or been to Sniffer University, and want to look at the actual packet capture, the switch can generate a p.cap file of all the dropped packets so you can look at it using Wireshark.
WJH helps to get to the bottom of problems, by showing who’s being impacted, which applications, which servers, what’s causing the problem, when and where the problem is in your network.

A New Hope in Network Telemetry

WJH a new way of monitoring a network.   Traditional network monitoring tools collect tons of innocent data and counters.  They may even use sFlow to sample random packets, with the idea that you’re collecting all this information to use to extrapolate, or guess, what went wrong in your network:

For some reason, the trickiest network problems usually occur at night or on the weekend and then you’ve got to leave the football game or dinner to sift through a mountain of data to find the root cause.  You try to guess what’s the culprit that’s causing all the trouble.  There are even predictive analytics tools, where they say, hey, we’ll look at that mountain of data for you, and we’ll give you like a 60 -70% confidence that they have found the root cause.  They’ll do that guesswork for you, but at the end of the day – it’s still just guesswork. The problem is you have too much data (from packet sampling) but often not the most important data (the What, Where, When and Why).

advanced streaming telemetry technology built for real time visibility

WJH is a new way of monitoring a network focusing on data plane anomalies, built to give you back your weekends.   WJH quickly shows you both the victims and the packet troublemakers or bandwidth bullies in your network.  You can keep collecting those huge piles of data about innocent devices and eventsand try to crunch them, but WJH will give you the smoking gun – the actual root cause at the scene of the crime: recorded first-hand by the switch that had to drop the packet.

No more Problem-Recreation Drama!

WJH also breaks the Problem-Recreate cycle:


The old approach trying to guess when a problem will re-occur and setup a recreate scenario on a test bed or packet trace only to have the problem not reveal itself, so you retry the next week… and the week after….  This was the impetus for Mellanox’s What Just Happened, advanced telemetry technology.  With WJH, because we hold onto those packets that are getting dropped, and we report on them, we’re able to help you get to the root cause and give you network visibility without needing to reproduce issues to resolve them.

So how do I deploy WJH?

Now I know some of you are thinking, “this sounds amazing, but I can’t replace my entire network with Mellanox switches”.  The great thing about WJH is that it works independent of the rest of the network.  WJH running on one switch can report errors that are likely happening on other switches in that tier of the network doing a similar function.  This is very different than InBand telemetry, which works best with all switches from the same vendor.

It’s super simple to get started with WJH:

WJH Deployment Procedure

Step 1 – Most people start using WJH by doing a Network Scan, which is done by enabling WJH on a switch they have plugged into their production network.  People are almost always surprised at the kind of errors that they learn about.  The Network Admin is super happy to learn what’s going on in the network.  So step one is just turn on WJH and see what’s really going on in your network
Step 2 – Next is the clean-up phase where people resolve the network issues that WJH found as well as the server issues and the storage issues that WJH found.

Step 3 – This is where you personalize WJH for your network and your management needs:

  • You might set some filters because you don’t need to report certain kinds of “normal” errors or even want to log or store it
  • You might set the WJH agent to aggregation mode if the kinds of issues tend to show up in 1000s of the exact same packet. Aggregation mode stores just one copy of that problem packet instead of 1000 identical problem packets.
  • You might set the severity level of issues that matter to you. Some might be critical and require immediate notification, while others you could check later, or even ignore.
  • You can set actions of the severity levels, for example you might want a text sent on critical issues, an email on significant issues, and no alerts on minor issues.

WJH is a great tool for advanced network-heads as well as the network novice who just wants a simple way to identify network issues from server and storage issues.  With WJH, you don’t have to be a network expert to very quickly find root causes of performance problems.

Advanced streaming telemetry technology is good for your business – it will help you get more performance, uptime, and productivity out of the networks that you you’ve paid for.

Read more:

WJH page:





How to Make Your Leaf/Spine Network Hum

In my last blog, I wrote about why you want a leaf/spine network

In this blog, I will discuss how to tune up your Leaf/Spine network:

  1. Better fabric performance through ECMP tuning
  2. Non-disruptive failover
  3. Hitless upgrades
  4. Automated provisioning
  5. Stretching VLANS & multi-tenant security
  6. Simple IP Mobility
  7. Decrease optics spend
  8. Eliminate licenses for basic features

The way modern leaf/spine topologies work is that all switches operate at Layer 3 and use a routing protocol, like BGP or OSPF, to advertise the reachability of their local networks.  All traffic from a leaf gets load balanced across a number of spine switches.

The key technology that enables leaf/spine topologies is Equal Cost Multi Pathing (ECMP), which the leaf switches use to send traffic evenly across multiple spine switches.  ECMP is a standard, non-proprietary, feature available on all modern switches.

However, not all switches are created equal, so here there are some ways to tune up your leaf/spine network:

Tune up #1: Better Fabric Performance through ECMP Tuning

When a switch load balances packets across multiple paths, it does not simply round robin each packet out a different link because that would allow packets to arrive out of order and most hosts just discard out of order packets.  So, instead of round-robin forwarding, switches distribute packets across equal paths in a way that forces all packets from a given flow to always take the same path through the network.  This deterministic forwarding is accomplished by making the path selection using packet fields that stay consistent for the entire session.  For example, a switch could just hash on the source IP address and then all the packets from a given server would always take the same path.  Every data center is a bit unique, with unique traffic patterns, address schemes, and applications, so if traffic isn’t load balanced well, the ECMP parameters used for load balancing may need to change, and it is considered a best practice to use as many tuples (or variables) as your switches allow.

Solution: better hash keys

With larger networks, it is considered best practice to use different ECMP parameters for the spine versus what is used on the leaf or super-spine switches.  Using the same ECMP parameters at multiple tiers can lead to polarized traffic flows (which is a bad thing), where all traffic gets sent on a single path leaving remaining paths unused.  For example, you may want to use SIP+DIP at the Leaf, and use SIP+DIP+DPORT on the Spine switches.

That said, Mellanox switches use a unique hash seed value per switch to avoid polarization without needing to change your hash parameters.

Solution: Use Mellanox switches, or use different ECMP parameters at each tier of the leaf/spine network

Trouble with old switches:

To get line rate performance, ECMP forwarding decisions are executed in hardware – inside the switch ASIC.  Some switches do this very well, but older switches only work really well if the number of paths is a power of 2.  Meaning, if there are 2 or 4 or 8 paths, then they get good distribution, but for 3 or 5 paths, they get really bad distribution where one path gets overloaded while others go unused:

The root cause is too few ECMP hash buckets in the switch ASIC.  The Mellanox Spectrum switch ASIC has thousands of hash buckets, so it gets great traffic distribution – regardless of the number of links.

Solution: Use modern switching hardware – like Mellanox Spectrum switches


Tune-up #2: Non-Disruptive Failover

When a path fails in traditional ECMP, all flows in the network can get disrupted.  The reason for this is that when a path fails – all the flows in the ECMP group get redistributed across the remaining paths – even flows on the non-failed links are rebalanced and get disrupted.  This rebalancing of flows leads to out of order packets and retransmissions which momentarily disrupts services across the entire IP fabric. Then, when the path is restored, this disruption gets repeated as all the flows get rebalanced again, further upsetting the applications flowing over this infrastructure.


Resilient Hashing solves this problem by rebalancing only those flows that were assigned to the path that failed – the flows on the un-impacted paths are left undisturbed:

Solution: Resilient Hashing

What if your switches don’t support resilient hashing?

Some folks will use layer 2 Link Aggregation (or LAG) on a pair of links between each leaf & spine in an attempt to minimize this problem.  This workaround allows a single link to fail and no traffic will be rebalanced.  The downside is that no traffic will be rebalanced and you will get less than ideal traffic distribution.  It also provides no benefit if a spine switch goes offline.

Tune-up #3: Hitless upgrades

In the old days of networking, updating a big modular switch meant taking down half the network.  This necessitated all upgrades to be scheduled around specific maintenance windows.  Now, the first time you update a big switch, you might trust a vendor’s In Service Software Upgrade (ISSU) feature – that is until you discover it frequent won’t work for “major” software upgrades which commonly include SDK upgrades, changes in the ISSU functionality itself, or an FPGA/CPLD firmware re-flash.  With the leaf/spine approach, updating a switch becomes a trivial event:

Step 1: change the route cost on the switch and watch flows move away from it until all traffic has been redirected to other switches

Step 2: upgrade the firmware & restore the route cost

This upgrade process can easily be automated.  In fact it should be automated, as another data center best practice is to automate anything that needs to be done more than once.  Which brings us to the next topic:


Tune-up #4: Automated Network Provisioning

There is a long held axiom that for every $1 spent on CAPEX, another $3 gets spent on OPEX.  The key to lowering that OPEX to CAPEX ratio is through automation.

Automation is needed for any large network and even more so with leaf/spine topologies.  Server automation tools like Puppet, Chef, and Ansible are now being applied to networks.  Ansible, in particular, has gained in popularity and is a great tool for enabling Zero Touch Provisioning (ZTP) for your network.  With ZTP, you can RMA a switch and the replacement switch is automatically provisioned with the correct IP addresses, configuration, and routing profile.  We provide sample Ansible scripts to make it easy for folks to start automating their networks.

One way to make it really easy to automate is to use a feature called IP unnumbered, which, besides conserving IP addresses, makes the configuration very simple, and, in some cases, the only config difference from one switch to the next is the loopback address.   Another automation trick is to reuse the same pair of ASN numbers on all your ToRs to further simplify the switch configs.

Automation pays additional dividends by making your network more reliable.  The majority of data center outages can be attributed to a lack of automation.  Either because of a simple fat-finger mistake where someone misconfigured a switch, or because a security policy was not followed.  When everything is automated, there are no fat-finger mistakes and security policies get implemented every time a switch is configured.


Tune–up #5: Stretching VLANs between racks and multitenancy

With most leaf/spine designs, each rack has a unique subnet or broadcast domain – which is a good thing since that limits how far trouble can spread.  However, it also keeps VMs from being able to live migrate from one rack to another because their VLAN is effectively trapped to a single rack.  The usual solution for this has been to stretch the VLAN across Layer 3 boundaries using an overlay technology like VXLAN and, historically, this was done with Software VTEPs on the servers.  These Software VTEPs were orchestrated by a centralized controller that would corral all the VTEPs and share the MAC reachability between the servers.

But now there is a controller-free VXLAN solution, called EVPN, which is great for putting a VLAN anywhere in the data center, or even stretching VLANS between data centers in some cases.  One of the key benefits to using EVPN is that the VXLAN tunnels terminate on the switches which makes it easy to deploy with bare-metal servers.  Here’s a recent webinar I gave on EVPN: Controller-Free VXLAN for your data center


Tune-up #6: Simple IP Mobility without Overlays

A major trend in larger data centers is to use Routing on the Host (RoH) to provide IP Mobility, allowing Containers to move anywhere in the data center.   RoH is particularly useful for Software as a Service (SaaS) data centers because, in many cases, there is just one “tenant” on the network.   As such, they don’t need VXLAN to carve up their network, but they do need to move containers anywhere in the data center.  A simple way to solve this problem is called Routing on the Host where a routing protocol is running the server.  With RoH, every container gets a unique /32 IP address which is advertised by the host server to the rest of the network using a dynamic routing protocol.

This is a very simple approach that eliminates the need for Controllers and complex NAT schemes.  It further simplifies the network as it also eliminates the need for a slew of protocols like Spanning tree, Loop-guard, Root guard, BPDU Guard, Uplink-Fast, Assured Forwarding, GVRP, or VTP.  Your datacenter becomes almost a one-protocol environment, which minimizes configuration and troubleshooting efforts.


With RoH, the servers also support ECMP, so there is no need for LACP/Bonding on the servers or proprietary features like VPC or MLAG on the switches.  This also frees up the MLAG ISL ports on the Top of Rack switches to use for forwarding traffic.  Another benefit is that you are no longer limited to 2 TOR switches per rack, and could dual/triple/quadruple-home your servers.

However, as simple as it sounds, RoH is usually reserved for more modern DevOps environments where there are no division between the server and network teams.


Tune-up #7: Reduce your spend on Optics

Don’t believe people telling you that connecting servers to an End of Row chassis switch will reduce costs because it eliminates the need for a ToR.   Just add up the optics and cabling costs and you’ll be very happy with the price of the ToR, not to mention your DC design will be nice and clean. Other things you can do to lower your optics spend is:

  1. Avoid buying optics from a network vendor that doesn’t actually make optics. Most network vendors just relabel generic optics and then charge a huge markup while providing no additional value
  2. Use Active Optical Cables (AOCs) instead of transceivers. This is a cost-saving tool used in the largest supercomputers in the world.  AOCs are basically fiber cables with the transceivers permanently attached which reduces costs.
  3. In large topologies, place the spine and super-spine close together so you can use inexpensive copper DAC cables for those connections

Tune-up #8: Eliminate Licenses for basic features

Some network vendors use bait-and-switch tactics where they quote you a great price for the switch hardware, win your business, and then later explain that you need to pay extra for basic features like BGP or ZTP.  Nobody pays extra to PXE boot their server, why would you pay extra to network boot a switch?

Reduce what you pay for “table stakes” features like VXLAN, BGP, EVPN, ZTP, Monitoring, and Mirroring by using one of two options:

Option 1: The best way to reduce those license costs is to use a network vendor that doesn’t play those tricks

Option 2: Introduce a 2nd vendor into your network and split your IT spend.  Those licenses have zero actual cost, so use your second vendor as leverage get the “bad” vendor to throw the licenses in for free.  Or, you know, you could just increase your spend with the “good” vendor (see option 1)

You shouldn’t have to pay exorbitant license fees for basic features like ZTP, BGP, or VXLAN.  These aren’t exotic features.  They are basic features that are expected in modern data center class switches.  These licenses are like car-salesman tricks, trying to get you to pay extra for “undercoating” and “headlight fluid”.



Leaf/Spine topologies are the best way to build data center networks.

If you want to try out a Leaf/Spine POC, we have an “easy button” where you can get a recommended topology, a BOM, sample configuration files, sample Ansible scripts to ZTP the entire fabric.  We even have a test plan you can use to verify the performance as well as the expected behavior: Easy POC

Supporting Resources


Why Leaf/Spine Networks are Taking Off

Unless you’ve been stranded on a deserted island, you’ve probably noticed that leaf/spine networks have started taking over data centers in the last few years. It’s no secret that people prefer scale-out over scale-up solutions, and for networking, the old scale-up approach was to use massive monolithic modular switches, a.k.a. Big Frickin’ Switches (BFS), whenever high port counts were needed. These BFS switches have gone the way of mainframes and Pokemon-go; some people still play with them, but all the cool kids have moved on to leaf/spine networks. The market data clearly shows this trend: years ago the majority of switches were modular, but now around 75% of all data center switch ports belong to fixed-port switches:

Source: Crehan Research, January 2017

Now, you shouldn’t look down on old companies still buying the BFS approach from Cisco/Arista any more than you should judge an elderly person too harshly for smoking cigarettes. They didn’t know any better when they started and now they are too old to change. But whenever you see a young person smoking, I know you can’t help thinking “really?” I get the same feeling whenever I see networks designed with big modular switches. I feel like asking “haven’t you seen the warning labels” (i.e., the price tags, and power consumption figures)?
Now, I’m not saying BFS are evil, but they *are* deployed like Sith Lords from Star Wars “always two there are, no more, no less”, while scale-out leaf/spine architectures spread traffic across many small fixed port switches:

Modern data centers use fixed port switches in leaf/spine topologies for all the right reasons:

Leaf/Spine networks scale very simply, just adding switches incrementally as growth is needed, but they do have some natural sweet spots. For example, since a 32 port spine switch can connect to 32 leaf switches, a natural pod size might be 32 racks of servers with 2 tiers of switching, serving around 1500 10/25GbE servers:

If you need a larger network, you would deploy these leaf/spine switches in “Pods” that would represent a logical building block that you could easily replicate as needed. In each Pod, you would reserve half of the spine ports for connecting to a super-spine which would allow for non-blocking connectivity between Pods. A best practice in leaf/spine topologies with a lot of east/west traffic is to keep everything non-blocking above the leaf switches. This would make a nature pod size of up to 16 server racks or 768 Servers per pod, and you could easily have up to 32 pods for around 24,000 servers:

There is no one-size-fits-all solution, so these are just examples to show what is possible, but whether you have 300 nodes or 30,000 nodes, chances are a leaf/spine network will work better for you than the old scale up model.
Now I know the resistance to adopting new approaches and some of you are looking at a group of super-spine switches and are thinking that a couple BFS switches would be easier than all those little spine switches. The thing to remember is that they take up around the same amount of rack space, and consume more power and…
Guess which one is going to perform best?
Guess which one is going to cost less?
I’ll give you a hint: Just turn it around and ask if you should just buy 16 small fixed port switches or if you should you buy 2 modular chassis + 16 line cards + 4 supervisor modules + 8 fabric cards?
You can see where Facebook is using 1U fixed port switches where their largest modular chassis aren’t big enough: Facebook Fabric Aggregator

In my next blog, I will write about how to make your Leaf and Spine network hum:

  1. Better fabric performance with ECMP tuning
  2. Non-disruptive failover
  3.  Hitless upgrades
  4. Automated provisioning
  5. Stretching VLANS & multi-tenant security
  6. Simple IP Mobility
  7. Decrease optics spend
  8. Eliminate licenses for basic features

Who’s Tapping Your Lines and Snooping On Your Apps?

Spoiler alert – it should be you!

No one would argue whether good vision was important if you were a surgeon, a welder, or an Uber driver. In technology, whether you’re a Cloud Architect or in Network Operations, you really need good visibility into what is going on inside your data center. To sleep soundly at night, you have got to actively monitor the performance of your Network, your application performance, and be on the lookout for security breaches. There are analyzers that specialize in each of these three distinct monitoring disciplines: Network Performance, Application Performance, and Security.

How do you get the right traffic to the right analyzers?

You need to “tap your own lines” by placing TAPs at key points in your network. These TAPs will copy all the data traversing the links they are attached to. Then, you need to aggregate those TAPs, consolidating all the flows into a few high bandwidth links on the analyzers. The modern, scaled-out, approach for consolidating TAPs is to use a Software Defined TAP Aggregation Fabric, which amounts to a bunch of Ethernet switches that are only specialized in that they don’t run normal Layer2/3 protocols. Instead,are steering specific flows to specific analyzers.

TAP Aggregation Fabric

You might want the TAP Aggregation fabric to do more than just steer the right flows to the right analyzers. You may want your TAP Aggregators to some of the following:

  • Filter out unwanted flows which will save bandwidth to the analyzers and increase the utilization of the analyzers
  • Truncate packets – to remove unneeded payload data – especially if your analyzers only look at the packet headers
  • Source tagging – to identify where packets came from by changing the MAC address or popping on a VLAN tag
  • Time-stamping – to identify exactly when packets hit the wire
  • Matching inside tunnels – to forward the right tunneled traffic to the right analyzer, while preserving the MPLS or VXLAN tunnel headers
  • Centralized management – to configure all the TAP Aggregation switches from a single control point. The per-flow filtering and forwarding rules can be configured a number of ways, but most people like to use an OpenFlow controller which is almost purpose built for this type of application. An added bonus is that it makes automation super easy since the individual switches configs are dead simple.

Where do you TAP your network?

There is no universal consensus on where to place your TAPs, but there are some very common models:

Financial Services organizations frequently TAP every Tier of their network, so they can measure the latency as packets traverse the network while they also implement security monitoring:

Many Cloud Providers TAP every Rack in their data centers for their own monitoring purposes, as well as offering Application Performance reports to their customers:


How do you know what traffic to send for analysis?

If you have ever enabled too many debug features on a Cisco/Arista switch, you are rightfully a bit cautious.  (Friendly advice: don’t do it unless you also want a switch reboot)

TAP Aggregation switches are the ideal place to implement heavy duty Telemetry features because they cannot impact your production network.

One technique for determining which flows need to be analyzed is to start monitoring your traffic with sFlow. sFlow can give you a picture of the busiest flows, top talkers, top protocols, most flows, and various traffic anomalies. It can help you detect and diagnose network problems. It can also provide a glimpse into which applications are using the network most.

You can also see when something changes and can point out what flows should be sent on for further analysis.

Some of the best monitoring, analytics, and graphing tools are Open Source. Recently, folks have been well served by sending their sFlow data to sFow-RT for analysis and then monitor the state of their datacenter with Grafana:

What to look out for when considering TAP aggregation solutions

  • Go with an open multi-vendor solution – don’t get locked into a proprietary one-of-a-kind closed solution. In the data center business, we call these, “Unicorns” because they are single vendor focused, single vendor sourced, and cannot be easily replaced. Beware – Unicorns are expensive!

  • Be sure to make “apples to apples” cost comparisons. Don’t just look at the switch hardware costs, but also look at the per-switch licensing and controller costs
  • Consider best-of-breed Open Source Tools which were developed for hyperscale data centers and scale better than expensive vendor-specific solutions
  • TAPs are preferred to SPAN as some switches are not able to mirror every packet.
  • Make sure your TAP Aggregation switches have sufficient packet rates (PPS) to be able to forward every packet sent by the TAPs


Supporting Resources

How the Space Race for Data Centers Helps Everyone

A lot of the household products that we take for granted and use every day were created as by-products of the space race. We take for granted well-known products that NASA claims as spin-offs include memory foam (originally named temper foam), freeze-dried food, firefighting equipment, emergency “space blankets“, Dustbusters, and cochlear implants to name just a few. Each was created out of necessity in order to further the space race. These happy side-products were only possible because of the massive government investments involved and now, they benefit our everyday lives. Interestingly, NASA didn’t invent Velcro, Tang or Teflon but as of 2012, NASA claimed that there are nearly 1,800 spin-off products in the fields of computer technology, environment and agriculture, health and medicine, public safety, transportation, recreation, and industrial productivity. And everyone knows that Star Trek was indirectly responsible for inspiring cell phone technology. Sorry, but who can resist? Beam me up Scotty.

Similarly, there is a sort of Internet “space race” of data centers that has been quietly underway for years now. As the hyperscalers have built increasingly massive datacenters to better serve the needs and scale of the Internet users, out of necessity, they have also created a number of innovations that are applicable to server deployments of every size. Well known Webscale IT innovations include: MapReduce/Hadoop, Mesos(Borg), and Containerization.  These happy side-projects were only possible because of massive hyperscale investments and now they benefit our everyday datacenter lives.

Just as consumers don’t need to be NASA and Star Trek to appreciate Dustbusters or even their beloved cell phones, IT professionals also don’t need social media to appreciate network automation. All datacenters can benefit from automation. They also benefit from higher speed networks.

Cloud computing and 25/100GbE

Cloud computing is constantly evolving. In 2012, Cloud based servers demanded 10 Gigabit Ethernet connectivity because the 1 Gigabit NICs so common at that time were hindering performance. Fast forward to 2016 and 10 Gigabit Ethernet can now be a bottleneck for modern server platforms which have leapt forward in performance; in the number of CPU cores and VMs they can support. Cloud Service providers are leveraging these new server platforms to increase the VM density per server, which has a corresponding increase in their profits.

These modern servers with their higher core count, higher VM density, and flash based storage are now bottlenecked by 10GbE connections and need high speed 25 Gigabit Ethernet connectivity. So too, the data center switch interconnects are moving from 40 Gigabit Ethernet to a technology with similar cost structures but with 2.5 times the bandwidth; 100 Gigabit Ethernet.

Cloud Computing is not just about speeds and feeds or about where workloads are located. Cloud computing requires a scalable provisioning framework that is automatic in nature. The best practices for network automation developed for the largest data centers in the world apply to every cloud based network.

Get up to speed on Cloud innovations, now!

Join my upcoming webinar, 25/100GbE and Network Automation for the Cloud, with Dinesh Dutt from Cumulus where we will discuss the tips and tricks to automating a data center as well as the Webscale data-plane innovations that drive server bandwidth to 25GbE, including OVS offload, RDMA, and VXLAN acceleration. I promise to keep my space references to a minimum.

See you on September 14, 2016 at 10:00 a.m. PT!





Why We Decided To Partner with Cumulus Networks

It is no secret that cloud computing has changed the IT infrastructure model forever. It has transformed the landscape of the data center including the way data centers are used, how they are designed, and how they are managed. The cloud has altered the buying behavior of the Enterprise, with every new project now going through a “buy vs. lease” evaluation to decide whether to build the infrastructure in-whether to build the infrastructure in-house or to deploy in the cloud. A side-effect of this new model is


that a subtle shift of the expectations of data center admins has quietly taken hold. Whether we are talking about public clouds, private clouds, or a hybrid of the two, those managing the infrastructure have grown to expect a plug and play experience from their data center. Time was, when you added a peripheral to your PC you needed to manually configure a slew of settings. That was then and this is now. Today we expect things to just work ‘automagically’. This auto-provisioning mentality is moving to the datacenter where there is now an expectation that when you add an application, virtual machine, container, or deploy a Hadoop cluster, you get the same plug-and-play experience with the entire data center as experienced with a laptop. As new workloads are deployed, the servers are automatically provisioned, so too, the storage, firewall rules, load balancers, and the physical network infrastructure needs to be automatically provisioned.

The automated provisioning and monitoring practices first developed for Web Scale IT have permeated even the smallest data center footprints. Whether your data center footprint covers two football fields or two rack units with a hyperconverged solution, people desire the force-multiplying benefits of automation. Think about it ̶when you have hundreds or thousands of servers (virtual or physical) and you need to change some security setting, or change where the SYSLOG alerts are sent, you use a tool like Puppet or Ansible to update all the server endpoints with a single command. Folks are now accustomed to making mass configuration changes in an automated, scriptable manner, for all of the data center infrastructure, not just servers. This shift becomes a significant OPEX differentiator for Cloud providers and larger enterprises that measure manually entered CLI key-strokes in terms of headcount.

At Mellanox, we are continually adding new automation features to our home-grown Network Operating System, MLNX-OS, with support for OpenStack, Puppet, NEO, REST, Neutron, and more every year. We have not stopped investing in MLNX, which offers an industry-standard interface, familiar to most networking professionals. However, there is a growing class of customers who are not satisfied with this approach, a flourishing rank of digerati who have embraced the DevOps approach and now treat their infrastructure as code. These technology boat-rockers started by adding network functions to their Linux servers and now want their Ethernet switches to offer the same programmable Linux interface as their servers. They have figured out, because it is Linux, they can load their own applications on their switches just like they do on servers. If they ever need some network visibility feature that didn’t come with the switch, they create a simple script to monitor the particular counter they are interested in and then have the switch automatically send an alert when appropriate.



The Mellanox leadership team thoughtfully considered whom to partner with in order to create the best solution for this new market and we found there was one clear choice: Cumulus Networks. Cumulus Networks is *the* leader for automating the network. Besides offering a native Linux interface that enables a switch to behave exactly like a Linux server, they have already integrated into every major Cloud Orchestration solution, including VMware EVO-SDDC, Nutanix, OpenStack, as well as Network Overlays like NSX, Nuage, PLUMgrid, and Midokura. They offer native support for server automation tools like Ansible, SaltStack, Chef, Puppet, and CF Engines. Beyond that, Cumulus has enhanced the Networking in Linux with the purpose of streamlining the provisioning of switches by reducing the number of unique network configuration parameters needed per switch. In many cases, all the Cumulus Linux switches in a data center POD will have nearly identical configurations, with the only difference being the loopback address:


A key benefit of offering a third party Operating System is that it allows Mellanox to compete with Broadcom-based switches in, “apples to apples” comparison tests in a way that highlights the hardware performance differences. Testing two switches with the same OS like the old Pepsi challenge in that it removes testing bias and shows how much better one hardware platform is than the other. We relish the opportunity to compete, especially when performance is the yard-stick. When it comes to 100GbE capable switches, our Spectrum switch is the clear performance leader, as documented by The Tolly Group here, and recorded in webinar here:


At Mellanox, we have been investing in Open Ethernet for many years. We contributed multiple Ethernet Switch designs to the Open Compute Project (OCP). We open-sourced our Multi-chassis link aggregation (MLAG) solution and contributed the code to the community. We spearheaded the Switch Abstraction Interface (SAI) which aims to make it easy to port Networking Operating Systems to many different switch ASICs from any vendor. We are founding member of the OpenSwitch Linux Foundation project and we are leading the Open Optics initiative, which is aimed at unlocking 100G, 400G, and higher speed technologies. This partnership with Cumulus is the logical culmination of this effort.

Albert Einstein, famously, would conduct, “thought experiments” to consider new theories and ideas. I would challenge you to a different kind of thought experiment: think about what you could do if your switches were as easy to automate as your servers. But you can do more than just thought experiments. If you are a hands-on kind of person with a penchant for Linux, do yourself a favor and download Cumulus VX, which is a fully featured SW-only version of Cumulus that is free and runs as a Virtual Machine. Build a virtual network of five or six routers inside your laptop and see how well it works with your favorite server configuration management tool. Then you will experience, first-hand, why Mellanox decided to partner with Cumulus Networks.