Let’s talk about network streaming telemetry and why you need it. If you ever had problems trying to recreate issues, or had a tough time figuring out why you were seeing packet loss, or if you are a network admin who has ever been blamed for an application outage or a server or storage performance issue, you need good network telemetry. Because the network enables applications to be accessed, share data, and connect to storage, good network streaming telemetry is also good application telemetry.
Some of you may be asking, what exactly is telemetry?
What is Telemetry?
If you were driving a car, telemetry is the speedometer, tachometer, gas gauge, oil pressure gauge, engine temperature, and dashboard warning lights – the data you need to safely get you where you want to go and to know how the car is doing along the way. Whether you are driving a car or flying a plane, you need good telemetry, and the faster you travel the more critical it is. So too, if you are running a data center, deploying VMs and containers, or managing a storage deployment, you need visibility into what’s going on inside the network fabric. And the faster your network runs, or the more critical network performance is to your business, the more important that becomes. Switch streaming telemetry can give you that crucial visibility.
Shift from Protocols to Streaming Telemetry
The legacy position on network management has been that more is better: more protocols, more packets captured, and—in case of a problem—more deep digging through captured packets to find the cause and then the fix. But over the last several years there’s been a trend in data center networks towards simplification. The larger, or more advanced the data center, the fewer the protocols they like to run. Back in my Tech Support days, we used to have a saying that “the smarter the customer, the shorter the config file.” This saying came out the fact that the guys that always had trouble were those who enabled every possible feature & protocol. You could estimate the amount of problems by the length of the config files. We’ve seen this trend to move to more L3 and away from L2 and all the versions of spanning tree, and band-aids like root guard, loop guard, BPDU guard, and the rest.
The main exception to the simplification trend is the need for more visibility as smart folks want to see what’s going on inside their network. As the networks gets larger and faster, savvy administrators are using fewer protocols but aiming for more network telemetry to achieve better visibility.
Some Network Admins want better streaming telemetry to improve their “mean time to innocence” – to speed up the time to find the root cause of issues so they can rule out whatever is NOT causing the problem and prove if it’s really the server team’s fault (or maybe the storage teams’ fault). Others are trying to get more out of their network. Most network teams don’t really know if their networks are under-utilized or over-utilized because they have poor visibility into what’s really going on. Without that understanding, it’s impossible to run the network efficiently or to grow it properly.
WJH is a switch-level monitoring solution where the switch ASIC monitors flows at line rate and will alert you if you had performance problems due to packet drops, congestion events, routing loops, etc.:
For example, if you’re dropping packets, because of a bad cable or bad optic, WJH will let you see those dropped packets and tell you why they were dropped. WJH will let you know if you’ve got congestion or buffer problems, or even security issues. For example, if you’re hitting a bunch of ACLs, and they’re dropping packets, you’d like to know why, because you might have a corrupted server or VM. Or you might have a poorly configured ACL that’s causing problems.
In lossless environments, like NVMe over Fabrics (NVMe-oF) running on RoCE, you might have performance problems even though you are not dropping packets. The performance issues could be due to congestion issues or excessive pause frames or latency issues. It’s very common to find out the root cause is uneven load balancing across a LAG or ECMP group. Whether your problem is packet drops or poor performance without packet drops, WJH was built to get to the bottom of those things and give you the best streaming telemetry for superior network visibility.
Pretty much every network in the world is going to have some packet drops. Sometimes, it’s for bad reasons and sometimes it’s for good reasons. Many of the other switch telemetry solutions don’t deliver enough data to diagnose and solve problems. When a non-Mellanox switch drops a packet, that packet is sent to Bit Heaven, never to be seen again. The packet, and all that useful diagnostic information, will just disappear and the most those switches will do is increment a vague counter. When you check that counter, the switch will say, “Oh, you’ve now dropped 504 packets, due to a bad VLAN”. But that’s it, those switches don’t tell you anything about the packet that was dropped, when it was dropped, or why it was dropped–just that it was dropped. So you don’t know if the packet was dropped because the switch was misconfigured or because the server was misconfigured, or something else completely.
Other switch or network management solutions perform statistical sampling of packets from every port on every switch. This produces a staggeringly large boatload of packets but not all the problem packets, so doesn’t record when, why or how a packet was lost. It also doesn’t properly explain how congestion started, what caused unacceptably high latency, or why traffic might have become unbalanced or been misrouted. When a problem is suspected, you need to sort through huge piles of saved packets and try to extrapolate (or guess) what really happened and why.
In these cases, you simultaneously have too much data (too many sampled packets) and yet not enough information (not enough details about the problem packets). Everything on the network becomes suspect and determining what really happened can take many hours. But, there is a better way!
How does WJH Work?
Mellanox’s What Just Happened (WJH) is a hardware-accelerated telemetry technology where the switch ASIC holds onto important parts of dropped packets. The switch won’t keep the whole packet or all the normal packets as that would consume a lot of space, for little benefit. Instead, the switch keeps the important parts of the problem packet like the source and destination IP and MAC, port numbers, etc., along with that some very detailed descriptions of why, when, and where it was dropped. Because the switch is involved, it knows which packets to save and why those packets were dropped, or too slow, or misrouted. And with hardware acceleration, the switch can record all the relevant packets along with important details, even while driving many ports of 25, 40, 50 or 100 (soon 200) Gigabit Ethernet.
For small deployments, you can log into the switch and quickly see what’s going wrong in your network. But for larger deployments, WJH can stream these packets out to a centralized database using gRPC. This works with turnkey solutions, like Mellanox NEO, and because it’s in a standard database, it works with open source tools like Kabana and Grafana. If you are a network expert, or been to Sniffer University, and want to look at the actual packet capture, the switch can generate a p.cap file of all the dropped packets so you can look at it using Wireshark.
WJH helps to get to the bottom of problems, by showing who’s being impacted, which applications, which servers, what’s causing the problem, when and where the problem is in your network.
A New Hope in Network Telemetry
WJH a new way of monitoring a network. Traditional network monitoring tools collect tons of innocent data and counters. They may even use sFlow to sample random packets, with the idea that you’re collecting all this information to use to extrapolate, or guess, what went wrong in your network:
For some reason, the trickiest network problems usually occur at night or on the weekend and then you’ve got to leave the football game or dinner to sift through a mountain of data to find the root cause. You try to guess what’s the culprit that’s causing all the trouble. There are even predictive analytics tools, where they say, hey, we’ll look at that mountain of data for you, and we’ll give you like a 60 -70% confidence that they have found the root cause. They’ll do that guesswork for you, but at the end of the day – it’s still just guesswork. The problem is you have too much data (from packet sampling) but often not the most important data (the What, Where, When and Why).
WJH is a new way of monitoring a network focusing on data plane anomalies, built to give you back your weekends. WJH quickly shows you both the victims and the packet troublemakers or bandwidth bullies in your network. You can keep collecting those huge piles of data about innocent devices and eventsand try to crunch them, but WJH will give you the smoking gun – the actual root cause at the scene of the crime: recorded first-hand by the switch that had to drop the packet.
No more Problem-Recreation Drama!
The old approach trying to guess when a problem will re-occur and setup a recreate scenario on a test bed or packet trace only to have the problem not reveal itself, so you retry the next week… and the week after…. This was the impetus for Mellanox’s What Just Happened, advanced telemetry technology. With WJH, because we hold onto those packets that are getting dropped, and we report on them, we’re able to help you get to the root cause and give you network visibility without needing to reproduce issues to resolve them.
So how do I deploy WJH?
Now I know some of you are thinking, “this sounds amazing, but I can’t replace my entire network with Mellanox switches”. The great thing about WJH is that it works independent of the rest of the network. WJH running on one switch can report errors that are likely happening on other switches in that tier of the network doing a similar function. This is very different than InBand telemetry, which works best with all switches from the same vendor.
It’s super simple to get started with WJH:
WJH Deployment Procedure
Step 1 – Most people start using WJH by doing a Network Scan, which is done by enabling WJH on a switch they have plugged into their production network. People are almost always surprised at the kind of errors that they learn about. The Network Admin is super happy to learn what’s going on in the network. So step one is just turn on WJH and see what’s really going on in your network
Step 2 – Next is the clean-up phase where people resolve the network issues that WJH found as well as the server issues and the storage issues that WJH found.
Step 3 – This is where you personalize WJH for your network and your management needs:
- You might set some filters because you don’t need to report certain kinds of “normal” errors or even want to log or store it
- You might set the WJH agent to aggregation mode if the kinds of issues tend to show up in 1000s of the exact same packet. Aggregation mode stores just one copy of that problem packet instead of 1000 identical problem packets.
- You might set the severity level of issues that matter to you. Some might be critical and require immediate notification, while others you could check later, or even ignore.
- You can set actions of the severity levels, for example you might want a text sent on critical issues, an email on significant issues, and no alerts on minor issues.
WJH is a great tool for advanced network-heads as well as the network novice who just wants a simple way to identify network issues from server and storage issues. With WJH, you don’t have to be a network expert to very quickly find root causes of performance problems.
Advanced streaming telemetry technology is good for your business – it will help you get more performance, uptime, and productivity out of the networks that you you’ve paid for.