All posts by Amit Katz

About Amit Katz

Amit Katz is Vice President Ethernet Switch business at Mellanox. Amit served as Senior Director of Worldwide Ethernet Switch Sales since 2014, and previously served in Product Management roles at Mellanox since 2011. Prior to that, Mr. Katz held various Product Management positions at Voltaire and at RAD Data Communications. Mr. Katz graduated from the Academic College of Tel Aviv-Yaffo with a BA of Computer Science, and MBA from Bar Ilan University.

Network Telemetry Will Never Be the Same

1990-2019

You get “The call”, a service is down, you know the drill: “is it a networking problem or am I being blamed due to some server/storage failure? And the race begins, you know the network is probably fine but till you prove it, you will be the one responsible for your company losing money, losing customers and damaging its reputation.

“Time To Innocence” is the key. How fast can you prove it is not the network’s fault?

Let’s look at the tools you’ve got to solve your problem:

  • SNMP MIBs
  • Counters
  • Syslog
  • Performance Monitoring Tools gathering the above data

So now you have to search for the piece of data which will solve the mystery and usually you’ll go: “TX counters – RX counters do not equal zero, yes, we found a packet loss!!!”

Great progress, now you’re just few hours (!) away from identifying the issue – you know you’ve dropped packets, now you can start investigating.

You know the packet drop came from a specific port on a specific switch. But why? The switch won’t tell you why… it just drops the packet.

So now you want to reproduce the issue, for that you need it to happen again in order to debug it yourself, or have the vendor debug it remotely or on site. The problem is that this packet drop doesn’t happen when you want it to happen, it happens when it happens… and you’re stuck.

Mellanox Introduces “What Just Happened, ” Advanced Streaming Telemetry Technology

Mellanox Introduces “What Just Happened, ” Advanced Streaming Telemetry Technology

2019: Mellanox Introduces “What Just Happened, ” our new Advanced Streaming Telemetry Technology. WJH tells you why the packet was dropped, when, in which protocol and more.

Let’s rethink this statement: “Switches don’t tell us why, they just drop the packet”.

Does it have to be this way? Why wouldn’t switches tell you why a packet was dropped and save all the hassle? When switches drop packets, they do so for a reason. It’s not a bug, but rather what they are designed to do. Here are a few examples of scenarios in which a switch is asked to drop packets:

  1. ACL action: “Drop the packets sent from a specific IP”
  2. “Drop the packet if TTL=0” (e.g. the packet traveled too long in the network hence there’s probably a loop
  3. “Drop the packet if SMAC=DMAC” (in other words, you sent the packet to yourself)

So… if the switch knows why it dropped the packet, why doesn’t it report that information?

Well, now it does.

Mellanox switches introduce “What Just Happened!”

Mellanox “What Just Happened” tells you why the packet was dropped, when it happened, who sent the packet, to whom, in which protocol, VLAN and more. WJH can even tell you if the issue was related to the network or rather to server or the storage. It provides recommended actions to ease troubleshooting and it is available on 3 different network operating systems.

Your time is precious. Don’t waste it, talk to me: https://www.mellanox.com/products/what-just-happened/lets-talk/

Mellanox “What Just Happened”, available TODAY, taking network telemetry to the next level.

Supporting Resources:

 

Deploying a 19’-Inch, 48 port Switch for Your Modern, Storage or Hyperconverged Solution?

Hyperconverged and modern storage systems typically use 8-16 nodes in a rack.

If this is the case, then why too many deployments are using standard Data Center switches with 48 ports ToR to build their modern HCI/storage solution today?

 

Let’s review the options in the market today and the major considerations when choosing a switch to connect your scale-out HCI/storage solution.

So, you deploy a modern solution, and system integrator or a storage vendor proposes a 48 port switch, as this is what he typically sells for data centers. What does it really mean?

  1. Many ports for future growth?
    1. No, not really, the rack will never have more ports, as storage/HCI scales using more racks, not enough power/space for more nodes in the rack
  2. Advanced features, many of them purchased with licenses…
    1. Most of these solutions leverage only 3 features on the switch: VLAN, LAG and mLAG…
    2. L3? VXLAN? OpenFlow? More… all supported/included on the ½ 19’ switches ubt typically not used so no need to pay extra for unused capabilities
  3. Many of those switches use Broadcom’s performance (latency, packet rate…)
    1. Sometimes it is good enough, with modern storage, it may not be sufficient… any reason to compromise on performance when the price and power are higher?
  4. Price…
    1. Paying more for switches than you are paying for the HCI/storage solution? Sounds crazy, but we have seen it happening…

Summary

Mellanox has built a dedicated HCI/Storage Switch. It comes at the right port count, form factor and price, delivering better performance. Many have realized there is no reason to use switches from legacy vendors as you pay more and get less – time to do the math and choose a switch which was built for your storage/HCI solution.

Hate Compromising on Performance and Scale when Running VXLAN?

I guess many of you have been disappointed that you cannot get what you really need when using VXLAN:

  • You can do VXLAN routing but only at 10/40GbE, while you want to run 25/100GbE (without using a loopback cable…)
  • You use VXLAN as you need a scalable network but you are unable to support more than 128 ToRs because some ASICs are limited to not more than 128 remote VTEPs per VNI
  • You like the SFP form factor and would rather use a switch with SFP28 ports, rather than using break out cables
  • You need more bandwidth going up, and after using 2 ports for mLAG, 4 ports are not enough when using 25GbE NICs
  • You want a switch running Full line rate with ZeroPacketLoss
  • You want a switch with fair buffering as all cloud/Data center tenants should get the same service unless defined otherwise, and you want to deliver QoS to customers

With Mellanox Spectrum switches, running Cumulus 3.4, all these issues are solved, no need to choose between features, performance and scale. One ASIC can do it all and you don’t have to worry about all the hidden licenses you see from some vendors, it is all included, transparent and well tested.

Contact me for more information.

Supporting Resources:

The Top 7 Data Center Networking Misconceptions

  1. Adding more 40GbE links is less expensive than using 100GbE links
  2. 40GbE price is 2X 10GbE price, similarly 2 x 40GbE are more expensive than a single 100GbE port
  3. Vendor C: “You get the best price in the entire country, more than 80% discount”
  4. Really? I have met many customers proud of getting 80% discount, worth doing an apples to apples comparison including 3 year Opex, licenses, support, transceivers…
  5. “I wish I could move to white box, but I don’t have tons of developers like Google and Facebook have…”
  6. There is no need for developers, try using these solutions and you’ll see, automation is built in the product and is sooooo easy to deploy
  7. L2 is simple, L3 is complicated and expensive
  8. STP? PVRST? Root Guard? BPDU Guard? mLAG? Broadcast storms? So in fact there is a huge amount of complexity to building a reliable and scalable L2 LAN.

Much of this complexity is hidden in an L3 environment because BGP/ECMP is very simple to use and to debug, especially with the right automation. Price is the same when buying a L2/3 switch from the right switch vendors.

  1. “Nobody ever got fired for buying X”
  2. No one ever got promoted for doing that either, once upon a time, some of these brands were really better, not anymore… It may seem like the safe bet, but paying more for less makes your company less competitive, and could jeopardize its very future …
  3. “You can automate servers, storage, billing system, order system, and pretty much everything in the infrastructure except the network”
  4. Today’s networks can be easily and fully automated using standard tools, integrated with compute and storage and being monitored using commercial and open source tools. Check out this short videoto see how simple automation can be
  5. Telemetry is such a special feature that you must buy a special license ($) to enable it.
  6. Why should you pay extra just for the switch to give real-time visibility, but regular counters are free? The same question should be asked about ZTP, VXLAN, EVPN, Tap Aggregation, or BGP. You should ask your vendor, what makes feature X so much more complicated that you get to charge more for that? Why isn’t that a feature that comes standard with a switch that costs over $10K???

Why are Baidu, Tencent, Medallia, Sys11 and others using Mellanox Spectrum?

In 2016, we saw a significant shift in our business. We started seeing our Ethernet switches being deployed in some of the largest data centers of the world. I’d like to share with you some of those that I can talk about.

Here at Mellanox, we have seen a lot of traction lately in the areas of Analytics and, “Something” As a Service, which people typically refer to as cloud.

Data Analytics

Baidu, Tencent, and many others have started investing in 25/100GbE infrastructure for their Analytics and Machine learning deployments. After doing their homework, running many in-depth performance benchmarks, they identified our Mellanox NIC as the key to run these performance hungry workloads simply due to the fact that RoCE provides significant application performance benefits. Earlier this year, Tencent participated in the TeraSort Benchmark’s Annual Global Computing Competition. They utilized the Mellanox Ethernet NIC + Switch solution as key component which enabled them to achieve amazing improvements over last year’s benchmarks:

Baidu experienced similar benefits when they adopted our Ethernet technology. When asking why they chose Mellanox they told us they view the Mellanox solution as the most efficient platform on the market for their applications.

When iFLYTEK from China decided they needed 25GbE server connectivity, they chose the Mellanox Spectrum SN2410 for their ToR running 25GbE to the server; with 100GbE uplinks to the Spectrum SN2700. They told us our solution enabled them to leverage the scalability of our Ethernet solutions and thereby grow their compute and storage needs in the most efficient manner.

These organizations have seen the value of a tested validated high performing End-to-End 100GbE Mellanox solution, and even more importantly, they have done the math and every single one of them came up with the same clear and inevitable conclusion.

Cloud – Anything as a Service

So, let’s talk about Cloud, and review why people choose Mellanox Spectrum to run their businesses.

I’d start that all these solutions labeled, “As a Service” which are typically driven by efficiency and ROI. This means that when people build these solutions they typically, “Do the Math” as the entire business model is based on cost efficiency, taking into account OpEx and CapEx.

Sys11 is one of the largest clouds in Germany and they needed a fully automated and very efficient cloud infrastructure.

Harald Wagener, Sys11 CTO, told us that they chose Mellanox switches because it allowed them to fully automate their state-of-the-art Cloud data center. He said they also enjoyed the cost effectiveness of our solution which allowed Sys11 to leverage the industry’s best bandwidth with the flexibility of the OpenStack open architecture.

Sys11 was one of our first deployments with Cumulus, but instead of using the more standard Quagga routing stack, they decided to use Bird. Initially, we were a little concerned because that is not what everyone else does, but then we tested it, ran some 35K routes and it worked like a charm. We used SN2410 at 25GbE as the ToR and the SN2700 running 100GbE as the spine switch.

One of the most interesting deployments recently has been with Medallia, a great example of a SAAS company which is growing fast and needed a fully automated solution that could effectively drive its various data center needs, such as high speed CEPH (50GbE), short-live containers and scale. They wanted IP mobility without the hassle of tunnels, which they got by adopting a fully routed network, all the way to the server.

Medallia deployed a Spectrum-Cumulus solution, running 50GbEè 100GbE to replace their all 40GbE network.  With their all-40GbE network, they needed 2 ToR switches and 1 spine switch per server rack. When they moved to 50/100GbE, they reduced the number of switches needed per rack by a whopping 50 percent.

What’s really cool about Medallia is that they are so open minded and while looking for the right solution, they made their vendor decision right after they chose Cumulus Linux. It was only then that they chose the vendor who provided the best ROI, another great example of people who, “did the math” and didn’t follow incumbent vendors who typically focus on confusing the customers with what they can do, making sure customer will not, “do the math”. So, what’s next?

Here’s my view of what’s coming in 2017:

The first change will impact latency sensitive High Frequency Trading environments. Around the world, 10G server connections are being replaced with 25&50GbE because 10G can bottleneck the performance of the fast new servers (Intel Skylake) with their high bandwidth flash-based (NVMe) storage. And with HFT, the bandwidth for market data increases every year – the OPRA options feed from NYSE recently crossed the 12Gbps barrier.

There currently exists a gap in the HFT switch market because the low latency switches from Cisco and Arista are capped at 10GbE and there is no silicon available for them to make new low latency switches. Their over-10GbE switches have average latencies that are ten times higher than their 10G switches which make them irrelevant for trading.

Mellanox has the solution for the HFT market; we built a super low latency switch for these 25/50/100G connections speeds and it has 10-20 times lower latency than Cisco and Arista’s new (25-100GbE) switches. We have also added a suite of new features important to HFTs: PIM, hardware based telemetry on packet rates, buffer usage and congestion alerts, we also have fine-tuned buffer controls, and slow receiver multicast protection.

So, we expect another busy and successful year, making sure organizations are not bottlenecked by their networks and most importantly – we all do the math!!!

100GbE Switches – Have You Done The Math?

100GbE switches – sounds futuristic? Not really, 100GbE is here and being deployed by those who do the math…
100GbE is not just about performance, it is about saving money. For many years, the storage market has been “doing the math”, $/IOPs is a very common metric to measure storage efficiency and make buying decisions. Ethernet switches are not different, when designing your network, $/GbE is the way to measure efficiency.
While enhanced performance is always better, 100GbE is also about using less components to achieve better data center efficiency, Capex & Opex. Whether a server should run 10, 25, 50 or 100GbE, this is about performance, but with switches, 100GbE simply means better Return on Investment!
Building a 100GbE switch doesn’t cost 2.5X than building a 40GbE switch. In today’s competitive market, vendors can no longer charge exorbitant prices for their switches. These days are over.
With 25GbE being adopted on more servers simply to get more out of the server you’ve already paid for, 100GbE is the way to connect switches.

[fig-2-math.jpg]

Today, when people do the math, they minimize the number of links between switches by using 100GbE. When a very large POD (Performance Optimized Datacenter) is needed, sometimes we see 50GbE being used as uplink to increase spine switch fan-out and thus the number of servers connected to the same POD. In other cases, people use the fastest available, it used to be 40GbE, and today it is 100GbE.
Who are these customers who migrate to 100GbE? They are those who consider datacenter’s efficiency being highly important for the success of their business. A few examples:
Medallia recently deployed 32 x Mellanox SN2700 running Cumulus Linux – Thorvald Natvig, Medallia lead architect told us that the math is simply about more cost effectiveness, especially when the switches are deployed with zero touch and run simple L3 protocols, eliminating old fashion complications of STP and other unnecessary protocols. QoS? Needed when the pipes are insufficient, not when running 100GbE with enough bandwidth coming from each rack. Buffers? Scale? Mellanox Spectrum ASIC provides everything a Data Center needs today and tomorrow.
University of Cambridge has also done the math and has selected the Mellanox End-to-End Ethernet interconnect solution including Spectrum SN2700 Ethernet switches for its OpenStack-based scientific research cloud. Why? 100GbE is there to unleash the capabilities of the NexentaEdge Software Defined Storage solution which can easily stress a 10/40GbE network.
Enter has been running Mellanox Ethernet Switches for a few years now. 100GbE is coming soon, Enter will deploy Mellanox Spectrum SN2700 switches with Cumulus Linux because they did the math! Enter, as a cloud service providers cannot get lazy and wait for 100GbE to be everywhere before they adopt it. Waiting means losing money. In today’s competitive world, standing is like walking backwards, 100GbE is here, it works and it is priced right!
Cloudalize was about to deploy a 10/40GbE solution. After they did the math, they went directly to 100GbE with Mellanox Spectrum SN2700 running Cumulus Linux.

To summarize: if your Data Center efficiency is important for your business, it is time to do the math:
1. Check the cost of any 10/40/100GbE solution vs. Mellanox Spectrum 100GbE
Cost must include all components: cables, support, licenses (no additional licenses with Mellanox)
2. Please note that even when 10GbE on the server is enough, 100GbE uplinks still make sense
3. A break-out cables always costs less than 4 x single speed cables
4. Pay attention to hidden costs (feature licenses, extra support…)
5. What’s the price of being free with 100% standard protocols and no “vendor specific”, which is a nicer way to say “proprietary” protocols
6. In the event that 100GbE is more cost effective, it is time to view the differences between various 100GbE switch solutions in the market, the following performance analysis provides a pretty good view of the market’s available options
7. How much money do you spend on QoS vs. the alternative of throwing bandwidth on the problem?
8. $/GbE is the best way to measure network efficiency
Feel free to contact me at amitka@mellanox.com , I would be happy to help “doing the math” and compare any 10/40/100GbE solution to Mellanox Spectrum.

What Happened to the Good Old RFC2544?

Compromising on the basics has resulted in broken data centers…

After the Spectrum vs. Tomahawk Tolly report was published, people asked me:

“Why was this great report commissioned to Tolly? Isn’t there an industry benchmark where multiple switch vendors participate?”

So, the simple answer is: No, unfortunately there isn’t…

Up until about 3 years ago, Nick Lippis and Network World ran an “Industry Performance Benchmark”.

These reports were conducted by a neutral third party, and different switch vendors used to participate and publish reports showing how great their switches were, how they passed RFC 2544, 2889 and 3918, etc.

blog-image1

Time to check which switch you plan to use in your data center!!!

Since the release of Trident2, which failed to pass the very basic RFC 2544 (it lost 19.8% of packets when tested with small packets), these industry reports seemed to have vanished. It is as if no one wants to run a benchmark showing RFC 2544 anymore. No wonder the tests are all failing.

The questions you really need to ask are the following:

  • Why is it that RFC 2544, which was established to test switches and verify they don’t lose packets, is all of a sudden being “forgotten”?
  • Is the Ethernet community lowering its standards because it has become too hard to keep up with the technology?
  • Has it become difficult to build 40GbE and 100GbE switches running at wire speed for all packet sizes and based on modern, true cut-through technology?

The answers to all these questions is simple: RFC 2544 is as important as ever and still the best way to test a data center switch. Sure, it is hard to build a state-of-the-art switch which is why RFC 2544 is now more important than ever. This is because there are more small packets in the network (requests packets, control packets, cellular messages, SYN attacks…), and ZeroPacketLoss was and is still essential for your Ethernet switches.

Here is how Arista defined RFC 2544 before abandoning it:

“RFC 2544 is the industry leading network device benchmarking test specification since 1999, established by the Internet Engineering Task Force (IETF). The standard outlines methodologies to evaluate the performance of network devices using throughput, latency and frame loss. Results of the test provide performance metrics for the Device Under Test (DUT). The test defines bi-directional traffic flows with varying frame size to simulate real world traffic conditions.”

And indeed the older Fulcrum-based 10GbE switch passed these tests: http://www.arista.com/media/system/pdf/LatencyReport.pdf

A simple web search will provide you with numerous articles defining the importance of running RFC 2544 before choosing a switch.

While working on this blog I ran into a performance report sponsored by one of the big server vendors for a switch using the Broadcom Tomahawk ASIC. They worked hard to make a failed RFC 2544 look okay. Using a very specific port connectivity and packet sizes, RFC 2544 failed only with 64 Byte packets, using 8 ports (out of 32 ports), and even the mesh test passed. What is a data center customer to conclude from this report? That one should buy a 32-port switch and use only 8 ports? Sponsoring such a report clearly means that RFC 2544, 2889 and 3918 are still important when making a switch decision. I definitely agree: these tests have been established to help customers buy the best switches for their data centers.

So, how has the decline in RFC 2544 testing resulted in unfair clouds?

Not surprisingly, once the market accepted the packet-loss first introduced by Trident2, things have not improved. In fact, they’ve gotten worse.

Building a 100GbE switch is harder than building a 40GbE switch, and the compromises are growing worse. So, the 19.8% switch packet loss has soared up to 30% switch packet loss, and the sizes of packets being lost have increased.

Moreover, a single switch ASIC is now comprised of multiple cores, which means a new type of compromise. When an ASIC is built out of multiple cores, not all ports are equal. What does this actually mean? It means that nothing is predictable any longer. The behavior depends upon which ports are allocated to which buffers (yes, it is not a single shared buffer anymore). The situation also depend on the hardware layout which defines which ports are assigned to which switch cores. To make it simple: 2 users connected to 2 different ports do not get the same bandwidth… for more details, read the Spectrum Tolly report.

The latest Broadcom Based Switch Tolly report was released three weeks after the original Tolly report was issued. It attempted to “answer” the RFC 2544 failure, but nowhere did it refute the fairness issue. It is hard to explain why 2 ports connected to the same switch provide different service level agreements. In one test, the results showed 3% vs. 50% of the available bandwidth. So, this means you have one customer who is very happy and another customer who is very unhappy. But this would be true only if the customer were to know the facts, right? Has anyone told the unhappy customer that the SLA is broken? Probably not.

Bottom line:

Has compromising on the basics truly benefitted end customers and proven worthwhile? Are they really happy with the additional, worsening compromises they have had to make in order to build their switches with faster switch ASICs, which as all can see, are undergoing multiple ASIC revisions and at the end of the day yielding delayed, compromised, packet-losing 100GbE data centers? One should think not!

Meanwhile…

Mellanox Spectrum™ runs at line rate at all packet sizes. The solution supports true cut through switching; has a single shared buffer; consists of a single, symmetrically balanced switch core; provides the world’s lowest power; and runs MLNX-OS® and Cumulus Linux, with more network operating systems coming…

So, stop compromising and get Mellanox Spectrum for your data center today!!!

 

A Final Word About Latency

Note that this report also uses a methodology to measure latency that is unusual at best, and bordering on deceptive. It is standard industry practice to measure latency from the first bit into the switch to the first bit out (FIFO). By contrast here they took the unusual approach of using a last in first out (LIFO) latency measurement methodology. Using LIFO measurements has the effect of dramatically reducing reported latencies. But unlike the normal FIFO measurements the results are not particularly enlightening or useful. For example you cannot just add latencies and get the results through a multi-hop environment. Additionally for a true cut-through switch such as the Mellanox Spectrum, using LIFO measurements would actually result in negative latency measurements – which clearly doesn’t make sense. The only reason to use these non-standard LIFO measurements is to obscure the penalty caused by switches not able to perform cut-through switching and to reduce the otherwise very large reported latencies that result from store and forward switching.