All posts by Stav Sitnikov

About Stav Sitnikov

Stav Sitnikov is a Networking Specialist at Mellanox Technologies. Stav has been in the networking industry for over 10 years. For the past 7 years he has worked at Mellanox in various roles focusing on the Ethernet switch product line.

Kubernetes for Network Engineers

As a network engineer, why should you care about what the developers are doing with Kubernetes? Isn’t it just another application consuming network resources?

Kubernetes is quickly becoming the new standard for deploying and managing containers in the Hybrid-Cloud. Using the same orchestration on-premise and on the public cloud allows a high level of agility and ease of operations (Use the SAME API across bare metal and public clouds). Kubernetes (K8s) is an open-source container-orchestration system for automating deployment, scaling and management of containerized applications. It was originally designed by Google and is now maintained by the Cloud Native Computing Foundation.

The building blocks of Kubernetes:

Node

A node is the smallest unit of computing element in Kubernetes. It is a representation of a single machine in a cluster. In most production systems, a node will likely be either a physical server or a virtual machine hosted on-premise or on the cloud.

Cluster

When deploying applications onto the cluster, it intelligently handles distributing work to the individual nodes. If any nodes are added or removed, the cluster will shift around workloads as necessary. It should not matter to the application, or the developer, which individual nodes are actually running the code.

Persistent Volume

To store data permanently, Kubernetes use Persistent Volumes.

Since applications running on the cluster are not guaranteed to run on a specific node, data cannot be saved to any arbitrary place in the file system. If an application tries to save data for later usage but is then relocated onto a new node, the data will no longer be where the application expects it to be. For this reason, the traditional local storage associated with each node is treated as a temporary cache to hold applications, but any data saved locally cannot be expected to persist.

To store data permanently, Kubernetes use Persistent Volumes. While the CPU and RAM resources of all nodes are effectively pooled and managed by the cluster, persistent file storage is not. Instead, local or cloud stores can be attached to the cluster as a Persistent Volume.

Container

Applications running on Kubernetes are packaged as Linux containers. Containers are a widely accepted standard, so there are already many pre-built images that can be deployed on Kubernetes.

Containerization allows the creation of self-contained Linux execution environments. Any application and all its dependencies can be bundled up into a single file. Containers allow powerful CI (continuous integration) and CD (continuous deployment) pipelines to be formed as each container can hold a specific part of an application. Containers are the underlying infrastructure for Microservices.

Microservices are a software development technique, an architectural style that structures an application as a collection of loosely coupled services. The benefit of decomposing an application into different smaller services is that it improves modularity. This makes the application easier to understand, develop, test, and deploy.

Pod

Kubernetes doesn’t run containers directly. Instead, it wraps one or more containers into a higher-level structure called a pod. Any containers in the same pod will share the same Node and local network. Containers can easily communicate with other containers in the same pod as though they were on the same machine while maintaining a degree of isolation from others.

Pods are used as the unit of replication in Kubernetes. If your application becomes too heavy and a single pod instance can’t carry the load, Kubernetes can be configured to deploy new replicas of your pod to the cluster as necessary. Even when not under heavy load, it is standard to have multiple copies of a pod running at any time in a production system to allow load balancing and failure resistance.

Deployments

The “deployment” manages the pods.

Although pods are the basic unit of computation in Kubernetes, they are not typically directly launched on a cluster. Instead, pods are usually managed by one more layer of abstraction, “deployment”. A deployment’s purpose is to declare how many replicas of a pod should be running at a time. When a deployment is added to the cluster, it will automatically spin up the requested number of pods, and then monitor them. If a pod dies, the deployment will automatically re-create it. Using a deployment, you don’t have to deal with pods manually. You can just declare the desired state of the system, and it will be managed for you automatically.

Example of a web application deployment over Kubernetes

Example of a web application deployment over Kubernetes

Service Mesh

Service/Micro-service: A Kubernetes Service is an abstraction which defines a logical set of pods and a policy by which to access them. Services enable loose coupling between dependent pods.

Service mesh is used to describe the network of microservices that make up such applications and the interactions between them.

The term service mesh is used to describe the network of microservices that make up such applications and the interactions between them.

What is a service mesh?

The term service mesh is used to describe the network of microservices that make up such applications and the interactions between them. As a service mesh grows in size and complexity, it can become harder to understand and manage. Its requirements can include discovery, load balancing, failure recovery, metrics, and monitoring. A service mesh also often has more complex operational requirements, like A/B testing, canary releases, rate limiting, access control, and end-to-end authentication.

One of the most popular plugins to control a service mesh is Istio, an open source, independent service, that provides the fundamentals you need to successfully run a distributed microservice architecture.

Istio, one of the most popular plugins to control a service mesh.

Istio provides behavioral insights and operational control over the service mesh as a whole, offering a complete solution to satisfy the diverse requirements of microservice applications.

With Istio, all instances of an application have their own sidecar container. This sidecar acts as a service proxy to all outgoing and incoming network traffic.

Networking

At its core, Kubernetes Networking has one important fundamental design philosophy:

Every Pod has a unique IP.

The Pod IP is shared by all the containers inside, and it’s routable from all the other Pods. A huge benefit of this IP-per-pod model is there are no IP or port collisions with the underlying host. There is no need to worry about what port the applications use.

With this in place, the only requirement Kubernetes has is that Pod IPs are routable/accessible from all the other pods, regardless of what node they’re on.

In the Kubernetes networking model, in order to reduce complexity and make app porting seamless, a few rules are enforced as fundamental requirements:

Kube Cluster

 

  • Containers can communicate with all other containers without NAT.
  • Nodes can communicate with all containers without NAT, and vice-versa.
  • The IP that a container sees itself as is the same IP that others see it as.

There is a vast amount of network implementations for Kubernetes. Among all these implementations Flannel and Calico are probably the most popular ones that are used as network plugins for the Container Network Interface (CNI). CNI, can be seen as the simplest possible interface between container runtimes and network implementations, with the goal of creating a generic plugin-based networking solution for containers.

Flannel, a popular network implementation for Kubernetes

Flannel can run using several encapsulation backends with VXLAN being the recommended one.

L2 connectivity is required between the Kubernetes nodes when using Flannel with VXLAN.

Due to this requirement the size of the fabric might be limited, if a pure L2 network is deployed, the number of Racks connected is limited to the number of ports on the Spine switches.

To overcome this issue, it is possible to deploy an L3 Fabric with VXLAN/EVPN on the leaf level. L2 connectivity will be provided to the nodes on top of a BGP routed fabric that can scale easily. VXLAN packets coming from the Nodes will be encapsulated into VXLAN Tunnels running between the leaf switches.

 

Mellanox Spectrum

The Mellanox Spectrum ASIC provides huge value when it comes to VXLAN throughput, latency and scale. While most switches can support up to 128 remote VTEPs, meaning up to 128 racks in a single fabric. The Mellanox Spectrum ASIC supports up to 750 remote VTEPs allowing up to 750 Racks in a single fabric.

Learn more about Mellanox Spectrum EVPN VXLAN Differentiators

Calico, a pure IP Networking fabric in Kubernetes clusters

Calico is not really an overlay network but can be seen as a pure IP networking fabric (leveraging BGP) in Kubernetes clusters across the cloud.

A typical Calico deployment looks as followed:

Calico AS Design options:

In a Calico network, each endpoint is a route. Hardware networking platforms are constrained by the number of routes they can learn. This is usually in the range of 10,000’s or 100,000’s of routes. Route aggregation can help, but that is usually dependent on the capabilities of the scheduler used by the orchestration software (e.g. OpenStack).

 When choosing a Switch for your Kubernetes deployment make sure it has a routing table size which will allow a scale that will not limit your Kubernetes compute scale.

Mellanox Spectrum

The Mellanox Spectrum ASIC provides a fully flexible table size which enables up to 176,000 IP route entries with Spectrum1 and up to 512,000 with Spectrum2, enabling the largest Kubernetes clusters which run by the biggest enterprises world-wide.

Routing stack persistency across physical network and Kubernetes

2 common routing stacks used with Calico: Bird and FRR

There are 2 common routing stacks used with Calico, Bird and FRR.

 

When working with Cumulus Linux OS on the switch layer, you would probably want to use FRR as the routing stack on your nodes, leveraging BGP unnumbered.

If you are looking for a pure open sourced solution you should check out the Mellanox LinuxSwitch, which supports both FRR and BIRD as the routing stack.

Network Visibility challenges when working with Kubernetes

Containers are automatically spun up and destroyed as needed on any server in the cluster.  Since the containers are located inside a host, they can be invisible to network engineers — never knowing where they are located or when they are created and destroyed.

Operating modern agile data centers is notoriously difficult with limited network visibility and changing traffic patterns.

By using Cumulus NetQ on top of Mellanox Spectrum switches running the Cumulus OS, network engineers can get wide visibility into Kubernetes deployments and operate in these fast-changing dynamic environments.

 

 

 

 

 

 

 

Mellanox Spectrum Linux Switch Powered by SwitchDev

Spectrum Linux Switch enables users to natively install and use any standard Linux distribution as the switch operating system on the Open Ethernet Mellanox Spectrum™ switch platforms and ASIC.

The Spectrum Linux switch is enabled by Switchdev, a Linux kernel driver model for Ethernet switches. It breaks the dependency of using vendor-specific, closed-source software development kits (SDK).

The open-source Linux driver is developed and maintained in the Linux kernel, replacing proprietary APIs with standard Linux kernel interfaces to control the switch hardware. This allows off-the-shelf Linux-based networking applications to operate on the Spectrum switch, including L2 switching, L3 routing, and IP tables (ACLs) at hardware-accelerated speeds.

On top of the above Switchdev enables native control over Temperature, LED and Fans directly thought the Linux user interface.

The combination of the Open Ethernet Spectrum switch and Switchdev driver provides users with the flexibility to choose the best hardware platform and software solution for their needs, resulting in optimized data center performance, lower cost of ownership and higher return on investment.

Installing network switches with a standard Linux distribution turns them into yet another server in the data center. This greatly reduces management efforts, as the same configuration and monitoring tools can be used for both servers and switches.

 

Linux application, OS and Kernal driver on Mellanox Spectrum

 

The Mellanox Spectrum ASIC based Switches

The Mellanox Open Ethernet Switch portfolio is fully based on the Spectrum ASIC, providing the lowest latency for 25G/100G in the market, Zero Packet Loss and a fully shared buffer. The ideal combination for Cloud Networking demands.

The Mellanox Spectrum switch systems are an ideal Spine and Top of Rack solution, allowing flexibility, with port speeds ranging from 10Gb/s to 100Gb/s per port, and port density that enables full rack connectivity to every server at any speed.

Check out this report and get details about our unmatched ASIC performance generated by the Tolly group. Read it to Understand the fundamental differences between Mellanox Spectrum and Broadcom Tomahawk based switches.

By using the Mellanox switches as your building blocks, you will be able to build a high performing leaf/spine data center.

The Mellanox Spectrum switch systems

 

Read more about leaf/spine designs best practices.

The use of the Linux Switch aligns well with Linux Based servers and Containerized workloads deployments

The drive for Containers in the data center requires technologies such as RoH (Routing on the Host), meaning that the same routing stack can run on both the Servers and the Switches by using SwitchDev.

Mellanox Technologies is the first hardware vendor to use the Switchdev API to offload the kernel’s forwarding plane to a real ASIC, allowing full line rate performance of Bridge, Router, ACLs, Tunnels and OVS without traffic going via the Kernel (CPU).

As an example, 2 servers are connected to ports Eth1 and Eth2, L2 connectivity is needed between the Servers, a Bridge will be created in the Linux user space on VLAN 10. In the image bellow we can see the difference when rules are offloaded and when they are not.

The difference between Linux user space with offload with SwitchDev and Spectrum and without

 

Mellanox’s current switchdev-based solution is focused on the 100Gb/s Spectrum ASIC switches (SN2000 Series).

This is achieved by using an upstream driver in the Linux kernel. A user can simply buy a switch, install Linux on it like any other server and benefit from the underlying hardware.

Linux kernel

SwitchDev offloaded features on Mellanox Spectrum Switches:

Visibility and Maintainability Protocols (L2/L3) ACL
  • [ER]SPAN
  • Temperature
  • Fans
  • LED Control
  • ethtool (port counter, FW version, transceiver data)
  • Resource queries
  • RIF counters
  • sFlow

 

  • Bridge – 802.1D
  • VLAN   – 802.1Q
  • LAG
  • LLDP
  • IGMP snooping
  • Unicast v4/v6 router
  • ECMP
  • DCB
  • QoS
  • IGMP flood control
  • 256 VRFs
  • GRE tunnelling
  • Multicast v4/v6 router
  • IPv4/IPv6 weighted ECMP
  • VRRP
  • VxLAN
  • ECN: RED and PRIO
  • OVS

 

  • tc-flower offload
  • Actions: Drop, Forward, Counters, Trap, TC_ACT_OK
  • TC chain template
  • Keys: Port, DMAC, SMAC, Ethertype, IP proto, SIP DIP (IPv4/6), TCP/UDP, L4 port, VLAN-ID, PCP, DCSP, VLAN valid, TCP flags

 

Detailed configuration guide for Linux based protocols can be found here.

 

As an example, by using Free Range Routing (FRR), you will be able to run a full routing stack on top of SwitchDev.

FRR is an IP routing protocol suite for Linux and Unix platforms which
includes protocol daemons for BGP, OSPF, PIM, and many other protocols.

FRR’s seamless integration with the native Linux/Unix IP networking stacks
makes it applicable to a wide variety of use cases including connecting
hosts, VMs and containers to the network, advertising network services, LAN
switching and routing, Internet access routers, and Internet peering.

For detailed documentation of FRR go here!

Mellanox Spectrum for Microsoft Azure SONiC

Everyone agrees that open solutions are the best solutions, but for Ethernet switches, there are very few truly open operating systems – until now.  Microsoft has open sourced the network operating system (NOS) they use in Azure, SONiC (Software for Open Networking in the Cloud), created the community, and posted it on Github. SONiC is a NOS that is designed to work on many different Ethernet switch ASICs, from many vendors.  At Mellanox, we have embraced open Ethernet and, besides just supporting SONiC, have contributed a number of innovations to this important community project

What is SONiC?

SONiC is an open source Networking OS initiative driven by Microsoft. Running one of the largest clouds in the world, Microsoft has gained a lot of insight into building and managing a global, high performance, highly available, and secure network. Experience has taught them that there is a need for a change in the way they operate and deploy their data centers.

The main requirements were:

  • Use best-of-breed switching hardware for the various tiers of the network.
  • Deploy new features without impacting end users.
  • Roll out updates securely and reliably across the fleet in hours instead of weeks.
  • Utilize cloud-scale deep telemetry and fully automated failure mitigation.
  • Enable Software-Defined Networking software to easily control all hardware elements in the network using a unified structure to eliminate duplication and reduce failures.

To address these requirements, Microsoft pioneered SONiC and open-sourced it to the community, making it available on SONiC GitHub Repository.

SONiC is built on the Switch Abstraction Interface (SAI), which defines a standardized API. Network hardware vendors can use it to develop innovative hardware platforms that can achieve great speeds while keeping the programming interface to ASIC (application-specific integrated circuit) consistent.

The SONiC OS Design

SONiC is the first solution to break monolithic switch software into multiple containerized components. SONiC enables fine-grained failure recovery and in-service upgrades with zero downtime. Instead of replacing the entire switch image for a bug fix, you can now upgrade the flawed container with the new code, including protocols such as Border Gateway Protocol (BGP), without data plane downtime. This capability is a key element in the serviceability and scalability of the SONiC platform.

Containerization also enables SONiC to be extremely extensible. At its core, SONiC is aimed at cloud networking scenarios, where simplicity and managing at scale are the highest priority. Operators can plug in new components, third-party, proprietary, or open sourced software, with minimum effort, and tailor SONiC to their specific scenarios.

The above was Written by Yousef Khalidi CVP, Azure Networking, in his blog about SONiC.

 

Full Architecture of SONiC – https://github.com/Azure/SONiC/wiki/Architecture

 

Features SONiC currently supports:

Additional features are in the oven, check out the SONiC Roadmap.

It’s highly recommended to watch the following video to understand how Alibaba uses SONiC in their Data Centers. 

Why should you use Mellanox Spectrum Switch with SONiC?

When choosing a switch to run SONiC on top you should look at two main factors:

  1. Is the switch vendor capable of supporting your deployment both ASIC, SAI and Software wise?
  1. What are the capabilities of the ASIC running under the hood.

Mellanox is the only company participating in all levels of the SONiC development community. We are one of the first companies to develop and adopt SAI, all of our Spectrum family switches are fully supported by SONiC as well as being a major and active contributor to the SONiC OS feature set.

 

The Mellanox Spectrum ASIC based Switches

The Mellanox Open Ethernet Switch portfolio is fully based on the Spectrum ASIC, providing the lowest latency for 25G/100G in the market, Zero Packet Loss and a fully shared buffer. The ideal combination for Cloud Networking demands.

All of the Mellanox platforms support port splitting via the SONiC OS, the only Platforms that currently supports this feature.

Check out this report and get details about our unmatched ASIC performance generated by the Tolly group. Read it to Understand the fundamental differences between Mellanox Spectrum and Broadcom Tomahawk based switches.

SONiC can be deployed on any switch in our Ethernet portfolio.

Mellanox Spectrum switch systems are an ideal Spine and Top of Rack solution, allowing flexibility, with port speeds ranging from 10Gb/s to 100Gb/s per port, and port density that enables full rack connectivity to every server at any speed. These ONIE based switch platforms support multiple operating systems, including SONiC and leverage the advantages of Open Network disaggregation and the Mellanox Spectrum ASIC capabilities.

 

 

By using the Mellanox switches as your building blocks, you will be able to build a high performing CLOS data center.

Typical leaf spine POD design with BGP as the routing protocol:

 

Scaling to multiple PODs:

 

Read more about leaf/spine designs best practices.

 

Interested to learn more about SONiC? Watch this webinar hosted by Mellanox, Apstra and Microsoft.

 

 

Ready to deploy SONiC on Spectrum? Check out this community post.