All posts by Elad Wind

About Elad Wind

Elad Wind is currently Director of Solution Engineer promoting the adoption of Mellanox interconnect solutions by hyperscalers. Since 2010, Elad has served in various technical and sales roles at Mellanox including Product Sales and Project Management. Elad was also a founding member of the Mellanox Singapore the APAC head office. Elad holds an MBA from Tel-Aviv University and ESSEC Business School Paris, and a Bachelor of Science degree in Electrical Engineering from the Technion, Israel.

OpenBMC Automates Cloud Operations

By Elad Wind and Yuval Itkin

There’s a saying in the IT industry that when you go Cloud, everything must scale to maximum levels. Server design, power consumption, network architecture, compute density, component costs, heat dissipation and all their related aspects must be designed for efficiency when one administrator might oversee thousands of servers in a datacenter. Likewise, all the hardware and software operations must be automated — setup, installation, upgrades, replacements, repair, migration and monitoring must be automated as much as possible. Each network or server admin is responsible for so many machines, connections and VMs or containers that it’s impossible to manage or even monitor each individual system or process, such as firmware upgrades or rebooting a stuck server.

The Baseboard Management Controller (BMC) is key to automation

The BMC is a specialized chip in the center of the server handling all “inside the box” management and monitoring operations. BMCs interface with server hardware and monitor input from sensors including temperature, fan speed, and power status. They do not rely on the main operating system (OS), so a BMC can send alerts or accept commands even before the OS has loaded or in case the OS crashes. BMCs can send network alerts to the administrator or operations software. They are also used to remotely reset or power cycle a stuck system or to facilitate a firmware upgrade for a network interface card (NIC).

Traditionally BMC runs on a proprietary chip with a proprietary OS. The communication with the BMC relied on the IPMI protocol where each server platform has a different set of commands or tools, matching its specific hardware architecture.

OpenBMC – BMC software for all hardware

At the hyperscale level, architects and administrators don’t want different server brands and models to use different BMC commands or different implementations of IPMI. They might want each server to use the same BMC setup, or need to review/modify the source code for the BMC. Thus a much better solution for large-scale cloud operations is the open source Linux-based OpenBMC, a project pioneered by Facebook that standardizes the interface with BMC hardware. This was the impetus for our implementation.

Mellanox is first in the industry with OpenBMC support for NIC FW upgrades

To support large scale cloud operations, Mellanox implemented the first uniform NIC OpenBMC interface that solves the complexities around firmware upgrades. OCP server platforms based on OpenBMC can now pull each NICs’ firmware images and perform firmware updates with Mellanox NICs.

Easier management for hyperscale – Is this a big deal?

Provisioning has always been a challenge for hyperscale data centers. This has been so far addressed with vendor tools running on the servers. Hyperscalers with their massive numbers of servers and network connections are dealing with the painful inconsistency of APIs and interfaces which leads to endless effort for porting their tool sets to match the different components and drivers across their many different servers and NICs.

Bringing uniformity across devices and event/error logs is tremendously valuable – as it removes the challenges in integrating the many multi-vendor and proprietary protocols. With OpenBMC, one set of operating processes, scripts, and monitoring tools can monitor, reboot, and upgrade firmware across many different types of servers.

Automated NIC firmware upgrades on Facebook’s OpenBMC

“We are pleased with Mellanox’s implementation of ConnectX NIC support for firmware updates through the OpenBMC. The results are impressive.” said Ben Wei, software engineer from the Facebook OpenBMC development team, “The joint development was based on a draft (Work-In-Progress) DMTF spec that specifically required close, timely cooperation with Mellanox.”

Mellanox OCP NIC – First to support PLDM over RMII Base Transport

Mellanox is a pioneer in OCP server support, with our first OCP mezzanine (mezz.) 0.5 NIC, and later OCP mezz. 2.0 and now OCP NIC 3.0 are deployed with Hyperscalers, enterprises and OEMs. Mellanox is a leading contributor to the OCP system management, system security, NIC hardware and mechanics designs.

The OCP Hardware Management Project specifies interoperable manageability for OCP platforms and OCP NICs, leveraging DMTF standards to communicate into the BMC.

The NICs serve as the BMCs’ gateway to the outside world, through NCSI, MCTP, PLDM messaging layers and underlying physical transports (see illustration below). Mellanox is actively contributing to the DMTF system management MCTP and PLDM specs to enable more flexible and efficient remote reporting and management for OCP servers.

NICs serve as the BMCs’ gateway to the outside world through NCSI, MCTP, PLDM messaging layers

NICs serve as the BMCs’ gateway to the outside world through NCSI, MCTP, PLDM messaging layers

Standardizing “inside the box” management

The DMTF defines standard protocols for monitoring and control (DSP0248) and for firmware update (DSP0267). The DMTF declared its intention to introduce standard support for Platform Level Data Model (PLDM) protocols over RMII Based Transport (RBT) as part of NC-SI (DSP0222) revision 1.2, this is planned to be released this year, we used a Work-In-Progress published document in our implementation.

RBT is the main management interface in OCP platforms. With PLDM over RBT, OCP platforms leverage the higher throughput of RBT (over RMII) compared to MCTP over SMBus. RBT also enjoys higher popularity compared to MCTP over PCIe, and RBT also offers higher availability as the PCIe interface is not available when systems are placed into low-power states such as S3, but RBT can still operating in such low-power states. And of course RBT can operate over Ethernet.

Initially, PLDM protocols could only be supported over MCTP transport. By introducing support for PLDM protocols over RBT – PLDM protocols can now be used in standard OCP platforms. Mellanox supports this new ability to send PLDM information over RBT.

How is Mellanox keeping management and firmware updates secured?

The ability to automate firmware updates over the network using the BMC often raises security concerns. How does one prevent unauthorized firmware from sneaking into a NIC?  Fortunately, Mellanox products support Secure Firmware Update. This assures that only properly-signed firmware images, created by Mellanox, can be programmed into Mellanox devices. Secure Firmware Update prevents malicious firmware from being loaded onto Mellanox NICs and is fully supported even when using PLDM over RBT.

Next steps in system security

The next steps being defined in DMTF and OCP will enable system security protocols for device authentication:

  • Device identification and authentication
  • Providing firmware measurement to platform RoT
  • Securing management protocols (future)

These new standards, along with Mellanox’s plans to support them, will bring further automation and standardization to large-scale datacenter operations.

That’s a lot of acronyms

DMTF – Distributed Management Task Force

For over a decade of the DMTF has been creating open manageability standards to improve the interoperable management of information technologies in servers and desktops.

PMCI – Platform Management Components Intercommunications

Is the DMTF working group that develops standards to address “inside the box” communication interfaces

PLDM – Platform Level Data Model

A data model for efficient communications between platform management components. Allows access to low-level platform inventory, monitoring, and control functions including firmware update.

MCTP – Management Component Transport Protocol

A transport-layer protocol that allows using the plurality of “inside the box” management protocols over physical interfaces other than RBT, such as SMBus and PCIe.

Read more:











Mellanox Joins ODCC—Speeding Up China’s Next Generation Data Center Networks

Mellanox Joins ODCC

Mellanox has joined ODCC to form strategic partnerships with other ODCC members interested in developing powerful and cost-efficient networks, to meet China’s booming demand for data-center capacity.

The Open Data Center Committee (ODCC) is an industry led non-profit consortium formed by China’s leading technology providers: the Web 2.0 giants Baidu, Alibaba, Tencent (BAT), the Telecom giants China Telecom, China Mobile and China Unicom, and backed by Chinese government agencies.

ODCC promotes open hyperscale data center specifications for building an ecosystem that shares proven best practices and designs, and leverages economy-of-scale efficiencies among China’s large players.

As a dedicated supporter and solution developer, Mellanox is leveraging and integrating, our differentiated technologies and designs into some of the world’s largest cloud data centers.

Mellanox is a leader in Ethernet networking and looks forward to collaborating with other ODCC members to advance data center architectures specifically by contributing to the Phoenix project, through RoCE enablement, and advancing the Open Ethernet Vision.


Phoenix Adopts Mellanox Spectrum 100/25G Switches

Launched in August 2017 by the ODCC Network Work Group, the goal of the Phoenix Project is to promote whitebox switching, with SONiC running as an open network operating system (NOS).

The Phoenix community is giving its blessing to a known and stable Community SONiC version and packages, which can run on approved switch platforms. Indeed, Phoenix has standardized on Mellanox SN2410 100/25G as the preferred Top-of-Rack (ToR) solution.

Mellanox is committed to sharing the 25GbE + 100GbE (SN2410) switch hardware design, and is the first switch OEM to do so. Spectrum Silicon building blocks is an exciting new contribution we provide to China’s industry leaders for high-performing, scaled-out and multi-tenant network systems.


ODCC Adopts RoCE to Boost Data Center Efficiency

ODCC has recognized RoCE’s crucial role in data center interconnectivity, which is fueling the data storm generated by learning, big data analytics and storage applications.

Mellanox is sharing with ODCC members extensive experience gathered on RoCE compatibility testing, construction of lossless networks and application level performance enhancements.


A Shared Open Ethernet Vision – Mellanox and ODCC

  1. Open Source Ecosystem – Mellanox offers a choice of Network Operating Systems to unleash differentiated Spectrum switch systems. Adopting the whitebox open principles from the server world delivers an open-source, fully interoperable and software-based ecosystem.
    Mellanox is one of the major contributors to ONIE, SAI and SONiC community software projects. The Mellanox SONiC contributions span features, code infrastructure and test frameworks; in addition, Mellanox maintains multiple repositories.
  2. High Performance – Data growth shifts the performance bottleneck to the network. Mellanox Spectrum silicon has set the record for packet rate, throughput and latency in the data center.
  3. RoCE Everywhere – Storage, deep learning and big data analytics applications are running natively on RoCE, assuring that the investment made into GPUs and CPUs is put to good use. Mellanox RoCE solutions include hardware-based offloads and intelligent congestion handling to maximize system bandwidth and CPU utilization.
  4. Simple – Modern data centers are easier to deploy and monitor. A single admin can manage thousands of nodes, using scripts and networking configuration templates which are easy to scale and manage—all this reducing the complexity of troubleshooting and fallback to minutes, instead of hours and days.
  5. Lower Power Consumption – The Chinese government incentivizes data centers to go green. Major power savings are made by saving on compute power in use; RoCE yields almost 100% CPU efficiency, further reducing the need for compute in distributed systems. In addition, Spectrum 100/25G switches hold the record for the lowest Watt per port today. The Mellanox SN2010 average power consumption is just 57 Watt.


A Bright Future Collaborating with ODCC

As China continues to grow and shape the future of cloud and edge computing landscapes, we are very excited to collaborate with the ODCC on developing high-performance, scale-out and multi-tenant data center architectures.

Autonomous Networking For Real

Self-Driving Data Center

by Phil Clegg (Nutanix) and Elad Wind (Mellanox)

Nutanix is designed from the ground up to simplify datacenter deployment and management.

Business applications are deployed in enterprise clouds in minutes. This simple to use and intuitive interface allows users to easily create, modify and delete virtual workloads in a cloud- like fashion. The Nutanix PRISM interface is a HTML5 page and / or set of APIs that are already secured and locked down to a production-ready standard.

NEO – You don’t have to be a network guru to build infrastructure

In keeping with this trend, Mellanox partnered with Nutanix to provide network APIs that allow the Mellanox NEO Management and Monitoring Tool to fully understand the virtual networking and run autonomous processes for network configurations.

4 hours a day reduced to just a few minutes a week

Consider the example of a New Zealand-based service provider who wanted a secured platform to deliver network function virtualization. Here, the firewall VMs are spun up via a portal that creates a VM, provides its configuration and assigns it to a customer’s VLAN.

The NEO plugin listens to Nutanix events and reacts so that VLANs are provisioned and terminated transparently.

Prior to this automation, VLAN’s were manually configured via an e-mail request triggered by the orchestration engine. These VLAN requests had a three day SLA and often took in excess of six days to provision and resolve any issues.

Simply by moving to Mellanox switches, and enabling the Mellanox NEO plugin, has changed this process; VLANs are automatically created, recorded and deployed at both the virtual and physical layer, removing risk of misconfiguration, and removing the hours employees worked to do this at the top of rack infrastructure.

Because everything is logged, it is easy to track the compliance needs of VLAN creation as well as the movement of VLAN settings as VM’s move around the Nutanix cluster.

This is estimated to reduce the effort of provisioning VLAN’s from 4 hours per day to less than 1 hour per week and removing over 90 percent of the rework associated with manual VLAN creation.

NEO and Prism Dashboard

PRISM is a HTML5 interface with no flash or JAVA plugins that is used to administer a Nutanix cluster. This intuitive interface allows users to analyze and alert as well as create, modify and delete all facets of virtualization including virtual machines, CPU, RAM, storage, networking, snapshots, replication and self-service portals.

The networking portion of Nutanix is not only virtual switch and VM-based, but also integrates into switch hardware like Mellanox and can provide hardware based networking stats within the PRISM interface
Similarly, add another window for NEO for insights on what’s running inside the fabric, simplify application development and accelerate application delivery

About Nutanix One-Click Infrastructure

Hypervisor and VM management provide a consumer-grade experience. Nutanix Prism gives administrators an easy way to manage virtual environments running on Acropolis. It simplifies and streamlines common workflows for hypervisor and virtual machine (VM) management; from VM creation and migration to virtual network setup and hypervisor upgrades. Rather than replicating the full set of features found in other virtualization solutions, virtualization management in Prism has been designed for an uncluttered, consumer-grade experience.

About Mellanox Ethernet Switches

Mellanox’s Spectrum switch technology offer data centers unparalleled performance that let providers and customers focus only on their applications – Mellanox networks scale easily, offer consistently lowest latency, run at full wire speed at all packet sizes with zero packet loss and have up to 15X better microburst resiliency

Supporting Resources:

The New Mellanox/Cumulus SN2100: A Revolutionary Approach to Top-of-Rack Switching

Does your Data Center / Cloud run racks with 40 or more servers? Then you are probably paying more than you should for your network, and you are probably consuming too much real estate and power. With Mellanox’s SN2100 Top of Rack (ToR) switch you can change all that.

Web-scaled companies have created major shifts in data centers by migrating to modular solutions that are comprised of flexible, dense and economical building blocks. The latest addition to Mellanox’s Spectrum family is part of this wave with an amazing half-wide size switch carrying sixteen (16) QSFP28 ports of 10/25/40/50/100GbE.

For this collection of storage and cloud example environments, the SN2100 ToR, sets new standards for increased flexibility, efficiency and price performance. Price suggestions are taken off public listings for the customer to draw conclusions and drive the discussion.

Example 1: Popular 48+4 ports ToR configuration: SN2100 ToR, with split ports, spares thousands of dollars per rack in comparison to leading solution providers.

Why use port splits? SN2100 ports split into quad SFP28 ports with Mellanox LinkX® breakout cables. This configuration is connected up to 48 nodes, running at 10G/25G speeds, while having four (4) uplink ports of 40G/100G. Splitting ports means you bring Web Scale innovation and savings into your design as already adopted by hyperscalers with: 1) simpler cable administration 2) clear and tidy rack 3) 25 percent savings on cables.

Each breakout cable replaces four 10G DAC cables and saves $50. Total cable savings for a common highly available rack of 40+ node is more than $1,000. DAC breakout is commonplace in the data center with MTBFs reaching higher than 2,000 years so customers will never need to replace them again.

fig 1 SN2100Example 2: Storage / SDS / hyperconverged systems: Smaller compute and storage applications typically fill 3-4 rack units (RU). A half-wide SN2100 switch means it is uniquely packed so two adjacent switches mLAG’d (logically link aggregated) will fit in just one additional RU for the best IOPS/RU ratio and maximum Gbps/dollar performance.

A compact network can only occupy a fraction of the prevailing 48+4 ports switch. Why pay 3x for ports you don’t need? Why pay for fancy L3 licenses when state-of-the-art L2+L3 MLNX_OS caters for all your network needs? Conclusion: Two SN2100 switches mLAG’d will address an entire whole storage solution, no throwing away ports and rack space, and saving $30,000 on networking gear per appliance.

Example 3: Leaf and spine with 100G aggregation layer: The SN2100 ToR leverages 100G bandwidth in the aggregation layer to save money over prevailing 40G networks. Do the math as 100G means pipes are running 2.5x faster vs 40G. Fast pipes translate into fewer optical cables towards the spine, fewer spines and less rack space used overall. Fig 2 SN2100

A fast and reduced network efficiently saves thousands of dollars per rack in gear, and saves again in operative cost. The math is compelling for small 10G networks as much as it is for scaled spine and leaf architecture.

Other integrations ideas for the SN2100 are deployments outside the traditional confinements of brick and mortar buildings such as field or mobile data center operations.

Mellanox’s SN2100 Open Ethernet switch comes with two pre-configured ONIE-based Network Operating System (NOS) options: Cumulus Linux and MLNX-OS. Open-source standards and cloud management platforms help organizations to reduce vendor lock-in and are a viable option for software-defined data centers. Speed can be limited to 40G for an even better price.

Register now for the webinar on June 28 to do the math with the new breed of ToR that can efficiently cover your network flexibility and efficiency while reducing costs for your needs.