In my first blog on Ceph, I explained what it is and why it’s hot. But what does Mellanox, a networking company, have to do with Ceph, a software-defined storage solution? The answer lies in the Ceph scale-out design. And some empirical results are found in the new “Red Hat Ceph Storage Clusters on Supermicro storage servers” reference architecture published August 10th.
Ceph has two logical networks, the client-facing (public) and the cluster (private) networks. Communication with clients or application servers is via the former while replication, heartbeat, and reconstruction traffic run on the latter. You can run both logical networks on one physical network or separate the networks if you have a large cluster or lots of activity.
Figure 1: Logical diagram of the two Ceph networks
How Big Is Your Network Pipe?
Ceph’s data protection supports both simple replication for higher performance and erasure coding for greater space efficiency with large objects. Simple replication makes 3 copies of the original data while erasure coding makes approximately 1.5 copies (depends on erasure coding scheme used), so 1TB of data written into Ceph on the public network generates either 3TB or 1.5TB of traffic on the cluster network for writes (1TB of reads only generates 1TB of cluster traffic).
If a node fails, reconstruction generates a large amount of traffic to copy or reconstruct the lost data onto another node. Failure of a node with 72TB (e.g. 24 x 3TB drives) means 72TB of data must be copied. Assuming replication is limited by the network, at 10GbE that reconstruction takes about 16 hours, while at 40GbE it only takes 4 hours. (Reconstruction of one 3TB drive would speed up from 40 minutes to 10 minutes.) Even if reconstruction only needs 10GbE of bandwidth, using a 40GbE cluster network prevents reconstruction traffic from slowing down normal Ceph performance to support clients and applications.
Figure 2: Big network pipes can speed up Ceph throughput
But Size Isn’t Everything
Even when bandwidth is sufficient, latency also matters. If a network were a water pipe, bandwidth is how much water can go through at once while latency determines how quickly you can start or stop the water flow or shift delivery from one pipe to another. Latency doesn’t matter much for big sequential data flows but can make a big difference for small random I/O, as well as for metadata, control, or monitoring traffic. Testing done by Mellanox showed that larger Ceph clusters benefit from running the public and cluster networks on separate physical networks so client and monitor traffic don’t interfere with replication and heartbeat traffic, resulting in higher throughput and less latency for both. Alternatively, if the switches support enough bandwidth and lossless Ethernet—as Mellanox switches do using DCBx features like Priority Flow Control—one shared 40GbE network may be enough for a smaller Ceph cluster.
Testing by Red Hat and Supermicro showed that using one Mellanox 40GbE network can lower latency by as much as 50% compared to two 10GbE networks. So even when the extra bandwidth isn’t needed, a faster network can still improve Ceph performance other ways.
When Do I need A Faster Network for Ceph Using Hard Drives?
A general rule of thumb is having more than 20 HDDs per Ceph server means a single 10GbE link per server is not enough to support sequential read throughput. Of course this depends on the type of CPU, HDD, Ceph version, and drive controller/HBA, and whether you use simple replication or erasure coding. Customers deploying performance-optimized Ceph clusters with 20+ HDDs per Ceph OSD server should seriously consider upgrading to 40GbE. Using 3x simple replication, Supermicro found a server with 72 HDDs could sustain 2000 MB/s (16Gb/s) read throughput and the same server with 60 HDDs + 12 SSDs sustained 2250 MB/s (18 Gb/s).
A different test (with a different server configuration) by Scalable Informatics showed a Ceph server with 60 HDD could sustain 3000 MB/s (24 Gb/s) of read throughput using simple replication and 2500 MB/s (20Gb/s) using 6+2 erasure coding. Even Cisco published an architecture showing a Ceph server-generated almost 20Gb/s of read throughput with 28 HDDs and 30Gb/s with 56 HDDs—in that situation they bonded 4x10GbE ports together to get the necessary bandwidth. Best case sequential write performance per server for both Scalable Informatics and Cisco also slightly exceeded what one 10GbE link can carry.
So remember—lots of hard drives in a Ceph server means you need a faster network for the best throughput.
What if my Ceph Node is All-Flash?
Traditionally Ceph has used for high-capacity object storage (video, photos, log files, etc.) with the highest priority on low cost/GB, or for streaming workloads (like video serving) with high sequential throughput. Hard drives can be very good at both. But now customers want to use Ceph for transactional or “high IOPS” workloads, where the goal is many input/output operations per second (IOPS). Solid state/flash drives (SSDs) can offer 2x to 5x more throughput and 20x to 100x more IOPs than hard drives. While it might take >20 HDDs per Ceph server to exceed what a 10GbE network can do, it only takes 2-6 SSDs per server to hit that limit. One of the Red Hat-Supermicro tests showed using just 2 fast PCIe SSDs could support 1800 MB/s (14.4 Gb/s) of read throughput per server, or close to what the same server could support with 72 HDDs. Another test using 48 SATA SSDs were able to hit up to 70Gb/s of Ceph read throughput using a Mellanox 100GbE network.
Figure 4: Want faster Ceph performance? Call the Flash!
Better Price Performance
Customers often worry that 40GbE networking is too expensive for storage networking. But the reality is in today’s competitive market, 40GbE networking carries 4x more data than 10GbE but only costs about 2x. Often Mellanox 40GbE switches cost only 1.5x more per port than popular 10GbE switches. 4x the performance at only 1.5x or 2x the cost is both cost-effective and gives you room to grow Ceph performance and cluster size. And even if your Ceph deployment only needs 10, 15 or 20 Gb/s of bandwidth per server, one 40GbE network can be simpler to set up and less expensive than bonding multiple 10GbE ports.
|Feature||Typical 10GbE||Mellanox 40GbE|
|Bandwidth||10Gb/s nominal||40GbE nominal|
|Ceph read throughput||1x per server||Up to 2x per server|
|Relative Ceph latency||1x||As low as 0.5x|
|Packet forwarding||Often store-and-forward||Almost always cut-through|
|Unified Ceph network||Only for tiny clusters||OK for small/medium clusters|
|Setup complexity||May need to bond links||One link per network|
|Support all-flash throughput||Only with multiple links||Yes|
|Relative cost per port||1x||1.5x to 2x|
|Upgrade path to 25/40/50/56/100GbE||No||Yes (Spectrum switch for 25/50/100GbE)|
|RDMA option||Adapters no, switch maybe||Yes—adapters and switches|
Figure 4: Comparing network options for Ceph
Faster Storage Needs Faster Networks
We see with Ceph something we’re seeing across the storage world. As storage gets faster, you need a faster network. Flash is making the storage hardware faster, Ceph software is also getting faster, and going to 40GbE is more affordable than most people think. So look for 40GbE networks (and later 25/50/100GbE networks) look to play a growing role in Ceph deployments.
What’s Next for Ceph And Networking
Mellanox is working with Red Hat, Samsung, SanDisk, Scalable Informatics, Supermicro, and others to do more Ceph testing using all-flash OSD servers. Some of these results were recently presented at the 2015 Flash Memory Summit, and more will be covered in this blog in the next few months!