In my previous “Call Me Stupid …” blog, I outlined why I think Moore’s Law is done, effectively ended as the driving force behind an exponential improvement in semiconductors every two years. So if Moore’s Law is slowing down and we can’t count on ever-faster processors every year, what does it mean for technologists and data center architects?
It’s obvious that one direction that is already happening is multi-chip packaging, where lots of silicon dies are assembled together in advanced packages that look like a single chip on the outside. For example, lift the lid on AMD’s new EPYC CPU and you’ll see eight 7nm CPU chiplets plus a ninth 14nm I/O chiplet. In another example, TSMC just wrote an interesting blog that showcased a giant interposer as proof that Moore’s Law is not dead but actually continues:
But as I’ve already stipulated, technology innovation is alive and well, and if you redefine Moore’s Law in some contorted way that allows progress at all cost and omits the economic, power, transistor density, and performance improvements – you can make the case that it will continue.
But even as various types of advanced chip packaging technologies will be cited as proof that Moore’s Law lives, this is really an indication that future chip progress will come more from vectors other than steady progress in shrinking semiconductor processing. From packaging innovation however, you simply don’t get all the performance, cost, and power benefits delivered by Moore’s Law and Dennard Scaling. So creative packaging will be an interesting but short-lived innovation vector. The more interesting vectors to pursue will be in two areas:
I’m excited about the prospects of #1, but it is both further out in time and beyond my pay grade to add much to this discussion. So for the remainder of this blog, I’ll focus on architectural innovation.
Interestingly in the original 1965 paper, Gordon Moore observed a doubling of transistors in a chip every 12 months and predicted it would continue, but adjusted this to every two years in 1975, citing the “end of cleverness.” Specifically, he noted that most of the “clever” process innovations such as oxide isolation had already been implemented and therefore the time period for the doubling of device density would stretch out. Fast forward half a century and the classical period of Moore’s Law is finally ending. Ironically, I believe this will lead to the return of cleverness. Cleverness not in the sense of semiconductor process innovation, but rather in design writ large at the system level. Put another way, as the advance of semiconductor processing slows, system and software architecture matter again.
So, looking at things from this broader perspective raises the question: what will this mean for computers and even whole data centers? My answer is that the return of cleverness, will be not in the field of semiconductor processing that Gordon was referring to in 1975, but rather a much higher system-level cleverness that requires a holistic view of the data center. This new form of cleverness will require that we optimize technology not at the transistor or even chip level, but rather at the data center level – across the entire compute, storage, networking, and software stack. Thus, this return to cleverness is at a fundamental system architecture level and represents tremendous innovation opportunities for the future.
So what sort of cleverness is available? First off we need to address the reality that the progression of ever faster and denser computer is slowing, while data growth continues unabated. The result is a “Compute-Data” gap that will widen over time:
Without faster computers, the solution is more of them, clustered together with high-performance networks. This scale-out computing is already happening in a massive way, originally driven by the cloud and social media companies who first faced the challenges of mining their ever-larger data sets for business value. These hyperscalers were the first to adopt higher speed networks of 25, 40, 100 Gb/s and beyond. A good example of clustered computing and massive global scale is Microsoft Azure Cloud Data Centers:
The reason that data center level innovation is important to the cloud hyperscalers is that the class of workloads, and the data sets they need to process, are truly enormous. They simply can’t be performed on a single computer or even a rack of servers and require operating at cloud scale. Mellanox CTO, Michael Kagan, wrote about this when he suggested the lowly ant as a metaphor in his article: “Think Outside the Computer.”
The scale of these modern workloads demands massive parallelism. And once you start to compute problems at scale in parallel on many different computers, the response time of the slowest machine becomes more important than the average latency. This in turn means that deterministic latencies become vital, which demands hardware implementations. I wrote about the rise of East-West traffic, the tale at scale, and the requirement for deterministic latencies in another blog: “In the Data Center, the Latency Tail Wags the Dog.”
But as microservices and distributed computing become the norm, optimizing network latency becomes a critical element of data center architectural innovation. Scale out computing works extremely well for workloads that are easy to run in parallel, but inevitably there are portions of the task that are serialized, whether to periodically synchronize data between nodes or simply to get more data to continue processing. But here there is another law that comes into play that imposes a fundamental limit on how fast a task can be performed. And in this challenge lays the opportunity for cleverness.
Around the exact same time that Gordon Moore first stated his law, another great of the computing world noticed that parallel computing was limited by the parts of the task that had to run sequentially. In 1967 Gene Amdahl presented his eponymous law that run time could be accelerated only up to a certain point. Amdahl’s Law has to do with the speedup of a computing task that consists of a portion that can be accelerated by running in parallel using multi-node or clustered computing and the remaining portion that must be run in a serial fashion. His conclusion was that the benefits of parallel computing are limited by the serial portion of the task as follows:
P=75% meaning 3/4 of the task can be accelerated by running in parallel on multiple nodes
S=2 with two nodes, meaning that the accelerated portion of task runs twice as fast
So the overall run time is reduced to 62.5% of the original run time. So instead of achieving a speedup of 2X, only 1.6X speedup is achieved. Thus multi-node parallelism does not yield perfect speedup because most workloads have only a portion that can be accelerated with parallel processing. With 2 cores you only run 1.6X faster for a task that is 75% parallelizable. Bottom line the overall speedup from doubling the number of processors is less than 2, because only part of the task is accelerated.
So instead of a constant linear speedup, the performance gain rolls off and each new node provides less acceleration benefit than the previously-added node.
So the speedup factor slows down because as more compute nodes are added the faster the parallelizable portion of the task runs, but at the same time the serialized portion becomes a larger and larger portion of the total run time. As more and more nodes are added, the serialized task dominates run time. In the example above the serialized portion of a task that started out as just 25% of the overall single-node runtime, with just 8 nodes rises to comprise 73% of the overall run time, limiting future scaling benefits.
This has impact well beyond computing because to achieve scalability and robustness modern workloads (including storage, database, Big Data, artificial intelligence, and machine learning) have been designed to run as distributed applications.
So Amdahl’s Law represents a practical limit to the benefit of parallelization. But this is not the same as the fundamental limit that the laws of physics impose on semiconductor that is slowing Moore’s Law to a crawl. Continued acceleration is possible simply by redefining the overall task so-as to make previously serialized operations parallelizable.
Simple to say, but not as simple to do. It requires a deep understanding of distributed applications and wholesale rethinking of complex algorithms and data flows. When a workload is looked at deeply and holistically there are tremendous opportunities for innovation and speedup along the compute, storage, and networking vectors of the system. For example, operations that were previously performed sequentially on a CPU can be parallelized on domain specific compute engines such as GPUs which are extremely well-suited to running graphics and machine learning tasks in parallel.
In addition, many of the serialized tasks are related to periodically sharing data between nodes at synchronization points using collective operations that allow the parallel computing to continue. These serialized tasks are precisely the ones that the network can accelerate! This requires an intelligent network that processes data as it moves and heralds a new class of domain specific processors called IPU’s or I/O Processing Units. A good example of the IPU is our new BlueField-2 SOC (System-On-a-Chip) which we announced at the VMworld 2019 conference.
It’s pretty clear that as Moore’s Law ends we will see a return of cleverness. But cleverness in this case not in semiconductor processing, but rather on a system architecture level. And this cleverness will manifest itself no longer just at chip scale or server level but rather at data center scale, including smart networking. That is exactly why the innovation of the next decade will come from those who understand the problems that need to be solved holistically – encompassing compute, storage, networking, and software.
Moore’s Law is dead and Amdahl’s Law still applies, but the return of cleverness means innovation and the development of faster data processing can continue at the server, network, application and data center level.