This is an excerpt of a post published today on the Cisco HPC Networking blog by Joshua Ladd, Mellanox:
At some point in the process of pondering this blog post I noticed that my subconscious had, much to my annoyance, registered a snippet of the chorus to Paul Simon’s timeless classic “50 Ways to Leave Your Lover” with my brain’s internal progress thread. Seemingly, endlessly repeating, billions of times over (well, at least ten times over) the catchy hook that offers one, of presumably 50, possible ways to leave one’s lover – “Hop on the bus, Gus.” Assuming Gus does indeed wish to extricate himself from a passionate predicament, this seems a reasonable suggestion. But, supposing Gus has a really jilted lover; his response to Mr. Simon’s exhortation might be “Just how many hops to that damn bus, Paul?”
HPC practitioners may find themselves asking a similar question, though in a somewhat less contentious context (pun intended.) Given the complexity of modern HPC systems with their increasingly stratified memory subsystems and myriad ways of interconnecting memory, networking, computing, and storage components such as NUMA nodes, computational accelerators, host channel adapters, NICs, VICs, JBODs, Target Channel Adapters, etc., reasoning about process placement has become a much more complex task with much larger performance implications between the “best” and the “worst” placement policies. To compound this complexity, the “best” and “worse” placement necessarily depends upon the specific application instance and its communication and I/O pattern. Indeed, an in-depth discussion on Open MPI’s sophisticated process affinity system is far beyond the scope of this humble blog post and I refer the interested reader to the deep dive talk Jeff Squyres (Cisco) gave at Euro MPI on this topic.
In this posting I’ll only consider the problem framed by Gus’ hypothetical query; How can one map MPI processes as close to an I/O device as possible thereby minimizing data movement or ‘hops’ through the intranode interconnect for those processes? This is a very reasonable request but the ability to automate this process has remained mostly absent in modern HPC middleware. Fortunately, powerful tools such as “hwloc” are available to help us with just such a task. Hwloc usually manipulates processing units and memory, but it can also discover I/O devices and report their locality as well. In simplest terms, this can be leveraged to place I/O intensive applications on cores near the I/O devices they use. Whereas Gus probably didn’t have the luxury to choose his locality so as to minimize the number of hops necessary to get on his bus, Open MPI, with the help of hwloc, now provides a mechanism for mapping MPI processes to NUMA nodes “closest” to an I/O device.
Read the full text of the blog here.
Joshua Ladd is an Open MPI developer & HPC algorithms engineer at Mellanox Technologies. His primary interests reside in algorithm design and development for extreme-scale high performance computing systems. Prior to joining Mellanox Technologies, Josh was a staff research scientist at the Oak Ridge National Lab where he was engaged in R&D on high-performance communication middleware. Josh holds a B.S., M.S., and Ph.D. all in applied mathematics.