Many HPC applications suffer from bandwidth and latency constraints as well as lacking message progression. These problems become particularly severe for sophisticated scientific codes employing multi-scale methods, adaptive meshes, multi-physics, multi-numerics, and so forth, as they require non-deterministic, non-homogeneous data exchange. Task-based programming can ride to our rescue, as it gives us, in principle, the opportunity to bring tasks that trigger communication forward and to switch to ready tasks whenever another task would have to wait for incoming data. Tasking is primarily an on-node technique yet allows us to mitigate problems of message exchange. In theory, it even facilitates the migration of tasks between ranks. In practice, tasking often fails to facilitate good MPI performance, as the MPI progression interrupts the computations, as task orchestration is expensive, and as task migration between ranks is far from trivial.
As part of the NVIDIA HPC weeks 2021 in Japan (https://events.nvidia.com/hpcweek), researchers from the task parallelism cross-cutting project and DiRAC have presented two different tasking approaches that use novel intelligent network devices (SmartNICs) to transfer tasks. One approach offloads the complete task scheduling and production into the network, while the other approach keeps the scheduling on the CPU yet throws the ready task into the intelligent network. In the former case, the network itself becomes the scheduler which uses the CPUs as compute servers. In the latter case, the smart network becomes a compute server into which we can throw work. The researchers’ presentation discusses lessons learned and pros and cons of both approaches.