Status | Available |
Access arrangements | Sign up at https://safe.epcc.ed.ac.uk/dirac, (project do009) |
Organisations | Durham University, Rockport Networks, Dell |
Project linkage | ExaHype, MPPHEA, Exploiting Task Parallelism |
This project is based around the adaption of a large-scale HPC cluster (up to 226 nodes) to use a novel high performance 100G switchless networking technology developed by Rockport Networks.
Based on a 6D torus today, and extensible to other topologies, this network technology is highly modular and has linear scaling up to hundreds of thousands of nodes (and probably beyond), and is well suited for Exascale.
By distributing the network switching function into each device endpoint, the nodes become the network. This gives consistently low latency under load, for a predictable workload performance, on networks of any size.
For classic HPC networks, congestion can introduce significant delays for messages, significantly reducing code performance. However, with the network proposed here, packets are split into small FLITs , and so small messages are not delayed by larger messages within the pipeline, thus delivering consistently low network latency.
Switchless networking removes network complexity, replacing centralised switch fabrics with a distributed, highly reliable interconnect, providing an intelligent, adaptable and self- healing network which is simple to operate. The testbed Rockport system is at a scale large enough to be useful for scientific simulations, and will be invaluable in determining whether this technology and approach is appropriate for future Exascale class systems.