See SHRAYonUCX#1
Registering memory is still a bottleneck, but at least somewhat mitigated. 6x improvement for the full multigrid benchmark on Snellius compared to before the fix, but 5x slower than a pure MPI implementation.
See SHRAYonUCX#1
Registering memory is still a bottleneck, but at least somewhat mitigated. 6x improvement for the full multigrid benchmark on Snellius compared to before the fix, but 5x slower than a pure MPI implementation.