GASNet is too high-level to implement Shray efficiently. The algorithm for pinning memory on HCA does not work properly for the SEGMENT_EVERYTHING configuration, which severely limits its performance on systems supporting RDMA.
Both MPICH and OpenMPI are built on a library called UCX, which is a very good low-level library for distributed memory. SHRAYonUCX is a re-implementation of Shray built on top of this. It has as shortcoming that we can only use one rank per physical machine, meaning it is hard to debug. For this reason, this merge request supports both implementations through targets distmem_shray and distmem_ucx
The main difference of distmem_ucx is that it is embedded in an MPI application (for OOB connection and routines like MPI_Bcast). So we generate
MPI_Init_thread
Shray_Init
...
Shray_Finalize
MPI_Finalize