Merged Thomas Koopman requested to merge thomas/sac2c:distmem-minimal into develop Apr 04, 2024

Features

Parallel genarray, modarray, foldarray
Parallel homogeneous multioperator with-loops
Correct (though inefficient) handling of side-effects
In-place reshape

Testing

The CFAL benchmarks compile, run, and compute the correct result with one exception: the quickselect in the initalisation of MG exhausts kernel resources as it does 10s of thousands of ShrayMallocs. Quickselect is a horrible algorithm for distributed memory anyway, so I do not think this is a problem.

Gaussian blur and nbody show reasonable speedups on the cluster.

The fancy 2D stencil code (blocked with overlapping tiles) computes the correct result.

The blocked matmul verifies, but does need to run with SHRAY_CACHELINE=10 in order to not exhaust kernel resources due to the sheer number of segfaults. This suggests we may want to make SHRAY_CACHELINE allocation dependend and set the default higher than 1 in Shray. E.g. turn the local part of an allocation into a fixed number of chunks. On my laptop get the expected speedups for Shray: no from 1 -> 2, 2x from 2 -> 4.

We also need to turn of phm as intercepting malloc interferes with GASNet initialisation function in some way

Edited Jun 27, 2024 by Thomas Koopman

Shray-based distributed memory backend

Features

Testing