Features
-
Parallel genarray, modarray, foldarray
-
Parallel homogeneous multioperator with-loops
-
Correct (though inefficient) handling of side-effects
-
In-place reshape
Testing
The CFAL benchmarks compile, run, and compute the correct result with one
exception: the quickselect in the initalisation of MG exhausts kernel
resources as it does 10s of thousands of ShrayMalloc
s. Quickselect is a
horrible algorithm for distributed memory anyway, so I do not think this is a
problem.
Gaussian blur and nbody show reasonable speedups on the cluster.
The fancy 2D stencil code (blocked with overlapping tiles) computes the correct result.
The blocked matmul verifies, but does need to run with SHRAY_CACHELINE=10 in order to not exhaust kernel resources due to the sheer number of segfaults. This suggests we may want to make SHRAY_CACHELINE allocation dependend and set the default higher than 1 in Shray. E.g. turn the local part of an allocation into a fixed number of chunks. On my laptop get the expected speedups for Shray: no from 1 -> 2, 2x from 2 -> 4.
We also need to turn of phm as intercepting malloc interferes with GASNet initialisation function in some way